2023-11-22

cs.CV

cs.CV - 2023-11-22

Importance of Feature Extraction in the Calculation of Fréchet Distance for Medical Imaging

paper_url: http://arxiv.org/abs/2311.13717
repo_url: https://github.com/mckellwoodland/fid-med-eval
paper_authors: McKell Woodland, Mais Al Taie, Jessica Albuquerque Marques Silva, Mohamed Eltaher, Frank Mohn, Alexander Shieh, Austin Castelo, Suprateek Kundu, Joshua P. Yung, Ankit B. Patel, Kristy K. Brock
for:* The paper aims to compare state-of-the-art feature extractors for computing Fréchet Distances (FDs) in medical imaging.methods:* The authors use a StyleGAN2 network trained with data augmentation techniques tailored for limited data domains on datasets comprising three medical imaging modalities and four anatomical locations.* They compare human evaluation of generative quality (via a visual Turing test) with FDs calculated using ImageNet-trained InceptionV3, ResNet50, SwAV, DINO, and Swin Transformer architectures, as well as an InceptionV3 network trained on a large medical dataset, RadImageNet.results:* All ImageNet-based extractors were consistent with each other, but only SwAV was significantly correlated with medical expert judgment.* The RadImageNet-based FD showed volatility and lacked correlation with human judgment.In simplified Chinese, the three points would be:for:* 这篇论文目标是比较医疗影像领域中最新的特征提取器，以计算Fréchet Distance（FD）。methods:* 作者使用了StyleGAN2网络，通过针对有限数据域的数据增强技术来训练。* 作者比较了人类评估生成质量（通过视觉图灵检测）和使用ImageNet训练的InceptionV3、ResNet50、SwAV、DINO和Swin Transformer架构计算的FD。results:* 所有ImageNet基于的提取器都是相互一致的，但只有SwAV显著地相关于医疗专家判断。* RadImageNet基于的FD表现了不稳定性，并且与人类判断没有相关性。

Abstract
Fr\'echet Inception Distance is a widely used metric for evaluating synthetic image quality that utilizes an ImageNet-trained InceptionV3 network as a feature extractor. However, its application in medical imaging lacks a standard feature extractor, leading to biased and inconsistent comparisons. This study aimed to compare state-of-the-art feature extractors for computing Fr\'echet Distances (FDs) in medical imaging. A StyleGAN2 network was trained with data augmentation techniques tailored for limited data domains on datasets comprising three medical imaging modalities and four anatomical locations. Human evaluation of generative quality (via a visual Turing test) was compared to FDs calculated using ImageNet-trained InceptionV3, ResNet50, SwAV, DINO, and Swin Transformer architectures, in addition to an InceptionV3 network trained on a large medical dataset, RadImageNet. All ImageNet-based extractors were consistent with each other, but only SwAV was significantly correlated with medical expert judgment. The RadImageNet-based FD showed volatility and lacked correlation with human judgment. Caution is advised when using medical image-trained extraction networks in the FD calculation. These networks should be rigorously evaluated on the imaging modality under consideration and publicly released. ImageNet-based extractors, while imperfect, are consistent and widely understood. Training extraction networks with SwAV is a promising approach for synthetic medical image evaluation.

摘要
“弗雷切幻想距离”是一个广泛使用的图像质量评估指标，它使用一个基于ImageNet的InceptionV3网络作为特征提取器。然而，在医疗图像中使用这个指标时，没有标准的特征提取器，导致比较偏见和不一致。本研究旨在比较医疗图像中的现有特征提取器，以计算弗雷切距离。一个StyleGAN2网络通过对限制 Daten Augmentation 技术进行训练，在三种医疗图像模式和四个生物学位置上进行训练。人类评价生成质量（通过视觉Turing测试）和基于ImageNet训练的InceptionV3、ResNet50、SwAV、DINO和Swin Transformer架构中的FD计算结果进行比较。所有基于ImageNet的抽取器都具有相似的性能，但只有SwAV在人类评价中表现出 statistically significant 的相关性。RadImageNet基于的FD表现了不稳定性，并没有与人类评价相关。当使用医疗图像训练的抽取网络时，应该对特定的图像模式进行严格评估，并公开释出抽取网络。ImageNet基于的抽取器，尽管不完美，但是一致且广泛理解。使用SwAV训练的特征提取网络是评估 sintetic医疗图像质量的有力方法。

DiverseNet: Decision Diversified Semi-supervised Semantic Segmentation Networks for Remote Sensing Imagery

paper_url: http://arxiv.org/abs/2311.13716
repo_url: None
paper_authors: Wanli Ma, Oktay Karakus, Paul L. Rosin
for: 降低大量静止图像训练中人工标注成本的开销，通过利用大量无标注数据中有用的特征。
methods: 提出了多头多模型半监督学习算法，同时提高精度和多样性在训练中。
results: 在四个广泛使用的静止图像数据集中，提出的多头多模型方法达到了最高的语义分割性能，而且相比之前的方法，提档头 architecture 更轻量级。

Abstract
Semi-supervised learning is designed to help reduce the cost of the manual labelling process by exploiting the use of useful features from a large quantity of unlabelled data during training. Since pixel-level manual labelling in large-scale remote sensing imagery is expensive, semi-supervised learning becomes an appropriate solution to this. However, most of the existing semi-supervised learning methods still lack efficient perturbation methods to promote diversity of features and the precision of pseudo labels during training. In order to fill this gap, we propose DiverseNet architectures which explore multi-head and multi-model semi-supervised learning algorithms by simultaneously promoting precision and diversity during training. The two proposed methods of DiverseNet, namely the DiverseHead and DiverseModel, achieve the highest semantic segmentation performance in four widely utilised remote sensing imagery data sets compared to state-of-the-art semi-supervised learning methods. Meanwhile, the proposed DiverseHead architecture is relatively lightweight in terms of parameter space compared to the state-of-the-art methods whilst reaching high-performance results for all the tested data sets.

摘要
DiverseNet 架构包括两种方法：DiverseHead 和 DiverseModel。这两种方法在四个广泛使用的遥感图像数据集上实现了最高的 semantics 分类性能，比之前的 semi-supervised learning 方法高。此外，提议的 DiverseHead 架构在参数空间上相对轻量级，而且对所有测试数据集都达到高性能。

A Somewhat Robust Image Watermark against Diffusion-based Editing Models

paper_url: http://arxiv.org/abs/2311.13713
repo_url: None
paper_authors: Mingtian Tan, Tianhao Wang, Somesh Jha
for: 这篇论文主要针对Diffusion Models（DMs）的图像合成技术，以及它们在图像修改和创建中的应用。
methods: 本文使用了对抗式示例技术（Adversarial Example Techniques），开发了一种名为Robust Invisible Watermarking（RIW）的隐形水印技术，以将隐形水印嵌入图像中。
results: 本文的RIW技术可以在图像修改后仍保持96%的抽取精度，而传统方法则无法提供任何抽取精度。

Abstract
Recently, diffusion models (DMs) have become the state-of-the-art method for image synthesis. Editing models based on DMs, known for their high fidelity and precision, have inadvertently introduced new challenges related to image copyright infringement and malicious editing. Our work is the first to formalize and address this issue. After assessing and attempting to enhance traditional image watermarking techniques, we recognize their limitations in this emerging context. In response, we develop a novel technique, RIW (Robust Invisible Watermarking), to embed invisible watermarks leveraging adversarial example techniques. Our technique ensures a high extraction accuracy of $96\%$ for the invisible watermark after editing, compared to the $0\%$ offered by conventional methods. We provide access to our code at https://github.com/BennyTMT/RIW.

摘要
最近，扩散模型（DM）已成为图像生成的状态艺术方法。基于DM的编辑模型，因其高精度和精准性而受到广泛应用，但也导致了图像版权侵犯和恶意编辑的新问题。我们的工作是首次正式化和解决这个问题。我们评估和尝试提高传统图像水印技术，但发现它们在这种emerging context中存在限制。因此，我们开发了一种新的技术——Robust Invisible Watermarking（RIW），通过对抗例技术来嵌入不可见的水印。我们的方法可以在编辑后提取不可见水印，并达到$96\%$的抽取精度，比传统方法的$0\%$要高得多。我们提供了代码，可以在https://github.com/BennyTMT/RIW上下载。

A Comprehensive Review of Artificial Intelligence Applications in Major Retinal Conditions

paper_url: http://arxiv.org/abs/2311.13710
repo_url: None
paper_authors: Hina Raja, Taimur Hassan, Bilal Hassan, Muhammad Usman Akram, Hira Raja, Alaa A Abd-alrazaq, Siamak Yousefi, Naoufel Werghi
for: 这篇论文提供了一项系统性的评估视网膜疾病，强调早期检测的重要性以便有效治疗。
methods: 论文涵盖了临床和自动化方法的检测视网膜疾病，重点强调过去十年内的研究。
results: 论文评估了不同Modalities的算法，以识别视网膜疾病的结构异常和诊断。

Abstract
This paper provides a systematic survey of retinal diseases that cause visual impairments or blindness, emphasizing the importance of early detection for effective treatment. It covers both clinical and automated approaches for detecting retinal disease, focusing on studies from the past decade. The survey evaluates various algorithms for identifying structural abnormalities and diagnosing retinal diseases, and it identifies future research directions based on a critical analysis of existing literature. This comprehensive study, which reviews both clinical and automated detection methods using different modalities, appears to be unique in its scope. Additionally, the survey serves as a helpful guide for researchers interested in digital retinopathy.

摘要
Translation in Simplified Chinese:这篇论文提供了一项系统性的报告关于视网膜疾病，强调早期发现的重要性以实现有效的治疗。它覆盖了临床和自动化方法的检测方法，关注过去一代的研究。survey评估了不同Modalities的算法，用于识别Structural 异常和诊断视网膜疾病，并标识未来研究方向。这项全面的研究，包括临床和自动化检测方法，似乎是其范围的唯一一项。此外，survey 还能为关注数字Retinopathy的研究人员提供有用的指南。

Multi-view Hybrid Graph Convolutional Network for Volume-to-mesh Reconstruction in Cardiovascular MRI

paper_url: http://arxiv.org/abs/2311.13706
repo_url: None
paper_authors: Nicolás Gaggion, Benjamin A. Matheson, Yan Xia, Rodrigo Bonazzola, Nishant Ravikumar, Zeike A. Taylor, Diego H. Milone, Alejandro F. Frangi, Enzo Ferrante
for: 这篇论文旨在推动心血管成像技术的发展，以便更好地研究心脏形态和功能。methods: 该论文提出了一种新的直接图像到网格抽取算法，称为HybridVNet，它结合了标准的卷积神经网络和图гра数据结构，并通过深度超vision和网格特征规范来提高网格生成的精度。results: 实验结果表明，HybridVNet可以高效地生成高质量的心脏CMR图像，并且可以在多视图情况下提高心脏MR图像生成的性能。

Abstract
Cardiovascular magnetic resonance imaging is emerging as a crucial tool to examine cardiac morphology and function. Essential to this endeavour are anatomical 3D surface and volumetric meshes derived from CMR images, which facilitate computational anatomy studies, biomarker discovery, and in-silico simulations. However, conventional surface mesh generation methods, such as active shape models and multi-atlas segmentation, are highly time-consuming and require complex processing pipelines to generate simulation-ready 3D meshes. In response, we introduce HybridVNet, a novel architecture for direct image-to-mesh extraction seamlessly integrating standard convolutional neural networks with graph convolutions, which we prove can efficiently handle surface and volumetric meshes by encoding them as graph structures. To further enhance accuracy, we propose a multiview HybridVNet architecture which processes both long axis and short axis CMR, showing that it can increase the performance of cardiac MR mesh generation. Our model combines traditional convolutional networks with variational graph generative models, deep supervision and mesh-specific regularisation. Experiments on a comprehensive dataset from the UK Biobank confirm the potential of HybridVNet to significantly advance cardiac imaging and computational cardiology by efficiently generating high-fidelity and simulation ready meshes from CMR images.

摘要
卡地Cardiovascular magnetic resonance imaging (CMR) 是一种非常重要的工具，用于评估心脏形态和功能。为了实现这一目标，需要使用CMR图像中的形态和功能信息来生成三维表面和体积网格，这些网格可以用于计算生物学研究、标志物发现和在线实验。然而，现有的表面网格生成方法，如活动形态模型和多Atlas分割，需要较长的时间和复杂的处理步骤来生成可用于实验的三维网格。为了解决这一问题，我们介绍了HybridVNet，一种新的架构，可以直接从CMR图像中提取形态和体积网格。HybridVNet将标准的卷积神经网络与图гра夫 convolutions相结合，并可以高效地处理表面和体积网格，通过编码它们为图гра夫结构。为了进一步提高准确性，我们提议了一种多视图HybridVNet架构，可以处理两个不同的方向的CMR图像（长轴和短轴），并证明了它可以提高心脏MR网格生成的性能。我们的模型结合了传统的卷积神经网络、变量图生成模型、深度监督和网格特定的正则化。我们在UK Biobank数据集上进行了广泛的实验，证明了HybridVNet可以高效地生成高精度和可用于实验的心脏MR网格。这些结果表明，HybridVNet可以在心脏成像和计算生物学方面提供重要的进步。

Masked Conditional Diffusion Models for Image Analysis with Application to Radiographic Diagnosis of Infant Abuse

paper_url: http://arxiv.org/abs/2311.13688
repo_url: None
paper_authors: Shaoju Wu, Sila Kurugol, Andy Tsai
for: 帮助骨科医生检测新生儿股骨损伤（CML）。
methods: 使用掩masked conditional diffusion model（MaC-DM）进行数据生成，并将权重 segmentation 面积用于分类指南。
results: 使用MaC-DM生成的图像和相关的分类标签，可以提高骨科医生在识别正常X光图像和CML损伤图像的性能，以及在标识CML损伤部位的性能。

Abstract
The classic metaphyseal lesion (CML) is a distinct injury that is highly specific for infant abuse. It commonly occurs in the distal tibia. To aid radiologists detect these subtle fractures, we need to develop a model that can flag abnormal distal tibial radiographs (i.e. those with CMLs). Unfortunately, the development of such a model requires a large and diverse training database, which is often not available. To address this limitation, we propose a novel generative model for data augmentation. Unlike previous models that fail to generate data that span the diverse radiographic appearance of the distal tibial CML, our proposed masked conditional diffusion model (MaC-DM) not only generates realistic-appearing and wide-ranging synthetic images of the distal tibial radiographs with and without CMLs, it also generates their associated segmentation labels. To achieve these tasks, MaC-DM combines the weighted segmentation masks of the tibias and the CML fracture sites as additional conditions for classifier guidance. The augmented images from our model improved the performances of ResNet-34 in classifying normal radiographs and those with CMLs. Further, the augmented images and their associated segmentation masks enhanced the performance of the U-Net in labeling areas of the CMLs on distal tibial radiographs.

摘要
《经典肢骨 lesion（CML）是婴儿虐待的特征性伤害，通常发生在 distal tibia。为了帮助 radiologists 探测这些微妙的骨折，我们需要开发一个可以标注不正常 distal tibial 放射图像的模型。然而，开发这种模型所需的大量和多样化的训练数据库通常缺乏。为解决这一限制，我们提议一种新的生成模型 для数据增强。不同于先前的模型，我们的提议的 masked conditional diffusion model（MaC-DM）不仅能够生成真实的外观和广泛的 synthetic 图像，还能够生成这些图像的相关的分割标签。为实现这些任务，MaC-DM 使用 tibias 和 CML 骨折区域的权重分割 маSK 作为分类指南。我们的模型生成的图像和其相关的分割标签可以提高 ResNet-34 分类正常放射图像和包含 CML 的放射图像的性能。此外，我们的模型生成的图像和其相关的分割标签还可以提高 U-Net 在 distal tibial 放射图像中标注 CML 区域的性能。

Single-Shot Plug-and-Play Methods for Inverse Problems

paper_url: http://arxiv.org/abs/2311.13682
repo_url: None
paper_authors: Yanqi Cheng, Lipei Zhang, Zhenda Shen, Shujun Wang, Lequan Yu, Raymond H. Chan, Carola-Bibiane Schönlieb, Angelica I Aviles-Rivero
for: 这个研究旨在解决对于问题 inverse problems 的应用，特别是使用 Plug-and-Play (PnP) 方法来应对这些问题。
methods: 这个研究使用 Single-Shot PnP 方法，将注意力集中在解决问题的最小数据上。具体来说，这个方法首先将 Single-Shot proximal denoiser 组合入迭代方法中，以实现单例训练。其次，这个方法提出了基于新函数的伪类型对应，以保留重要的频率特征并避免变分 gradients 的问题。
results: 经过广泛的数据和视觉实验，我们的方法能够获得更好的近似解。

Abstract
The utilisation of Plug-and-Play (PnP) priors in inverse problems has become increasingly prominent in recent years. This preference is based on the mathematical equivalence between the general proximal operator and the regularised denoiser, facilitating the adaptation of various off-the-shelf denoiser priors to a wide range of inverse problems. However, existing PnP models predominantly rely on pre-trained denoisers using large datasets. In this work, we introduce Single-Shot PnP methods (SS-PnP), shifting the focus to solving inverse problems with minimal data. First, we integrate Single-Shot proximal denoisers into iterative methods, enabling training with single instances. Second, we propose implicit neural priors based on a novel function that preserves relevant frequencies to capture fine details while avoiding the issue of vanishing gradients. We demonstrate, through extensive numerical and visual experiments, that our method leads to better approximations.

摘要
“对于逆问题，使用插入式测试（Plug-and-Play，PnP）的使用已经在最近几年中变得越来越普遍。这是基于普遍减少数据的泛化条件和训练的问题，使得各种商业测试条件可以适用到广泛的逆问题。然而，现有的PnP模型主要靠摄氏训练大量数据。在这个工作中，我们引入单一测试 proximal denoiser 到迭代方法中，允许单一的训练。其次，我们提出了基于新函数的匿名紧缩网络，以保留重要的频率来捕捉细节，而不是让梯度变得太小。我们通过广泛的数据和视觉实验示出，我们的方法可以获得更好的近似。”

Compact 3D Gaussian Representation for Radiance Field

paper_url: http://arxiv.org/abs/2311.13681
repo_url: None
paper_authors: Joo Chan Lee, Daniel Rho, Xiangyu Sun, Jong Hwan Ko, Eunbyung Park
for: 高效的3D场景表示和渲染
methods: 提出了一种基于学习mask的3D Gaussian splatting方法，并使用grid-based neural field来表示视觉依赖的颜色，以及vector quantization来压缩Gaussian特征。
results: 实现了10倍的存储减少和渲染速度提高，同时保持场景表示质量不变，比3DGS更高效、快速、 компакт化和实时渲染。

Abstract
Neural Radiance Fields (NeRFs) have demonstrated remarkable potential in capturing complex 3D scenes with high fidelity. However, one persistent challenge that hinders the widespread adoption of NeRFs is the computational bottleneck due to the volumetric rendering. On the other hand, 3D Gaussian splatting (3DGS) has recently emerged as an alternative representation that leverages a 3D Gaussisan-based representation and adopts the rasterization pipeline to render the images rather than volumetric rendering, achieving very fast rendering speed and promising image quality. However, a significant drawback arises as 3DGS entails a substantial number of 3D Gaussians to maintain the high fidelity of the rendered images, which requires a large amount of memory and storage. To address this critical issue, we place a specific emphasis on two key objectives: reducing the number of Gaussian points without sacrificing performance and compressing the Gaussian attributes, such as view-dependent color and covariance. To this end, we propose a learnable mask strategy that significantly reduces the number of Gaussians while preserving high performance. In addition, we propose a compact but effective representation of view-dependent color by employing a grid-based neural field rather than relying on spherical harmonics. Finally, we learn codebooks to compactly represent the geometric attributes of Gaussian by vector quantization. In our extensive experiments, we consistently show over 10$\times$ reduced storage and enhanced rendering speed, while maintaining the quality of the scene representation, compared to 3DGS. Our work provides a comprehensive framework for 3D scene representation, achieving high performance, fast training, compactness, and real-time rendering. Our project page is available at https://maincold2.github.io/c3dgs/.

摘要
neural radiance fields (nerfs) 有 demonstrated 出色的可能性， capture 复杂的 3D 场景高度准确。然而，一个持续的挑战是 computing bottleneck，妨碍 nerfs 的广泛应用。另一方面， 3D Gaussian splatting (3DGS) 最近 emerged as an alternative representation，使用 3D Gaussian-based representation，采用 rasterization pipeline 渲染图像，而不是 volumetric rendering，实现 very fast rendering speed 和 promising image quality。然而，3DGS 存在一个重要的缺点，需要维护大量的 Gaussiane 以保持高品质的渲染图像，需要大量的内存和存储空间。为了解决这个杰克难题，我们强调两个关键目标：1. 减少 Gaussiane 数量，不产生性能下降。2. 压缩 Gaussiane 属性，如视角依赖颜色和 covariance。为此，我们提出了一种学习掩模策略，可以减少 Gaussiane 数量，而不产生性能下降。另外，我们还提出了一种精简但有效的视角依赖颜色表示方法，而不是使用球面幂。最后，我们学习了一个 codebook 来压缩 Gaussian 的 геометрические属性，通过向量量化。在我们的广泛实验中，我们一直能够保持场景表示的质量，同时实现了 greater than 10 times 减少的存储空间和提高的渲染速度，相比 3DGS。我们的工作提供了一个完整的场景表示框架，实现了高性能、快速训练、紧凑性和实时渲染。更多信息请访问我们的项目页面：https://maincold2.github.io/c3dgs/。

BenthIQ: a Transformer-Based Benthic Classification Model for Coral Restoration

paper_url: http://arxiv.org/abs/2311.13661
repo_url: None
paper_authors: Rupa Kurinchi-Vendhan, Drew Gray, Elijah Cole
for: 这项研究的目的是为了提高浅水底图像中生物多样性的识别精度，以便更好地管理和保护珊瑚礁生态系统。
methods: 这篇论文提出了一种基于多标签语义分割网络的方法，使用嵌入式的SwinTransformer结构来学习地方semantic特征，并通过U字形Encoder-Decoder架构来实现本地-全球semantic特征学习。
results: 研究人员在法国波利尼西亚的实际案例中比较了传统CNN和注意力基于模型，并证明了他们的方法在浅水底图像中像素精度分类中表现出色。

Abstract
Coral reefs are vital for marine biodiversity, coastal protection, and supporting human livelihoods globally. However, they are increasingly threatened by mass bleaching events, pollution, and unsustainable practices with the advent of climate change. Monitoring the health of these ecosystems is crucial for effective restoration and management. Current methods for creating benthic composition maps often compromise between spatial coverage and resolution. In this paper, we introduce BenthIQ, a multi-label semantic segmentation network designed for high-precision classification of underwater substrates, including live coral, algae, rock, and sand. Although commonly deployed CNNs are limited in learning long-range semantic information, transformer-based models have recently achieved state-of-the-art performance in vision tasks such as object detection and image classification. We integrate the hierarchical Swin Transformer as the backbone of a U-shaped encoder-decoder architecture for local-global semantic feature learning. Using a real-world case study in French Polynesia, we demonstrate that our approach outperforms traditional CNN and attention-based models on pixel-wise classification of shallow reef imagery.

摘要
珊瑚礁是全球重要的marine生物多样性、海岸保护和人类生活所需的重要资源。然而，它们由于大规模脱色事件、污染和不可持续的做法而面临巨大的威胁。监测这些生态系统的健康状况非常重要，以便有效地恢复和管理。现有的方法创建底层组合图像经常妥协 между空间覆盖率和分辨率。在这篇论文中，我们介绍了BenthIQ，一种基于多个标签的semantic segmentation网络，用于高精度地分类水下substrate，包括活珊瑚、藻类、岩石和沙。尽管通常部署的CNNs是学习范围内semantic信息的劣化，但是基于transformer的模型在视觉任务中最近几年获得了状态对应的表现。我们将 hierarchy Swin Transformer作为encoder-decoder架构的背景，以实现本地-全球semantic特征学习。使用法国波利尼西亚的实际案例，我们示出了我们的方法在静止珊瑚图像中的像素精度分类性能高于传统CNN和注意力基本模型。

Panda or not Panda? Understanding Adversarial Attacks with Interactive Visualization

paper_url: http://arxiv.org/abs/2311.13656
repo_url: None
paper_authors: Yuzhe You, Jarvis Tse, Jian Zhao
for: 本研究旨在帮助AML学习者更好地理解攻击机器学习算法的攻击方法和影响，以及如何强化模型的可靠性。
methods: 本研究使用了多级交互式视觉化系统，帮助AML学习者更好地理解不同图像分类器面对攻击的不同属性和影响。
results: 研究表明，AdvEx可以非仅作为AML机制理解的视觉化工具，还可以提供有趣和有挑战性的学习体验，因此可以帮助AML学习者更好地学习和理解攻击机器学习算法。

Abstract
Adversarial machine learning (AML) studies attacks that can fool machine learning algorithms into generating incorrect outcomes as well as the defenses against worst-case attacks to strengthen model robustness. Specifically for image classification, it is challenging to understand adversarial attacks due to their use of subtle perturbations that are not human-interpretable, as well as the variability of attack impacts influenced by diverse methodologies, instance differences, and model architectures. Through a design study with AML learners and teachers, we introduce AdvEx, a multi-level interactive visualization system that comprehensively presents the properties and impacts of evasion attacks on different image classifiers for novice AML learners. We quantitatively and qualitatively assessed AdvEx in a two-part evaluation including user studies and expert interviews. Our results show that AdvEx is not only highly effective as a visualization tool for understanding AML mechanisms, but also provides an engaging and enjoyable learning experience, thus demonstrating its overall benefits for AML learners.

摘要
adversarial machine learning (AML) 研究攻击机器学习算法生成错误结果以及对最坏情况进行防御，以增强模型的可靠性。特别是对于图像分类，针对攻击的细微变化很难理解，同时攻击方法、实例差异和模型结构的影响也使得攻击的影响变化很大。我们通过与 AML 学习者和教师合作设计，引入 AdvEx，一种多级互动视觉化系统，为 novice AML 学习者全面展示不同图像分类器对欺骗攻击的性能和影响。我们通过用户研究和专家采访进行了两部分评估，结果表明 AdvEx 不仅是一种高效的 AML 视觉化工具，而且提供了有趣和有挑战性的学习体验，从而证明了它的总体效果。

GAN-Avatar: Controllable Personalized GAN-based Human Head Avatar

paper_url: http://arxiv.org/abs/2311.13655
repo_url: None
paper_authors: Berna Kabadayi, Wojciech Zielonka, Bharat Lal Bhatnagar, Gerard Pons-Moll, Justus Thies
for: 实现高品质的数位人类和3D facial avatar，解决传统方法所带来的限制，如不能捕捉 Profile 和后方视角。
methods: 基于images的3D-aware生成模型，不需要精确的面部表情追踪，从2D图像和相应的摄像头参数进行训练。
results: 与现有单目方法比较，提供更高品质的图像 sintesis，并且不需要训练数据中的表情追踪。

Abstract
Digital humans and, especially, 3D facial avatars have raised a lot of attention in the past years, as they are the backbone of several applications like immersive telepresence in AR or VR. Despite the progress, facial avatars reconstructed from commodity hardware are incomplete and miss out on parts of the side and back of the head, severely limiting the usability of the avatar. This limitation in prior work stems from their requirement of face tracking, which fails for profile and back views. To address this issue, we propose to learn person-specific animatable avatars from images without assuming to have access to precise facial expression tracking. At the core of our method, we leverage a 3D-aware generative model that is trained to reproduce the distribution of facial expressions from the training data. To train this appearance model, we only assume to have a collection of 2D images with the corresponding camera parameters. For controlling the model, we learn a mapping from 3DMM facial expression parameters to the latent space of the generative model. This mapping can be learned by sampling the latent space of the appearance model and reconstructing the facial parameters from a normalized frontal view, where facial expression estimation performs well. With this scheme, we decouple 3D appearance reconstruction and animation control to achieve high fidelity in image synthesis. In a series of experiments, we compare our proposed technique to state-of-the-art monocular methods and show superior quality while not requiring expression tracking of the training data.

摘要
“数字人类和特别是3D面部模拟器在过去几年中吸引了很多关注，因为它们是虚拟现实（VR）和扩展现实（AR）等应用程序的核心。尽管有了进步，但面部模拟器由商业硬件重建的结果仍然缺失部分头顶和背部，这限制了模拟器的可用性。这种在先前的工作中的局限性来自于需要面部跟踪，而面部跟踪在Profile和后视图时失败。为解决这个问题，我们提议从图像中学习人Specific可控的动atable avatar，不需要对 facial expression 的跟踪进行假设。我们的方法的核心在于利用3D感知的生成模型，这种模型是通过训练数据来重现 facial expression 的分布。为了训练这种外观模型，我们只需要一个包含2D图像和相应的摄像头参数的集合。为控制模型，我们学习一个从3DMM facial expression参数到生成模型的 latent space 的映射。这个映射可以通过从normalized frontal view中重建 facial parameters，来学习。通过这种方案，我们解耦3D appearance reconstruction和动画控制，以实现高品质的图像合成。在一系列实验中，我们与state-of-the-art monocular方法进行比较，并显示我们的提议方法在图像合成质量方面具有显著的优势，而不需要训练数据中的 facial expression 跟踪。”

Retrieval-Augmented Layout Transformer for Content-Aware Layout Generation

paper_url: http://arxiv.org/abs/2311.13602
repo_url: None
paper_authors: Daichi Horita, Naoto Inoue, Kotaro Kikuchi, Kota Yamaguchi, Kiyoharu Aizawa
for: 这 paper 的目的是提出一种基于图像内容的自动化排版生成方法，以提高 e-commerce 产品图片的排版质量。
methods: 这 paper 使用了一种名为 Retrieval-Augmented Layout Transformer (RALF) 的模型，它根据输入图像进行 nearest neighbor 搜索，并将这些结果传递给一个 autoregressive 生成器。
results: EXTENSIVE 实验表明，RALF 可以在受控制的和无控制的情况下生成高质量的内容感知排版，并在基eline 上显著超越了基elines。

Abstract
Content-aware graphic layout generation aims to automatically arrange visual elements along with a given content, such as an e-commerce product image. In this paper, we argue that the current layout generation approaches suffer from the limited training data for the high-dimensional layout structure. We show that a simple retrieval augmentation can significantly improve the generation quality. Our model, which is named Retrieval-Augmented Layout Transformer (RALF), retrieves nearest neighbor layout examples based on an input image and feeds these results into an autoregressive generator. Our model can apply retrieval augmentation to various controllable generation tasks and yield high-quality layouts within a unified architecture. Our extensive experiments show that RALF successfully generates content-aware layouts in both constrained and unconstrained settings and significantly outperforms the baselines.

摘要
content-aware 图像排版生成目标是自动将视觉元素与给定内容（如电商产品图片）进行排版，以提高排版质量。在这篇论文中，我们认为当前的排版生成方法受到高维排版结构的培训数据的限制。我们表明，一种简单的检索扩充可以显著提高生成质量。我们称之为“检索扩充layouter”（RALF），它根据输入图片检索最近的 neighboor 排版示例，并将这些结果传递给一个autoregressive生成器。我们的模型可以应用检索扩充到多种可控生成任务，并在一个统一的架构中实现高质量的排版生成。我们的广泛实验表明，RALF成功地生成了内容相关的排版，并在受限和无限制的设置下显著超过基elines。

Diffusion models meet image counter-forensics

paper_url: http://arxiv.org/abs/2311.13629
repo_url: https://github.com/mtailanian/diff-cf
paper_authors: Matías Tailanian, Marina Gardella, Álvaro Pardo, Pablo Musé
for: 本研究探讨了一种基于扩散模型的反反馈技术，用于隐藏图像 traces 以逃脱审查。
methods: 研究使用了 diffusion purification 方法，通过扩散图像中的 traces 来隐藏它们，从而逃脱审查。
results: 研究表明，使用 diffusion purification 方法可以很好地隐藏图像 traces，并且可以超越现有的反反馈技术。

Abstract
From its acquisition in the camera sensors to its storage, different operations are performed to generate the final image. This pipeline imprints specific traces into the image to form a natural watermark. Tampering with an image disturbs these traces; these disruptions are clues that are used by most methods to detect and locate forgeries. In this article, we assess the capabilities of diffusion models to erase the traces left by forgers and, therefore, deceive forensics methods. Such an approach has been recently introduced for adversarial purification, achieving significant performance. We show that diffusion purification methods are well suited for counter-forensics tasks. Such approaches outperform already existing counter-forensics techniques both in deceiving forensics methods and in preserving the natural look of the purified images. The source code is publicly available at https://github.com/mtailanian/diff-cf.

摘要
原文：从摄像头感知器到存储，不同的操作都会对图像产生影响，以生成最终的图像。这些操作会留下特定的 traces，这些 traces 会形成自然的水印。如果图像被修改，这些 traces 就会受到影响，这些变化可以作为证据用于检测和定位伪造。在这篇文章中，我们评估了扩散模型可以对伪造者留下的 traces 进行抹除，从而骗过审查方法。这种方法在对抗检测方面表现出色，并且能够保持图像的自然外观。源代码可以在上获取。Note: "diffusion purification" in the original text is translated as "扩散纯化" in Simplified Chinese, which is a literal translation of the English phrase. However, "diffusion purification" is not a commonly used term in Chinese, and a more appropriate translation might be "图像纯化" (image purification) or "图像杀死" (image killing).

ZipLoRA: Any Subject in Any Style by Effectively Merging LoRAs

paper_url: http://arxiv.org/abs/2311.13600
repo_url: https://github.com/mkshing/ziplora-pytorch
paper_authors: Viraj Shah, Nataniel Ruiz, Forrester Cole, Erika Lu, Svetlana Lazebnik, Yuanzhen Li, Varun Jampani
for: 这篇论文旨在提出一种能够实现个性化的概念驱动Generative Model fine-tuning方法，以实现任何用户提供的主题和风格下的生成。
methods: 该方法使用低级别适应（LoRA）来实现概念驱动个性化，并通过独立 trains style和subject LoRAs来实现主题和风格的分离。
results: 实验表明，ZipLoRA可以在各种主题和风格组合下生成吸引人的结果，同时保持主题和风格的准确性，并且可以允许用户重新contextualize。项目页面：https://ziplora.github.io

Abstract
Methods for finetuning generative models for concept-driven personalization generally achieve strong results for subject-driven or style-driven generation. Recently, low-rank adaptations (LoRA) have been proposed as a parameter-efficient way of achieving concept-driven personalization. While recent work explores the combination of separate LoRAs to achieve joint generation of learned styles and subjects, existing techniques do not reliably address the problem; they often compromise either subject fidelity or style fidelity. We propose ZipLoRA, a method to cheaply and effectively merge independently trained style and subject LoRAs in order to achieve generation of any user-provided subject in any user-provided style. Experiments on a wide range of subject and style combinations show that ZipLoRA can generate compelling results with meaningful improvements over baselines in subject and style fidelity while preserving the ability to recontextualize. Project page: https://ziplora.github.io

摘要
方法 дляfinetuning生成模型以实现主题驱动个性化通常实现了主题驱动或风格驱动的生成中强大的结果。最近，低级 adaptations（LoRA）已经被提议为一种效率的参数方法以实现主题驱动个性化。 although recent work explores the combination of separate LoRAs to achieve joint generation of learned styles and subjects, existing techniques do not reliably address the problem; they often compromise either subject fidelity or style fidelity。我们提出ZipLoRA，一种可以便宜地和有效地将独立训练的style和subject LoRAs合并以实现任何用户提供的主题在任何用户提供的风格下生成any subject in any style。经过广泛的主题和风格组合实验，ZipLoRA能够生成吸引人的结果，同时保持了重contextualize的能力。项目页面：https://ziplora.github.ioNote that the translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you need Traditional Chinese, please let me know and I can provide that as well.

T-Rex: Counting by Visual Prompting

paper_url: http://arxiv.org/abs/2311.13596
repo_url: None
paper_authors: Qing Jiang, Feng Li, Tianhe Ren, Shilong Liu, Zhaoyang Zeng, Kent Yu, Lei Zhang
for: 本研究设计了一个名为T-Rex的互动物 counting模型，用于实时检测并计算任何物体。
methods: 本研究使用了视觉提示来实现开放集合物 detection task，让用户可以通过指定兴趣物体的点或方块在库影像上，然后T-Rex将检测到的所有物体都具有相似的模式。用户可以透过T-Rex提供的视觉反馈来进一步精确地调整计数结果，包括缺失或错误检测的物体。
results: T-Rex在多个无预设的 counting benchmark 上 achieve 了现代水准的表现，并且在新设置的 counting benchmark 上显示了无预设的 zero-shot counting 能力。实际应用场景中，T-Rex表现了优异的可靠性和精度。

Abstract
We introduce T-Rex, an interactive object counting model designed to first detect and then count any objects. We formulate object counting as an open-set object detection task with the integration of visual prompts. Users can specify the objects of interest by marking points or boxes on a reference image, and T-Rex then detects all objects with a similar pattern. Guided by the visual feedback from T-Rex, users can also interactively refine the counting results by prompting on missing or falsely-detected objects. T-Rex has achieved state-of-the-art performance on several class-agnostic counting benchmarks. To further exploit its potential, we established a new counting benchmark encompassing diverse scenarios and challenges. Both quantitative and qualitative results show that T-Rex possesses exceptional zero-shot counting capabilities. We also present various practical application scenarios for T-Rex, illustrating its potential in the realm of visual prompting.

摘要
我们介绍T-Rex，一个互动式物件数据模型，旨在首先探测并 counted任何物件。我们将物件数据设置为开放集Object detection任务，并与视觉提示相结合。使用者可以在参考图片上标注点或方块，以便T-Rex检测到相似模式的所有物件。根据T-Rex的视觉反馈，使用者也可以互动地修正数据结果，并在缺失或错误检测的物件上提示。T-Rex在多个无预设分类 counting benchmark 上达到了现有最佳性能。为了更好地挖掘它的潜力，我们建立了一个包括多元enario和挑战的新 counting benchmark。两个量值和质量值的结果显示，T-Rex具有杰出的零shot counting能力。此外，我们还提供了不同实际应用场景，以示T-Rex在视觉提示领域的潜力。

XAGen: 3D Expressive Human Avatars Generation

paper_url: http://arxiv.org/abs/2311.13574
repo_url: https://github.com/magic-research/xagen
paper_authors: Zhongcong Xu, Jianfeng Zhang, Jun Hao Liew, Jiashi Feng, Mike Zheng Shou
for: 这篇论文旨在提出一种能够控制人体表达的3D生成模型（XAGen），以提高人体图像生成的真实性和可控性。
methods: 该模型采用多级、多部分3D表示方法，以增强小规模区域的细节表示，并提出了多部分渲染技术来分离体、面和手部的生成。
results: 实验表明，XAGen比前 estado-of-the-art 方法更高效，具有更高的真实性、多样性和表达控制能力。

Abstract
Recent advances in 3D-aware GAN models have enabled the generation of realistic and controllable human body images. However, existing methods focus on the control of major body joints, neglecting the manipulation of expressive attributes, such as facial expressions, jaw poses, hand poses, and so on. In this work, we present XAGen, the first 3D generative model for human avatars capable of expressive control over body, face, and hands. To enhance the fidelity of small-scale regions like face and hands, we devise a multi-scale and multi-part 3D representation that models fine details. Based on this representation, we propose a multi-part rendering technique that disentangles the synthesis of body, face, and hands to ease model training and enhance geometric quality. Furthermore, we design multi-part discriminators that evaluate the quality of the generated avatars with respect to their appearance and fine-grained control capabilities. Experiments show that XAGen surpasses state-of-the-art methods in terms of realism, diversity, and expressive control abilities. Code and data will be made available at https://showlab.github.io/xagen.

摘要
Translated into Simplified Chinese:最近的3D-aware GAN模型已经启用了真实和可控的人体图像生成。然而，现有的方法主要控制主体 JOINTS，忽略了表达特征，如 facial expression, jaw pose, hand pose 等。在这项工作中，我们提出了 XAGen，首个具有人像表达控制的3D生成模型。为了提高小规模区域的精度，我们设计了多尺度和多部分3D表示方法，模型细节。基于这种表示方法，我们提议了多部分渲染技术，以分解体、面和手的生成，以便模型训练和提高几何质量。此外，我们设计了多部分评估器，以评估生成人像的外观和细节控制能力。实验结果表明，XAGen超越了当前方法的真实、多样性和表达控制能力。代码和数据将在上公开。

WildFusion: Learning 3D-Aware Latent Diffusion Models in View Space

paper_url: http://arxiv.org/abs/2311.13570
repo_url: None
paper_authors: Katja Schwarz, Seung Wook Kim, Jun Gao, Sanja Fidler, Andreas Geiger, Karsten Kreis
for: 这种研究旨在实现基于学习的3D意识图像生成，以实现高度的真实感和3D视角变化。
methods: 我们使用了压缩含义模型（LDM）来实现3D意识图像生成，并通过使用单视深度预测来学习一个准确的3D表示。
results: 我们的方法可以生成高质量的3D一致的图像样本，超过了最近的状态艺术GAN基本方法。此外，我们的方法不需要 pose 图像或学习 pose 或摄像机分布，直接从各种自然图像数据中学习3D表示。

Abstract
Modern learning-based approaches to 3D-aware image synthesis achieve high photorealism and 3D-consistent viewpoint changes for the generated images. Existing approaches represent instances in a shared canonical space. However, for in-the-wild datasets a shared canonical system can be difficult to define or might not even exist. In this work, we instead model instances in view space, alleviating the need for posed images and learned camera distributions. We find that in this setting, existing GAN-based methods are prone to generating flat geometry and struggle with distribution coverage. We hence propose WildFusion, a new approach to 3D-aware image synthesis based on latent diffusion models (LDMs). We first train an autoencoder that infers a compressed latent representation, which additionally captures the images' underlying 3D structure and enables not only reconstruction but also novel view synthesis. To learn a faithful 3D representation, we leverage cues from monocular depth prediction. Then, we train a diffusion model in the 3D-aware latent space, thereby enabling synthesis of high-quality 3D-consistent image samples, outperforming recent state-of-the-art GAN-based methods. Importantly, our 3D-aware LDM is trained without any direct supervision from multiview images or 3D geometry and does not require posed images or learned pose or camera distributions. It directly learns a 3D representation without relying on canonical camera coordinates. This opens up promising research avenues for scalable 3D-aware image synthesis and 3D content creation from in-the-wild image data. See https://katjaschwarz.github.io/wildfusion for videos of our 3D results.

摘要
现代学习基于的3D意识图像合成方法可以实现高度的真实感和3D视角变化，但现有方法通常使用共同的坐标系来表示实例。然而，在野外数据集中，共同的坐标系可能很难定义或者甚至不存在。在这种情况下，我们在视角空间中模型实例，从而减轻了需要姿势图像和学习摄像头分布的限制。然而，现有的GAN基于方法在这种设定下很容易生成平坦的geometry和分布覆盖率差。我们因此提出了WildFusion，一种基于扩散模型（LDM）的新的3D意识图像合成方法。我们首先训练一个自适应encoder，可以恢复压缩的秘密表示，同时还能捕捉图像的下面结构，从而不仅可以进行重建，而且还可以生成新的视角。为了学习准确的3D表示，我们利用独立视野预测的干扰信号。然后，我们在3D意识的秘密空间中训练扩散模型，从而实现高质量的3D一致的图像样本，超过了最近的状态艺术GAN基于方法。这些方法不需要直接的多视图图像或3D геомет里的直接监督，也不需要姿势图像或学习的姿势或摄像头分布。它直接在3D意识的秘密空间中学习3D表示。这开启了可扩展的3D意识图像合成和3D内容创造的研究可能性。请参考https://katjaschwarz.github.io/wildfusion для视频详情。

ADriver-I: A General World Model for Autonomous Driving

paper_url: http://arxiv.org/abs/2311.13549
repo_url: None
paper_authors: Fan Jia, Weixin Mao, Yingfei Liu, Yucheng Zhao, Yuqing Wen, Chi Zhang, Xiangyu Zhang, Tiancai Wang
for: 这个论文的目的是提出一种基于多模态大语言模型和扩散技术的自动驾驶模型，以提高自动驾驶的性能和可解释性。
methods: 该论文使用了多模态大语言模型和扩散技术，并将视觉特征和控制信号融合为视觉动作对，然后使用这些视觉动作对来构建一个总世界模型，以便在这个模型中进行自动驾驶。
results: 经过广泛的实验，该论文的ADriver-I模型在nuScenes和私人数据集上表现出优秀的性能，比较于一些构建的基准模型。

Abstract
Typically, autonomous driving adopts a modular design, which divides the full stack into perception, prediction, planning and control parts. Though interpretable, such modular design tends to introduce a substantial amount of redundancy. Recently, multimodal large language models (MLLM) and diffusion techniques have demonstrated their superior performance on comprehension and generation ability. In this paper, we first introduce the concept of interleaved vision-action pair, which unifies the format of visual features and control signals. Based on the vision-action pairs, we construct a general world model based on MLLM and diffusion model for autonomous driving, termed ADriver-I. It takes the vision-action pairs as inputs and autoregressively predicts the control signal of the current frame. The generated control signals together with the historical vision-action pairs are further conditioned to predict the future frames. With the predicted next frame, ADriver-I performs further control signal prediction. Such a process can be repeated infinite times, ADriver-I achieves autonomous driving in the world created by itself. Extensive experiments are conducted on nuScenes and our large-scale private datasets. ADriver-I shows impressive performance compared to several constructed baselines. We hope our ADriver-I can provide some new insights for future autonomous driving and embodied intelligence.

摘要
Note:* "autonomous driving" is translated as "自动驾驶" (zì huó xíng)* "modular design" is translated as "模块化设计" (molduō zhì yì)* "perception, prediction, planning, and control" is translated as "感知、预测、规划和控制" (gǎn zhě, yù jì, guī yì yǔ kòng yì)* "multimodal large language models" is translated as "多模态大语言模型" (duō mó dà yǔ yán mó delè)* "diffusion techniques" is translated as "分布式技术" (fēn bù zhì yè)* "interleaved vision-action pairs" is translated as "混合视力动作对" (hùn hǎo zhì xìng dòng dài)* "general world model" is translated as "通用世界模型" (gōng yòng shì jiè mó delè)* "autoregressively predict" is translated as "自回归预测" (zì huí jí yù jì)* "control signal" is translated as "控制信号" (kòng zhì xìn haò)* "future frames" is translated as "未来帧" (wèi lái pán)* "extensive experiments" is translated as "广泛实验" (guǎng cháo shí yì)* "impressive performance" is translated as "出色表现" (chū sè biǎo xiǎng)

paper_url: http://arxiv.org/abs/2311.13535
repo_url: None
paper_authors: Yangyang Xu, Shengfeng He, Wenqi Shao, Kwan-Yee K. Wong, Yu Qiao, Ping Luo
for: 这 paper 的目的是提出一种新的图像分离框架，以便更好地解决图像分离问题。
methods: 这 paper 使用了一种扩散模型来从粗糙的 alpha 质量图向 refined alpha 质量图的过程中进行过程。
results: 对 Several image matting benchmarks 进行评估，DiffusionMat consistently outperforms 现有的方法。In English:
for: The purpose of this paper is to propose a new image matting framework to better solve the image matting problem.
methods: This paper uses a diffusion model to refine the alpha matte from a coarse to a refined one.
results: The proposed method is evaluated on several image matting benchmarks and consistently outperforms existing methods.

Abstract
In this paper, we introduce DiffusionMat, a novel image matting framework that employs a diffusion model for the transition from coarse to refined alpha mattes. Diverging from conventional methods that utilize trimaps merely as loose guidance for alpha matte prediction, our approach treats image matting as a sequential refinement learning process. This process begins with the addition of noise to trimaps and iteratively denoises them using a pre-trained diffusion model, which incrementally guides the prediction towards a clean alpha matte. The key innovation of our framework is a correction module that adjusts the output at each denoising step, ensuring that the final result is consistent with the input image's structures. We also introduce the Alpha Reliability Propagation, a novel technique designed to maximize the utility of available guidance by selectively enhancing the trimap regions with confident alpha information, thus simplifying the correction task. To train the correction module, we devise specialized loss functions that target the accuracy of the alpha matte's edges and the consistency of its opaque and transparent regions. We evaluate our model across several image matting benchmarks, and the results indicate that DiffusionMat consistently outperforms existing methods. Project page at~\url{https://cnnlstm.github.io/DiffusionMat

摘要
“在这篇论文中，我们介绍了DiffusionMat，一种新的图像分离框架，它利用分散模型来从粗糙的Alpha矩阵逐渐提取精细的Alpha矩阵。与传统方法不同，我们的方法不仅将Trimap作为Alpha矩阵预测的严格指导，而是将图像分离看作一种顺序改进学习过程。这个过程从Trimap中添加噪声开始，然后使用预训练的分散模型进行多个逐渐减噪步骤，这些步骤逐渐引导预测向着干净的Alpha矩阵。DiffusionMat的关键创新是一个修正模块，它在每个减噪步骤后对输出进行调整，以确保最终结果与输入图像的结构一致。我们还提出了Alpha可靠传播（Alpha Reliability Propagation），一种新的技术，用于最大化可用指导的 utility， selectively增强Trimap中可信Alpha信息，从而简化修正任务。为了训练修正模块，我们设计了特殊的损失函数， targeting Alpha矩阵的精度和 Edge的一致性。我们在多个图像分离 benchmark上评估DiffusionMat，结果表明DiffusionMat在存在其他方法的情况下一直保持领先地位。项目页面在~\url{https://cnnlstm.github.io/DiffusionMat}。”

Leveraging CNNs and Ensemble Learning for Automated Disaster Image Classification

paper_url: http://arxiv.org/abs/2311.13531
repo_url: None
paper_authors: Archit Rathod, Veer Pariawala, Mokshit Surana, Kumkum Saxena
for: 这个论文是为了探讨自然灾害图像分类的问题，并提出了一种基于卷积神经网络（CNN）的方法。
methods: 作者使用了多种CNN建模器，并对一个包含自然灾害图像的数据集进行训练。他们还使用了栅格神经网络（ResNet）和堆叠CNN ensemble方法来优化模型的性能。
results: 研究发现，使用栅格神经网络和堆叠CNN ensemble方法可以达到95%的准确率和0.96的F1分数。这些结果表明了CNN基本模型在自然灾害图像分类中的可靠性和有效性。

Abstract
Natural disasters act as a serious threat globally, requiring effective and efficient disaster management and recovery. This paper focuses on classifying natural disaster images using Convolutional Neural Networks (CNNs). Multiple CNN architectures were built and trained on a dataset containing images of earthquakes, floods, wildfires, and volcanoes. A stacked CNN ensemble approach proved to be the most effective, achieving 95% accuracy and an F1 score going up to 0.96 for individual classes. Tuning hyperparameters of individual models for optimization was critical to maximize the models' performance. The stacking of CNNs with XGBoost acting as the meta-model utilizes the strengths of the CNN and ResNet models to improve the overall accuracy of the classification. Results obtained from the models illustrated the potency of CNN-based models for automated disaster image classification. This lays the foundation for expanding these techniques to build robust systems for disaster response, damage assessment, and recovery management.

摘要
自然灾害是全球的严重威胁，需要有效和高效的灾害管理和恢复。这篇论文关注使用卷积神经网络（CNN）来分类自然灾害图像。我们建立了多种CNN建筑和训练集合，包括地震、洪水、野火和火山等自然灾害图像。结果表明，使用栅stacking方法组合CNN和XGBoost作为meta-模型，可以提高总精度。在优化各个模型的超参数方面也是非常重要。结果表明，CNN模型在自动化灾害图像分类中表现了remarkable的可靠性和精度。这为扩展这些技术，建立可靠的灾害应对、损伤评估和恢复管理系统提供了基础。

Hybrid Whale-Mud-Ring Optimization for Precise Color Skin Cancer Image Segmentation

paper_url: http://arxiv.org/abs/2311.13512
repo_url: None
paper_authors: Amir Hamza, Badis Lekouaghet, Yassine Himeur
for: 这篇论文的目的是提高皮肤癌检测的精度，以帮助患有皮肤癌的患者保留健康。
methods: 本篇论文提出了一个基于多层阈值调整的方法，使用了鲸鱼优化算法与穆迪环节攻击来解决遗传问题，以提高皮肤癌检测的精度。
results: 实验结果显示，提案的WMRA方法在评估适当性、PSNR和MSE三个指标方面，较以对比的最新方法更具优势。

Abstract
Timely identification and treatment of rapidly progressing skin cancers can significantly contribute to the preservation of patients' health and well-being. Dermoscopy, a dependable and accessible tool, plays a pivotal role in the initial stages of skin cancer detection. Consequently, the effective processing of digital dermoscopy images holds significant importance in elevating the accuracy of skin cancer diagnoses. Multilevel thresholding is a key tool in medical imaging that extracts objects within the image to facilitate its analysis. In this paper, an enhanced version of the Mud Ring Algorithm hybridized with the Whale Optimization Algorithm, named WMRA, is proposed. The proposed approach utilizes bubble-net attack and mud ring strategy to overcome stagnation in local optima and obtain optimal thresholds. The experimental results show that WMRA is powerful against a cluster of recent methods in terms of fitness, Peak Signal to Noise Ratio (PSNR), and Mean Square Error (MSE).

摘要
时间准确识别和治疗皮肤癌可以对病人健康和生活质量有很大的贡献。德摩斯可观，一种可靠且接受的工具，在皮肤癌检测的初期阶段扮演着重要的角色。因此，对于数位德摩斯图像的有效处理具有重要的重要性，以提高皮肤癌诊断的准确性。多层阈值是医疗影像中一个重要的工具，可以帮助检测出图像中的物体，并且对其进行分析。本文提出了一个改进的Mud Ring Algorithm（MRAC），与鲸鱼算法（WOA）的融合版本，称为WMRA。这个方法使用泡泡网攻击和泥环策略，以超越本地最佳点的僵持，从而获得最佳的阈值。实验结果显示，WMRA在一群最新的方法中与其他方法相比，在适应度、PSNR和MSE方面具有优秀的表现。

Deep-learning-based acceleration of MRI for radiotherapy planning of pediatric patients with brain tumors

paper_url: http://arxiv.org/abs/2311.13485
repo_url: https://github.com/stjude/deepmrirec
paper_authors: Shahinur Alam, Jinsoo Uh, Alexander Dresner, Chia-ho Hua, Khaled Khairy
for: 这篇论文的目的是为了提高磁共振成像（MRI）的不侵入性诊断和放射治疗（RT）规划，并提供更多的关于人体解剖结构的细节。
methods: 这篇论文使用了深度学习的方法（DeepMRIRec）来重建从不完整测量数据中获得的MRI影像。
results: 这篇论文的结果显示，使用DeepMRIRec方法可以将MRI影像重建时间缩短到四倍，并且与完整测量数据相比，其 Structural Similarity Score 为 0.960，与评估的现有方法（0.896）相比，显示了这篇论文的潜在优势。

Abstract
Magnetic Resonance Imaging (MRI) is a non-invasive diagnostic and radiotherapy (RT) planning tool, offering detailed insights into the anatomy of the human body. The extensive scan time is stressful for patients, who must remain motionless in a prolonged imaging procedure that prioritizes reduction of imaging artifacts. This is challenging for pediatric patients who may require measures for managing voluntary motions such as anesthesia. Several computational approaches reduce scan time (fast MRI), by recording fewer measurements and digitally recovering full information via post-acquisition reconstruction. However, most fast MRI approaches were developed for diagnostic imaging, without addressing reconstruction challenges specific to RT planning. In this work, we developed a deep learning-based method (DeepMRIRec) for MRI reconstruction from undersampled data acquired with RT-specific receiver coil arrangements. We evaluated our method against fully sampled data of T1-weighted MR images acquired from 73 children with brain tumors/surgical beds using loop and posterior coils (12 channels), with and without applying virtual compression of coil elements. DeepMRIRec reduced scanning time by a factor of four producing a structural similarity score surpassing the evaluated state-of-the-art method (0.960 vs 0.896), thereby demonstrating its potential for accelerating MRI scanning for RT planning.

摘要

SkeletonGait: Gait Recognition Using Skeleton Maps

paper_url: http://arxiv.org/abs/2311.13444
repo_url: https://github.com/shiqiyu/opengait
paper_authors: Chao Fan, Jingzhe Ma, Dongyang Jin, Chuanfu Shen, Shiqi Yu
for: 本研究旨在提出一种新的骨架姿势表示法和基于骨架的方法，以提高深度步态识别的性能。
methods: 本研究使用的方法包括提出了一种新的骨架姿势表示法（即骨架地图）和基于骨架的方法（即SkeletonGait），以及一种多支路架构（即SkeletonGait++），以便使用相互补充的特征。
results: 本研究的实验结果表明，SkeletonGait++ 可以在多种场景中取得显著的高精度，比如在挑战性的 GREW 数据集上达到了超过 85% 的排名一精度。

Abstract
The choice of the representations is essential for deep gait recognition methods. The binary silhouettes and skeletal coordinates are two dominant representations in recent literature, achieving remarkable advances in many scenarios. However, inherent challenges remain, in which silhouettes are not always guaranteed in unconstrained scenes, and structural cues have not been fully utilized from skeletons. In this paper, we introduce a novel skeletal gait representation named Skeleton Map, together with SkeletonGait, a skeleton-based method to exploit structural information from human skeleton maps. Specifically, the skeleton map represents the coordinates of human joints as a heatmap with Gaussian approximation, exhibiting a silhouette-like image devoid of exact body structure. Beyond achieving state-of-the-art performances over five popular gait datasets, more importantly, SkeletonGait uncovers novel insights about how important structural features are in describing gait and when do they play a role. Furthermore, we propose a multi-branch architecture, named SkeletonGait++, to make use of complementary features from both skeletons and silhouettes. Experiments indicate that SkeletonGait++ outperforms existing state-of-the-art methods by a significant margin in various scenarios. For instance, it achieves an impressive rank-1 accuracy of over $85\%$ on the challenging GREW dataset. All the source code will be available at https://github.com/ShiqiYu/OpenGait.

摘要
“选择表现方式是深度走势识别方法中非常重要的一部分。 binary silhouette 和 skeletal coordinates 是现代文献中最具代表性的两种表现方式，在许多场景中实现了惊人的进步。然而，这些表现方式中存在一些潜在的挑战，例如在无条件场景中不一定能获取silhouette，以及skeleton 中的结构信息未被完全利用。在本文中，我们介绍了一个新的肢体对称走势表示方法，名为Skeleton Map，以及一个基于肢体图的方法，名为SkeletonGait，以利用人体肢体图中的结构信息。具体来说，Skeleton Map 将人体 JOINT 的坐标转换为一个 Gaussian 折衣的热力映射，呈现一个没有具体身体结构的 Silhouette 像。在五个受欢迎的走势数据集上，SkeletonGait 不仅实现了现有方法的州供应率，而且发现了走势中结构特征的重要性，以及当结构特征在哪些情况下发挥作用。此外，我们提出了一个多支架架构，名为SkeletonGait++，以利用肢体图和 Silhouette 之间的补充特征。实验结果显示，SkeletonGait++ 在不同的场景中均具有显著的提升效果，例如在挑战性的 GREW 数据集上，它的 rank-1 精度高于 $85\%$。所有的源代码将在 GitHub 上公开。”

CompenHR: Efficient Full Compensation for High-resolution Projector

paper_url: http://arxiv.org/abs/2311.13409
repo_url: https://github.com/cyxwang/compenhr
paper_authors: Yuxi Wang, Haibin Ling, Bingyao Huang
For: This paper proposes a practical full compensation solution for high-resolution projector-camera systems, addressing the issues of long training time and high memory cost in state-of-the-art methods.* Methods: The proposed method includes an attention-based grid refinement network to improve geometric correction quality, an end-to-end compensation network with a novel sampling scheme and attention blocks to alleviate computation and preserve key features, and a benchmark dataset for high-resolution projector full compensation.* Results: The proposed method demonstrates clear advantages in both efficiency and quality compared to state-of-the-art methods in experiments.Here’s the summary in Traditional Chinese:* For: 本研究提出一个实用的高分辨率投影机构补偿方案，以解决现有方法的长时间训练时间和高内存成本问题。* Methods: 提案的方法包括一个注意力基于的格子精确化网络，以提高几何调正质量；一个终端到端补偿网络，具有一个新的抽样方案和注意区块来缓和计算和保留重要特征；以及一个高分辨率投影机全补偿 benchmark 数据集。* Results: 在实验中，提案的方法与现有方法相比，在效率和质量两方面都有明显的优势。

Abstract
Full projector compensation is a practical task of projector-camera systems. It aims to find a projector input image, named compensation image, such that when projected it cancels the geometric and photometric distortions due to the physical environment and hardware. State-of-the-art methods use deep learning to address this problem and show promising performance for low-resolution setups. However, directly applying deep learning to high-resolution setups is impractical due to the long training time and high memory cost. To address this issue, this paper proposes a practical full compensation solution. Firstly, we design an attention-based grid refinement network to improve geometric correction quality. Secondly, we integrate a novel sampling scheme into an end-to-end compensation network to alleviate computation and introduce attention blocks to preserve key features. Finally, we construct a benchmark dataset for high-resolution projector full compensation. In experiments, our method demonstrates clear advantages in both efficiency and quality.

摘要
“全景补偿”是一个实用的专案，旨在找到一个对应投影机的输入图像，以补偿因物理环境和硬件而导致的几何和光学扭曲。现代方法通过深度学习来解决这个问题，并在低分辨率设置下显示出惊人的表现。然而，直接将深度学习应用到高分辨率设置是不实际的，因为训练时间太长，内存成本高昂。为解决这个问题，这篇论文提出了一个实用的全景补偿方案。首先，我们设计了一个注意力基于的格子精细化网络，以提高几何补偿质量。其次，我们将一个新的抽样方案integrated into an end-to-end补偿网络，以减少计算量和引入注意点对。最后，我们建立了一个高分辨率专案的参考 dataset。在实验中，我们的方法在效率和质量两方面都显示了明显的优势。

Animatable 3D Gaussians for High-fidelity Synthesis of Human Motions

paper_url: http://arxiv.org/abs/2311.13404
repo_url: None
paper_authors: Keyang Ye, Tianjia Shao, Kun Zhou
for: Synthesizing high-fidelity free-view human motions in real time.
methods: Novel animatable 3D Gaussian model with learnable code and alpha loss for refining appearance, and joint optimization of human joint parameters.
results: Superior performance over NeRF-based methods, with the ability to synthesize new human motions in real time (66 fps on average).

Abstract
We present a novel animatable 3D Gaussian model for rendering high-fidelity free-view human motions in real time. Compared to existing NeRF-based methods, the model owns better capability in synthesizing high-frequency details without the jittering problem across video frames. The core of our model is a novel augmented 3D Gaussian representation, which attaches each Gaussian with a learnable code. The learnable code serves as a pose-dependent appearance embedding for refining the erroneous appearance caused by geometric transformation of Gaussians, based on which an appearance refinement model is learned to produce residual Gaussian properties to match the appearance in target pose. To force the Gaussians to learn the foreground human only without background interference, we further design a novel alpha loss to explicitly constrain the Gaussians within the human body. We also propose to jointly optimize the human joint parameters to improve the appearance accuracy. The animatable 3D Gaussian model can be learned with shallow MLPs, so new human motions can be synthesized in real time (66 fps on avarage). Experiments show that our model has superior performance over NeRF-based methods.

摘要
我们提出了一种新的可动的3D高斯模型，用于在实时下生成高精度自由视人姿。与现有基于NeRF的方法相比，我们的模型具有更好的高频环境详细Synthesize能力，不受视频帧中的颤动问题的影响。我们的模型核心是一种新的增强了3D高斯表示方法，将每个高斯分配一个学习码。这个学习码 acts as a基于pose的出现嵌入，用于根据目标姿态修复高斯出现的错误。为了让高斯学习人体背景分离，我们还提出了一种新的alpha损失来Explicitly constrain高斯在人体躯体中。此外，我们还提议同时优化人 JOINT参数，以提高出现精度。我们的可动3D高斯模型可以使用浅层MLP学习，因此可以在实时下新的人姿Synthesize。实验表明，我们的模型在NeRF-based方法的基础上具有更好的性能。

Depth-Regularized Optimization for 3D Gaussian Splatting in Few-Shot Images

paper_url: http://arxiv.org/abs/2311.13398
repo_url: None
paper_authors: Jaeyoung Chung, Jeongtaek Oh, Kyoung Mu Lee
for: 优化高斯拟合使用有限数量的图像，避免过拟合。
methods: 引入密集深度地图作为空间指南，以避免过拟合。
results: 提出的方法在NERF-LLFF数据集中，与原始方法相比，显示出更好的空间准确性和抗折扣性。

Abstract
In this paper, we present a method to optimize Gaussian splatting with a limited number of images while avoiding overfitting. Representing a 3D scene by combining numerous Gaussian splats has yielded outstanding visual quality. However, it tends to overfit the training views when only a small number of images are available. To address this issue, we introduce a dense depth map as a geometry guide to mitigate overfitting. We obtained the depth map using a pre-trained monocular depth estimation model and aligning the scale and offset using sparse COLMAP feature points. The adjusted depth aids in the color-based optimization of 3D Gaussian splatting, mitigating floating artifacts, and ensuring adherence to geometric constraints. We verify the proposed method on the NeRF-LLFF dataset with varying numbers of few images. Our approach demonstrates robust geometry compared to the original method that relies solely on images.

摘要
在这篇论文中，我们提出了一种方法来优化 Gaussian 抖splatting，使用有限数量的图像而不导致过拟合。将三个维度场景表示为多个 Gaussian 抖splat 已经得到了出色的视觉质量。然而，当只有少数图像时，它往往会过拟合训练视图。为解决这个问题，我们引入了密集的深度地图作为geometry guide，以 Mitigate overfitting。我们通过使用预训练的单目深度估计模型和稀疏 COLMAP 特征点进行偏移和缩放，获得了调整后的深度地图。调整后的深度地图对于颜色基于的三个维度 Gaussian 抖splatting 进行了颜色优化，避免浮动 artifacts，并保证遵循 geometric 约束。我们在NeRF-LLFF数据集上验证了我们的方法，并证明我们的方法具有比原始方法更好的Robust geometry。

SegVol: Universal and Interactive Volumetric Medical Image Segmentation

paper_url: http://arxiv.org/abs/2311.13385
repo_url: https://github.com/baai-dcai/segvol
paper_authors: Yuxin Du, Fan Bai, Tiejun Huang, Bo Zhao
for: 这个论文是为了提供一个基础的医疗影像分类模型，以便在诊断过程中实现准确且有结构的信息提供。
methods: 这个模型使用了一种名为SegVol的基础医疗影像分类模型，通过训练90000个无标注的 Computed Tomography（CT）Volume和6000个标注CT，以支持分类200多个 анатомі学类别。
results: 实验结果显示，SegVol在多个分类benchmark上比前一代模型大幅提高了表现，特别是在三个挑战性的肿瘤Dataset上，我们的方法在Dice分数上比nnU-Net高出约20%。

Abstract
Precise image segmentation provides clinical study with meaningful and well-structured information. Despite the remarkable progress achieved in medical image segmentation, there is still an absence of foundation segmentation model that can segment a wide range of anatomical categories with easy user interaction. In this paper, we propose a universal and interactive volumetric medical image segmentation model, named SegVol. By training on 90k unlabeled Computed Tomography (CT) volumes and 6k labeled CTs, this foundation model supports the segmentation of over 200 anatomical categories using semantic and spatial prompts. Extensive experiments verify that SegVol outperforms the state of the art by a large margin on multiple segmentation benchmarks. Notably, on three challenging lesion datasets, our method achieves around 20% higher Dice score than nnU-Net. The model and data are publicly available at: https://github.com/BAAI-DCAI/SegVol.

摘要
精准的图像分割提供丰富的临床研究信息，具有明确的结构和意义。尽管医疗图像分割领域已经取得了很大的进步，但是仍然缺乏一个可以覆盖广泛的生物学类别的基础模型。在这篇论文中，我们提出了一种通用和交互式的三维医疗图像分割模型，名为SegVol。通过训练90000个未标注计算机断层（CT）Volume和6000个标注CT，这个基础模型支持分割超过200个生物学类别，使用semantic和空间提示。广泛的实验证明，SegVol在多个分割benchmark上大幅超越了状态的艺术。特别是在三个困难的肿瘤数据集上，我们的方法实现了大约20%的 dice分数高于nnU-Net。模型和数据可以在：https://github.com/BAAI-DCAI/SegVol中获得。

LucidDreamer: Domain-free Generation of 3D Gaussian Splatting Scenes

paper_url: http://arxiv.org/abs/2311.13384
repo_url: https://github.com/luciddreamer-cvlab/luciddreamer-cvlab.github.io
paper_authors: Jaeyoung Chung, Suyoung Lee, Hyeongjin Nam, Jaerin Lee, Kyoung Mu Lee
for: 提高3D场景生成技术的灵活性和可靠性，使其能够生成具有高细节和多视角一致性的3D场景，无需受限于特定领域或数据集。
methods: 提出了一种域外场景生成管线，名为LucidDreamer，具有两个步骤：梦境和协调。在梦境阶段，通过使用扩散型生成模型，从输入点云中生成多视角一致的图像。在协调阶段，通过一种协调算法，将新生成的3D场景积成为一个完整的3D场景。
results: LucidDreamer可以生成高细节和多视角一致的3D场景，无需受限于特定领域或数据集。与先前的3D场景生成方法相比，LucidDreamer的结果更加细节和可靠。项目页面：https://luciddreamer-cvlab.github.io/

Abstract
With the widespread usage of VR devices and contents, demands for 3D scene generation techniques become more popular. Existing 3D scene generation models, however, limit the target scene to specific domain, primarily due to their training strategies using 3D scan dataset that is far from the real-world. To address such limitation, we propose LucidDreamer, a domain-free scene generation pipeline by fully leveraging the power of existing large-scale diffusion-based generative model. Our LucidDreamer has two alternate steps: Dreaming and Alignment. First, to generate multi-view consistent images from inputs, we set the point cloud as a geometrical guideline for each image generation. Specifically, we project a portion of point cloud to the desired view and provide the projection as a guidance for inpainting using the generative model. The inpainted images are lifted to 3D space with estimated depth maps, composing a new points. Second, to aggregate the new points into the 3D scene, we propose an aligning algorithm which harmoniously integrates the portions of newly generated 3D scenes. The finally obtained 3D scene serves as initial points for optimizing Gaussian splats. LucidDreamer produces Gaussian splats that are highly-detailed compared to the previous 3D scene generation methods, with no constraint on domain of the target scene. Project page: https://luciddreamer-cvlab.github.io/

摘要
With the widespread use of VR devices and contents, the demand for 3D scene generation techniques has become more popular. However, existing 3D scene generation models are limited to specific domains due to their training strategies using 3D scan datasets that are far from the real world. To address this limitation, we propose LucidDreamer, a domain-free scene generation pipeline that fully leverages the power of existing large-scale diffusion-based generative models. Our LucidDreamer has two alternative steps: Dreaming and Alignment. First, to generate multi-view consistent images from inputs, we set the point cloud as a geometric guide for each image generation. Specifically, we project a portion of the point cloud to the desired view and use the projection as a guide for inpainting using the generative model. The inpainted images are lifted to 3D space with estimated depth maps, composing a new point cloud. Second, to aggregate the new points into the 3D scene, we propose an aligning algorithm that harmoniously integrates the portions of newly generated 3D scenes. The final obtained 3D scene serves as the initial points for optimizing Gaussian splats. LucidDreamer produces Gaussian splats that are highly detailed compared to previous 3D scene generation methods, with no constraint on the domain of the target scene. Project page: https://luciddreamer-cvlab.github.io/

Point Projection Mapping System for Tracking, Registering, Labeling and Validating Optical Tissue Measurements

paper_url: http://arxiv.org/abs/2311.13378
repo_url: None
paper_authors: Lianne Feenstra, Stefan D. van der Stel, Marcos Da Silva Guimaraes, Theo J. M Ruers, Behdad Dashtbozorg
for: 这个论文是为了验证新发展的光学组织检测技术，以便在肿瘤手术中进行精准的肿瘤检测。
methods: 这个论文使用的方法包括非破坏性跟踪系统Point Projection Mapping，以及一种有效的注册、验证和标注方法，与 histopathology 结果进行比较。
results: 这个论文的结果表明，使用这种新的方法可以更加准确地跟踪和验证光学组织检测技术的表现，相比传统的方法更加高效和可靠。

Abstract
Validation of newly developed optical tissue sensing techniques for tumor detection during cancer surgery requires an accurate correlation with histological results. Additionally, such accurate correlation facilitates precise data labeling for developing high-performance machine-learning tissue classification models. In this paper, a newly developed Point Projection Mapping system will be introduced, which allows non-destructive tracking of the measurement locations on tissue specimens. Additionally, a framework for accurate registration, validation, and labeling with histopathology results is proposed and validated on a case study. The proposed framework provides a more robust and accurate method for tracking and validation of optical tissue sensing techniques, which saves time and resources compared to conventional techniques available.

摘要
新开发的光学组织检测技术的验证需要与癌症手术中的 histological 结果准确相关。此外，准确的相关还可以促进高性能机器学习组织分类模型的开发。本文将介绍一种新开发的点投影映射系统，可以非Destructive地跟踪测量位置在组织样本中。此外，一种精准的 registration、验证和标注方法将被提出，并在一个案例研究中验证。该方法提供了更加robust和准确的跟踪和验证光学组织检测技术的方法，可以节省时间和资源，比传统方法更高效。

MRGazer: Decoding Eye Gaze Points from Functional Magnetic Resonance Imaging in Individual Space

paper_url: http://arxiv.org/abs/2311.13372
repo_url: None
paper_authors: Xiuwen Wu, Rongjie Hu, Jie Liang, Yanming Wang, Bensheng Qiu, Xiaoxiao Wang
for: 这个论文的目的是提出一种基于深度学习的眼动预测方法，以便从fMRI数据中预测眼动点。
methods: 这个方法使用了一个名为MRGazer的框架，该框架包括眼球EXTRACT模块和基于差异网络的眼动预测模块。与之前的方法相比，该方法减少了fMRI均衡步骤，简化了处理协议，并实现了端到端眼动预测。
results: 该方法在多种眼动任务中表现出色，比之前的方法更高效，并在每个volume（约0.02秒）内提供了 объекivity 的结果，比之前的方法（约0.3秒）更快。

Abstract
Eye-tracking research has proven valuable in understanding numerous cognitive functions. Recently, Frey et al. provided an exciting deep learning method for learning eye movements from fMRI data. However, it needed to co-register fMRI into standard space to obtain eyeballs masks, and thus required additional templates and was time consuming. To resolve this issue, in this paper, we propose a framework named MRGazer for predicting eye gaze points from fMRI in individual space. The MRGazer consisted of eyeballs extraction module and a residual network-based eye gaze prediction. Compared to the previous method, the proposed framework skips the fMRI co-registration step, simplifies the processing protocol and achieves end-to-end eye gaze regression. The proposed method achieved superior performance in a variety of eye movement tasks than the co-registration-based method, and delivered objective results within a shorter time (~ 0.02 Seconds for each volume) than prior method (~0.3 Seconds for each volume).

摘要
眼动研究已经对许多认知功能提供了重要的启示。最近，Frey等人提出了一种深度学习方法，可以从fMRI数据中学习眼动。然而，该方法需要将fMRI数据转换为标准空间，以获取眼球面积，这需要额外的模板和时间消耗。为解决这个问题，在这篇论文中，我们提出了一个名为MRGazer的框架，可以从fMRI数据中预测眼动点。MRGazer包括眼球EXTRACT模块和一个基于差异网络的眼动预测模块。与之前的方法相比，我们的框架取消了fMRI协调步骤，简化了处理协议，并实现了端到端的眼动预测。我们的方法在多种眼动任务中表现出色，比之前的方法更快（大约0.02秒每个量），并且具有更加准确的结果。

Unified Classification and Rejection: A One-versus-All Framework

paper_url: http://arxiv.org/abs/2311.13355
repo_url: None
paper_authors: Zhen Cheng, Xu-Yao Zhang, Cheng-Lin Liu
for: 本研究旨在建立一个统一的框架，用于实现开集类别和 unknown input 拒绝（OOD）识别任务。
methods: 本研究使用一个统一的框架，将开集类别转换为一个 $(K+1)$ 类别的分类问题，并使用一个 hybrid 训练策略，结合 OVA 损失和多类别梯度损失，以维持关闭集类别准确性。
results: 实验结果显示，提案的框架在两个受测集上具有竞争性的表现，包括关闭集类别准确性、OOD检测和误分检测。

Abstract
Classifying patterns of known classes and rejecting ambiguous and novel (also called as out-of-distribution (OOD)) inputs are involved in open world pattern recognition. Deep neural network models usually excel in closed-set classification while performing poorly in rejecting OOD. To tackle this problem, numerous methods have been designed to perform open set recognition (OSR) or OOD rejection/detection tasks. Previous methods mostly take post-training score transformation or hybrid models to ensure low scores on OOD inputs while separating known classes. In this paper, we attempt to build a unified framework for building open set classifiers for both classification and OOD rejection. We formulate the open set recognition of $ K $-known-class as a $ (K + 1) $-class classification problem with model trained on known-class samples only. By decomposing the $ K $-class problem into $ K $ one-versus-all (OVA) binary classification tasks and binding some parameters, we show that combining the scores of OVA classifiers can give $ (K + 1) $-class posterior probabilities, which enables classification and OOD rejection in a unified framework. To maintain the closed-set classification accuracy of the OVA trained classifier, we propose a hybrid training strategy combining OVA loss and multi-class cross-entropy loss. We implement the OVA framework and hybrid training strategy on the recently proposed convolutional prototype network. Experiments on popular OSR and OOD detection datasets demonstrate that the proposed framework, using a single multi-class classifier, yields competitive performance in closed-set classification, OOD detection, and misclassification detection.

摘要
<>TRANSLATE_TEXT开放世界模式识别中，需要识别已知类型和未知类型（也称为 OUT-OF-DISTRIBUTION，OOD）输入。深度神经网络通常在已知类型识别中表现出色，但在拒绝OOD输入方面表现不佳。为解决这个问题，许多方法已经被设计用于进行开放集Recognition（OSR）或OOD拒绝/检测任务。先前的方法主要通过post-training score transformation或混合模型来确保OOD输入的低分数，以分离已知类型。在这篇论文中，我们尝试建立一个简单的框架用于建立开放集分类器，包括分类和OOD拒绝。我们将开放集识别作为$ K $-已知类型的$ (K + 1) $-类分类问题，使用已知类型样本进行训练。我们将这个问题 decomposing 为$ K $个一对一（OVA）二分类任务，并将一些参数绑定，以确定$(K+1)$类 posterior probabilities，这些 posterior probabilities 允许我们在一个简单的框架中进行分类和OOD拒绝。为保持OVA训练得到的closed-set分类精度，我们提议一种混合Hybrid Training策略， combining OVA损失和多类 Cross-Entropy损失。我们在最近提出的卷积抽象网络上实现了OVA框架和混合训练策略。实验表明，我们的框架在 популяр的OSR和OOD检测数据集上表现竞争力强，在closed-set分类、OOD检测和误分类检测中均表现出色。Note: "<>" is the marker used to indicate the beginning of the text to be translated, and "TRANSLATE_TEXT" is the command used to translate the text.

High-Quality Face Caricature via Style Translation

paper_url: http://arxiv.org/abs/2311.13338
repo_url: None
paper_authors: Lamyanba Laishram, Muhammad Shaheryar, Jong Taek Lee, Soon Ki Jung
for:* 这个论文的目的是提出一种高质量、无监督的人脸塑型方法，适用于实际应用场景。methods:* 该方法使用计算机视觉技术和GAN模型，通过两步过程来实现人脸塑型和面部特征强调：面部塑型生成和面部塑型投影。results:* 该方法可以很好地强调人脸特征和面部特征，同时保持人脸属性、表情和特征的一致性。* 与其他现有的面部塑型方法相比，该方法具有更高的真实感和更好的可视化性。

Abstract
Caricature is an exaggerated form of artistic portraiture that accentuates unique yet subtle characteristics of human faces. Recently, advancements in deep end-to-end techniques have yielded encouraging outcomes in capturing both style and elevated exaggerations in creating face caricatures. Most of these approaches tend to produce cartoon-like results that could be more practical for real-world applications. In this study, we proposed a high-quality, unpaired face caricature method that is appropriate for use in the real world and uses computer vision techniques and GAN models. We attain the exaggeration of facial features and the stylization of appearance through a two-step process: Face caricature generation and face caricature projection. The face caricature generation step creates new caricature face datasets from real images and trains a generative model using the real and newly created caricature datasets. The Face caricature projection employs an encoder trained with real and caricature faces with the pretrained generator to project real and caricature faces. We perform an incremental facial exaggeration from the real image to the caricature faces using the encoder and generator's latent space. Our projection preserves the facial identity, attributes, and expressions from the input image. Also, it accounts for facial occlusions, such as reading glasses or sunglasses, to enhance the robustness of our model. Furthermore, we conducted a comprehensive comparison of our approach with various state-of-the-art face caricature methods, highlighting our process's distinctiveness and exceptional realism.

摘要
《卡通化人脸》是一种扩展了人脸特征的艺术形式，通过强调人脸特有的特征来增强人脸的表达。最近，深度端到端技术的进步使得在创造人脸卡通化时能够更好地捕捉到人脸特征的扩展和精细化。大多数这些方法会生成具有漫画化效果的人脸卡通化结果，这些结果在实际应用中更加实用。在这项研究中，我们提出了一种高质量、无需对比数据的人脸卡通化方法，使用计算机视觉技术和GAN模型。我们通过两步进程来实现人脸卡通化：面听生成和面听投影。面听生成阶段从真实图像中生成新的卡通化人脸数据集，并使用真实和生成的卡通化数据集来训练生成模型。面听投影阶段使用已经训练的encoder和生成器来投影真实和卡通化人脸。我们通过增量地夹带真实图像到卡通化人脸中来实现人脸特征的扩展和样式的转换，并保持人脸特征、属性和表情的一致性。此外，我们还进行了对我们方法与现有的州前沿技术进行了广泛的比较，展示了我们的过程的独特性和准确性。

Revisiting Supervision for Continual Representation Learning

paper_url: http://arxiv.org/abs/2311.13321
repo_url: None
paper_authors: Daniel Marczak, Sebastian Cygert, Tomasz Trzciński, Bartłomiej Twardowski
for: 这个论文主要是为了研究 continual learning 中的模型学习方法。
methods: 这个论文使用了自我超vision和supervision两种方法来学习 continual representation learning。
results: 研究发现，带有多层感知器头的supervised模型在 continual representation learning 中可以超越自我超vision模型的表现。

Abstract
In the field of continual learning, models are designed to learn tasks one after the other. While most research has centered on supervised continual learning, recent studies have highlighted the strengths of self-supervised continual representation learning. The improved transferability of representations built with self-supervised methods is often associated with the role played by the multi-layer perceptron projector. In this work, we depart from this observation and reexamine the role of supervision in continual representation learning. We reckon that additional information, such as human annotations, should not deteriorate the quality of representations. Our findings show that supervised models when enhanced with a multi-layer perceptron head, can outperform self-supervised models in continual representation learning.

摘要
在持续学习领域中，模型通常是一个接一个地学习任务。大多数研究都集中在监督持续学习中，但最近的研究表明自动化持续表示学习具有优势。各层抽象器的改进了表示的可重用性通常与自动化方法的角色相关。在这项工作中，我们假设额外信息，如人工标注，不会下降表示质量。我们的发现表明，当增强了多层抽象器头的监督模型时，可以在持续表示学习中超越自动化模型。

Deep Learning for Vascular Segmentation and Applications in Phase Contrast Tomography Imaging

paper_url: http://arxiv.org/abs/2311.13319
repo_url: None
paper_authors: Ekin Yagis, Shahab Aslani, Yashvardhan Jain, Yang Zhou, Shahrokh Rahmani, Joseph Brunet, Alexandre Bellier, Christopher Werlein, Maximilian Ackermann, Danny Jonigk, Paul Tafforeau, Peter D Lee, Claire Walsh
for: 这篇论文的目的是提供一份关于自动血管分割的广泛文献综述，以提供基础知识并确定一种可靠的基线模型，以应用于新的成像Modalities中的血管分割。
methods: 本文使用了多种机器学习技术，包括nnU Net模型，以进行血管分割。
results: 研究发现，虽然 segmentation 得分 relativity high（例如 clDice 值在0.82-0.88之间），但还存在一些错误，如大血管的分割不佳，由于HiP CT是一种外部技术，缺少水压导致血管压缩，以及细血管内的连接性下降和边界分割错误。

Abstract
Automated blood vessel segmentation is vital for biomedical imaging, as vessel changes indicate many pathologies. Still, precise segmentation is difficult due to the complexity of vascular structures, anatomical variations across patients, the scarcity of annotated public datasets, and the quality of images. We present a thorough literature review, highlighting the state of machine learning techniques across diverse organs. Our goal is to provide a foundation on the topic and identify a robust baseline model for application to vascular segmentation in a new imaging modality, Hierarchical Phase Contrast Tomography (HiP CT). Introduced in 2020 at the European Synchrotron Radiation Facility, HiP CT enables 3D imaging of complete organs at an unprecedented resolution of ca. 20mm per voxel, with the capability for localized zooms in selected regions down to 1mm per voxel without sectioning. We have created a training dataset with double annotator validated vascular data from three kidneys imaged with HiP CT in the context of the Human Organ Atlas Project. Finally, utilising the nnU Net model, we conduct experiments to assess the models performance on both familiar and unseen samples, employing vessel specific metrics. Our results show that while segmentations yielded reasonably high scores such as clDice values ranging from 0.82 to 0.88, certain errors persisted. Large vessels that collapsed due to the lack of hydrostatic pressure (HiP CT is an ex vivo technique) were segmented poorly. Moreover, decreased connectivity in finer vessels and higher segmentation errors at vessel boundaries were observed. Such errors obstruct the understanding of the structures by interrupting vascular tree connectivity. Through our review and outputs, we aim to set a benchmark for subsequent model evaluations using various modalities, especially with the HiP CT imaging database.

摘要
自动化血管分 segmentation 是生物医学成像中非常重要的，因为血管变化表示许多疾病。然而，精准的分 segmentation 很难，因为血管结构复杂，每个患者的解剖结构不同，公共数据集的缺乏和图像质量。我们提供了一份通俗的文献综述，探讨机器学习技术在多种器官上的状况。我们的目标是提供基础知识，并identify一个可靠的基线模型，用于应用于血管分 segmentation 在新的成像模式下，即层次相对辐射Tomography（HiP CT）。 HiP CT在2020年欧洲 synchrotron Radiation Facility中引入，可以在不同器官的3D成像中获得无前例的分辨率（约20mm/voxel），并且可以在选择的区域进行高分辨率（1mm/voxel）的局部扩展，无需sectioning。我们创建了基于double annotator验证的血管数据集，来源于HiP CT在人类器官 Atlases Project中对三个肾脏的成像。最后，我们使用nnU Net模型进行实验，评估模型在 familiarnon-familiar样本上的性能，并使用血管特定的指标进行评估。我们的结果表明，虽然segmentation得分reasonably high（clDice值在0.82-0.88之间），但有一些错误存在。例如，由于HiP CT是外部技术，大血管受到lack of hydrostatic pressure的影响，导致 segmentation poorly。此外，finer vessels中的连接性降低和Boundaries segmentation error higher were observed。这些错误会中断血管树的连接，阻碍了结构的理解。通过我们的综述和输出，我们 hope to set a benchmark for subsequent model evaluations using various modalities，特别是使用HiP CT成像数据库。

Recognition-Guided Diffusion Model for Scene Text Image Super-Resolution

paper_url: http://arxiv.org/abs/2311.13317
repo_url: https://github.com/shercoo/RGDiffSR
paper_authors: Yuxuan Zhou, Liangcai Gao, Zhi Tang, Baole Wei
for: 提高Scene Text Recognition（STR）的精度和可读性，使用低分辨率（LR）图像中的文本进行提高。
methods: 使用Recognition-Guided Diffusion模型，以及Recognition-Guided Denoising Network进行图像提高和降噪处理。
results: 在TextZoom dataset上，RGDiffSR比前一代方法更高的提高文本识别精度和图像质量。

Abstract
Scene Text Image Super-Resolution (STISR) aims to enhance the resolution and legibility of text within low-resolution (LR) images, consequently elevating recognition accuracy in Scene Text Recognition (STR). Previous methods predominantly employ discriminative Convolutional Neural Networks (CNNs) augmented with diverse forms of text guidance to address this issue. Nevertheless, they remain deficient when confronted with severely blurred images, due to their insufficient generation capability when little structural or semantic information can be extracted from original images. Therefore, we introduce RGDiffSR, a Recognition-Guided Diffusion model for scene text image Super-Resolution, which exhibits great generative diversity and fidelity even in challenging scenarios. Moreover, we propose a Recognition-Guided Denoising Network, to guide the diffusion model generating LR-consistent results through succinct semantic guidance. Experiments on the TextZoom dataset demonstrate the superiority of RGDiffSR over prior state-of-the-art methods in both text recognition accuracy and image fidelity.

摘要

Retargeting Visual Data with Deformation Fields

paper_url: http://arxiv.org/abs/2311.13297
repo_url: None
paper_authors: Tim Elsner, Julia Berger, Tong Wu, Victor Czech, Lin Gao, Leif Kobbelt
for: 本研究旨在推广现有的图像编辑方法，以便在更广泛的视觉数据格式和自由度下进行编辑。
methods: 本研究使用 neural network 学习一种可以在低信息含量区域进行塑形的方法，以实现更好的内容意识扩展和修改。
results: 实验结果表明，本方法可以在不同的视觉数据上实现更好的内容意识扩展和修改，比之前的方法更高效。

Abstract
Seam carving is an image editing method that enable content-aware resizing, including operations like removing objects. However, the seam-finding strategy based on dynamic programming or graph-cut limits its applications to broader visual data formats and degrees of freedom for editing. Our observation is that describing the editing and retargeting of images more generally by a displacement field yields a generalisation of content-aware deformations. We propose to learn a deformation with a neural network that keeps the output plausible while trying to deform it only in places with low information content. This technique applies to different kinds of visual data, including images, 3D scenes given as neural radiance fields, or even polygon meshes. Experiments conducted on different visual data show that our method achieves better content-aware retargeting compared to previous methods.

摘要
【image editing方法】：Seam carving是一种图像编辑技术，允许内容感知的缩放和物体 removing 操作。但是，基于动态编程或图形抽取的seam-finding策略限制了其应用范围和视觉数据格式的自由度。我们的观察是，通过描述图像编辑和重定向的更一般的描述方法，可以得到内容感知的变形。我们提议使用神经网络学习一种尝试将输出变形到低信息含量地方的扩展。这种技术适用于不同类型的视觉数据，包括图像、神经辉场、或者 polygon mesh。我们在不同的视觉数据上进行了实验，并证明了我们的方法可以在内容感知的情况下实现更好的重定向。

CMFDFormer: Transformer-based Copy-Move Forgery Detection with Continual Learning

paper_url: http://arxiv.org/abs/2311.13263
repo_url: None
paper_authors: Yaqi Liu, Chao Xia, Song Xiao, Qingxiao Guan, Wenqian Dong, Yifan Zhang, Nenghai Yu
for: 这个研究旨在探讨一种基于深度学习的复制过程伪造侦测方法，以提高伪造侦测的精度和效率。
methods: 本研究提出了一个基于Transformer的伪造侦测网络，名为CMFDFormer，并提出了一个PCSD（Pool Cube and Strip Distillation）常规学习框架，以帮助CMFDFormer处理新任务。
results: 实验结果显示，CMFDFormer能够在公开 disponibles datasets上表现出色，并且PCSD常规学习框架能够避免伪造侦测中的忘记现象。

Abstract
Copy-move forgery detection aims at detecting duplicated regions in a suspected forged image, and deep learning based copy-move forgery detection methods are in the ascendant. These deep learning based methods heavily rely on synthetic training data, and the performance will degrade when facing new tasks. In this paper, we propose a Transformer-style copy-move forgery detection network named as CMFDFormer, and provide a novel PCSD (Pooled Cube and Strip Distillation) continual learning framework to help CMFDFormer handle new tasks. CMFDFormer consists of a MiT (Mix Transformer) backbone network and a PHD (Pluggable Hybrid Decoder) mask prediction network. The MiT backbone network is a Transformer-style network which is adopted on the basis of comprehensive analyses with CNN-style and MLP-style backbones. The PHD network is constructed based on self-correlation computation, hierarchical feature integration, a multi-scale cycle fully-connected block and a mask reconstruction block. The PHD network is applicable to feature extractors of different styles for hierarchical multi-scale information extraction, achieving comparable performance. Last but not least, we propose a PCSD continual learning framework to improve the forgery detectability and avoid catastrophic forgetting when handling new tasks. Our continual learning framework restricts intermediate features from the PHD network, and takes advantage of both cube pooling and strip pooling. Extensive experiments on publicly available datasets demonstrate the good performance of CMFDFormer and the effectiveness of the PCSD continual learning framework.

摘要
копирование-перенос фальсификат detection aim at detecting duplicated regions in a suspected forged image, and deep learning based copy-move forgery detection methods are on the rise. These deep learning based methods heavily rely on synthetic training data, and the performance will degrade when facing new tasks. In this paper, we propose a Transformer-style copy-move forgery detection network named as CMFDFormer, and provide a novel PCSD (Pooled Cube and Strip Distillation) continual learning framework to help CMFDFormer handle new tasks. CMFDFormer consists of a MiT (Mix Transformer) backbone network and a PHD (Pluggable Hybrid Decoder) mask prediction network. The MiT backbone network is a Transformer-style network which is adopted on the basis of comprehensive analyses with CNN-style and MLP-style backbones. The PHD network is constructed based on self-correlation computation, hierarchical feature integration, a multi-scale cycle fully-connected block and a mask reconstruction block. The PHD network is applicable to feature extractors of different styles for hierarchical multi-scale information extraction, achieving comparable performance. Last but not least, we propose a PCSD continual learning framework to improve the forgery detectability and avoid catastrophic forgetting when handling new tasks. Our continual learning framework restricts intermediate features from the PHD network, and takes advantage of both cube pooling and strip pooling. Extensive experiments on publicly available datasets demonstrate the good performance of CMFDFormer and the effectiveness of the PCSD continual learning framework.

Immunohistochemistry guided segmentation of benign epithelial cells, in situ lesions, and invasive epithelial cells in breast cancer slides

paper_url: http://arxiv.org/abs/2311.13261
repo_url: https://github.com/aican-research/breast-epithelium-segmentation
paper_authors: Maren Høibø, André Pedersen, Vibeke Grotnes Dale, Sissel Marie Berget, Borgny Ytterhus, Cecilia Lindskog, Elisabeth Wik, Lars A. Akslen, Ingerid Reinertsen, Erik Smistad, Marit Valla
for: 这个研究旨在开发一种人工智能模型，用于自动分类乳腺癌组织中的细胞。
methods: 研究人员使用了 convolutional neural network 和数据扩展技术来训练模型，并使用了 Hematoxylin and eosin 和 cytokeratin AE1/AE3 两种抗体来生成细胞背景图标。
results: 研究人员通过对 839 名乳腺癌患者的组织片和两名患者的整个染色片进行训练和评估，实现了对细胞类型的自动分类。模型在质量评估中取得了平均 dice 分数为 0.70、0.79 和 0.75，以及相关的四个类别的质量分数。

Abstract
Digital pathology enables automatic analysis of histopathological sections using artificial intelligence (AI). Automatic evaluation could improve diagnostic efficiency and help find associations between morphological features and clinical outcome. For development of such prediction models, identifying invasive epithelial cells, and separating these from benign epithelial cells and in situ lesions would be the first step. In this study, we aimed to develop an AI model for segmentation of epithelial cells in sections from breast cancer. We generated epithelial ground truth masks by restaining hematoxylin and eosin (HE) sections with cytokeratin (CK) AE1/AE3, and by pathologists' annotations. HE/CK image pairs were used to train a convolutional neural network, and data augmentation was used to make the model more robust. Tissue microarrays (TMAs) from 839 patients, and whole slide images from two patients were used for training and evaluation of the models. The sections were derived from four cohorts of breast cancer patients. TMAs from 21 patients from a fifth cohort was used as a second test set. In quantitative evaluation, a mean Dice score of 0.70, 0.79, and 0.75 for invasive epithelial cells, benign epithelial cells, and in situ lesions, respectively, were achieved. In qualitative scoring (0-5) by pathologists, results were best for all epithelium and invasive epithelium, with scores of 4.7 and 4.4. Scores for benign epithelium and in situ lesions were 3.7 and 2.0. The proposed model segmented epithelial cells in HE stained breast cancer slides well, but further work is needed for accurate division between the classes. Immunohistochemistry, together with pathologists' annotations, enabled the creation of accurate ground truths. The model is made freely available in FastPathology and the code is available at https://github.com/AICAN-Research/breast-epithelium-segmentation

摘要
“数字 PATHOLOGY 可以自动分析 Histopathological section 使用人工智能（AI）。自动评估可以提高诊断效率，并帮助发现 morphological 特征和临床结果之间的关系。为开发这类预测模型，首先需要 identific invasive epithelial cells，并将其分离于正常 epithelial cells 和 situ lesions。在本研究中，我们 aimed to develop an AI model for segmenting epithelial cells in breast cancer sections.我们使用 cytokeratin（CK） AE1/AE3 重新染色 HE sections，并由 PATHOLOGISTS manually annotate。HE/CK 图像组被用来训练 convolutional neural network，并使用数据增强来使模型更加可靠。来自 839 名患者的 Tissue microarrays（TMAs）和 two 名患者的全片图像被用于训练和评估模型。sections 来自四个患者群体，而 TMAs 从第五个患者群体中来自 21 名患者被用作第二个测试集。在量化评估中，我们得到了 Mean Dice 分数为 0.70、0.79 和 0.75 для invasive epithelial cells，benign epithelial cells 和 situ lesions，respectively。 PATHOLOGISTS 对所有 epithelium 和 invasive epithelium 的评分为 4.7 和 4.4，而对 benign epithelium 和 situ lesions 的评分为 3.7 和 2.0。我们的模型可以准确地在 HE 染色 breast cancer slice 上分类 epithelial cells，但需要进一步的工作以确定分类的准确性。免疫 histochemistry ， alongside pathologists' annotations，enabled the creation of accurate ground truths。我们的模型现在可以免费在 FastPathology 上获得，代码可以在 https://github.com/AICAN-Research/breast-epithelium-segmentation 上获取。”Note: The translation is in Simplified Chinese, which is the standard version of Chinese used in mainland China and other countries. If you need the translation in Traditional Chinese, please let me know.

Density Distribution-based Learning Framework for Addressing Online Continual Learning Challenges

paper_url: http://arxiv.org/abs/2311.13623
repo_url: None
paper_authors: Shilin Zhang, Jiahui Wang
for: 本研究旨在解决在线Continual Learning（CL）中的挑战，提出了基于概率分布的学习框架。CL特别是类增量学习，可以在单 passes的训练数据流中不断学习并适应新的测试分布，更加符合实际应用场景的需求。但是，现有的CL方法 oft suffer from catastrophic forgetting和更高的计算成本，限制其实际应用。我们的提议的框架可以超越这些限制，实现更高的均值准确率和时空效率， bridge the performance gap between CL和经典机器学习。
methods: 我们采用了独立的生成器kernel density estimation（GKDE）模型 для每个CL任务。在测试阶段，GKDEs通过自身报告的最大概率密度值来决定对入参测试实例进行预测。GKDE-based学习目标可以保证同类标签的样本被分组 together，而不同实例被推 farther apart。
results: 我们的方法在多个CL数据集上进行了广泛的实验，并证明了我们的提议的框架的效果。与流行的CL方法相比，我们的方法可以达到更高的均值准确率，同时保持竞争的时空效率，使我们的框架适用于实际应用。

Abstract
In this paper, we address the challenges of online Continual Learning (CL) by introducing a density distribution-based learning framework. CL, especially the Class Incremental Learning, enables adaptation to new test distributions while continuously learning from a single-pass training data stream, which is more in line with the practical application requirements of real-world scenarios. However, existing CL methods often suffer from catastrophic forgetting and higher computing costs due to complex algorithm designs, limiting their practical use. Our proposed framework overcomes these limitations by achieving superior average accuracy and time-space efficiency, bridging the performance gap between CL and classical machine learning. Specifically, we adopt an independent Generative Kernel Density Estimation (GKDE) model for each CL task. During the testing stage, the GKDEs utilize a self-reported max probability density value to determine which one is responsible for predicting incoming test instances. A GKDE-based learning objective can ensure that samples with the same label are grouped together, while dissimilar instances are pushed farther apart. Extensive experiments conducted on multiple CL datasets validate the effectiveness of our proposed framework. Our method outperforms popular CL approaches by a significant margin, while maintaining competitive time-space efficiency, making our framework suitable for real-world applications. Code will be available at https://github.com/xxxx/xxxx.

摘要
在这篇论文中，我们Addressing the challenges of online Continual Learning (CL) by introducing a density distribution-based learning framework. CL, especially Class Incremental Learning, enables adaptation to new test distributions while continuously learning from a single-pass training data stream, which is more in line with the practical application requirements of real-world scenarios. However, existing CL methods often suffer from catastrophic forgetting and higher computing costs due to complex algorithm designs, limiting their practical use. Our proposed framework overcomes these limitations by achieving superior average accuracy and time-space efficiency, bridging the performance gap between CL and classical machine learning. Specifically, we adopt an independent Generative Kernel Density Estimation (GKDE) model for each CL task. During the testing stage, the GKDEs utilize a self-reported max probability density value to determine which one is responsible for predicting incoming test instances. A GKDE-based learning objective can ensure that samples with the same label are grouped together, while dissimilar instances are pushed farther apart. Extensive experiments conducted on multiple CL datasets validate the effectiveness of our proposed framework. Our method outperforms popular CL approaches by a significant margin, while maintaining competitive time-space efficiency, making our framework suitable for real-world applications. 代码将会公开在https://github.com/xxxx/xxxx。

Towards Hetero-Client Federated Multi-Task Learning

paper_url: http://arxiv.org/abs/2311.13250
repo_url: None
paper_authors: Yuxiang Lu, Suizhi Huang, Yuwen Yang, Shalayiding Sirejiding, Yue Ding, Hongtao Lu
for:HC-FMTL 是一个新的问题设定，它旨在扩展现实世界中的应用，允许不同任务的训练和模型的多标的整合。methods:我们提出了 FedHCA$^2$ 框架，它使用了模型关系modeling来处理不同客户端的数据和任务不同性。我们还提出了 Hyper Conflict-Averse Aggregation 和 Hyper Cross Attention Aggregation 两种方法来解决模型不一致和数据类型的问题。results:我们的实验结果显示，FedHCA$^2$ 在多个HC-FMTL scenario中表现出色，较前代方法好。我们的代码将会公开。

Abstract
Federated Learning (FL) enables joint training across distributed clients using their local data privately. Federated Multi-Task Learning (FMTL) builds on FL to handle multiple tasks, assuming model congruity that identical model architecture is deployed in each client. To relax this assumption and thus extend real-world applicability, we introduce a novel problem setting, Hetero-Client Federated Multi-Task Learning (HC-FMTL), to accommodate diverse task setups. The main challenge of HC-FMTL is the model incongruity issue that invalidates conventional aggregation methods. It also escalates the difficulties in accurate model aggregation to deal with data and task heterogeneity inherent in FMTL. To address these challenges, we propose the FedHCA$^2$ framework, which allows for federated training of personalized models by modeling relationships among heterogeneous clients. Drawing on our theoretical insights into the difference between multi-task and federated optimization, we propose the Hyper Conflict-Averse Aggregation scheme to mitigate conflicts during encoder updates. Additionally, inspired by task interaction in MTL, the Hyper Cross Attention Aggregation scheme uses layer-wise cross attention to enhance decoder interactions while alleviating model incongruity. Moreover, we employ learnable Hyper Aggregation Weights for each client to customize personalized parameter updates. Extensive experiments demonstrate the superior performance of FedHCA$^2$ in various HC-FMTL scenarios compared to representative methods. Our code will be made publicly available.

摘要
Federated Learning (FL) 允许分布式客户端共同训练使用本地数据的私人方式。 Federated Multi-Task Learning (FMTL) 基于 FL 来处理多个任务，假设每个客户端都使用相同的模型结构。为了放弃这个假设，我们引入一个新的问题设定：异质客户端 Federated Multi-Task Learning (HC-FMTL)，以适应实际应用中的多个任务和数据不同。 HC-FMTL 的主要挑战是模型不一致问题，这使得传统的汇集方法无效。此外，HC-FMTL 还存在数据和任务不同性的困难，这使得准确的模型汇集变得更加困难。为解决这些挑战，我们提出了 FedHCA$^2$ 框架，允许 federated 训练个性化模型。通过模型之间的关系建模，FedHCA$^2$ 可以减少模型不一致问题。此外，我们还提出了 Hyper Conflict-Averse Aggregation 方案，用于在编码器更新中减少冲突。此外，我们还提出了 Hyper Cross Attention Aggregation 方案，使用层次跨注意力来增强解码器之间的互动，同时减少模型不一致问题。最后，我们采用可学习的 Hyper Aggregation Weights，为每个客户端自定义个性化参数更新。我们的实验表明，FedHCA$^2$ 在多种 HC-FMTL 场景中表现出色，至于代表方法的比较。我们的代码将会公开发布。

TDiffDe: A Truncated Diffusion Model for Remote Sensing Hyperspectral Image Denoising

paper_url: http://arxiv.org/abs/2311.13622
repo_url: None
paper_authors: Jiang He, Yajie Li, Jie L, Qiangqiang Yuan
for: 减少干扰的干扰照片（Hyperspectral images corrupted by various noise）
methods: 使用截断扩散模型（Truncated diffusion model）来恢复有用信息
results: 可以逐渐恢复有用信息，而不是从纯干扰开始（Avoid destroying valid information）

Abstract
Hyperspectral images play a crucial role in precision agriculture, environmental monitoring or ecological analysis. However, due to sensor equipment and the imaging environment, the observed hyperspectral images are often inevitably corrupted by various noise. In this study, we proposed a truncated diffusion model, called TDiffDe, to recover the useful information in hyperspectral images gradually. Rather than starting from a pure noise, the input data contains image information in hyperspectral image denoising. Thus, we cut the trained diffusion model from small steps to avoid the destroy of valid information.

摘要
干扰图像在精度农业、环境监测或生态分析中发挥关键作用。然而，由于感器设备和捕捉环境，观察到的干扰图像经常会受到各种干扰。在这种情况下，我们提出了一种裁剪扩散模型，称为TDiffDe，以逐渐恢复干扰图像中有用的信息。不同于从零开始，输入数据包含干扰图像信息。因此，我们将训练的扩散模型从小步骤割除，以避免有效信息的破坏。

Knowledge From the Dark Side: Entropy-Reweighted Knowledge Distillation for Balanced Knowledge Transfer

paper_url: http://arxiv.org/abs/2311.13621
repo_url: https://github.com/cpsu00/er-kd
paper_authors: Chi-Ping Su, Ching-Hsun Tseng, Shin-Jye Lee
for: 本研究旨在解决知识传递过程中存在的知识差距问题，通过精细控制学习模型的学习率，使学习模型更加准确地捕捉老师模型的隐性知识。
methods: 本研究提出了一种基于熵重新调整的知识传递方法，即熵重新调整知识传递（ER-KD），该方法通过熵量来重新调整知识传递的损失函数，使学习模型更加注重老师模型的隐性知识。
results: 本研究的实验结果表明，ER-KD可以与现有的知识传递方法兼容，同时提高知识传递的性能，而且可以在不同的数据集上实现更好的性能。

Abstract
Knowledge Distillation (KD) transfers knowledge from a larger "teacher" model to a compact "student" model, guiding the student with the "dark knowledge" $\unicode{x2014}$ the implicit insights present in the teacher's soft predictions. Although existing KDs have shown the potential of transferring knowledge, the gap between the two parties still exists. With a series of investigations, we argue the gap is the result of the student's overconfidence in prediction, signaling an imbalanced focus on pronounced features while overlooking the subtle yet crucial dark knowledge. To overcome this, we introduce the Entropy-Reweighted Knowledge Distillation (ER-KD), a novel approach that leverages the entropy in the teacher's predictions to reweight the KD loss on a sample-wise basis. ER-KD precisely refocuses the student on challenging instances rich in the teacher's nuanced insights while reducing the emphasis on simpler cases, enabling a more balanced knowledge transfer. Consequently, ER-KD not only demonstrates compatibility with various state-of-the-art KD methods but also further enhances their performance at negligible cost. This approach offers a streamlined and effective strategy to refine the knowledge transfer process in KD, setting a new paradigm in the meticulous handling of dark knowledge. Our code is available at https://github.com/cpsu00/ER-KD.

摘要
知识填充（KD）将知识从大型"教师"模型传递到紧凑"学生"模型，指导学生通过"黑知识"（ implicit insights）在教师的软预测中存在的隐性信息。 although existing KDs have shown the potential of transferring knowledge, the gap between the two parties still exists. with a series of investigations, we argue the gap is the result of the student's overconfidence in prediction, signaling an imbalanced focus on pronounced features while overlooking the subtle yet crucial dark knowledge. to overcome this, we introduce the Entropy-Reweighted Knowledge Distillation (ER-KD), a novel approach that leverages the entropy in the teacher's predictions to reweight the KD loss on a sample-wise basis. ER-KD precisely refocuses the student on challenging instances rich in the teacher's nuanced insights while reducing the emphasis on simpler cases, enabling a more balanced knowledge transfer. consequently, ER-KD not only demonstrates compatibility with various state-of-the-art KD methods but also further enhances their performance at negligible cost. this approach offers a streamlined and effective strategy to refine the knowledge transfer process in KD, setting a new paradigm in the meticulous handling of dark knowledge. our code is available at https://github.com/cpsu00/ER-KD.

Towards Detecting, Recognizing, and Parsing the Address Information from Bangla Signboard: A Deep Learning-based Approach

paper_url: http://arxiv.org/abs/2311.13222
repo_url: None
paper_authors: Hasan Murad, Mohammed Eunus Ali
for: 这个研究旨在提高巴基斯坦语（Bangla）地标信息抽取率，并提供一个综合系统来检测、识别、修正和分析地标信息。methods: 该研究使用了深度学习基于模型，包括CTC基于模型和Encoder-Decoder模型，以及一种新的地址文本修正模型。results: 研究显示，使用CTC基于模型可以获得较高的识别率，而使用Encoder-Decoder模型可以提高修正率。此外，研究还开发了一个基于转换器的地址文本分析器，可以帮助提高巴基斯坦语地标信息抽取率。

Abstract
Retrieving textual information from natural scene images is an active research area in the field of computer vision with numerous practical applications. Detecting text regions and extracting text from signboards is a challenging problem due to special characteristics like reflecting lights, uneven illumination, or shadows found in real-life natural scene images. With the advent of deep learning-based methods, different sophisticated techniques have been proposed for text detection and text recognition from the natural scene. Though a significant amount of effort has been devoted to extracting natural scene text for resourceful languages like English, little has been done for low-resource languages like Bangla. In this research work, we have proposed an end-to-end system with deep learning-based models for efficiently detecting, recognizing, correcting, and parsing address information from Bangla signboards. We have created manually annotated datasets and synthetic datasets to train signboard detection, address text detection, address text recognition, address text correction, and address text parser models. We have conducted a comparative study among different CTC-based and Encoder-Decoder model architectures for Bangla address text recognition. Moreover, we have designed a novel address text correction model using a sequence-to-sequence transformer-based network to improve the performance of Bangla address text recognition model by post-correction. Finally, we have developed a Bangla address text parser using the state-of-the-art transformer-based pre-trained language model.

摘要
“从自然场景图像中提取文本信息是计算机视觉领域的活跃研究领域，具有许多实际应用。检测文本区域并从牌匾中提取文本是一个具有特殊特征的问题，如反射光、不均匀照明或阴影，发现在真实的自然场景图像中。随着深度学习基于方法的出现，不同的复杂技术已经被提议用于文本检测和文本识别从自然场景图像中。虽然对于英语等资源语言进行了大量的努力，但对于low-resource语言如孟加拉语来说，却很少进行了研究。在这项研究中，我们提出了一个端到端的系统，使用深度学习基于模型来高效地检测、识别、更正和分析孟加拉牌匾上的地址信息。我们创建了手动标注的数据集和 sintetic 数据集，用于训练牌匾检测模型、地址文本检测模型、地址文本识别模型、地址文本更正模型和地址文本解析模型。我们进行了不同的 CTC-based 和 Encoder-Decoder 模型建立的比较研究，以及设计了一种基于 transformer 网络的新的地址文本更正模型，以提高孟加拉地址文本识别模型的性能。最后，我们使用现有的 transformer 预训练语言模型来开发一个孟加拉地址文本解析模型。”

paper_url: http://arxiv.org/abs/2311.13209
repo_url: https://github.com/Feliciaxyao/FSTTA
paper_authors: Junyu Gao, Xuan Yao, Changsheng Xu
for: 提高 Vision-and-Language Navigation (VLN) 模型的泛化能力，使其在多种环境下能够更好地适应变化。
methods: 提出了 Fast-Slow Test-Time Adaptation (FSTTA) 方法，通过对 gradients 和参数进行分解-积累分析，实现了在快更新阶段快速适应，以及在慢更新阶段保持模型稳定。
results: 对四个流行的标准 benchmark 进行了广泛的实验，并取得了非常出色的性能提升。

Abstract
Vision-and-Language Navigation (VLN) has witnessed significant advancements in recent years, largely attributed to meticulously curated datasets and proficiently trained models. Nevertheless, when tested in diverse environments, the trained models inevitably encounter significant shifts in data distribution, highlighting that relying solely on pre-trained and fixed navigation models is insufficient. To enhance models' generalization ability, test-time adaptation (TTA) demonstrates significant potential in the computer vision field by leveraging unlabeled test samples for model updates. However, simply applying existing TTA methods to the VLN task cannot well handle the adaptability-stability dilemma of VLN models, i.e., frequent updates can result in drastic changes in model parameters, while occasional updates can make the models ill-equipped to handle dynamically changing environments. Therefore, we propose a Fast-Slow Test-Time Adaptation (FSTTA) approach for VLN by performing decomposition-accumulation analysis for both gradients and parameters in a unified framework. Specifically, in the fast update phase, gradients generated during the recent multi-step navigation process are decomposed into components with varying levels of consistency. Then, these components are adaptively accumulated to pinpoint a concordant direction for fast model adaptation. In the slow update phase, historically recorded parameters are gathered, and a similar decomposition-accumulation analysis is conducted to revert the model to a stable state. Extensive experiments show that our method obtains impressive performance gains on four popular benchmarks.

摘要
Computer vision field 中的 Test-Time Adaptation (TTA) 技术在最近几年内得到了广泛的发展，它可以在测试时使用无标签数据来更新模型。然而，在不同的环境中测试的模型往往会遇到数据分布的显著变化，这表明仅仅靠先天学习的模型是不够的。为了增强模型的泛化能力，我们提出了 Fast-Slow Test-Time Adaptation (FSTTA) 方法，该方法可以在 VLN 任务中处理泛化-稳定之间的矛盾。具体来说，在快更新阶段，在最近的多步导航过程中生成的梯度被分解成具有不同一致性水平的组分。然后，这些组分被可适应地积累以确定一个 concordant 方向的快速模型更新。在慢更新阶段，历史记录的参数被聚集，并进行了相似的分解-积累分析，以恢复模型到一个稳定状态。我们的方法在四个流行的标准测试上表现出了很好的性能提升。

The Challenges of Image Generation Models in Generating Multi-Component Images

paper_url: http://arxiv.org/abs/2311.13620
repo_url: None
paper_authors: Tham Yik Foong, Shashank Kotyan, Po Yuan Mao, Danilo Vasconcellos Vargas
for: 本研究旨在探讨现代文本到图像生成器的进一步发展，尤其是对于使用复杂提示的图像生成质量的影响。
methods: 本研究使用了多种生成模型，包括Stable Diffusion V2，并在这些模型上进行了微调。
results: 研究发现，现有的文本到图像生成器在处理多个组件的提示时存在重大的限制，导致图像质量下降和上下文感知减退。

Abstract
Recent advances in text-to-image generators have led to substantial capabilities in image generation. However, the complexity of prompts acts as a bottleneck in the quality of images generated. A particular under-explored facet is the ability of generative models to create high-quality images comprising multiple components given as a prior. In this paper, we propose and validate a metric called Components Inclusion Score (CIS) to evaluate the extent to which a model can correctly generate multiple components. Our results reveal that the evaluated models struggle to incorporate all the visual elements from prompts with multiple components (8.53% drop in CIS per component for all evaluated models). We also identify a significant decline in the quality of the images and context awareness within an image as the number of components increased (15.91% decrease in inception Score and 9.62% increase in Frechet Inception Distance). To remedy this issue, we fine-tuned Stable Diffusion V2 on a custom-created test dataset with multiple components, outperforming its vanilla counterpart. To conclude, these findings reveal a critical limitation in existing text-to-image generators, shedding light on the challenge of generating multiple components within a single image using a complex prompt.

摘要

Steal My Artworks for Fine-tuning? A Watermarking Framework for Detecting Art Theft Mimicry in Text-to-Image Models

paper_url: http://arxiv.org/abs/2311.13619
repo_url: None
paper_authors: Ge Luo, Junqiang Huang, Manman Zhang, Zhenxing Qian, Sheng Li, Xinpeng Zhang
for: 保护艺术家的版权和鼓励原创作品
methods: 使用微型水印技术嵌入数字艺术作品中，以检测和防止使用艺术家的作品来模仿其风格，并且可以检测到 watermark 在生成图像中的分布，以暴露非法练习和盗取作品的情况
results: 研究在多种练习和攻击方式下，可以可靠地检测和暴露非法练习和盗取作品的行为，并且可以保护艺术家的版权和鼓励原创作品

Abstract
The advancement in text-to-image models has led to astonishing artistic performances. However, several studios and websites illegally fine-tune these models using artists' artworks to mimic their styles for profit, which violates the copyrights of artists and diminishes their motivation to produce original works. Currently, there is a notable lack of research focusing on this issue. In this paper, we propose a novel watermarking framework that detects mimicry in text-to-image models through fine-tuning. This framework embeds subtle watermarks into digital artworks to protect their copyrights while still preserving the artist's visual expression. If someone takes watermarked artworks as training data to mimic an artist's style, these watermarks can serve as detectable indicators. By analyzing the distribution of these watermarks in a series of generated images, acts of fine-tuning mimicry using stolen victim data will be exposed. In various fine-tune scenarios and against watermark attack methods, our research confirms that analyzing the distribution of watermarks in artificially generated images reliably detects unauthorized mimicry.

摘要
文本到图像模型的进步已导致惊人的艺术表演。然而，许多工作室和网站违法细化这些模型，使用艺术家的作品来模仿他们的风格以获利，这违反了艺术家的版权和减少了他们创作原创作品的动力。目前，这个问题尚未得到了足够的研究。在这篇论文中，我们提出了一种新的水印框架，通过细化来探测文本到图像模型中的模仿。这个框架将隐藏的水印 embed 到数字艺术作品中，以保护艺术家的版权，同时仍保持艺术家的视觉表达。如果有人使用水印图作为训练数据来模仿艺术家的风格，这些水印就可以作为检测器。通过分析水印在生成图像序列中的分布，我们可以检测到使用盗取受害者数据进行细化模仿。在不同的细化场景和水印攻击方法下，我们的研究表明，通过分析水印在人工生成图像中的分布可靠地检测非法模仿。

Self-guided Few-shot Semantic Segmentation for Remote Sensing Imagery Based on Large Vision Models

paper_url: http://arxiv.org/abs/2311.13200
repo_url: None
paper_authors: Xiyu Qi, Yifan Wu, Yongqiang Mao, Wenhui Zhang, Yidan Zhang
for: 这个论文是为了提出一种自动化几何推理方法，以便在遥感图像中进行几何推理。
methods: 这个论文使用了SAM模型，并采用了一种新的自动提示学习方法，利用导向掩码生成粗略像素级划请求。
results: 实验结果表明，这个方法在DLRSR数据集上表现出色，比其他可用的几何推理方法更高效。

Abstract
The Segment Anything Model (SAM) exhibits remarkable versatility and zero-shot learning abilities, owing largely to its extensive training data (SA-1B). Recognizing SAM's dependency on manual guidance given its category-agnostic nature, we identified unexplored potential within few-shot semantic segmentation tasks for remote sensing imagery. This research introduces a structured framework designed for the automation of few-shot semantic segmentation. It utilizes the SAM model and facilitates a more efficient generation of semantically discernible segmentation outcomes. Central to our methodology is a novel automatic prompt learning approach, leveraging prior guided masks to produce coarse pixel-wise prompts for SAM. Extensive experiments on the DLRSD datasets underline the superiority of our approach, outperforming other available few-shot methodologies.

摘要
《Segment Anything Model（SAM） exhibits remarkable versatility and zero-shot learning abilities, owing largely to its extensive training data（SA-1B）。Recognizing SAM's dependency on manual guidance given its category-agnostic nature，we identified unexplored potential within few-shot semantic segmentation tasks for remote sensing imagery。This research introduces a structured framework designed for the automation of few-shot semantic segmentation。It utilizes the SAM model and facilitates a more efficient generation of semantically discernible segmentation outcomes。Central to our methodology is a novel automatic prompt learning approach， leveraging prior guided masks to produce coarse pixel-wise prompts for SAM。Extensive experiments on the DLRSD datasets underline the superiority of our approach，outperforming other available few-shot methodologies。》Here's the word-for-word translation:《 segment anything model（SAM） exhibits remarkable versatility and zero-shot learning abilities，owing largely to its extensive training data（SA-1B）。recognizing SAM's dependency on manual guidance given its category-agnostic nature，we identified unexplored potential within few-shot semantic segmentation tasks for remote sensing imagery。this research introduces a structured framework designed for the automation of few-shot semantic segmentation。it utilizes the SAM model and facilitates a more efficient generation of semantically discernible segmentation outcomes。central to our methodology is a novel automatic prompt learning approach， leveraging prior guided masks to produce coarse pixel-wise prompts for SAM。extensive experiments on the DLRSD datasets underline the superiority of our approach，outperforming other available few-shot methodologies。》

DRIFu: Differentiable Rendering and Implicit Function-based Single-View 3D Reconstruction

paper_url: http://arxiv.org/abs/2311.13199
repo_url: https://github.com/kuangzijian/drifu-for-animals
paper_authors: Zijian Kuang, Lihang Ying, Shi Jin, Li Cheng
for: This paper aims to develop a novel 3D digitization technique for live animals, specifically tailored for avian forms.
methods: The proposed method, called DRIFu, leverages a curated set of synthetic 3D animal models, innovative alignment tools, and a shared shape space to enable precise predictions of animal shape and texture.
results: The DRIFu model has the potential to revolutionize our understanding and representation of avian forms, enabling realistic posing, animation, and alignment with real-world data.

Abstract
The Differentiable Rendering and Implicit Function-based model (DRIFu) draws its roots from the Pixel-aligned Implicit Function (PIFU), a pioneering 3D digitization technique initially designed for clothed human bodies. PIFU excels in capturing nuanced body shape variations within a low-dimensional space and has been extensively trained on human 3D scans. However, the application of PIFU to live animals poses significant challenges, primarily due to the inherent difficulty in obtaining the cooperation of animals for 3D scanning. In response to this challenge, we introduce the DRIFu model, specifically tailored for animal digitization. To train DRIFu, we employ a curated set of synthetic 3D animal models, encompassing diverse shapes, sizes, and even accounting for variations such as baby birds. Our innovative alignment tools play a pivotal role in mapping these diverse synthetic animal models onto a unified template, facilitating precise predictions of animal shape and texture. Crucially, our template alignment strategy establishes a shared shape space, allowing for the seamless sampling of new animal shapes, posing them realistically, animating them, and aligning them with real-world data. This groundbreaking approach revolutionizes our capacity to comprehensively understand and represent avian forms. For further details and access to the project, the project website can be found at https://github.com/kuangzijian/drifu-for-animals

摘要
DRIFu模型（Differentiable Rendering and Implicit Function-based model）的起源可以追溯到Pixel-aligned Implicit Function（PIFU），这是一种推进人体三维化技术的先驱技术。 PIFU excellently captures the nuanced variations of body shape in a low-dimensional space and has been extensively trained on human 3D scans. However, applying PIFU to live animals poses significant challenges, primarily due to the difficulty in obtaining the cooperation of animals for 3D scanning. In response to this challenge, we introduce the DRIFu model, specifically tailored for animal digitization.To train DRIFu, we use a curated set of synthetic 3D animal models that encompass diverse shapes, sizes, and even account for variations such as baby birds. Our innovative alignment tools play a crucial role in mapping these diverse synthetic animal models onto a unified template, facilitating precise predictions of animal shape and texture. Importantly, our template alignment strategy establishes a shared shape space, allowing for seamless sampling of new animal shapes, posing them realistically, animating them, and aligning them with real-world data. This groundbreaking approach revolutionizes our capacity to comprehensively understand and represent avian forms.For more details and access to the project, please visit the project website at .

DoubleAUG: Single-domain Generalized Object Detector in Urban via Color Perturbation and Dual-style Memory

paper_url: http://arxiv.org/abs/2311.13198
repo_url: None
paper_authors: Lei Qi, Peng Dong, Tan Xiong, Hui Xue, Xin Geng
for: solve the single-domain generalizable object detection task in urban scenarios
methods: Double AUGmentation (DoubleAUG) method that includes image- and feature-level augmentation schemes, Color Perturbation (CP) method, and Dual-Style Memory (DSM)
results: outperforms state-of-the-art methods, effective in enhancing the model’s generalization capability, and can be integrated into existing methods to further improve model performance.Here is the simplified Chinese text:
for: 解决城市景观中单一领域可感知对象检测任务
methods: 基于图像和特征级别扩展方法，包括图像级别的颜色扰动（CP）方法和双风格记忆（DSM）
results: 超越当前最佳方法，有效地提高模型泛化能力，可以与现有方法结合使用进一步改进模型性能。

Abstract
Object detection in urban scenarios is crucial for autonomous driving in intelligent traffic systems. However, unlike conventional object detection tasks, urban-scene images vary greatly in style. For example, images taken on sunny days differ significantly from those taken on rainy days. Therefore, models trained on sunny day images may not generalize well to rainy day images. In this paper, we aim to solve the single-domain generalizable object detection task in urban scenarios, meaning that a model trained on images from one weather condition should be able to perform well on images from any other weather conditions. To address this challenge, we propose a novel Double AUGmentation (DoubleAUG) method that includes image- and feature-level augmentation schemes. In the image-level augmentation, we consider the variation in color information across different weather conditions and propose a Color Perturbation (CP) method that randomly exchanges the RGB channels to generate various images. In the feature-level augmentation, we propose to utilize a Dual-Style Memory (DSM) to explore the diverse style information on the entire dataset, further enhancing the model's generalization capability. Extensive experiments demonstrate that our proposed method outperforms state-of-the-art methods. Furthermore, ablation studies confirm the effectiveness of each module in our proposed method. Moreover, our method is plug-and-play and can be integrated into existing methods to further improve model performance.

摘要
In the image-level augmentation, we consider the variation in color information across different weather conditions and propose a Color Perturbation (CP) method that randomly exchanges the RGB channels to generate various images. In the feature-level augmentation, we propose to utilize a Dual-Style Memory (DSM) to explore the diverse style information on the entire dataset, further enhancing the model's generalization capability.Extensive experiments demonstrate that our proposed method outperforms state-of-the-art methods. Furthermore, ablation studies confirm the effectiveness of each module in our proposed method. Moreover, our method is plug-and-play and can be integrated into existing methods to further improve model performance.

Boosting3D: High-Fidelity Image-to-3D by Boosting 2D Diffusion Prior to 3D Prior with Progressive Learning

paper_url: http://arxiv.org/abs/2311.13617
repo_url: None
paper_authors: Kai Yu, Jinlin Liu, Mengyang Feng, Miaomiao Cui, Xuansong Xie
For: 本文提出了一种多阶段单图像到3D生成方法，可以在不同数据领域中生成可靠的3D对象。* Methods: 本文使用了更好的3D优先来训练NeRF。Specifically, the authors train an object-level LoRA for the target object using the original image and the rendering output of NeRF, and then train the LoRA and NeRF using a progressive training strategy.* Results: 实验表明，提出的方法可以学习对象特定的3D优先，超出了预训练的扩散优先的能力，并实现了单图像到3D生成任务的 estado del arte性能。

Abstract
We present Boosting3D, a multi-stage single image-to-3D generation method that can robustly generate reasonable 3D objects in different data domains. The point of this work is to solve the view consistency problem in single image-guided 3D generation by modeling a reasonable geometric structure. For this purpose, we propose to utilize better 3D prior to training the NeRF. More specifically, we train an object-level LoRA for the target object using original image and the rendering output of NeRF. And then we train the LoRA and NeRF using a progressive training strategy. The LoRA and NeRF will boost each other while training. After the progressive training, the LoRA learns the 3D information of the generated object and eventually turns to an object-level 3D prior. In the final stage, we extract the mesh from the trained NeRF and use the trained LoRA to optimize the structure and appearance of the mesh. The experiments demonstrate the effectiveness of the proposed method. Boosting3D learns object-specific 3D prior which is beyond the ability of pre-trained diffusion priors and achieves state-of-the-art performance in the single image-to-3d generation task.

摘要
我们介绍Boosting3D，一种多阶段单图像到3D生成方法，可以强健地生成不同数据领域中的合理3D物体。本工作的目的是解决单图像指导3D生成中的视角一致问题，通过建立合理的几何结构来解决。为此，我们提议利用更好的3D先验来训练NeRF。更具体地说，我们在目标对象上训练一个对象级别的LoRA，使用原始图像和NeRF的渲染输出来训练。然后，我们在进行进化训练策略时，训练LoRA和NeRF。LoRA和NeRF在训练中相互推进。在最后阶段，我们提取NeRF中训练过的网格，并使用训练过的LoRA来优化网格的结构和外观。实验表明，提出的方法效果很高，Boosting3D可以学习对象特有的3D先验，超过预训练的扩散先验，并达到单图像到3D生成任务的州OF-THE-ART表现。

Online Video Quality Enhancement with Spatial-Temporal Look-up Tables

paper_url: http://arxiv.org/abs/2311.13616
repo_url: None
paper_authors: Zefan Qu, Xinyang Jiang, Yifan Yang, Dongsheng Li, Cairong Zhao
for: 提高在线视频质量，如视频会议和云游戏等在线应用需要低延迟率和高质量视频。
methods: 提出了一种新的STLVQE方法，包括Module-Agnostic Feature Extractor和Spatial-Temporal Look-up Tables等模块，从而大幅减少计算量并提高网络的推移、对齐和提升模块。
results: 在MFQE 2.0 dataset上进行了广泛的实验，显示了STLVQE在性能与速度之间达到了满意的平衡。

Abstract
Low latency rates are crucial for online video-based applications, such as video conferencing and cloud gaming, which make improving video quality in online scenarios increasingly important. However, existing quality enhancement methods are limited by slow inference speed and the requirement for temporal information contained in future frames, making it challenging to deploy them directly in online tasks. In this paper, we propose a novel method, STLVQE, specifically designed to address the rarely studied online video quality enhancement (Online-VQE) problem. Our STLVQE designs a new VQE framework which contains a Module-Agnostic Feature Extractor that greatly reduces the redundant computations and redesign the propagation, alignment, and enhancement module of the network. A Spatial-Temporal Look-up Tables (STL) is proposed, which extracts spatial-temporal information in videos while saving substantial inference time. To the best of our knowledge, we are the first to exploit the LUT structure to extract temporal information in video tasks. Extensive experiments on the MFQE 2.0 dataset demonstrate that our STLVQE achieves a satisfactory performance-speed trade-off.

摘要
(Simplified Chinese translation)低延迟率对于在线视频应用程序，如视频会议和云游戏，变得越来越重要。然而，现有的质量提升方法受到慢速推理和未来帧中的时间信息的限制，使其直接应用在线上任务中困难。在这篇论文中，我们提出了一种新的方法，STLVQE，用于解决在线视频质量提升（Online-VQE）问题。我们的STLVQE方法设计了一个新的VQE框架，包括一个Module-Agnostic Feature Extractor，可以大幅减少重复计算，并重新设计卷积、对齐和提升模块。我们还提出了一个Spatial-Temporal Look-up Tables（STL），可以在视频中提取空间-时间信息，并在推理时保留大量的计算时间。到目前为止，我们是第一个在视频任务中利用LUT结构提取时间信息的人。我们在MFQE 2.0数据集上进行了广泛的实验，得到了满意的性能-速度交易。

Towards Improving Document Understanding: An Exploration on Text-Grounding via MLLMs

paper_url: http://arxiv.org/abs/2311.13194
repo_url: None
paper_authors: Yonghui Wang, Wengang Zhou, Hao Feng, Keyi Zhou, Houqiang Li
for: 这 paper 是为了提高文本理解器在文本多样化场景中的表现而做出的研究。
methods: 这 paper 使用了 fine-tuning 多Modal Large Language Models（MLLMs），并将文本位置数据 integrate 到指令中，以提高模型对文本场景中文本的理解能力。
results: 实验表明，这 paper 的方法可以在多种文本多样化场景中达到状态机器的表现，证明了该方法的有效性。

Abstract
In the field of document understanding, significant advances have been made in the fine-tuning of Multimodal Large Language Models (MLLMs) with instruction-following data. Nevertheless, the potential of text-grounding capability within text-rich scenarios remains underexplored. In this paper, we present a text-grounding document understanding model, termed TGDoc, which addresses this deficiency by enhancing MLLMs with the ability to discern the spatial positioning of text within images. Empirical evidence suggests that text-grounding improves the model's interpretation of textual content, thereby elevating its proficiency in comprehending text-rich images. Specifically, we compile a dataset containing 99K PowerPoint presentations sourced from the internet. We formulate instruction tuning tasks including text detection, recognition, and spotting to facilitate the cohesive alignment between the visual encoder and large language model. Moreover, we curate a collection of text-rich images and prompt the text-only GPT-4 to generate 12K high-quality conversations, featuring textual locations within text-rich scenarios. By integrating text location data into the instructions, TGDoc is adept at discerning text locations during the visual question process. Extensive experiments demonstrate that our method achieves state-of-the-art performance across multiple text-rich benchmarks, validating the effectiveness of our method.

摘要
在文档理解领域，已经有 significante 的进步在多Modal大型自然语言模型（MLLM）的精度调整中，使用 instrucion-following 数据。然而，文本位置能力在文本 abbondanza enario 中的潜力仍未得到充分利用。在这篇论文中，我们提出了一种文本位置理解模型，称为 TGDoc，以解决这个问题。我们在 MLLM 中增强了可以理解图像中文本的能力，从而提高文本理解的精度。我们编译了一个包含 99,000 个 PowerPoint presentaion 的数据集，从互联网上收集。我们设置了 instrucion 调整任务，包括文本检测、识别和找到，以便在视觉编码器和大型自然语言模型之间进行一体化。此外，我们编辑了一个包含文本 abbondanza 的图像集，并让文本只GPT-4生成 12,000 个高质量对话，其中包含文本位置在文本 abbondanza enario 中的信息。通过将文本位置数据包含在 instrucion 中，TGDoc 在视觉问题过程中能够准确地识别文本位置。我们进行了广泛的实验，并证明了我们的方法可以在多个文本 abbondanza benchmark 上 дости得 state-of-the-art 性能，验证了我们的方法的有效性。

HEViTPose: High-Efficiency Vision Transformer for Human Pose Estimation

paper_url: http://arxiv.org/abs/2311.13615
repo_url: https://github.com/T1sweet/HEViTPose
paper_authors: Chengpeng Wu, Guangxing Tan, Chunyu Li
for: 提高人体 pose 估计的效率和精度，特别是在复杂的场景下。
methods: 提出了一种高效视觉转换器（HEViTPose），包括一个层次分组空间减少多头注意机制（CGSR-MHA），以减少计算成本，保持特征多样性。另外，定义了一个patch overlap width（PEOW）概念，以了解 overlap 的关系和本地连续性。通过优化 PEOW，模型获得了性能、参数和GFLOPs的改进。
results: 在 MPII 和 COCO 两个标准测试集上，HEViTPose 小型和大型模型与当前状态艺术模型在轻量级上达到了相同水平，而且在 PCK@0.5 和 AP 上均有显著提高。具体来说，HEViTPose-B 在 MPII 测试集上达到了 90.7 PCK@0.5，COCO 测试集上达到了 72.6 AP。与 HRNet-W32 和 Swin-S 相比，HEViTPose-B 可以减少参数（$\downarrow$62.1%, $\downarrow$80.4%）和 GFLOPs（$\downarrow$43.4%, $\downarrow$63.8%）。

Abstract
Human pose estimation in complicated situations has always been a challenging task. Many Transformer-based pose networks have been proposed recently, achieving encouraging progress in improving performance. However, the remarkable performance of pose networks is always accompanied by heavy computation costs and large network scale. In order to deal with this problem, this paper proposes a High-Efficiency Vision Transformer for Human Pose Estimation (HEViTPose). In HEViTPose, a Cascaded Group Spatial Reduction Multi-Head Attention Module (CGSR-MHA) is proposed, which reduces the computational cost through feature grouping and spatial degradation mechanisms, while preserving feature diversity through multiple low-dimensional attention heads. Moreover, a concept of Patch Embedded Overlap Width (PEOW) is defined to help understand the relationship between the amount of overlap and local continuity. By optimising PEOW, our model gains improvements in performance, parameters and GFLOPs. Comprehensive experiments on two benchmark datasets (MPII and COCO) demonstrate that the small and large HEViTPose models are on par with state-of-the-art models while being more lightweight. Specifically, HEViTPose-B achieves 90.7 PCK@0.5 on the MPII test set and 72.6 AP on the COCO test-dev2017 set. Compared with HRNet-W32 and Swin-S, our HEViTPose-B significantly reducing Params ($\downarrow$62.1%,$\downarrow$80.4%,) and GFLOPs ($\downarrow$43.4%,$\downarrow$63.8%,). Code and models are available at \url{here}.

摘要
人体姿势估计在复杂情况下一直是一个挑战。近年来，许多基于Transformer的pose网络被提出，以提高性能。然而，出色的pose网络总是与重大计算成本和大规模网络成本一起出现。为了解决这个问题，这篇论文提出了高效视觉Transformer для人体姿势估计（HEViTPose）。在HEViTPose中，我们提出了一个分组空间减少多头注意机制（CGSR-MHA），可以降低计算成本，同时保持特征多样性。此外，我们定义了一个Patch Embedded Overlap Width（PEOW）概念，以帮助理解重叠的量和本地连续性之间的关系。通过优化PEOW，我们的模型在性能、参数和GFLOPs方面都有改进。我们在MPII和COCO两个标准数据集上进行了广泛的实验，并证明了小型和大型HEViTPose模型与状态前的模型相当，而且更轻量级。具体来说，HEViTPose-B在MPII测试集上达到了90.7 PCK@0.5，COCO测试集上达到了72.6 AP。与HRNet-W32和Swin-S相比，我们的HEViTPose-B显著减少了参数（$\downarrow$62.1%, $\downarrow$80.4%）和GFLOPs（$\downarrow$43.4%, $\downarrow$63.8%）。代码和模型可以在以下链接获取：\url{here}。

NeISF: Neural Incident Stokes Field for Geometry and Material Estimation

paper_url: http://arxiv.org/abs/2311.13187
repo_url: None
paper_authors: Chenhao Li, Taishi Ono, Takeshi Uemori, Hajime Mihara, Alexander Gatto, Hajime Nagahara, Yusuke Moriuchi
for: 这篇论文旨在解决多视图逆渲染问题，即从不同视点拍摄的图像序列中估计场景参数（形状、材料、照明）。
methods: 该方法使用神经网络incident Stokes场（NeISF），利用偏振信息减少不确定性。它基于偏振信息的积累效应，通过原始物理基于的可 diferenciable polarimetric renderer来快速和高效地模拟偏振效应。
results: 实验结果表明，该方法在synthetic和实际场景中都能够超越现有方法。

Abstract
Multi-view inverse rendering is the problem of estimating the scene parameters such as shapes, materials, or illuminations from a sequence of images captured under different viewpoints. Many approaches, however, assume single light bounce and thus fail to recover challenging scenarios like inter-reflections. On the other hand, simply extending those methods to consider multi-bounced light requires more assumptions to alleviate the ambiguity. To address this problem, we propose Neural Incident Stokes Fields (NeISF), a multi-view inverse rendering framework that reduces ambiguities using polarization cues. The primary motivation for using polarization cues is that it is the accumulation of multi-bounced light, providing rich information about geometry and material. Based on this knowledge, the proposed incident Stokes field efficiently models the accumulated polarization effect with the aid of an original physically-based differentiable polarimetric renderer. Lastly, experimental results show that our method outperforms the existing works in synthetic and real scenarios.

摘要
多视图反投影问题是估计场景参数（形状、材质、照明）从不同视点捕捉的图像序列中获取。许多方法假设单一反射，因此无法恢复困难的场景，如间接反射。然而，简单地扩展这些方法以考虑多重反射需要更多的假设来减轻ambiguity。为解决这个问题，我们提议使用神经入射Stokes场（NeISF），一种多视图反投影框架，通过波动极化指示器减少歧义。我们的主要动机是使用极化指示器，因为它是多重反射的积累效果，提供了丰富的几何和材质信息。基于这种知识，我们提出的入射Stokes场有效地模拟了积累极化效果，通过原始物理基于的幂ometric renderer来支持。最后，我们的实验结果表明，我们的方法在 sintetic和实际场景中都超过了现有的方法。

Applications of Spiking Neural Networks in Visual Place Recognition

paper_url: http://arxiv.org/abs/2311.13186
repo_url: https://github.com/qvpr/vprsnn
paper_authors: Somayeh Hussaini, Michael Milford, Tobias Fischer
for: 这个研究旨在探讨对于机器人Task中的视觉地点识别(VPR)中使用的脉冲神经网络(SNNs)的可能性，特别是在实现 neuromorphic 硬件时。
methods: 我们提出了三个进步：首先，我们提出了 Modular SNNs，每个 SNN 代表了不同的地理位置，并且可以扩展到大型环境中；其次，我们发展了多个 Modular SNNs 的 Ensemble，它们可以实现更高的准确度；最后，我们研究了使用序列匹配技术来增强 SNN-based VPR，这种技术可以使用接下来的图像来精确化地点识别。
results: 我们的研究结果显示，使用 Modular SNNs 和 Ensemble 可以提高 VPR 的准确度和稳定性，并且显示了序列匹配技术的优势。我们的 SNNs 小巧，只有 1500 个神经元和 474k synapses，这使得它们适合集成。

Abstract
In robotics, Spiking Neural Networks (SNNs) are increasingly recognized for their largely-unrealized potential energy efficiency and low latency particularly when implemented on neuromorphic hardware. Our paper highlights three advancements for SNNs in Visual Place Recognition (VPR). First, we propose Modular SNNs, where each SNN represents a set of non-overlapping geographically distinct places, enabling scalable networks for large environments. Secondly, we present Ensembles of Modular SNNs, where multiple networks represent the same place, significantly enhancing accuracy compared to single-network models. Our SNNs are compact and small, comprising only 1500 neurons and 474k synapses, which makes them ideally suited for ensembling due to this small size. Lastly, we investigate the role of sequence matching in SNN-based VPR, a technique where consecutive images are used to refine place recognition. We analyze the responsiveness of SNNs to ensembling and sequence matching compared to other VPR techniques. Our contributions highlight the viability of SNNs for VPR, offering scalable and robust solutions, paving the way for their application in various energy-sensitive robotic tasks.

摘要
在机器人学中，脉冲神经网络（SNN）正在被广泛认可为其能够实现高效的能源消耗和低延迟，特别是在神经模拟硬件上实现。我们的论文探讨了SNN在视觉位置识别（VPR）方面的三项进步。首先，我们提出了模块化SNN，其中每个SNN表示一组不重叠的地理位置。这使得我们可以构建大型环境中可扩展的网络。其次，我们提出了多个模块化SNN的集合，其中多个网络表示同一个地点，相比单个网络模型，增强了准确性。我们的SNN很小，只有1500个神经元和474k个 synapse，这使得它们适合集成，因为它们的小型。最后，我们研究了基于SNN的序列匹配技术的作用，其中 consecutive images 用于精细地位置识别。我们分析了SNN对集成和序列匹配的回应，与其他VPR技术相比。我们的贡献表明SNN可以在VPR中实现可扩展和可靠的解决方案，为能量敏感的机器人任务开出了新的可能。

Differentiable Radio Frequency Ray Tracing for Millimeter-Wave Sensing

paper_url: http://arxiv.org/abs/2311.13182
repo_url: None
paper_authors: Xingyu Chen, Xinyu Zhang, Qiyue Xia, Xinmin Fang, Chris Xiaoxuan Lu, Zhengxiong Li
for: 这篇论文旨在实现基于 millimeter wave（mmWave）探测的精细三维重建。
methods: 该方法使用可微分的射线跟踪引擎模拟雷达点云，并使用梯度优化器微调模型参数以最小化实际和虚拟点云之间的差异。
results: 实验结果表明，DiffSBR可以高精度地重建三维对象，包括未曾由雷达见过的新对象。

Abstract
Millimeter wave (mmWave) sensing is an emerging technology with applications in 3D object characterization and environment mapping. However, realizing precise 3D reconstruction from sparse mmWave signals remains challenging. Existing methods rely on data-driven learning, constrained by dataset availability and difficulty in generalization. We propose DiffSBR, a differentiable framework for mmWave-based 3D reconstruction. DiffSBR incorporates a differentiable ray tracing engine to simulate radar point clouds from virtual 3D models. A gradient-based optimizer refines the model parameters to minimize the discrepancy between simulated and real point clouds. Experiments using various radar hardware validate DiffSBR's capability for fine-grained 3D reconstruction, even for novel objects unseen by the radar previously. By integrating physics-based simulation with gradient optimization, DiffSBR transcends the limitations of data-driven approaches and pioneers a new paradigm for mmWave sensing.

摘要
millimeter wave (mmWave) 探测技术是一种emerging technology，应用于3D物体特征化和环境地图。然而，从稀疏 mmWave 信号中获得高精度3D重建仍然是一个挑战。现有方法通过数据驱动学习，受数据可用性和通用性的限制。我们提出了DiffSBR，一种可导的 mmWave 基于的3D重建框架。DiffSBR integrate了可导的射线跟踪引擎，用于模拟雷达点云从虚拟3D模型中生成。一种势能基本优化器，用于微调模型参数，以降低实际和虚拟点云之间的差异。实验用various雷达硬件 validate DiffSBR 的3D重建精度，包括未经雷达见到的新物体。通过将物理学习与势能优化结合，DiffSBR 超越了数据驱动方法的限制，开拓了mmWave探测新的可能性。

Volumetric Reconstruction Resolves Off-Resonance Artifacts in Static and Dynamic PROPELLER MRI

paper_url: http://arxiv.org/abs/2311.13177
repo_url: https://github.com/sarafridov/volumetric-propeller
paper_authors: Annesha Ghosh, Gordon Wetzstein, Mert Pilanci, Sara Fridovich-Keil
for: 解决MRI图像中的偏差artefacts，提高图像诊断质量。
methods: 引入额外”spectral”维度来模型偏差频率，基于近期进步的光谱场模型。
results: 可以重建静态和动态MRI图像，以及分离脂肪和水，具有独立的临床价值。

Abstract
Off-resonance artifacts in magnetic resonance imaging (MRI) are visual distortions that occur when the actual resonant frequencies of spins within the imaging volume differ from the expected frequencies used to encode spatial information. These discrepancies can be caused by a variety of factors, including magnetic field inhomogeneities, chemical shifts, or susceptibility differences within the tissues. Such artifacts can manifest as blurring, ghosting, or misregistration of the reconstructed image, and they often compromise its diagnostic quality. We propose to resolve these artifacts by lifting the 2D MRI reconstruction problem to 3D, introducing an additional "spectral" dimension to model this off-resonance. Our approach is inspired by recent progress in modeling radiance fields, and is capable of reconstructing both static and dynamic MR images as well as separating fat and water, which is of independent clinical interest. We demonstrate our approach in the context of PROPELLER (Periodically Rotated Overlapping ParallEL Lines with Enhanced Reconstruction) MRI acquisitions, which are popular for their robustness to motion artifacts. Our method operates in a few minutes on a single GPU, and to our knowledge is the first to correct for chemical shift in gradient echo PROPELLER MRI reconstruction without additional measurements or pretraining data.

摘要
magnetic resonance imaging (MRI)中的偏振artefacts是视觉扭曲，由于实际核磁共振频率与预期使用的核磁共振频率之间的差异引起。这些差异可能由磁场不均、化学偏移或组织内的磁矩差引起。这些artefacts可能会导致图像模糊、鬼影或偏移重建图像，并且通常会降低图像的诊断质量。我们提议利用2D MRI重建问题的3D升级，通过模elling磁共振场来解决这些偏振artefacts。我们的方法 inspirited by recent progress in modeling radiance fields，可以重建静止和动态MR图像，以及分离脂肪和水，这是独立的临床兴趣。我们在PROPELLER (Periodically Rotated Overlapping ParallEL Lines with Enhanced Reconstruction) MRI获得中进行了应用，这种方法在几分钟内在单个GPU上运行，并且我们知道是首次对gradient echo PROPELLER MRI重建中的化学偏移进行了 corrections，无需额外测量或预训练数据。

Learning to Complement with Multiple Humans (LECOMH): Integrating Multi-rater and Noisy-Label Learning into Human-AI Collaboration

paper_url: http://arxiv.org/abs/2311.13172
repo_url: None
paper_authors: Zheng Zhang, Kevin Wells, Gustavo Carneiro
For: The paper is written for developing robust classifiers that can address challenges posed by different types of data imperfections and complex decision processes in real-world applications.* Methods: The paper integrates noisy-label learning, multi-rater learning, and human-AI collaboration with new benchmarks and the innovative Learning to Complement with Multiple Humans (LECOMH) approach.* Results: The paper shows that LECOMH consistently outperforms leading human-AI collaboration methods using proposed benchmarks, with accuracy improving as collaboration costs increase. Additionally, LECOMH is the only method that enhances human labeller performance across all benchmarks.Here is the information in Simplified Chinese text:* For: 这篇论文是为了开发能够 Addressing 不同类型的数据不精确和复杂决策过程的Robust 分类器而写的。* Methods: 这篇论文 integrate 随机标签学习、多个评审者学习和人类-AI协作，并使用新的 Benchmark 和 Learning to Complement with Multiple Humans (LECOMH) 方法。* Results: 论文显示，LECOMH 在提案的 Benchmark 上 consistently 超过其他主要的人类-AI协作方法，准确率随合作成本增加而增加。此外，LECOMH 是唯一能够提高人类标注员表现的方法。

Abstract
The advent of learning with noisy labels (LNL), multi-rater learning, and human-AI collaboration has revolutionised the development of robust classifiers, enabling them to address the challenges posed by different types of data imperfections and complex decision processes commonly encountered in real-world applications. While each of these methodologies has individually made significant strides in addressing their unique challenges, the development of techniques that can simultaneously tackle these three problems remains underexplored. This paper addresses this research gap by integrating noisy-label learning, multi-rater learning, and human-AI collaboration with new benchmarks and the innovative Learning to Complement with Multiple Humans (LECOMH) approach. LECOMH optimises the level of human collaboration during testing, aiming to optimise classification accuracy while minimising collaboration costs that vary from 0 to M, where M is the maximum number of human collaborators. We quantitatively compare LECOMH with leading human-AI collaboration methods using our proposed benchmarks. LECOMH consistently outperforms the competition, with accuracy improving as collaboration costs increase. Notably, LECOMH is the only method enhancing human labeller performance across all benchmarks.

摘要
“学习噪音标签（LNL）、多评分学习和人类-AI合作的出现，对实际应用中的数据不完整和复杂决策过程中的分类器发展起了革命性的影响。各自面临着独特挑战，但同时解决这三个问题的技术还是有很大的研究空间。这篇论文填补这个研究漏洞，通过结合噪音标签学习、多评分学习和人类-AI合作，与新的标准和创新的学习来COMPlement with Multiple Humans（LECOMH）方法相结合。LECOMH在测试过程中优化人类合作水平，以提高分类精度，同时最小化与M最大数量的人类合作成本。我们对LECOMH与主流人类-AI合作方法进行了量化比较。LECOMH在所有标准上表现出色，并且在所有标准上提高人类标签人性能。”Note: "LECOMH" is the name of the proposed method in the paper, and it is a combination of "Learning to Complement with Multiple Humans".

3D Face Style Transfer with a Hybrid Solution of NeRF and Mesh Rasterization

paper_url: http://arxiv.org/abs/2311.13168
repo_url: None
paper_authors: Jianwei Feng, Prateek Singhal
for: 这篇研究目的是实现3D人脸风格转移，即将3D人脸当作参考图像，并将其数字化以获得各种不同的风格。
methods: 我们使用神经辐射场（NeRF）来表示3D人脸，并与2D风格转移结合以实现3D人脸的风格转移。但是，直接将NeRF训练在彩色图像上的风格转移结果中会导致3D不一致问题，而且将NeRF训练与2D风格转移目标结合可能会导致训练不稳定和对于训练时间和内存的需求增加。我们因此提出了一个混合框架，将NeRF与网络绘制结合，以获得高精确的几何重建和快速渲染速度。
results: 我们的方法可以实现高品质的3D人脸风格转移，并且可以调控风格。实验结果显示我们的方法可以获得高度一致的3D人脸风格转移结果，同时也可以提供自适应的风格控制。

Abstract
Style transfer for human face has been widely researched in recent years. Majority of the existing approaches work in 2D image domain and have 3D inconsistency issue when applied on different viewpoints of the same face. In this paper, we tackle the problem of 3D face style transfer which aims at generating stylized novel views of a 3D human face with multi-view consistency. We propose to use a neural radiance field (NeRF) to represent 3D human face and combine it with 2D style transfer to stylize the 3D face. We find that directly training a NeRF on stylized images from 2D style transfer brings in 3D inconsistency issue and causes blurriness. On the other hand, training a NeRF jointly with 2D style transfer objectives shows poor convergence due to the identity and head pose gap between style image and content image. It also poses challenge in training time and memory due to the need of volume rendering for full image to apply style transfer loss functions. We therefore propose a hybrid framework of NeRF and mesh rasterization to combine the benefits of high fidelity geometry reconstruction of NeRF and fast rendering speed of mesh. Our framework consists of three stages: 1. Training a NeRF model on input face images to learn the 3D geometry; 2. Extracting a mesh from the trained NeRF model and optimizing it with style transfer objectives via differentiable rasterization; 3. Training a new color network in NeRF conditioned on a style embedding to enable arbitrary style transfer to the 3D face. Experiment results show that our approach generates high quality face style transfer with great 3D consistency, while also enabling a flexible style control.

摘要
Recent years have seen extensive research into 2D image-based style transfer for human faces. However, these methods often suffer from 3D inconsistency issues when applied to different viewpoints of the same face. In this paper, we aim to address the problem of 3D face style transfer, generating stylized novel views of a 3D human face with multi-view consistency.To achieve this, we propose combining a neural radiance field (NeRF) with 2D style transfer. We find that directly training a NeRF on stylized images from 2D style transfer leads to 3D inconsistency and blurriness. On the other hand, jointly training a NeRF with 2D style transfer objectives results in poor convergence due to the identity and head pose gap between the style image and content image. Additionally, training time and memory requirements are increased due to the need for volume rendering for full image style transfer.To address these challenges, we propose a hybrid framework combining the benefits of high-fidelity geometry reconstruction from NeRF and fast rendering speed from mesh rasterization. Our framework consists of three stages:1. Training a NeRF model on input face images to learn the 3D geometry.2. Extracting a mesh from the trained NeRF model and optimizing it with style transfer objectives via differentiable rasterization.3. Training a new color network in NeRF conditioned on a style embedding to enable arbitrary style transfer to the 3D face.Experimental results show that our approach generates high-quality face style transfer with great 3D consistency, while also enabling flexible style control.

Test-Time Augmentation for 3D Point Cloud Classification and Segmentation

paper_url: http://arxiv.org/abs/2311.13152
repo_url: None
paper_authors: Tuan-Anh Vu, Srinjay Sarkar, Zhiyuan Zhang, Binh-Son Hua, Sai-Kit Yeung
For: The paper is written for improving the performance of 3D deep learning tasks, specifically addressing the issue of sparse point cloud representation.* Methods: The paper explores test-time augmentation (TTA) for 3D point clouds, leveraging implicit field reconstruction and point cloud upsampling techniques to augment point cloud data.* Results: The paper shows that both strategies are effective in improving accuracy, with point cloud upsampling leading to more significant performance improvement on downstream tasks such as object classification and segmentation on several datasets.

Abstract
Data augmentation is a powerful technique to enhance the performance of a deep learning task but has received less attention in 3D deep learning. It is well known that when 3D shapes are sparsely represented with low point density, the performance of the downstream tasks drops significantly. This work explores test-time augmentation (TTA) for 3D point clouds. We are inspired by the recent revolution of learning implicit representation and point cloud upsampling, which can produce high-quality 3D surface reconstruction and proximity-to-surface, respectively. Our idea is to leverage the implicit field reconstruction or point cloud upsampling techniques as a systematic way to augment point cloud data. Mainly, we test both strategies by sampling points from the reconstructed results and using the sampled point cloud as test-time augmented data. We show that both strategies are effective in improving accuracy. We observed that point cloud upsampling for test-time augmentation can lead to more significant performance improvement on downstream tasks such as object classification and segmentation on the ModelNet40, ShapeNet, ScanObjectNN, and SemanticKITTI datasets, especially for sparse point clouds.

摘要
<>将文本翻译成简化中文。<>深度学习任务的性能可以通过数据扩展技术进行提高，但在3D深度学习领域received less attention。当3D形状 sparse representation with low point density时，下游任务的性能会降低显著。这项工作探讨了在3D点云数据上的test-time augmentation（TTA）技术。我们受到了最近学习隐藏表示和点云upsampling的革命的启发，这两种技术可以生成高质量的3D表面重建和靠近表面，分别。我们的想法是利用这两种技术作为系统atic way to augment point cloud data。主要是 sampling points from the reconstructed results and using the sampled point cloud as test-time augmented data。我们发现了这两种策略都能提高准确性。我们发现在下游任务如Object Classification和Segmentation上，使用test-time augmentation Point cloud upsampling可以更大的性能提高，特别是 для稀点云。Note: The translation is in Simplified Chinese, which is one of the two standard versions of Chinese. The other version is Traditional Chinese.

Single Image Compressed Sensing MRI via a Self-Supervised Deep Denoising Approach

paper_url: http://arxiv.org/abs/2311.13144
repo_url: None
paper_authors: Marlon Bran Lorenzana, Feng Liu, Shekhar S. Chandra
for: 本研究旨在提出一种基于单张图像、自助学习（SS）的扩sensing（CS）-核磁共振成像（MRI）框架，以实现对多个数据集的普适性和实际应用中的访问。
methods: 本研究使用了深度学习（DL）方法，将大量数据用于训练非线性重建模型。然而，保证多个数据集的普适性和实际应用中的访问是一大allenge。为解决这些问题，本研究提出了一种基于单张图像、SS CS-MRI框架，具有joint深度和稀疏正则化的特点。
results: 根据使用Cartesian 1D masks的评价指标，在脑和膝关节数据集上，PSNR提高了2-4dB的平均值。

Abstract
Popular methods in compressed sensing (CS) are dependent on deep learning (DL), where large amounts of data are used to train non-linear reconstruction models. However, ensuring generalisability over and access to multiple datasets is challenging to realise for real-world applications. To address these concerns, this paper proposes a single image, self-supervised (SS) CS-MRI framework that enables a joint deep and sparse regularisation of CS artefacts. The approach effectively dampens structured CS artefacts, which can be difficult to remove assuming sparse reconstruction, or relying solely on the inductive biases of CNN to produce noise-free images. Image quality is thereby improved compared to either approach alone. Metrics are evaluated using Cartesian 1D masks on a brain and knee dataset, with PSNR improving by 2-4dB on average.

摘要
通过深度学习（DL），抽象压缩感知（CS）的常用方法与大量数据进行训练非线性重建模型。然而，在实际应用中保证普适性和多个数据集访问是困难的。为此，本文提出了基于单个图像、自动超视的CS-MRI框架，可以同时进行深度和稀疏正则化，对CS噪声进行有效降低。这种方法可以提高图像质量，比较单独使用稀疏重建或仅仅靠 CNN 生成噪声自由图像。在脑和膝盖数据集上使用 Cartesian 1D 面罩，PSNR 提高了2-4dB的平均值。

Diffusion360: Seamless 360 Degree Panoramic Image Generation based on Diffusion Models

paper_url: http://arxiv.org/abs/2311.13141
repo_url: https://github.com/archerfmy/sd-t2i-360panoimage
paper_authors: Mengyang Feng, Jinlin Liu, Miaomiao Cui, Xuansong Xie
for: 这份技术报告关于基于分散模型的360度全景图生成任务。
methods: 提出了一种循环融合策略，用于在杂谱降噪和VAE解码阶段维护图像几何连续性。
results: 提出了两种模型，用于实现文本到360度全景图和单个图像到360度全景图的转换任务。

Abstract
This is a technical report on the 360-degree panoramic image generation task based on diffusion models. Unlike ordinary 2D images, 360-degree panoramic images capture the entire $360^\circ\times 180^\circ$ field of view. So the rightmost and the leftmost sides of the 360 panoramic image should be continued, which is the main challenge in this field. However, the current diffusion pipeline is not appropriate for generating such a seamless 360-degree panoramic image. To this end, we propose a circular blending strategy on both the denoising and VAE decoding stages to maintain the geometry continuity. Based on this, we present two models for \textbf{Text-to-360-panoramas} and \textbf{Single-Image-to-360-panoramas} tasks. The code has been released as an open-source project at \href{https://github.com/ArcherFMY/SD-T2I-360PanoImage}{https://github.com/ArcherFMY/SD-T2I-360PanoImage} and \href{https://www.modelscope.cn/models/damo/cv_diffusion_text-to-360panorama-image_generation/summary}{ModelScope}

摘要
这是一份关于基于扩散模型的360度全景图生成任务的技术报告。与常见的2D图像不同，360度全景图捕捉了整个$360^\circ\times 180^\circ$的视场，因此右侧和左侧的360度全景图需要继续，这是该领域的主要挑战。然而，当前的扩散管道不适合生成无缝360度全景图。为解决这个问题，我们提议在杂化和VAE解码阶段使用圆形杂化策略以保持几何继续性。基于这，我们提出了两种模型，一种是\textbf{Text-to-360-panoramas}任务，另一种是\textbf{Single-Image-to-360-panoramas}任务。代码已经在\href{https://github.com/ArcherFMY/SD-T2I-360PanoImage}{https://github.com/ArcherFMY/SD-T2I-360PanoImage}和\href{https://www.modelscope.cn/models/damo/cv_diffusion_text-to-360panorama-image_generation/summary}{ModelScope}上发布为开源项目。

Spanning Training Progress: Temporal Dual-Depth Scoring (TDDS) for Enhanced Dataset Pruning

paper_url: http://arxiv.org/abs/2311.13613
repo_url: https://github.com/zhangxin-xd/Dataset-Pruning-TDDS
paper_authors: Xin Zhang, Jiawei Du, Yunsong Li, Weiying Xie, Joey Tianyi Zhou
for: 本研究的目的是提出一种新的数据集缩写方法，以实现与原始数据集的性能相似的核心集合构建。
methods: 本研究使用了一种双层策略，将广泛考虑训练过程中的各种动态因素，并将这些因素与代表性样本相结合，以确保缩写后的模型可以保持高度的泛化性。
results: 实验结果表明， compared to previoius State-of-the-Art (SOTA) 方法，本研究的方法在 CIFAR 和 ImageNet 数据集上具有明显的优势，尤其是在 CIFAR-100 上，其达到了 54.51% 的准确率，比Random Selection 方法高出 7.83%，并高于其他比较方法至少 12.69%。

Abstract
Dataset pruning aims to construct a coreset capable of achieving performance comparable to the original, full dataset. Most existing dataset pruning methods rely on snapshot-based criteria to identify representative samples, often resulting in poor generalization across various pruning and cross-architecture scenarios. Recent studies have addressed this issue by expanding the scope of training dynamics considered, including factors such as forgetting event and probability change, typically using an averaging approach. However, these works struggle to integrate a broader range of training dynamics without overlooking well-generalized samples, which may not be sufficiently highlighted in an averaging manner. In this study, we propose a novel dataset pruning method termed as Temporal Dual-Depth Scoring (TDDS), to tackle this problem. TDDS utilizes a dual-depth strategy to achieve a balance between incorporating extensive training dynamics and identifying representative samples for dataset pruning. In the first depth, we estimate the series of each sample's individual contributions spanning the training progress, ensuring comprehensive integration of training dynamics. In the second depth, we focus on the variability of the sample-wise contributions identified in the first depth to highlight well-generalized samples. Extensive experiments conducted on CIFAR and ImageNet datasets verify the superiority of TDDS over previous SOTA methods. Specifically on CIFAR-100, our method achieves 54.51% accuracy with only 10% training data, surpassing random selection by 7.83% and other comparison methods by at least 12.69%.

摘要

Lightweight High-Speed Photography Built on Coded Exposure and Implicit Neural Representation of Videos

paper_url: http://arxiv.org/abs/2311.13134
repo_url: https://github.com/zhihongz/bdinr
paper_authors: Zhihong Zhang, Runzhao Yang, Jinli Suo, Yuxiao Cheng, Qionghai Dai
for: 高速Scene拍摄需要高分辨率，但高带宽需求导致系统庞大、重量大，限制了应用于低容量平台。
methods: 采用coded Exposure设置，将帧序列编码为模糊Snapshot，然后从模糊图像中提取latent sharp video。
results: 比较 existed方法高效和灵活，可以在低容量平台上实现高速Scene拍摄。

Abstract
The compact cameras recording high-speed scenes with high resolution are highly demanded, but the required high bandwidth often leads to bulky, heavy systems, which limits their applications on low-capacity platforms. Adopting a coded exposure setup to encode a frame sequence into a blurry snapshot and retrieve the latent sharp video afterward can serve as a lightweight solution. However, restoring motion from blur is quite challenging due to the high ill-posedness of motion blur decomposition, intrinsic ambiguity in motion direction, and diverse motions in natural videos. In this work, by leveraging classical coded exposure imaging technique and emerging implicit neural representation for videos, we tactfully embed the motion direction cues into the blurry image during the imaging process and develop a novel self-recursive neural network to sequentially retrieve the latent video sequence from the blurry image utilizing the embedded motion direction cues. To validate the effectiveness and efficiency of the proposed framework, we conduct extensive experiments on benchmark datasets and real-captured blurry images. The results demonstrate that our proposed framework significantly outperforms existing methods in quality and flexibility. The code for our work is available at https://github.com/zhihongz/BDINR

摘要
compact 相机记录高速场景高分辨率视频的需求很高，但高带宽需求常常导致系统庞大、重量增加，限制其在低容量平台上应用。采用编码曝光设置，将帧序列编码成卷积图像，然后再取回原始清晰视频的想法可以提供轻量级解决方案。然而，从压缩遮盲中恢复运动是很困难的，因为压缩遮盲分解具有高度不定性，动作方向具有内在的 ambiguity，自然视频中的动作多样。在这种情况下，我们利用经典的编码曝光成像技术和 emerging 隐式神经表示法，在曝光过程中注意 embedding 动作方向征在卷积图像中，并开发了一种新的自我递归神经网络，通过利用注入的动作方向征来顺序从卷积图像中提取原始视频序列。为证明我们提出的框架的有效性和高效率，我们在标准数据集和实际捕捉的模糊图像上进行了广泛的实验。实验结果表明，我们的提出的框架在质量和灵活性方面明显超越了现有方法。代码可以在中下载。

P2RBox: A Single Point is All You Need for Oriented Object Detection

paper_url: http://arxiv.org/abs/2311.13128
repo_url: None
paper_authors: Guangming Cao, Xuehui Yu, Wenwen Yu, Xumeng Han, Xue Yang, Guorong Li, Jianbin Jiao, Zhenjun Han
for: 本研究旨在使用点标注进行特种对象检测，以提高对象检测精度和效率。
methods: 该方法使用点标注生成推荐的面板，并通过多元学习原理进行评估。然后，通过约束模块进行筛选高质量的面板，并将其转换为旋转盒子标注。
results: 该方法可以与多种完全supervised对象检测器结合使用，并在DOTA-v1.0测试数据集上达到62.26%的性能。这是首次使用点标注进行 oriented 对象检测训练。

Abstract
Oriented object detection, a specialized subfield in computer vision, finds applications across diverse scenarios, excelling particularly when dealing with objects of arbitrary orientations. Conversely, point annotation, which treats objects as single points, offers a cost-effective alternative to rotated and horizontal bounding boxes but sacrifices performance due to the loss of size and orientation information. In this study, we introduce the P2RBox network, which leverages point annotations and a mask generator to create mask proposals, followed by filtration through our Inspector Module and Constrainer Module. This process selects high-quality masks, which are subsequently converted into rotated box annotations for training a fully supervised detector. Specifically, we've thoughtfully crafted an Inspector Module rooted in multi-instance learning principles to evaluate the semantic score of masks. We've also proposed a more robust mask quality assessment in conjunction with the Constrainer Module. Furthermore, we've introduced a Symmetry Axis Estimation (SAE) Module inspired by the spectral theorem for symmetric matrices to transform the top-performing mask proposal into rotated bounding boxes. P2RBox performs well with three fully supervised rotated object detectors: RetinaNet, Rotated FCOS, and Oriented R-CNN. By combining with Oriented R-CNN, P2RBox achieves 62.26% on DOTA-v1.0 test dataset. As far as we know, this is the first attempt at training an oriented object detector with point supervision.

摘要
特化对象检测，计算机视觉专业领域中的一个特殊化领域，在多种场景中表现出色，特别是对于任意方向的对象来说。然而，点注释，即对象作为单点处理，可以提供Cost-effective的替代方案，但是因为失去大小和方向信息而影响性能。在这项研究中，我们介绍了P2RBox网络，它利用点注释和掩码生成器创建掩码提案，然后通过我们的检查模块和约束模块进行筛选。这个过程选择高质量的掩码，然后将其转换成旋转盒子注释，用于训练完全supervised的检测器。我们特别地设计了基于多例学习原理的检查模块，用于评估掩码的 semantic score。此外，我们还提出了一种更加 Robust的掩码质量评估方法，并且在掩码生成器模块中引入了带有spectral theorem for symmetric matrices的Symmetry Axis Estimation（SAE）模块。P2RBox在与RetinaNet、Rotated FCOS和Oriented R-CNN三种完全supervised旋转 объек检测器结合使用时表现出色，在DOTA-v1.0测试集上达到62.26%。我们知道，这是第一次对oriented object detector进行点级指导的训练。

DAE-Net: Deforming Auto-Encoder for fine-grained shape co-segmentation

paper_url: http://arxiv.org/abs/2311.13125
repo_url: https://github.com/czq142857/dae-net
paper_authors: Zhiqin Chen, Qimin Chen, Hang Zhou, Hao Zhang
for: 本研究开发了一种不需要监督的3D形状协同分割方法，可以从形状集合中学习一组可变的部件模板。
methods: 我们的网络使用一个分支式自编码器，其中一个CNNEncoder接受一个 voxel形状作为输入，生成每个部件的变换矩阵、秘密代码和部件存在度，而Decoder输出点存在度来定义重建损失。
results: 我们的网络可以实现不需要监督的3D形状协同分割，并且可以生成细化、紧凑、有意义的部件，这些部件在不同的形状上具有一致性。我们在ShapeNet Part数据集、DFAUST和一个动物subset中进行了广泛的实验，并证明了我们的方法在先前方法之上显著提高了性能。

Abstract
We present an unsupervised 3D shape co-segmentation method which learns a set of deformable part templates from a shape collection. To accommodate structural variations in the collection, our network composes each shape by a selected subset of template parts which are affine-transformed. To maximize the expressive power of the part templates, we introduce a per-part deformation network to enable the modeling of diverse parts with substantial geometry variations, while imposing constraints on the deformation capacity to ensure fidelity to the originally represented parts. We also propose a training scheme to effectively overcome local minima. Architecturally, our network is a branched autoencoder, with a CNN encoder taking a voxel shape as input and producing per-part transformation matrices, latent codes, and part existence scores, and the decoder outputting point occupancies to define the reconstruction loss. Our network, coined DAE-Net for Deforming Auto-Encoder, can achieve unsupervised 3D shape co-segmentation that yields fine-grained, compact, and meaningful parts that are consistent across diverse shapes. We conduct extensive experiments on the ShapeNet Part dataset, DFAUST, and an animal subset of Objaverse to show superior performance over prior methods.

摘要
我们提出了一种无监督的三维形状协同分割方法，它学习了一组可变形部模板从形状集合中。为了处理集合中的结构变化，我们的网络将每个形状分解成一组选择的模板部分，并使用这些部分进行非线性变换。为了增强模板的表达力，我们引入了每个部分的异常变换网络，以便模型具有多种几何变化的部分，并在异常变换的同时保持形状的准确性。我们还提出了一种训练方案，以避免地方最佳化。我们的网络架构是一个分支式自动编码器，其中 CNN 编码器接受一个 voxel 形状作为输入，生成每个部分的变换矩阵、秘密代码和部分存在概率，而解码器输出点占据来定义重建损失。我们命名这种网络为 DAE-Net，它可以实现无监督的三维形状协同分割，并生成细化、紧凑、意义重要的部分，这些部分在不同的形状中具有一致性。我们在 ShapeNet Part 集合、DFAUST 和 Objaverse 中的动物子集进行了广泛的实验，并证明了我们的方法在先前方法之上表现出优异的性能。

paper_url: http://arxiv.org/abs/2311.13120
repo_url: None
paper_authors: Zhen Zhao, Jingqun Tang, Chunhui Lin, Binghong Wu, Hao Liu, Zhizhong Zhang, Xin Tan, Can Huang, Yuan Xie
for: The paper is written for recognizing scene text in the wild, which frequently encounters challenges such as domain variations, font diversity, and shape deformations.
methods: The paper proposes a novel training strategy called in-context training to generate context-rich scene text sequences, which are used to train a scene text recognizer.
results: The proposed method, called E$^2$STR, achieves effective in-context learning capabilities in scene text recognition and outperforms state-of-the-art approaches on public benchmarks, even with a regular-sized model.

Abstract
Scene text recognition (STR) in the wild frequently encounters challenges when coping with domain variations, font diversity, shape deformations, etc. A straightforward solution is performing model fine-tuning tailored to a specific scenario, but it is computationally intensive and requires multiple model copies for various scenarios. Recent studies indicate that large language models (LLMs) can learn from a few demonstration examples in a training-free manner, termed "In-Context Learning" (ICL). Nevertheless, applying LLMs as a text recognizer is unacceptably resource-consuming. Moreover, our pilot experiments on LLMs show that ICL fails in STR, mainly attributed to the insufficient incorporation of contextual information from diverse samples in the training stage. To this end, we introduce E$^2$STR, a STR model trained with context-rich scene text sequences, where the sequences are generated via our proposed in-context training strategy. E$^2$STR demonstrates that a regular-sized model is sufficient to achieve effective ICL capabilities in STR. Extensive experiments show that E$^2$STR exhibits remarkable training-free adaptation in various scenarios and outperforms even the fine-tuned state-of-the-art approaches on public benchmarks.

摘要
Scene文本识别（STR）在野外经常遇到域外变化、字体多样性、形态扭曲等挑战。直接采用模型精度调整为特定场景是一个简单的解决方案，但是计算成本高并需要多个模型 копи本。现有研究表明，大型自然语言模型（LLM）可以通过几个示例学习，无需训练，称为“在场景中学习”（ICL）。然而，将 LLM 作为文本识别器是不可接受的资源浪费。另外，我们的预测实验表示，ICL 在 STR 中失败，主要归结于在训练阶段不充分 интеграble多样化样本的上下文信息。为此，我们提出了 E$^2$STR，一种基于上下文富文本序列训练的 STR 模型。E$^2$STR 表明，一个常规大小的模型可以实现有效的 ICL 能力。我们进行了广泛的实验，显示 E$^2$STR 在多种场景下具有惊人的无需训练适应能力，并在公共测试上超越了精度调整的现状标准方法。

Automated Measurement of Pericoronary Adipose Tissue Attenuation and Volume in CT Angiography

paper_url: http://arxiv.org/abs/2311.13100
repo_url: None
paper_authors: Andrew M. Nguyen, Tejas Sudharshan Mathai, Liangchen Liu, Jianfei Liu, Ronald M. Summers
for: 这项研究的目的是开发一种完全自动化的 péripheral adipose tissue (PCAT) 测量方法，以便更好地评估 coronary artery disease (CAD) 的风险。methods: 这项研究使用了一种名为 nnUNet 的三维全分辨率神经网络，对两个 coronary artery（RCA 和 LCA）进行 segmentation，并自动测量 PCAT 的含气和体积。results: 研究结果表明，使用自动化方法可以准确地测量 PCAT 的含气和体积，并且可以在两个 coronary artery 上进行同时测量。这种方法可能成为 coronary artery disease 的生物标志物，并且有助于评估 coronary artery disease 的风险。

Abstract
Pericoronary adipose tissue (PCAT) is the deposition of fat in the vicinity of the coronary arteries. It is an indicator of coronary inflammation and associated with coronary artery disease. Non-invasive coronary CT angiography (CCTA) is presently used to obtain measures of the thickness, volume, and attenuation of fat deposition. However, prior works solely focus on measuring PCAT using semi-automated approaches at the right coronary artery (RCA) over the left coronary artery (LCA). In this pilot work, we developed a fully automated approach for the measurement of PCAT mean attenuation and volume in the region around both coronary arteries. First, we used a large subset of patients from the public ImageCAS dataset (n = 735) to train a 3D full resolution nnUNet to segment LCA and RCA. Then, we automatically measured PCAT in the surrounding arterial regions. We evaluated our method on a held-out test set of patients (n = 183) from the same dataset. A mean Dice score of 83% and PCAT attenuation of -73.81 $\pm$ 12.69 HU was calculated for the RCA, while a mean Dice score of 81% and PCAT attenuation of -77.51 $\pm$ 7.94 HU was computed for the LCA. To the best of our knowledge, we are the first to develop a fully automated method to measure PCAT attenuation and volume at both the RCA and LCA. Our work underscores how automated PCAT measurement holds promise as a biomarker for identification of inflammation and cardiac disease.

摘要
PCAT（ péripheral coronary adipose tissue）是在心血管周围存储脂肪的现象，是心血管炎的指标，与心血管疾病相关。现有的非侵入式 coronary CT ANGIOGRAPHY（CCTA）技术可以获得PCAT的厚度、体积和吸收强度的测量。然而，先前的研究都是通过 semi-automated 方法在右 coronary artery（RCA）上测量PCAT。在这个预测工作中，我们开发了一种全自动的PCAT吸收强度和体积测量方法，并在两个 coronary artery 周围进行测量。我们使用了一个大型的 ImageCAS 数据集（n = 735）来训练一个3D全分辨率 nnUNet 来分割 LCA 和 RCA。然后，我们自动测量 PCAT 的吸收强度和体积。我们对一个保留的测试集（n = 183）进行评估。得到的平均 Dice 分数为 83%，PCAT 吸收强度为 -73.81 ± 12.69 HU，而 LCA 的平均 Dice 分数为 81%，PCAT 吸收强度为 -77.51 ± 7.94 HU。我们知道，我们是第一个开发了全自动的PCAT吸收强度和体积测量方法，并测量 PCAT 的吸收强度和体积在 RCA 和 LCA 两个 coronary artery 周围。我们的工作表明，自动测量 PCAT 吸收强度和体积可能成为心血管炎和心血管疾病的标志物。

Stable Unlearnable Example: Enhancing the Robustness of Unlearnable Examples via Stable Error-Minimizing Noise

paper_url: http://arxiv.org/abs/2311.13091
repo_url: https://github.com/liuyixin-louis/stable-unlearnable-example
paper_authors: Yixin Liu, Kaidi Xu, Xun Chen, Lichao Sun
for: 这个研究旨在提高隐私保护的深度学习模型，以防止无授权第三方利用公开的图像数据集训练深度学习模型来进行商业或非法用途。methods: 这个研究提出了一种投毒技术，即“不可学习的例子”，将添加一种不可见的噪声到数据中，以导致模型的对应性下降。然后，通过让模型在不同的噪声下进行反对抗训练，提高模型的防护性。results: 这个研究发现，对于防护性的模型，其中的稳定噪声（SEM）可以实现更高的效率和稳定性。通过广泛的实验，SEM在CIFAR-10、CIFAR-100和ImageNet Subset上实现了新的纪录性表现，并且在效率和稳定性之间做出了一个平衡。

Abstract
The open source of large amounts of image data promotes the development of deep learning techniques. Along with this comes the privacy risk of these open-source image datasets being exploited by unauthorized third parties to train deep learning models for commercial or illegal purposes. To avoid the abuse of public data, a poisoning-based technique, the unlearnable example, is proposed to significantly degrade the generalization performance of models by adding a kind of imperceptible noise to the data. To further enhance its robustness against adversarial training, existing works leverage iterative adversarial training on both the defensive noise and the surrogate model. However, it still remains unknown whether the robustness of unlearnable examples primarily comes from the effect of enhancement in the surrogate model or the defensive noise. Observing that simply removing the adversarial noise on the training process of the defensive noise can improve the performance of robust unlearnable examples, we identify that solely the surrogate model's robustness contributes to the performance. Furthermore, we found a negative correlation exists between the robustness of defensive noise and the protection performance, indicating defensive noise's instability issue. Motivated by this, to further boost the robust unlearnable example, we introduce stable error-minimizing noise (SEM), which trains the defensive noise against random perturbation instead of the time-consuming adversarial perturbation to improve the stability of defensive noise. Through extensive experiments, we demonstrate that SEM achieves a new state-of-the-art performance on CIFAR-10, CIFAR-100, and ImageNet Subset in terms of both effectiveness and efficiency. The code is available at https://github.com/liuyixin-louis/Stable-Unlearnable-Example.

摘要
大量开源图像数据的开放源代码促进了深度学习技术的发展。然而，这些开放源代码的泄露可能会被不当使用者利用来训练深度学习模型，进而导致商业或非法用途。为了避免这种滥用公共数据，我们提出了一种毒剂基于技术——不可学习示例，可以在模型训练过程中添加不可见的噪声，以降低模型的总体性能。为了进一步增强其对抗式训练的稳定性，现有的工作利用了迭代式抗击训练，同时训练抗击噪声和代理模型。然而，我们仍未知道抗击示例的稳定性是由代理模型的强化还是抗击噪声的增强所带来的。我们发现，简单地从训练过程中移除对抗噪声可以提高robust不可学习示例的性能，这表明抗击示例的稳定性是由代理模型的稳定性带来的。此外，我们发现抗击噪声的稳定性和保护性之间存在负相关性，这表明抗击噪声存在稳定性问题。为此，我们引入稳定错误最小化噪声（SEM），通过对随机干扰进行训练而提高抗击噪声的稳定性。我们在广泛的实验中示出，SEM可以在CIFAR-10、CIFAR-100和ImageNet Subset上实现新的状态级表现，并且在效果和效率两个方面均达到了新的高水平。代码可以在https://github.com/liuyixin-louis/Stable-Unlearnable-Example上获取。

FusionFrames: Efficient Architectural Aspects for Text-to-Video Generation Pipeline

paper_url: http://arxiv.org/abs/2311.13073
repo_url: https://github.com/ai-forever/kandinskyvideo
paper_authors: Vladimir Arkhipkin, Zein Shaheen, Viacheslav Vasilev, Elizaveta Dakhova, Andrey Kuznetsov, Denis Dimitrov
for: 这篇论文旨在提出一种基于文本扩散模型的双stage潜在扩散文本到视频生成架构，以生成高质量的视频。
methods: 该架构包括两个阶段：第一阶段是关键帧生成，用于决定视频的故事情节，第二阶段是 interpolate 帧生成，用于使场景和物体的运动平滑。论文还比较了多种 temporal conditioning 方法，并评估了不同的 MoVQ 视频解码方案。
results: 论文的实验结果显示，使用分立的 temporal 块比 temporal 层更有利于视频生成质量相关指标和人类偏好。 interpolation 模型的设计还有所降低了计算成本，相比其他masked frame interpolation方法。最后，论文比较了其架构与现有解决方案，并 achieved top-2 分数和 top-1 中开源解决方案。

Abstract
Multimedia generation approaches occupy a prominent place in artificial intelligence research. Text-to-image models achieved high-quality results over the last few years. However, video synthesis methods recently started to develop. This paper presents a new two-stage latent diffusion text-to-video generation architecture based on the text-to-image diffusion model. The first stage concerns keyframes synthesis to figure the storyline of a video, while the second one is devoted to interpolation frames generation to make movements of the scene and objects smooth. We compare several temporal conditioning approaches for keyframes generation. The results show the advantage of using separate temporal blocks over temporal layers in terms of metrics reflecting video generation quality aspects and human preference. The design of our interpolation model significantly reduces computational costs compared to other masked frame interpolation approaches. Furthermore, we evaluate different configurations of MoVQ-based video decoding scheme to improve consistency and achieve higher PSNR, SSIM, MSE, and LPIPS scores. Finally, we compare our pipeline with existing solutions and achieve top-2 scores overall and top-1 among open-source solutions: CLIPSIM = 0.2976 and FVD = 433.054. Project page: https://ai-forever.github.io/kandinsky-video/

摘要
multimedia生成方法在人工智能研究中占据着重要地位。文本到图像模型在过去几年内达到了高质量结果。但是，视频生成方法最近才开始发展。这篇文章描述了一种新的两阶段潜在扩散文本到视频生成架构，基于文本到图像扩散模型。首阶段关键帧生成以描述视频的故事情节，第二阶段是用于 interpolating 帧的生成，以使场景和 объек 的运动平滑。我们比较了几种 temporal conditioning 方法，并证明使用分离的 temporal 块的方法在指标反映视频生成质量方面和人类喜好方面具有优势。我们的 interpolating 模型的设计显著降低了计算成本，相比其他封面掩码插值方法。此外，我们评估了不同的 MoVQ 基于 видео解码方案，以提高一致性和 дости得高PSNR、SSIM、MSE 和 LPIPS 分数。最后，我们与现有解决方案进行比较，并达到了总体第二名和开源解决方案第一名：CLIPSIM = 0.2976 和 FVD = 433.054。项目页面：https://ai-forever.github.io/kandinsky-video/

FuseNet: Self-Supervised Dual-Path Network for Medical Image Segmentation

paper_url: http://arxiv.org/abs/2311.13069
repo_url: https://github.com/xmindflow/fusenet
paper_authors: Amirhossein Kazerouni, Sanaz Karimijafarbigloo, Reza Azad, Yury Velichko, Ulas Bagci, Dorit Merhof
For: 实现自动化 semantic segmentation 的需求，以往需要大量的 annotated dataset，这个方法可以解决这个问题。* Methods: 提出了一个 dual-stream 框架，使用自动生成的增强图像来取代 manual annotation，并且使用 cross-modal fusion 技术，将图像与文本数据结合，从而增强模型的可读性。* Results: 实验结果显示，这个方法可以实现高精度的 semantic segmentation，并且可以增强模型的类别能力和数据内涵。

Abstract
Semantic segmentation, a crucial task in computer vision, often relies on labor-intensive and costly annotated datasets for training. In response to this challenge, we introduce FuseNet, a dual-stream framework for self-supervised semantic segmentation that eliminates the need for manual annotation. FuseNet leverages the shared semantic dependencies between the original and augmented images to create a clustering space, effectively assigning pixels to semantically related clusters, and ultimately generating the segmentation map. Additionally, FuseNet incorporates a cross-modal fusion technique that extends the principles of CLIP by replacing textual data with augmented images. This approach enables the model to learn complex visual representations, enhancing robustness against variations similar to CLIP's text invariance. To further improve edge alignment and spatial consistency between neighboring pixels, we introduce an edge refinement loss. This loss function considers edge information to enhance spatial coherence, facilitating the grouping of nearby pixels with similar visual features. Extensive experiments on skin lesion and lung segmentation datasets demonstrate the effectiveness of our method. \href{https://github.com/xmindflow/FuseNet}{Codebase.}

摘要
semantic segmentation, a crucial task in computer vision, often relies on labor-intensive and costly annotated datasets for training. in response to this challenge, we introduce FuseNet, a dual-stream framework for self-supervised semantic segmentation that eliminates the need for manual annotation. FuseNet leverages the shared semantic dependencies between the original and augmented images to create a clustering space, effectively assigning pixels to semantically related clusters, and ultimately generating the segmentation map. additionally, FuseNet incorporates a cross-modal fusion technique that extends the principles of CLIP by replacing textual data with augmented images. this approach enables the model to learn complex visual representations, enhancing robustness against variations similar to CLIP's text invariance. to further improve edge alignment and spatial consistency between neighboring pixels, we introduce an edge refinement loss. this loss function considers edge information to enhance spatial coherence, facilitating the grouping of nearby pixels with similar visual features. extensive experiments on skin lesion and lung segmentation datasets demonstrate the effectiveness of our method. 👉 codebase.

2023-11-22

Importance of Feature Extraction in the Calculation of Fréchet Distance for Medical Imaging

DiverseNet: Decision Diversified Semi-supervised Semantic Segmentation Networks for Remote Sensing Imagery

A Somewhat Robust Image Watermark against Diffusion-based Editing Models

A Comprehensive Review of Artificial Intelligence Applications in Major Retinal Conditions

Multi-view Hybrid Graph Convolutional Network for Volume-to-mesh Reconstruction in Cardiovascular MRI

Masked Conditional Diffusion Models for Image Analysis with Application to Radiographic Diagnosis of Infant Abuse

Single-Shot Plug-and-Play Methods for Inverse Problems

Compact 3D Gaussian Representation for Radiance Field

BenthIQ: a Transformer-Based Benthic Classification Model for Coral Restoration

Panda or not Panda? Understanding Adversarial Attacks with Interactive Visualization

GAN-Avatar: Controllable Personalized GAN-based Human Head Avatar

Retrieval-Augmented Layout Transformer for Content-Aware Layout Generation

Diffusion models meet image counter-forensics

ZipLoRA: Any Subject in Any Style by Effectively Merging LoRAs

T-Rex: Counting by Visual Prompting

XAGen: 3D Expressive Human Avatars Generation

WildFusion: Learning 3D-Aware Latent Diffusion Models in View Space

ADriver-I: A General World Model for Autonomous Driving

DiffusionMat: Alpha Matting as Sequential Refinement Learning

Leveraging CNNs and Ensemble Learning for Automated Disaster Image Classification

Hybrid Whale-Mud-Ring Optimization for Precise Color Skin Cancer Image Segmentation

Deep-learning-based acceleration of MRI for radiotherapy planning of pediatric patients with brain tumors

SkeletonGait: Gait Recognition Using Skeleton Maps

CompenHR: Efficient Full Compensation for High-resolution Projector

Animatable 3D Gaussians for High-fidelity Synthesis of Human Motions

Depth-Regularized Optimization for 3D Gaussian Splatting in Few-Shot Images

SegVol: Universal and Interactive Volumetric Medical Image Segmentation

LucidDreamer: Domain-free Generation of 3D Gaussian Splatting Scenes

Point Projection Mapping System for Tracking, Registering, Labeling and Validating Optical Tissue Measurements

MRGazer: Decoding Eye Gaze Points from Functional Magnetic Resonance Imaging in Individual Space

Unified Classification and Rejection: A One-versus-All Framework

High-Quality Face Caricature via Style Translation

Revisiting Supervision for Continual Representation Learning

Deep Learning for Vascular Segmentation and Applications in Phase Contrast Tomography Imaging

Recognition-Guided Diffusion Model for Scene Text Image Super-Resolution

Retargeting Visual Data with Deformation Fields

CMFDFormer: Transformer-based Copy-Move Forgery Detection with Continual Learning

Immunohistochemistry guided segmentation of benign epithelial cells, in situ lesions, and invasive epithelial cells in breast cancer slides

Density Distribution-based Learning Framework for Addressing Online Continual Learning Challenges

Towards Hetero-Client Federated Multi-Task Learning

TDiffDe: A Truncated Diffusion Model for Remote Sensing Hyperspectral Image Denoising

Knowledge From the Dark Side: Entropy-Reweighted Knowledge Distillation for Balanced Knowledge Transfer

Towards Detecting, Recognizing, and Parsing the Address Information from Bangla Signboard: A Deep Learning-based Approach

Test-time Adaptive Vision-and-Language Navigation

The Challenges of Image Generation Models in Generating Multi-Component Images

Steal My Artworks for Fine-tuning? A Watermarking Framework for Detecting Art Theft Mimicry in Text-to-Image Models

Self-guided Few-shot Semantic Segmentation for Remote Sensing Imagery Based on Large Vision Models

DRIFu: Differentiable Rendering and Implicit Function-based Single-View 3D Reconstruction

DoubleAUG: Single-domain Generalized Object Detector in Urban via Color Perturbation and Dual-style Memory

Boosting3D: High-Fidelity Image-to-3D by Boosting 2D Diffusion Prior to 3D Prior with Progressive Learning

Online Video Quality Enhancement with Spatial-Temporal Look-up Tables

Towards Improving Document Understanding: An Exploration on Text-Grounding via MLLMs

HEViTPose: High-Efficiency Vision Transformer for Human Pose Estimation

NeISF: Neural Incident Stokes Field for Geometry and Material Estimation

Applications of Spiking Neural Networks in Visual Place Recognition

Differentiable Radio Frequency Ray Tracing for Millimeter-Wave Sensing

Volumetric Reconstruction Resolves Off-Resonance Artifacts in Static and Dynamic PROPELLER MRI

Learning to Complement with Multiple Humans (LECOMH): Integrating Multi-rater and Noisy-Label Learning into Human-AI Collaboration

3D Face Style Transfer with a Hybrid Solution of NeRF and Mesh Rasterization

Test-Time Augmentation for 3D Point Cloud Classification and Segmentation

Single Image Compressed Sensing MRI via a Self-Supervised Deep Denoising Approach

Diffusion360: Seamless 360 Degree Panoramic Image Generation based on Diffusion Models

Spanning Training Progress: Temporal Dual-Depth Scoring (TDDS) for Enhanced Dataset Pruning

Lightweight High-Speed Photography Built on Coded Exposure and Implicit Neural Representation of Videos

P2RBox: A Single Point is All You Need for Oriented Object Detection

DAE-Net: Deforming Auto-Encoder for fine-grained shape co-segmentation

Multi-modal In-Context Learning Makes an Ego-evolving Scene Text Recognizer

Automated Measurement of Pericoronary Adipose Tissue Attenuation and Volume in CT Angiography

Stable Unlearnable Example: Enhancing the Robustness of Unlearnable Examples via Stable Error-Minimizing Noise

FusionFrames: Efficient Architectural Aspects for Text-to-Video Generation Pipeline

FuseNet: Self-Supervised Dual-Path Network for Medical Image Segmentation