2023-10-27

cs.CV

cs.CV - 2023-10-27

Using convolutional neural networks for stereological characterization of 3D hetero-aggregates based on synthetic STEM data

paper_url: http://arxiv.org/abs/2310.18523
repo_url: None
paper_authors: Lukas Fuchs, Tom Kirstein, Christoph Mahr, Orkun Furat, Valentin Baric, Andreas Rosenauer, Lutz Maedler, Volker Schmidt
for: 这个论文的目的是研究三维结构的异构聚合体，以derive process-structure或structure-property关系。
methods: 这个论文使用机器学习和空间随机模型的方法，通过生成Synthetic Training Data来 overcome 3D imaging技术的问题。
results: 这个论文提出了一种基于机器学习和空间随机模型的方法，可以从2D图像中预测3D结构。此外，论文还进行了错误分析，以评估这种预测方法的准确性。

Abstract
The structural characterization of hetero-aggregates in 3D is of great interest, e.g., for deriving process-structure or structure-property relationships. However, since 3D imaging techniques are often difficult to perform as well as time and cost intensive, a characterization of hetero-aggregates based on 2D image data is desirable, but often non-trivial. To overcome the issues of characterizing 3D structures from 2D measurements, a method is presented that relies on machine learning combined with methods of spatial stochastic modeling, where the latter are utilized for the generation of synthetic training data. This kind of training data has the advantage that time-consuming experiments for the synthesis of differently structured materials followed by their 3D imaging can be avoided. More precisely, a parametric stochastic 3D model is presented, from which a wide spectrum of virtual hetero-aggregates can be generated. Additionally, the virtual structures are passed to a physics-based simulation tool in order to generate virtual scanning transmission electron microscopy (STEM) images. The preset parameters of the 3D model together with the simulated STEM images serve as a database for the training of convolutional neural networks, which can be used to determine the parameters of the underlying 3D model and, consequently, to predict 3D structures of hetero-aggregates from 2D STEM images. Furthermore, an error analysis is performed to evaluate the prediction power of the trained neural networks with respect to structural descriptors, e.g. the hetero-coordination number.

摘要
“三维结构Characterization的研究对于异化体组合物有很大的 интерес，例如 derivation of process-structure or structure-property relationships. However, since 3D imaging techniques are often difficult to perform and time-consuming, a characterization of hetero-aggregates based on 2D image data is desirable but challenging. To overcome the limitations of characterizing 3D structures from 2D measurements, a method is proposed that combines machine learning with spatial stochastic modeling. This approach utilizes synthetic training data generated by the latter method to avoid time-consuming experiments for the synthesis of differently structured materials and their 3D imaging. Specifically, a parametric stochastic 3D model is presented, from which a wide spectrum of virtual hetero-aggregates can be generated. The virtual structures are then passed to a physics-based simulation tool to generate virtual scanning transmission electron microscopy (STEM) images. The pre-set parameters of the 3D model and the simulated STEM images serve as a database for training convolutional neural networks, which can be used to determine the parameters of the underlying 3D model and predict 3D structures of hetero-aggregates from 2D STEM images. Additionally, an error analysis is performed to evaluate the prediction power of the trained neural networks with respect to structural descriptors, such as the hetero-coordination number.”Note that Simplified Chinese is used in this translation, which is a standardized form of Chinese that is easier to read and write than Traditional Chinese. However, if you prefer Traditional Chinese, I can also provide the translation in that format.

Learning to recognize occluded and small objects with partial inputs

paper_url: http://arxiv.org/abs/2310.18517
repo_url: https://github.com/hasibzunair/msl-recognition
paper_authors: Hasib Zunair, A. Ben Hamza
for: 多个图像中的多个对象识别，特别是当这些对象小时， occlusion 问题更加困难。
methods: 我们提出了Masked Supervised Learning（MSL），一种单stage，无需特定模型的学习方法，通过遮盲分支学习 context-based 表示，并通过标签一致性来模型标签的相互关系。
results: 实验结果表明，MSL 能够与之前的状态图像识别方法竞争，并且可以快速、简单地应用于多个标签图像识别任务。此外，我们还证明了MSL 对随机遮盲的稳定性和非遮盲对象的识别能力。代码和预训练模型可以在 GitHub 上获取。

Abstract
Recognizing multiple objects in an image is challenging due to occlusions, and becomes even more so when the objects are small. While promising, existing multi-label image recognition models do not explicitly learn context-based representations, and hence struggle to correctly recognize small and occluded objects. Intuitively, recognizing occluded objects requires knowledge of partial input, and hence context. Motivated by this intuition, we propose Masked Supervised Learning (MSL), a single-stage, model-agnostic learning paradigm for multi-label image recognition. The key idea is to learn context-based representations using a masked branch and to model label co-occurrence using label consistency. Experimental results demonstrate the simplicity, applicability and more importantly the competitive performance of MSL against previous state-of-the-art methods on standard multi-label image recognition benchmarks. In addition, we show that MSL is robust to random masking and demonstrate its effectiveness in recognizing non-masked objects. Code and pretrained models are available on GitHub.

摘要
Recognizing multiple objects in an image is challenging due to occlusions, and becomes even more so when the objects are small. While existing multi-label image recognition models show promise, they do not explicitly learn context-based representations, and therefore struggle to correctly recognize small and occluded objects. Intuitively, recognizing occluded objects requires knowledge of partial input, and hence context. Motivated by this intuition, we propose Masked Supervised Learning (MSL), a single-stage, model-agnostic learning paradigm for multi-label image recognition. The key idea is to learn context-based representations using a masked branch and to model label co-occurrence using label consistency. Experimental results demonstrate the simplicity, applicability, and more importantly the competitive performance of MSL against previous state-of-the-art methods on standard multi-label image recognition benchmarks. In addition, we show that MSL is robust to random masking and demonstrate its effectiveness in recognizing non-masked objects. Code and pretrained models are available on GitHub.Here's the translation in Traditional Chinese:识别多个图像中的物体是困难的，尤其是当物体小时。现有的多 Label 图像识别模型虽然有推荐，但是它们不会直接学习上下文基于的表现，因此对于小和遮蔽的物体来说，其表现不佳。我们受到这个直觉的动机，提出了几个概念，包括：几个 Label 的共同出现，以及对于部分输入的知识。我们提出了一个单阶段、无法检测的学习方法，即掩盖Supervised Learning (MSL)，以学习上下文基于的表现。我们的关键思想是，通过掩盖分支来学习上下文基于的表现，并且使用标签的共同出现来模型标签的共同性。我们的实验结果显示，MSL 的简单性、应用性和更重要的是，与前一代方法相比，其表现非常竞争。此外，我们还证明了 MSL 在随机掩盖下是稳定的，并且在非掩盖的情况下表现良好。我们的代码和预训模型都可以在 GitHub 上找到。

GPT-4 Vision on Medical Image Classification – A Case Study on COVID-19 Dataset

paper_url: http://arxiv.org/abs/2310.18498
repo_url: None
paper_authors: Ruibo Chen, Tianyi Xiong, Yihan Wu, Guodong Liu, Zhengmian Hu, Lichang Chen, Yanshuo Chen, Chenxi Liu, Heng Huang
for: 这份技术报告探讨了 COVID-19 图像分类领域中 GPT-4 Vision (GPT-4V) 的应用，通过培养学习环境中的启发性来提高诊断过程。
methods: 该文使用 GPT-4V 在 COVID-19 图像中进行了启发性学习，以提高图像分类的准确率。
results: 研究发现，通过使用 GPT-4V，图像分类的准确率得到了显著提高，表明了 GPT-4V 在 COVID-19 图像分类中的潜在应用价值。

Abstract
This technical report delves into the application of GPT-4 Vision (GPT-4V) in the nuanced realm of COVID-19 image classification, leveraging the transformative potential of in-context learning to enhance diagnostic processes.

摘要
这份技术报告探讨了 COVID-19 图像分类领域中 GPT-4 Vision（GPT-4V）的应用，利用 context learning 的潜在力量提高诊断过程。Note:* "GPT-4V" is translated as "GPT-4 Vision" (格PT-4视力)* "in-context learning" is translated as " context learning" (上下文学习)* "diagnostic processes" is translated as "诊断过程" (诊断过程)

Knowledge-based in silico models and dataset for the comparative evaluation of mammography AI for a range of breast characteristics, lesion conspicuities and doses

paper_url: http://arxiv.org/abs/2310.18494
repo_url: https://github.com/didsr/msynth-release
paper_authors: Elena Sizikova, Niloufar Saharkhiz, Diksha Sharma, Miguel Lago, Berkman Sahiner, Jana G. Delfino, Aldo Badano
for: 评估人工智能（AI）医疗设备的安全性和效能，需要评估AI模型在不同的患者群体中表现。
methods: 我们提出一种基于计算机模拟的评估方法，使用干扰性的数字模型来模拟人体解剖结构，并使用数字复制成像系统来生成真实的synthetic图像集。
results: 我们释放了M-SYNTH数据集，包含四种乳腺纤维质分布的人群，通过Monte Carlo x射线计算模拟不同的暴露水平进行图像捕获。我们发现，随着乳腺纤维质的增加，AI模型的性能逐渐下降，而随着质量的增加，AI模型的性能则逐渐提高。随着暴露水平的减少，AI模型的性能下降，最高的性能出现在较低的暴露水平下。

Abstract
To generate evidence regarding the safety and efficacy of artificial intelligence (AI) enabled medical devices, AI models need to be evaluated on a diverse population of patient cases, some of which may not be readily available. We propose an evaluation approach for testing medical imaging AI models that relies on in silico imaging pipelines in which stochastic digital models of human anatomy (in object space) with and without pathology are imaged using a digital replica imaging acquisition system to generate realistic synthetic image datasets. Here, we release M-SYNTH, a dataset of cohorts with four breast fibroglandular density distributions imaged at different exposure levels using Monte Carlo x-ray simulations with the publicly available Virtual Imaging Clinical Trial for Regulatory Evaluation (VICTRE) toolkit. We utilize the synthetic dataset to analyze AI model performance and find that model performance decreases with increasing breast density and increases with higher mass density, as expected. As exposure levels decrease, AI model performance drops with the highest performance achieved at exposure levels lower than the nominal recommended dose for the breast type.

摘要
<>为了生成人工智能（AI）医疗设备的安全性和有效性的证据，我们需要对各种患者群体的病例进行评估。我们提出了一种使用数字医学ipeline进行医学影像AI模型的评估方法，其中使用了卫星模型来生成人工的影像数据集。在这种方法中，我们使用了VICTRE工具包来进行Monte Carlo x射线计算，生成了不同抑制物质分布的胸部病例数据集。我们通过分析这些数据来评估AI模型的性能，发现模型性能随着乳腺细胞分布的增加而下降，而模型在高质量细胞分布下的性能最高。随着曝光水平的下降，AI模型的性能下降，最佳性能在低于标准推荐剂量之下得到。>>>

Exploring Shape Embedding for Cloth-Changing Person Re-Identification via 2D-3D Correspondences

paper_url: http://arxiv.org/abs/2310.18438
repo_url: None
paper_authors: Yubin Wang, Huimin Yu, Yuming Yan, Shuyi Song, Biyang Liu, Yichong Lu
for: 这篇论文旨在解决 cloth-changing ReID 问题，即人脸识别 task 中人物穿着不同服装时的问题。methods: 这篇论文提出了一种新的 shape embedding 方法，即 Continuous Surface Correspondence Learning (CSCL)，它通过Pixel-to-vertex classification来建立人像与3D人体模型之间的连续匹配，从而获取人像与3D模型之间的匹配点。results: 实验表明，CSCL 方法可以remarkably enhance the model’s global understanding of human body shape，并在 cloth-changing ReID 和 cloth-consistent ReID Benchmarks 上达到了出色的效果。

Abstract
Cloth-Changing Person Re-Identification (CC-ReID) is a common and realistic problem since fashion constantly changes over time and people's aesthetic preferences are not set in stone. While most existing cloth-changing ReID methods focus on learning cloth-agnostic identity representations from coarse semantic cues (e.g. silhouettes and part segmentation maps), they neglect the continuous shape distributions at the pixel level. In this paper, we propose Continuous Surface Correspondence Learning (CSCL), a new shape embedding paradigm for cloth-changing ReID. CSCL establishes continuous correspondences between a 2D image plane and a canonical 3D body surface via pixel-to-vertex classification, which naturally aligns a person image to the surface of a 3D human model and simultaneously obtains pixel-wise surface embeddings. We further extract fine-grained shape features from the learned surface embeddings and then integrate them with global RGB features via a carefully designed cross-modality fusion module. The shape embedding paradigm based on 2D-3D correspondences remarkably enhances the model's global understanding of human body shape. To promote the study of ReID under clothing change, we construct 3D Dense Persons (DP3D), which is the first large-scale cloth-changing ReID dataset that provides densely annotated 2D-3D correspondences and a precise 3D mesh for each person image, while containing diverse cloth-changing cases over all four seasons. Experiments on both cloth-changing and cloth-consistent ReID benchmarks validate the effectiveness of our method.

摘要
cloth-changing 人识别 (CC-ReID) 是一个常见并且现实存在的问题，因为时尚不断发展，人们的美学偏好也不是固定的。现有的 cloth-changing ReID 方法多数是通过学习粗略的 semantic cues（例如 silhouette 和 part segmentation map）来学习不同服装下的人脸特征，但是它们忽略了人体图像中精细的形状分布。在这篇文章中，我们提出了 Continuous Surface Correspondence Learning (CSCL)，一种新的形状嵌入方法，用于 cloth-changing ReID。CSCL 通过像素到顶点的分类来建立人体图像与Canonical 3D 人体模型之间的连续对应关系，从而自然地将人体图像与模型之间建立对应关系，同时获得像素级别的表面嵌入。我们还提取了高级别的形状特征从学习的表面嵌入，然后与全球 RGB 特征进行权重相乘。基于 2D-3D 对应关系的形状嵌入方法，可以强化模型对人体形状的全面理解。为了推动 cloth-changing ReID 的研究，我们构建了 3D Dense Persons (DP3D)，这是首个包含了不同的服装变化情况的 cloth-changing ReID 数据集，每个人像图像都有精 densely 注解的 2D-3D 对应关系和精确的 3D 网格。实验表明，我们的方法在 cloth-changing 和 cloth-consistent ReID Benchmark 上具有remarkable的效果。

Always Clear Days: Degradation Type and Severity Aware All-In-One Adverse Weather Removal

paper_url: http://arxiv.org/abs/2310.18293
repo_url: None
paper_authors: Yu-Wei Chen, Soo-Chang Pei
for: 本研究旨在提出一种综合恢复多种恶势力影像的模型，以解决多种恶势力影像恢复中的两大挑战：第一，发现和处理多元领域的目标分布；第二，设计高效和有效的操作来处理不同类型的降低。
methods: 该模型基于inter&intra-domain适应 literatura，并提出了一种新的Marginal Quality Ranking Loss（MQRL）和一种新的Contrastive Loss（CL）来引导气象情况的提取，以及一些新的技术如多头相关混合（MHCA）和本地-全局适应实例 нормализа（LG-AdaIN）来高效地恢复空间变化的气象降低。
results: 相比于现有的State-of-the-Art方法，该模型可以在不同的气象恢复任务中显著超越对手，并且具有较少的模型参数。此外，该模型还可以 Restore 未seen 领域的多种气象降低图像，并可以调整恢复水平。

Abstract
All-in-one adverse weather removal is an emerging topic on image restoration, which aims to restore multiple weather degradation in an unified model, and the challenging are twofold. First, discovering and handling the property of multi-domain in target distribution formed by multiple weather conditions. Second, design efficient and effective operations for different degradation types. To address this problem, most prior works focus on the multi-domain caused by weather type. Inspired by inter\&intra-domain adaptation literature, we observed that not only weather type but also weather severity introduce multi-domain within each weather type domain, which is ignored by previous methods, and further limit their performance. To this end, we proposed a degradation type and severity aware model, called \textbf{UtilityIR}, for blind all-in-one bad weather image restoration. To extract weather information from single image, we proposed a novel Marginal Quality Ranking Loss (MQRL) and utilized Contrastive Loss (CL) to guide weather severity and type extraction, and leverage a bag of novel techniques such as Multi-Head Cross Attention (MHCA) and Local-Global Adaptive Instance Normalization (LG-AdaIN) to efficiently restore spatial varying weather degradation. The proposed method can significantly outperform the SOTA methods subjectively and objectively on different weather restoration tasks with a large margin, and enjoy less model parameters. Proposed method even can restore \textbf{unseen} domain combined multiple degradation images, and modulating restoration level. Implementation code will be available at {https://github.com/fordevoted/UtilityIR}{\textit{this repository}

摘要
全面天气环境去除是一个现代图像修复领域的热点问题，目标是通过一个统一模型来恢复多种天气下的图像异常情况，问题的两个级别是：首先，发现和处理目标分布中多个域的性质，其次，设计高效和有效的操作方法 для不同的退化类型。以前的大多数工作都是通过多种天气类型来处理多个域的问题，但是我们发现，不同的天气严重性也会在每个天气类型中引入多个域，这一点被以前的方法忽略了，从而限制了其性能。为了解决这个问题，我们提出了一种具有退化类型和严重性意识的模型，称为\textbf{UtilityIR}，用于盲目全面坏天气图像修复。为了从单个图像中提取天气信息，我们提出了一种新的环境质量排名损失函数（MQRL），并使用了对比损失函数（CL）来引导天气严重性和类型的提取，并利用了一系列新的技术，如多头交叉注意力（MHCA）和本地-全局适应实例均衡化（LG-AdaIN），以高效地恢复空间变化的天气退化。我们的方法可以Subjectively和Objectively在不同的天气修复任务上与state-of-the-art方法进行比较，并且具有较少的模型参数。我们的方法甚至可以恢复未经见过的多个退化图像，并可以调整修复水平。我们的实现代码将在[这个仓库](https://github.com/fordevoted/UtilityIR)上提供。

Heterogeneous Federated Learning with Group-Aware Prompt Tuning

paper_url: http://arxiv.org/abs/2310.18285
repo_url: None
paper_authors: Wenlong Deng, Christos Thrampoulidis, Xiaoxiao Li
for: 这篇论文探讨了在联合学习（Federated Learning，FL）中使用转换器模型的应用，尤其是在具有多种本地数据的各种客户端上。
methods: 我们采用了预训练的转换器模型，并使用高效的提问调整策略。我们的策略是学习共享提问和组提问，以同时获得通用知识和组特定知识。此外，我们还设置了个性化组提问分配模块，以对每个输入分配个性化的组提问，使全球模型与每个客户端的数据分布相匹配。
results: 我们的方法可以让单个全球模型自动适应不同客户端的本地数据分布，不需要本地微调。与替换方法不同，我们的方法可以准确地跨越客户端之间的差异，从而实现联合学习中的全球和本地模型匹配。我们通过了广泛的实验和减少研究来证明方法的有效性。

Abstract
Transformers have achieved remarkable success in various machine-learning tasks, prompting their widespread adoption. In this paper, we explore their application in the context of federated learning (FL), with a particular focus on heterogeneous scenarios where individual clients possess diverse local datasets. To meet the computational and communication demands of FL, we leverage pre-trained Transformers and use an efficient prompt-tuning strategy. Our strategy introduces the concept of learning both shared and group prompts, enabling the acquisition of universal knowledge and group-specific knowledge simultaneously. Additionally, a prompt selection module assigns personalized group prompts to each input, aligning the global model with the data distribution of each client. This approach allows us to train a single global model that can automatically adapt to various local client data distributions without requiring local fine-tuning. In this way, our proposed method effectively bridges the gap between global and personalized local models in Federated Learning and surpasses alternative approaches that lack the capability to adapt to previously unseen clients. The effectiveness of our approach is rigorously validated through extensive experimentation and ablation studies.

摘要
“对于联邦学习（Federated Learning，FL）的应用，trasnformers已经取得了杰出的成就，它们的广泛应用引起了广泛的关注。在本文中，我们探讨trasnformers在多种不同资料分布的联邦学习中的应用，特别是在客户端拥有多样化的本地数据时。为了解决联邦学习中的计算和通信需求，我们将pre-trained transformers和高效的提示调整策略应用于联邦学习。我们的策略是学习共享和分组提示，允许同时获取通用知识和分组特定知识。此外，提示选择模块将每个输入的个人化分组提示分配给每个客户端，使全球模型与每个客户端的数据分布保持一致。这种方法允许我们训练一个单一的全球模型，无需进行本地微调整，并且自动适应不同客户端的数据分布。因此，我们的提案可以有效地跨越全球和个人化的客户端模型之间的差异，超越缺乏适应不见前的客户端模型。我们的方法的有效性经过了广泛的实验和剥夺研究，以证明其可行性和优势。”

FOUND: Foot Optimization with Uncertain Normals for Surface Deformation Using Synthetic Data

paper_url: http://arxiv.org/abs/2310.18279
repo_url: None
paper_authors: Oliver Boyne, Gwangbin Bae, James Charles, Roberto Cipolla
for: 这个论文的目的是为了实现几个视图图像中的表面重建 task，并且它特别关注人 Foot 的重建。
methods: 该论文使用了一种 uncertainty-aware 的表面法向量预测器，以及一种优化方案来适应一系列图像。
results: 论文表明其法向量预测器在实际图像上表现出色，而优化方案也在几个视图设置下比 estado del arte 光学测量管道表现更好。

Abstract
Surface reconstruction from multi-view images is a challenging task, with solutions often requiring a large number of sampled images with high overlap. We seek to develop a method for few-view reconstruction, for the case of the human foot. To solve this task, we must extract rich geometric cues from RGB images, before carefully fusing them into a final 3D object. Our FOUND approach tackles this, with 4 main contributions: (i) SynFoot, a synthetic dataset of 50,000 photorealistic foot images, paired with ground truth surface normals and keypoints; (ii) an uncertainty-aware surface normal predictor trained on our synthetic dataset; (iii) an optimization scheme for fitting a generative foot model to a series of images; and (iv) a benchmark dataset of calibrated images and high resolution ground truth geometry. We show that our normal predictor outperforms all off-the-shelf equivalents significantly on real images, and our optimization scheme outperforms state-of-the-art photogrammetry pipelines, especially for a few-view setting. We release our synthetic dataset and baseline 3D scans to the research community.

摘要
表面重建从多视图图像是一项具有挑战性的任务，解决方案通常需要大量的采样图像和高重叠率。我们寻求开发一种几视图重建方法，专门针对人体脚部。为解决这个任务，我们需要从RGB图像中提取丰富的地理学特征，然后精心融合到最终的3D对象中。我们的FOUND方法从以下四个方面做出贡献：(i) SynFoot，一个包含50,000个真实风格的脚部图像，每个图像都有附加的表面法向量和关键点数据；(ii) 基于我们的 sintetic dataset 的不确定性感知表面法向量预测器；(iii) 用于把一系列图像适应到生成的脚部模型中的优化方案；(iv) 一个准备了卡лли布рован的图像和高分辨率的真实地理学几何结构的参考数据集。我们表明我们的normal预测器在真实图像上明显超过了所有准备的等价器，而我们的优化方案在几视图设置下明显超过了当前的摄影探测渠道。我们发布我们的 sintetic dataset 和基线3D扫描数据，以便研究人员进行更多的探索和应用。

LipSim: A Provably Robust Perceptual Similarity Metric

paper_url: http://arxiv.org/abs/2310.18274
repo_url: https://github.com/saraghazanfari/lipsim
paper_authors: Sara Ghazanfari, Alexandre Araujo, Prashanth Krishnamurthy, Farshad Khorrami, Siddharth Garg
for: This paper is written for researchers and practitioners interested in developing and applying perceptual similarity metrics, particularly those concerned with the vulnerability of these metrics to adversarial attacks.
methods: The paper uses an ensemble of ViT-based feature extractors and proposes a framework for training a robust perceptual similarity metric called LipSim, which leverages 1-Lipschitz neural networks as the backbone and provides provable guarantees.
results: The paper demonstrates the vulnerability of state-of-the-art perceptual similarity metrics to adversarial attacks and presents a comprehensive set of experiments showing the performance of LipSim in terms of natural and certified scores, as well as on the image retrieval application.

Abstract
Recent years have seen growing interest in developing and applying perceptual similarity metrics. Research has shown the superiority of perceptual metrics over pixel-wise metrics in aligning with human perception and serving as a proxy for the human visual system. On the other hand, as perceptual metrics rely on neural networks, there is a growing concern regarding their resilience, given the established vulnerability of neural networks to adversarial attacks. It is indeed logical to infer that perceptual metrics may inherit both the strengths and shortcomings of neural networks. In this work, we demonstrate the vulnerability of state-of-the-art perceptual similarity metrics based on an ensemble of ViT-based feature extractors to adversarial attacks. We then propose a framework to train a robust perceptual similarity metric called LipSim (Lipschitz Similarity Metric) with provable guarantees. By leveraging 1-Lipschitz neural networks as the backbone, LipSim provides guarded areas around each data point and certificates for all perturbations within an $\ell_2$ ball. Finally, a comprehensive set of experiments shows the performance of LipSim in terms of natural and certified scores and on the image retrieval application. The code is available at https://github.com/SaraGhazanfari/LipSim.

摘要
近年来，有越来越多的研究者关注开发和应用感知相似度度量。研究表明，感知度量比像素精度更能与人类感知相匹配，并作为人类视觉系统的代理。然而，由于感知度量基于神经网络，因此存在抗击攻击的担忧。这是合理的推理，因为神经网络具有抗击攻击的敏感性。在这种情况下，我们展示了现状顶尖感知相似度度量基于ViT基于特征提取器的集成系统对抗攻击的漏斗性。然后，我们提议一种训练可靠的感知相似度度量的框架，称为LipSim（Lipschitz相似度度量）。通过使用1-Lipschitz神经网络作为核心，LipSim提供了每个数据点的保护区和所有折射在$\ell_2$球体内的证明。最后，我们进行了详细的实验，以评估LipSim在自然和证明得分上的性能，以及图像检索应用中的表现。代码可以在https://github.com/SaraGhazanfari/LipSim中找到。

PlantPlotGAN: A Physics-Informed Generative Adversarial Network for Plant Disease Prediction

paper_url: http://arxiv.org/abs/2310.18268
repo_url: None
paper_authors: Felipe A. Lopes, Vasit Sagan, Flavio Esposito
for: 园区监测是重要的农业管理和收获健康的关键，尤其是检测植物疾病。
methods: 我们使用无人飞行器（UAV）收集多spectral图像，以帮助园区监测。
results: 我们的 PlantPlotGAN 模型可以生成高品质的合成多spectral图像，并且可以提高检测植物疾病的预测模型精度。

Abstract
Monitoring plantations is crucial for crop management and producing healthy harvests. Unmanned Aerial Vehicles (UAVs) have been used to collect multispectral images that aid in this monitoring. However, given the number of hectares to be monitored and the limitations of flight, plant disease signals become visually clear only in the later stages of plant growth and only if the disease has spread throughout a significant portion of the plantation. This limited amount of relevant data hampers the prediction models, as the algorithms struggle to generalize patterns with unbalanced or unrealistic augmented datasets effectively. To address this issue, we propose PlantPlotGAN, a physics-informed generative model capable of creating synthetic multispectral plot images with realistic vegetation indices. These indices served as a proxy for disease detection and were used to evaluate if our model could help increase the accuracy of prediction models. The results demonstrate that the synthetic imagery generated from PlantPlotGAN outperforms state-of-the-art methods regarding the Fr\'echet inception distance. Moreover, prediction models achieve higher accuracy metrics when trained with synthetic and original imagery for earlier plant disease detection compared to the training processes based solely on real imagery.

摘要
监测植业是cro管理和生产健康卫生的关键。无人驾驶飞行器（UAV）已被用于收集多spectral图像，以帮助监测。然而， giventhe number of hectares to be monitored and the limitations of flight, plant disease signals only become visually clear in the later stages of plant growth, and only if the disease has spread throughout a significant portion of the plantation. This limited amount of relevant data hampers the prediction models, as the algorithms struggle to generalize patterns with unbalanced or unrealistic augmented datasets effectively. To address this issue, we propose PlantPlotGAN, a physics-informed generative model capable of creating synthetic multispectral plot images with realistic vegetation indices. These indices served as a proxy for disease detection and were used to evaluate if our model could help increase the accuracy of prediction models. The results demonstrate that the synthetic imagery generated from PlantPlotGAN outperforms state-of-the-art methods regarding the Fréchet inception distance. Moreover, prediction models achieve higher accuracy metrics when trained with synthetic and original imagery for earlier plant disease detection compared to the training processes based solely on real imagery.Here's the word-for-word translation:监测植业是cro管理和生产健康卫生的关键。无人驾驶飞行器（UAV）已被用于收集多spectral图像，以帮助监测。然而， giventhe number of hectares to be monitored and the limitations of flight, plant disease signals only become visually clear in the later stages of plant growth, and only if the disease has spread throughout a significant portion of the plantation. This limited amount of relevant data hampers the prediction models, as the algorithms struggle to generalize patterns with unbalanced or unrealistic augmented datasets effectively. To address this issue, we propose PlantPlotGAN, a physics-informed generative model capable of creating synthetic multispectral plot images with realistic vegetation indices. These indices served as a proxy for disease detection and were used to evaluate if our model could help increase the accuracy of prediction models. The results demonstrate that the synthetic imagery generated from PlantPlotGAN outperforms state-of-the-art methods regarding the Fréchet inception distance. Moreover, prediction models achieve higher accuracy metrics when trained with synthetic and original imagery for earlier plant disease detection compared to the training processes based solely on real imagery.

A Self-Supervised Approach to Land Cover Segmentation

paper_url: http://arxiv.org/abs/2310.18251
repo_url: None
paper_authors: Charles Moore, Dakota Hester
for: 这个论文旨在提供一种自动化高分辨率农业Remote sensing图像分类方法，不需要高质量的地面真实标签。
methods: 该方法使用一个冻结的预训练ViT背景，并在STEGO架构中进行微调。
results: 经过10个微调轮，实现了约52%的准确率在5个样本中，表明了自动化标注高分辨率农业Remote sensing图像的可能性。

Abstract
Land use/land cover change (LULC) maps are integral resources in earth science and agricultural research. Due to the nature of such maps, the creation of LULC maps is often constrained by the time and human resources necessary to accurately annotate satellite imagery and remote sensing data. While computer vision models that perform semantic segmentation to create detailed labels from such data are not uncommon, litle research has been done on self-supervised and unsupervised approaches to labelling LULC maps without the use of ground-truth masks. Here, we demonstrate a self-supervised method of land cover segmentation that has no need for high-quality ground truth labels. The proposed deep learning employs a frozen pre-trained ViT backbone transferred from DINO in a STEGO architecture and is fine-tuned using a custom dataset consisting of very high resolution (VHR) sattelite imagery. After only 10 epochs of fine-tuning, an accuracy of roughly 52% was observed across 5 samples, signifying the feasibility of self-supervised models for the automated labelling of VHR LULC maps.

摘要
Land use/land cover change（LULC）地图是地球科学和农业研究中的重要资源。由于LULC地图的创建通常受到时间和人员资源的限制，因为需要精确地标注卫星图像和远程感知数据。虽然用计算机视觉模型进行semantic segmentation，从数据中生成细节标签并不是无前例的，但是对LULC地图的自动标注而无需高质量地面真实标签的研究不多。本文提出了一种没有需要高质量地面真实标签的自动标注方法。该深度学习模型使用冰结的预训练ViT背bone，并在STEGO架构中进行了精度调整。经过10个精度调整 epoch，模型在5个样本上达到了约52%的准确率，表明自动标注模型可以实施高分辨率LULC地图的自动标注。

Generative AI Model for Artistic Style Transfer Using Convolutional Neural Networks

paper_url: http://arxiv.org/abs/2310.18237
repo_url: None
paper_authors: Jonayet Miah, Duc M Cao, Md Abu Sayed, Md. Sabbirul Haque
for: 本文概述了一种基于卷积神经网络（CNN）的图像风格转移技术，用于将一个图像的内容和风格与另一个图像的风格相结合，创造独特的视觉作品。
methods: 本文使用了深度图像表示学习的 CNN 来分离和控制图像的内容和风格，并通过损失计算和优化来实现高质量的风格转移。
results: 本文通过实验结果显示了该方法的效果和多样性，包括不同风格和内容的图像合成。

Abstract
Artistic style transfer, a captivating application of generative artificial intelligence, involves fusing the content of one image with the artistic style of another to create unique visual compositions. This paper presents a comprehensive overview of a novel technique for style transfer using Convolutional Neural Networks (CNNs). By leveraging deep image representations learned by CNNs, we demonstrate how to separate and manipulate image content and style, enabling the synthesis of high-quality images that combine content and style in a harmonious manner. We describe the methodology, including content and style representations, loss computation, and optimization, and showcase experimental results highlighting the effectiveness and versatility of the approach across different styles and content

摘要
美术风格传输，一种吸引人的生成人工智能应用，涉及将一幅图像的内容与另一幅图像的艺术风格相结合，以创造独特的视觉作品。本文提出了一种基于卷积神经网络（CNN）的新方法，用于实现风格传输。通过利用深度图像表示学习出来的CNN，我们示例了如何分离和处理图像内容和风格，以生成高质量的合成图像，其中内容和风格兼得协调。我们介绍了方法的具体实现，包括内容和风格表示、损失计算和优化，并通过实验结果表明该方法在不同的风格和内容上的效果和多样性。

How Re-sampling Helps for Long-Tail Learning?

paper_url: http://arxiv.org/abs/2310.18236
repo_url: https://github.com/shijxcs/csa
paper_authors: Jiang-Xin Shi, Tong Wei, Yuke Xiang, Yu-Feng Li
for: investigate the effectiveness of re-sampling in modern long-tail learning tasks
methods: experiments on two homogeneous datasets, context shift augmentation module to generate diverse training images for the tail class
results: proposed module can boost generalization and outperform other approaches, including class-balanced re-sampling, decoupled classifier re-training, and data augmentation methods

Abstract
Long-tail learning has received significant attention in recent years due to the challenge it poses with extremely imbalanced datasets. In these datasets, only a few classes (known as the head classes) have an adequate number of training samples, while the rest of the classes (known as the tail classes) are infrequent in the training data. Re-sampling is a classical and widely used approach for addressing class imbalance issues. Unfortunately, recent studies claim that re-sampling brings negligible performance improvements in modern long-tail learning tasks. This paper aims to investigate this phenomenon systematically. Our research shows that re-sampling can considerably improve generalization when the training images do not contain semantically irrelevant contexts. In other scenarios, however, it can learn unexpected spurious correlations between irrelevant contexts and target labels. We design experiments on two homogeneous datasets, one containing irrelevant context and the other not, to confirm our findings. To prevent the learning of spurious correlations, we propose a new context shift augmentation module that generates diverse training images for the tail class by maintaining a context bank extracted from the head-class images. Experiments demonstrate that our proposed module can boost the generalization and outperform other approaches, including class-balanced re-sampling, decoupled classifier re-training, and data augmentation methods. The source code is available at https://www.lamda.nju.edu.cn/code_CSA.ashx.

摘要
“长尾学习在最近几年内得到了广泛关注，因为它面临着极其不均衡的数据集的挑战。在这些数据集中，只有一些类（称为头类）有足够的训练样本，而另外的类（称为尾类）则是训练数据中罕见的。重新采样是经典的和广泛使用的方法来解决类均衡问题。然而，最新的研究表明，重新采样在现代长尾学习任务中并不能提供显著的性能提升。本文旨在系统地探讨这种现象。我们的研究表明，重新采样可以在训练图像不含 semantically irrelevant 上下文时大幅提高泛化。在其他情况下，它可能学习不相关的上下文和目标标签之间的意外相关性。我们设计了两个同质数据集的实验，一个包含 irrelevant context，另一个不包含，以确认我们的发现。为避免学习不相关的上下文，我们提议一种新的上下文shift augmentation模块，该模块可以生成 tail 类的多样化训练图像，保持 head 类图像中的上下文银行。实验表明，我们提议的模块可以提高泛化和其他方法相比，包括类均衡重新采样、解册分类器重新训练和数据扩展方法。代码可以在中获取。”

Edge AI-Based Vein Detector for Efficient Venipuncture in the Antecubital Fossa

paper_url: http://arxiv.org/abs/2310.18234
repo_url: None
paper_authors: Edwin Salcedo, Patricia Peñaloza
for: 这个论文是为了提高 antecubital fossa 中的血管可见性而作的。
methods: 这个论文使用了 Near Infrared (NIR) 成像和深度学习 (DL) 技术来 segmentation 腕静脉。
results: 这个论文提出了一种新的 NIR 成像基于的腕静脉 segmentation 数据集，并提出了一种修改后的 U-Net 架构来特别地在 antecubital fossa 区域中找到血管。此外，这个论文还测试了四种常用的嵌入式微计算机和四种压缩模式，并选择了使用 Raspberry Pi 4B 卡来实现最佳的执行时间和准确性平衡。

Abstract
Assessing the condition and visibility of veins is a crucial step before obtaining intravenous access in the antecubital fossa, which is a common procedure to draw blood or administer intravenous therapies (IV therapies). Even though medical practitioners are highly skilled at intravenous cannulation, they usually struggle to perform the procedure in patients with low visible veins due to fluid retention, age, overweight, dark skin tone, or diabetes. Recently, several investigations proposed combining Near Infrared (NIR) imaging and deep learning (DL) techniques for forearm vein segmentation. Although they have demonstrated compelling results, their use has been rather limited owing to the portability and precision requirements to perform venipuncture. In this paper, we aim to contribute to bridging this gap using three strategies. First, we introduce a new NIR-based forearm vein segmentation dataset of 2,016 labelled images collected from 1,008 subjects with low visible veins. Second, we propose a modified U-Net architecture that locates veins specifically in the antecubital fossa region of the examined patient. Finally, a compressed version of the proposed architecture was deployed inside a bespoke, portable vein finder device after testing four common embedded microcomputers and four common quantization modalities. Experimental results showed that the model compressed with Dynamic Range Quantization and deployed on a Raspberry Pi 4B card produced the best execution time and precision balance, with 5.14 FPS and 0.957 of latency and Intersection over Union (IoU), respectively. These results show promising performance inside a resource-restricted low-cost device.

摘要
医疗人员在 antecubital fossa 区域进行血液或 intravenous therapies (IV therapies) 的时候，需要评估血管的状况和可见度。尽管医疗人员具有高度的血液引导技能，但在有低可见度的血管的患者中，医疗人员通常会面临困难。近些年，一些研究提出了结合 Near Infrared (NIR) 成像和深度学习 (DL) 技术来 segment 胳膊血管的方法。尽管它们已经展示出了吸引人的结果，但它们的使用受到了可移植性和精度的限制，以便在进行 venipuncture 时进行血液引导。在这篇论文中，我们想要帮助bridging这个差距。我们的方法包括三个方面：1. 我们提供了一个新的 NIR-based 胳膊血管 segmentation 数据集，包含了 2,016 个标注的图像，来自 1,008 名患者，其中许多患者有低可见度的血管。2. 我们提出了一种修改后的 U-Net 架构，可以在特定的 antecubital fossa 区域内准确地定位血管。3. 我们在一个特制的、可携带的 vein finder 设备中部署了一个压缩版的提议架构，并测试了四种常见的嵌入式微计算机和四种常见的压缩模式。实验结果表明，使用 Dynamics Range Quantization 压缩并在 Raspberry Pi 4B 卡上部署的模型在执行时间和精度之间达到了良好的平衡，具体来说是 5.14 FPS 和 0.957 的延迟和 Intersection over Union (IoU)，分别是。这些结果表明在有限的资源和低成本设备中，我们的方法可以实现出色的性能。

TBDLNet: a network for classifying multidrug-resistant and drug-sensitive tuberculosis

paper_url: http://arxiv.org/abs/2310.18222
repo_url: None
paper_authors: Ziquan Zhu, Jing Tao, Shuihua Wang, Xin Zhang, Yudong Zhang
for: 本研究用一种新型深度学习模型TBDLNet来自动识别CT图像，以分类多药 resistant和敏感肺炎。
methods: 该模型采用预训练ResNet50提取特征，并使用三个随机神经网络来避免过拟合问题。 ensemble of three RNNs 是用来提高Robustness的。
results: 该模型在五种批处分划 validation中得到了0.9822的准确率、0.9815的特征率、0.9823的精度、0.9829的敏感率和0.9826的F1-score。TBDLNet适用于分类多药 resistant和敏感肺炎，可以早些地检测多药 resistant肺炎，帮助在时间内调整治疗方案，提高治疗效果。

Abstract
This paper proposes applying a novel deep-learning model, TBDLNet, to recognize CT images to classify multidrug-resistant and drug-sensitive tuberculosis automatically. The pre-trained ResNet50 is selected to extract features. Three randomized neural networks are used to alleviate the overfitting problem. The ensemble of three RNNs is applied to boost the robustness via majority voting. The proposed model is evaluated by five-fold cross-validation. Five indexes are selected in this paper, which are accuracy, sensitivity, precision, F1-score, and specificity. The TBDLNet achieves 0.9822 accuracy, 0.9815 specificity, 0.9823 precision, 0.9829 sensitivity, and 0.9826 F1-score, respectively. The TBDLNet is suitable for classifying multidrug-resistant tuberculosis and drug-sensitive tuberculosis. It can detect multidrug-resistant pulmonary tuberculosis as early as possible, which helps to adjust the treatment plan in time and improve the treatment effect.

摘要
Translation in Simplified Chinese:这篇论文提议使用一种新的深度学习模型TBDLNet，用于自动识别CT图像，并将其分为多药抗药性和敏感肺结核细菌两类。模型使用预训练的ResNet50提取特征，并使用三个随机的神经网络来避免过拟合问题。ensemble三个RNN使用多数投票法来提高鲁棒性。模型使用五fold交叉验证来评估，使用五个指标：准确率、敏感率、精度、F1分数和特征率。TBDLNet在这些指标中得分为0.9822、0.9815、0.9823、0.9829和0.9826，分别。TBDLNet适用于分类多药抗药性和敏感肺结核细菌，可以在时间上早 detection多药抗药性肺结核细菌，帮助在时间上适当地调整治疗方案，提高治疗效果。

Artifact-Robust Graph-Based Learning in Digital Pathology

paper_url: http://arxiv.org/abs/2310.18192
repo_url: None
paper_authors: Saba Heidari Gheshlaghi, Milan Aryal, Nasim Yahyasoltani, Masoud Ganji
for:This paper aims to develop a novel robust learning approach to account for perturbations in whole slide images (WSIs) for prostate cancer diagnosis.methods:The proposed approach uses graph convolutional networks (GCNs) to extract features from the graph representing WSI, followed by a denoiser and a transformer for classification.results:The proposed model shows significant improvement in cancer diagnosis compared to non-robust algorithms, with accuracy and kappa scores improved by the denoiser and the use of GCNs.

Abstract
Whole slide images~(WSIs) are digitized images of tissues placed in glass slides using advanced scanners. The digital processing of WSIs is challenging as they are gigapixel images and stored in multi-resolution format. A common challenge with WSIs is that perturbations/artifacts are inevitable during storing the glass slides and digitizing them. These perturbations include motion, which often arises from slide movement during placement, and changes in hue and brightness due to variations in staining chemicals and the quality of digitizing scanners. In this work, a novel robust learning approach to account for these artifacts is presented. Due to the size and resolution of WSIs and to account for neighborhood information, graph-based methods are called for. We use graph convolutional network~(GCN) to extract features from the graph representing WSI. Through a denoiser {and pooling layer}, the effects of perturbations in WSIs are controlled and the output is followed by a transformer for the classification of different grades of prostate cancer. To compare the efficacy of the proposed approach, the model without denoiser is trained and tested with WSIs without any perturbation and then different perturbations are introduced in WSIs and passed through the network with the denoiser. The accuracy and kappa scores of the proposed model with prostate cancer dataset compared with non-robust algorithms show significant improvement in cancer diagnosis.

摘要
整幕图像（WSIs）是用高级扫描仪将组织胶囊中的组织样本扫描成数字图像。由于WSIs的数字处理具有高分辨率和多resolution format，因此处理WSIs是一项挑战。常见的WSIs问题是在存储玻璃板和扫描时产生的干扰和 artifacts。这些干扰包括摆动、着色和亮度变化，这些变化可能是化学品的质量和扫描仪的不同。在这项工作中，我们提出了一种新的Robust学习方法来处理这些干扰。由于WSIs的大小和分辨率，以及需要考虑 neighboring information，因此我们使用图gram卷积网络（GCN）来提取WSIs中的特征。通过杂化和池化层，我们控制了干扰的影响，然后使用变换器进行不同grade的肾癌诊断。为了比较提议方法的有效性，我们在不含干扰的WSIs上train和测试模型，然后在WSIs中引入不同的干扰，并将其传递 через网络。我们的方法与肾癌数据集的准确率和κ值 Score在非Robust算法的情况下显示了显著的改善。

Semi-Supervised Panoptic Narrative Grounding

paper_url: http://arxiv.org/abs/2310.18142
repo_url: https://github.com/nini0919/sspng
paper_authors: Danni Yang, Jiayi Ji, Xiaoshuai Sun, Haowei Wang, Yinan Li, Yiwei Ma, Rongrong Ji
for: 本研究旨在提高叙述幻像检测（PNG）任务的进步，使得它在有限的标注数据下进行训练。
methods: 我们提出了一种新的半有序PNG学习方案（SS-PNG），利用更少的标注图像对和更多的无标注对来实现竞争力的表现。在PNG任务中，每个像素可以属于多个开放的物体，因此现有的多类基于semi-supervised segmentation的框架无法直接应用于这个任务。我们开发了一种专门针对SS-PNG设置的SS-PNG网络（SS-PNG-NW），并进行了严格的 исследование和优化。
results: 我们的SS-PNG-NW+在PNG数据集上进行了广泛的实验，与完全有标注的模型相比，在所有数据比例下达到了相当的表现。特别是，我们的SS-PNG-NW+在只使用30%和50%的标注数据时表现出色，与完全有标注的模型相比，提高了0.8%和1.1%的表现。这表明我们的提出的SS-PNG-NW+在限制标注数据下提高PNG任务的实际性。

Abstract
Despite considerable progress, the advancement of Panoptic Narrative Grounding (PNG) remains hindered by costly annotations. In this paper, we introduce a novel Semi-Supervised Panoptic Narrative Grounding (SS-PNG) learning scheme, capitalizing on a smaller set of labeled image-text pairs and a larger set of unlabeled pairs to achieve competitive performance. Unlike visual segmentation tasks, PNG involves one pixel belonging to multiple open-ended nouns. As a result, existing multi-class based semi-supervised segmentation frameworks cannot be directly applied to this task. To address this challenge, we first develop a novel SS-PNG Network (SS-PNG-NW) tailored to the SS-PNG setting. We thoroughly investigate strategies such as Burn-In and data augmentation to determine the optimal generic configuration for the SS-PNG-NW. Additionally, to tackle the issue of imbalanced pseudo-label quality, we propose a Quality-Based Loss Adjustment (QLA) approach to adjust the semi-supervised objective, resulting in an enhanced SS-PNG-NW+. Employing our proposed QLA, we improve BCE Loss and Dice loss at pixel and mask levels, respectively. We conduct extensive experiments on PNG datasets, with our SS-PNG-NW+ demonstrating promising results comparable to fully-supervised models across all data ratios. Remarkably, our SS-PNG-NW+ outperforms fully-supervised models with only 30% and 50% supervision data, exceeding their performance by 0.8% and 1.1% respectively. This highlights the effectiveness of our proposed SS-PNG-NW+ in overcoming the challenges posed by limited annotations and enhancing the applicability of PNG tasks. The source code is available at https://github.com/nini0919/SSPNG.

摘要
尽管已经做出了很大的进步，但是对于图像文本对应关系（PNG）的进一步发展仍然受到严重的标注成本限制。在这篇论文中，我们介绍了一种新的半超vised Panoptic Narrative Grounding（SS-PNG）学习方案，利用一个更小的标注图像文本对的集合和一个更大的无标注对来实现竞争性的性能。与视觉分割任务不同，PNG中一个像素可以属于多个开放式名称。因此，现有的多类基于 semi-supervised segmentation的框架无法直接应用于这个任务。为解决这个挑战，我们首先开发了一种适应 SS-PNG 的 SS-PNG 网络（SS-PNG-NW）。我们在这种 SS-PNG-NW 中进行了严格的调查和数据增强等策略，以确定最佳的通用配置。此外，为了解决假标注质量偏斜的问题，我们提出了一种 Quality-Based Loss Adjustment（QLA）方法，以调整 semi-supervised 目标函数，从而得到了一种提升的 SS-PNG-NW+。我们在 PNG 数据集上进行了广泛的实验，并证明了我们的 SS-PNG-NW+ 在所有数据比例下具有出色的表现，与完全超vised 模型相当。特别是，我们的 SS-PNG-NW+ 在仅使用 30% 和 50% 的超visisted数据时，超过了完全超vised 模型的性能，提高了其性能的 0.8% 和 1.1% 分别。这种表现说明了我们提出的 SS-PNG-NW+ 对于做到 PNG 任务的应用性能具有很高的效iveness。SS-PNG 网络的源代码可以在 GitHub 上找到：https://github.com/nini0919/SSPNG。

Unsupervised Representation Learning for Diverse Deformable Shape Collections

paper_url: http://arxiv.org/abs/2310.18141
repo_url: None
paper_authors: Sara Hahner, Souhaib Attaiki, Jochen Garcke, Maks Ovsjanikov
for: 本研究旨在开发一种基于学习的3D表面网格编码和处理方法，用于创建可解释的嵌入空间 для弹性形状集合。
methods: 我们的方法使用spectral pooling技术建立一个通用的隐藏空间，从 tradicional的mesh connectivity和形状类别中解脱出来。
results: 我们的方法可以实现优秀的重建和更加真实和平滑的 interpolations，并且超过基eline方法的性能。

Abstract
We introduce a novel learning-based method for encoding and manipulating 3D surface meshes. Our method is specifically designed to create an interpretable embedding space for deformable shape collections. Unlike previous 3D mesh autoencoders that require meshes to be in a 1-to-1 correspondence, our approach is trained on diverse meshes in an unsupervised manner. Central to our method is a spectral pooling technique that establishes a universal latent space, breaking free from traditional constraints of mesh connectivity and shape categories. The entire process consists of two stages. In the first stage, we employ the functional map paradigm to extract point-to-point (p2p) maps between a collection of shapes in an unsupervised manner. These p2p maps are then utilized to construct a common latent space, which ensures straightforward interpretation and independence from mesh connectivity and shape category. Through extensive experiments, we demonstrate that our method achieves excellent reconstructions and produces more realistic and smoother interpolations than baseline approaches.

摘要
我们提出了一种新的学习基于方法用于编码和操作三维表面网格。我们的方法专门设计用于创建可解释的嵌入空间，用于不可归类的形状集合。与过去的3D笼自动编码器不同，我们的方法不需要笼子在1-1对应。我们的方法在无监督的情况下在多种笼子上进行训练。中心于我们的方法是一种spectral pooling技术，该技术建立了一个通用的嵌入空间，脱离了传统的笼子连接和形状类别的限制。整个过程分为两个阶段。在第一阶段，我们使用函数映射方法抽取点对点（p2p）地图 между一个集合的形状。这些p2p地图然后用于构建共同嵌入空间，这使得解释更直观，独立于笼子连接和形状类别。通过广泛的实验，我们证明了我们的方法可以实现出色的重建和更加真实和平滑的 interpolations than 基准方法。

End-to-end Video Gaze Estimation via Capturing Head-face-eye Spatial-temporal Interaction Context

paper_url: http://arxiv.org/abs/2310.18131
repo_url: https://github.com/zgchen33/mcgaze
paper_authors: Yiran Guan, Zhuoguang Chen, Wenzheng Zeng, Zhiguo Cao, Yang Xiao
for: 提出了一种新的方法 Multi-Clue Gaze (MCGaze)，用于通过捕捉头、脸、眼的空间-时间交互context来进行视频跟踪眼动Estimation，这个问题之前没有得到充分关注。
methods: MCGaze方法可以同时解决头、脸、眼的指示位置定位问题，并在一步式的方式下进行优化，从而实现最佳性能。在这个过程中，头、脸、眼的上下文信息互相交换，从而在眼动推断中 simultanously capture global clue from head and face, and local clue from eye.
results: 实验结果表明，MCGaze方法在面临到复杂的 Gaze360 数据集的测试中表现出色，证明了我们的提议的优越性。

Abstract
In this letter, we propose a new method, Multi-Clue Gaze (MCGaze), to facilitate video gaze estimation via capturing spatial-temporal interaction context among head, face, and eye in an end-to-end learning way, which has not been well concerned yet. The main advantage of MCGaze is that the tasks of clue localization of head, face, and eye can be solved jointly for gaze estimation in a one-step way, with joint optimization to seek optimal performance. During this, spatial-temporal context exchange happens among the clues on the head, face, and eye. Accordingly, the final gazes obtained by fusing features from various queries can be aware of global clues from heads and faces, and local clues from eyes simultaneously, which essentially leverages performance. Meanwhile, the one-step running way also ensures high running efficiency. Experiments on the challenging Gaze360 dataset verify the superiority of our proposition. The source code will be released at https://github.com/zgchen33/MCGaze.

摘要
在这封信中，我们提出了一种新的方法，即多 clue gaze（MCGaze），用于通过捕捉头、面和眼的空间-时间交互 context来进行视频眼动估计，这种方法尚未得到了充分关注。MCGaze 的主要优点是可以同时解决头、面和眼的 clue localization 问题，从而实现一步骤的眼动估计，并且在joint optimization中进行优化以求最佳性能。在这个过程中，头、面和眼之间的空间-时间上的Context Exchange 发生，从而使得最终的眼动结果可以同时充分利用全头和面上的全局 clue，以及眼上的本地 clue，这种方法可以提高性能。此外，MCGaze 的一步运行方式也保证了高效率。实验表明，在 Gaze360 数据集上，我们的提议超过了传统方法的性能。源代码将于 https://github.com/zgchen33/MCGaze 上发布。

Direct Unsupervised Denoising

paper_url: http://arxiv.org/abs/2310.18116
repo_url: https://github.com/krulllab/DirectDenoiser
paper_authors: Benjamin Salmon, Alexander Krull
for: 这个论文是为了提出一种新的干扰除法，以替代传统的监督学习方法。
methods: 这个论文使用了Variational AutoEncoders（VAEs）来实现无监督的干扰除法，而不需要对应的干扰数据。
results: 该方法可以在各种情况下提供高质量的干扰除结果，而且比传统的监督方法更快速，并且可以避免创造大量的样本抽象。

Abstract
Traditional supervised denoisers are trained using pairs of noisy input and clean target images. They learn to predict a central tendency of the posterior distribution over possible clean images. When, e.g., trained with the popular quadratic loss function, the network's output will correspond to the minimum mean square error (MMSE) estimate. Unsupervised denoisers based on Variational AutoEncoders (VAEs) have succeeded in achieving state-of-the-art results while requiring only unpaired noisy data as training input. In contrast to the traditional supervised approach, unsupervised denoisers do not directly produce a single prediction, such as the MMSE estimate, but allow us to draw samples from the posterior distribution of clean solutions corresponding to the noisy input. To approximate the MMSE estimate during inference, unsupervised methods have to create and draw a large number of samples - a computationally expensive process - rendering the approach inapplicable in many situations. Here, we present an alternative approach that trains a deterministic network alongside the VAE to directly predict a central tendency. Our method achieves results that surpass the results achieved by the unsupervised method at a fraction of the computational cost.

摘要
传统的监督式降噪器通常通过对噪声输入和干净目标图像的对照对进行训练，学习预测噪声输入的后逻脑分布中的中位数。例如，使用流行的quadratic loss函数训练网络，网络的输出将对应于最小平均方差估计（MMSE）。不同于传统的监督式方法，无监督降噪器基于Variational AutoEncoders（VAEs）可以在不需要对应的干净数据的情况下实现状态的最佳结果。然而，在推理过程中，无监督降噪器不直接生成唯一的预测结果，而是允许我们从降噪器的 posterior distribution 中随机抽取干净解决方案对应的噪声输入。为了在推理过程中 aproximate MMSE 估计，无监督方法需要创建和抽取大量的样本，这是 computationally expensive 的过程，因此在许多情况下无法应用。在这篇文章中，我们提出了一种alternative方法，该方法通过同时训练 deterministic 网络和 VAE 来直接预测中位数。我们的方法可以在computational cost的一个 fraction 的情况下超越无监督方法的结果。

Classifier-head Informed Feature Masking and Prototype-based Logit Smoothing for Out-of-Distribution Detection

paper_url: http://arxiv.org/abs/2310.18104
repo_url: None
paper_authors: Zhuohao Sun, Yiqiao Qiu, Zhijun Tan, Weishi Zheng, Ruixuan Wang
for: 这篇研究旨在提出一种有效的后期Out-of-distribution（OOD）检测方法，以解决深度学习模型在实际应用中的错误预测问题。
methods: 本研究使用了一种新的特征遮盾策略和一种新的权重缓和策略，将特征遮盾定义为每个内部分类（ID）的重要特征，并将其他特征遮盾。此外，本研究还使用了一种cosine similarity的类似性计算来自适应温度因子，以缓和神经网络的过度自信预测。
results: 实验结果显示，本研究的方法可以将OOD检测精度提高，并且与现有方法相容。本研究新创出了State-of-the-art的性能。代码将会公开发布。

Abstract
Out-of-distribution (OOD) detection is essential when deploying neural networks in the real world. One main challenge is that neural networks often make overconfident predictions on OOD data. In this study, we propose an effective post-hoc OOD detection method based on a new feature masking strategy and a novel logit smoothing strategy. Feature masking determines the important features at the penultimate layer for each in-distribution (ID) class based on the weights of the ID class in the classifier head and masks the rest features. Logit smoothing computes the cosine similarity between the feature vector of the test sample and the prototype of the predicted ID class at the penultimate layer and uses the similarity as an adaptive temperature factor on the logit to alleviate the network's overconfidence prediction for OOD data. With these strategies, we can reduce feature activation of OOD data and enlarge the gap in OOD score between ID and OOD data. Extensive experiments on multiple standard OOD detection benchmarks demonstrate the effectiveness of our method and its compatibility with existing methods, with new state-of-the-art performance achieved from our method. The source code will be released publicly.

摘要
OUT-OF-DISTRIBUTION (OOD) 检测是在真实世界中部署神经网络的关键。一个主要挑战是神经网络frequently 对OOD数据进行过自信的预测。在这项研究中，我们提出了一种有效的后置OOD检测方法，基于新的特征遮盾策略和一种新的logit平滑策略。特征遮盾在半最后层确定每个ID类型的重要特征，根据ID类型的分类器头的权重，并将其他特征遮盾。logit平滑计算测试样本的特征向量和预测ID类型的prototype在半最后层的cos仿射系数，并使用这个相似性作为适应温度因子来缓解神经网络对OOD数据的过自信预测。通过这些策略，我们可以降低OOD数据的特征活动和扩大ID和OOD数据之间的分布差。我们的方法与现有方法相容，并在多个标准OOD检测 benchmark上实现了新的 state-of-the-art 性能。我们将代码公开发布。

A Chebyshev Confidence Guided Source-Free Domain Adaptation Framework for Medical Image Segmentation

paper_url: http://arxiv.org/abs/2310.18087
repo_url: None
paper_authors: Jiesi Hu, Yanwu Yang, Xutao Guo, Jinghua Wang, Ting Ma
for:This paper focuses on addressing the accuracy deterioration issue of pseudo-labels (PLs) in source-free domain adaptation (SFDA) methods, which is a crucial problem in medical imaging scenarios due to privacy concerns.methods:The proposed framework consists of three main components: (1) Chebyshev confidence guided SFDA, (2) confidence-guided denoising methods (direct denoising and prototypical denoising), and (3) a novel teacher-student joint training scheme (TJTS) with a confidence weighting module.results:Extensive experiments in diverse domain scenarios demonstrate the effectiveness of the proposed framework, achieving superior performance compared to state-of-the-art SFDA methods. The proposed approach precisely estimates the reliability of PLs and generates high-quality PLs, leading to improved adaptation performance.

Abstract
Source-free domain adaptation (SFDA) aims to adapt models trained on a labeled source domain to an unlabeled target domain without the access to source data. In medical imaging scenarios, the practical significance of SFDA methods has been emphasized due to privacy concerns. Recent State-of-the-art SFDA methods primarily rely on self-training based on pseudo-labels (PLs). Unfortunately, PLs suffer from accuracy deterioration caused by domain shift, and thus limit the effectiveness of the adaptation process. To address this issue, we propose a Chebyshev confidence guided SFDA framework to accurately assess the reliability of PLs and generate self-improving PLs for self-training. The Chebyshev confidence is estimated by calculating probability lower bound of the PL confidence, given the prediction and the corresponding uncertainty. Leveraging the Chebyshev confidence, we introduce two confidence-guided denoising methods: direct denoising and prototypical denoising. Additionally, we propose a novel teacher-student joint training scheme (TJTS) that incorporates a confidence weighting module to improve PLs iteratively. The TJTS, in collaboration with the denoising methods, effectively prevents the propagation of noise and enhances the accuracy of PLs. Extensive experiments in diverse domain scenarios validate the effectiveness of our proposed framework and establish its superiority over state-of-the-art SFDA methods. Our paper contributes to the field of SFDA by providing a novel approach for precisely estimating the reliability of pseudo-labels and a framework for obtaining high-quality PLs, resulting in improved adaptation performance.

摘要
To address this issue, we propose a Chebyshev confidence guided SFDA framework to accurately assess the reliability of PLs and generate self-improving PLs for self-training. The Chebyshev confidence is estimated by calculating the probability lower bound of the PL confidence, given the prediction and the corresponding uncertainty.Leveraging the Chebyshev confidence, we introduce two confidence-guided denoising methods: direct denoising and prototypical denoising. Additionally, we propose a novel teacher-student joint training scheme (TJTS) that incorporates a confidence weighting module to improve PLs iteratively. The TJTS, in collaboration with the denoising methods, effectively prevents the propagation of noise and enhances the accuracy of PLs.Extensive experiments in diverse domain scenarios validate the effectiveness of our proposed framework and establish its superiority over state-of-the-art SFDA methods. Our paper contributes to the field of SFDA by providing a novel approach for precisely estimating the reliability of pseudo-labels and a framework for obtaining high-quality PLs, resulting in improved adaptation performance.

Text Augmented Spatial-aware Zero-shot Referring Image Segmentation

paper_url: http://arxiv.org/abs/2310.18049
repo_url: None
paper_authors: Yucheng Suo, Linchao Zhu, Yi Yang
for: 这种研究旨在解决零shot引用图像分割中的挑战，即基于引用表达而不需要训练的实例掩模分割。methods: 该方法基于Text Augmented Spatial-aware（TAS）框架，包括实例掩模提取网络、文本增强视觉对应分数以及空间修正器。results: 对RefCOCO、RefCOCO+和RefCOCOg等多个 dataset进行了广泛的实验，并表明该方法在零shot引用图像分割任务中具有明显的优势，超越了现有的状态计算方法。

Abstract
In this paper, we study a challenging task of zero-shot referring image segmentation. This task aims to identify the instance mask that is most related to a referring expression without training on pixel-level annotations. Previous research takes advantage of pre-trained cross-modal models, e.g., CLIP, to align instance-level masks with referring expressions. %Yet, CLIP only considers image-text pair level alignment, which neglects fine-grained image region and complex sentence matching. Yet, CLIP only considers the global-level alignment of image-text pairs, neglecting fine-grained matching between the referring sentence and local image regions. To address this challenge, we introduce a Text Augmented Spatial-aware (TAS) zero-shot referring image segmentation framework that is training-free and robust to various visual encoders. TAS incorporates a mask proposal network for instance-level mask extraction, a text-augmented visual-text matching score for mining the image-text correlation, and a spatial rectifier for mask post-processing. Notably, the text-augmented visual-text matching score leverages a $P$ score and an $N$-score in addition to the typical visual-text matching score. The $P$-score is utilized to close the visual-text domain gap through a surrogate captioning model, where the score is computed between the surrogate model-generated texts and the referring expression. The $N$-score considers the fine-grained alignment of region-text pairs via negative phrase mining, encouraging the masked image to be repelled from the mined distracting phrases. Extensive experiments are conducted on various datasets, including RefCOCO, RefCOCO+, and RefCOCOg. The proposed method clearly outperforms state-of-the-art zero-shot referring image segmentation methods.

摘要
在这篇论文中，我们研究了零shot引用图像分割的挑战性任务。这个任务的目标是使用没有Pixel级别注释的情况下，从referring表达中确定最相关的实例Mask。先前的研究利用了预训练的交叉模态模型，如CLIP，来将实例级别的mask与referring表达相Alignment。然而，CLIP只考虑了图像文本对的全局匹配，忽略了图像区域细化和复杂的句子匹配。为解决这个挑战，我们提出了一个Text Augmented Spatial-aware（TAS）零shot引用图像分割框架。TAS包括一个Mask proposal网络 для实例级别的Mask提取，一个文本增强的视觉文本匹配分数 для挖掘图像文本的相关性，以及一个空间正则化器 дляMask后处理。值得注意的是，文本增强的视觉文本匹配分数利用了$P$ score和$N$-score，以及传统的视觉文本匹配分数。$P$-score通过一个surrogate captioning模型来闭合视觉文本域的差距，其中分数是计算surrogate模型生成的文本和引用表达之间的相似度。$N$-score考虑了图像文本对的细化对应，通过负phrase挖掘，使masked图像受到挖掘的负面抑制。我们对RefCOCO、RefCOCO+和RefCOCOg等多个dataset进行了广泛的实验，并证明了我们的方法在零shot引用图像分割任务中具有明显的优势。

ZeroNVS: Zero-Shot 360-Degree View Synthesis from a Single Real Image

paper_url: http://arxiv.org/abs/2310.17994
repo_url: None
paper_authors: Kyle Sargent, Zizhang Li, Tanmay Shah, Charles Herrmann, Hong-Xing Yu, Yunzhi Zhang, Eric Ryan Chan, Dmitry Lagun, Li Fei-Fei, Deqing Sun, Jiajun Wu
for: 这篇论文是设计来 Synthesize single-image novel view for in-the-wild scenes, 对于现有的方法而言，这些方法只适用于单一物体的场景中，这篇论文提出了新的技术来解决受到野外多个物体和复杂背景的挑战。
methods: 这篇论文使用了一个3D-aware散射模型，ZeroNVS，并将其训练在一个混合数据源上，这个数据源包括物体中心、室内和室外场景。为了解决数据混合所引入的问题，例如深度尺度歧义，这篇论文提出了一个新的摄像头参数化和均衡方案。
results: 这篇论文的模型在LPIPS中的 zero-shot 设定中设置了新的州OF-the-art 纪录，甚至超过了特别在DTU上训练的方法。此外，这篇论文还适用了Mip-NeRF 360 dataset作为单一图像新观点合成的新 bencmark，并在这个设定中展现了强大的性能。

Abstract
We introduce a 3D-aware diffusion model, ZeroNVS, for single-image novel view synthesis for in-the-wild scenes. While existing methods are designed for single objects with masked backgrounds, we propose new techniques to address challenges introduced by in-the-wild multi-object scenes with complex backgrounds. Specifically, we train a generative prior on a mixture of data sources that capture object-centric, indoor, and outdoor scenes. To address issues from data mixture such as depth-scale ambiguity, we propose a novel camera conditioning parameterization and normalization scheme. Further, we observe that Score Distillation Sampling (SDS) tends to truncate the distribution of complex backgrounds during distillation of 360-degree scenes, and propose "SDS anchoring" to improve the diversity of synthesized novel views. Our model sets a new state-of-the-art result in LPIPS on the DTU dataset in the zero-shot setting, even outperforming methods specifically trained on DTU. We further adapt the challenging Mip-NeRF 360 dataset as a new benchmark for single-image novel view synthesis, and demonstrate strong performance in this setting. Our code and data are at http://kylesargent.github.io/zeronvs/

摘要
我们介绍了一种3D意识扩散模型，namely ZeroNVS，用于单图新视角合成Scene中的异常场景。而现有方法通常是为单个物体设置masked背景，我们提出了新的技术来解决在野外多对象场景中引入的挑战。具体来说，我们在混合数据源上训练了生成的先验，以捕捉object-centric、indoor和outdoor场景。为了解决数据混合引入的深度尺度歧义，我们提出了一种新的摄像头条件化和正规化方案。此外，我们发现Score Distillation Sampling (SDS)在混合360度场景中进行distillation时，容易对复杂背景进行短结，我们提出了"SDS anchoring"来提高合成的新视角的多样性。我们的模型在LPIPS上DTU数据集上达到了新的州OF-THE-ART记录，甚至超越了特地在DTU上训练的方法。此外，我们采用了Difficult Mip-NeRF 360数据集作为新的benchmark，并在这个设置下达到了出色的性能。我们的代码和数据可以在http://kylesargent.github.io/zeronvs/上找到。

FaultSeg Swin-UNETR: Transformer-Based Self-Supervised Pretraining Model for Fault Recognition

paper_url: http://arxiv.org/abs/2310.17974
repo_url: None
paper_authors: Zeren Zhang, Ran Chen, Jinwen Ma
for: 提高震动 fault 识别精度
methods: 自动学习 + Swintransformer + SimMIM 预训练 + 多尺度拟合 + edge detection
results: 在Thebe数据集上实现了领先的性能，OIS和ODS指标中评估为最佳Here’s a brief explanation of each point:1. for: The paper aims to improve the accuracy of seismic fault recognition by introducing a self-supervised learning approach using a large amount of unlabeled seismic data for pretraining.2. methods: The proposed method utilizes the Swin Transformer model as the core network and employs the SimMIM pretraining task to capture unique features related to discontinuities in seismic data. Additionally, the authors refine the structure of the Swin-UNETR model to enable multiscale decoding and fusion for more effective fault detection.3. results: The experimental results on the Thebe dataset demonstrate that the proposed method achieves state-of-the-art performance, as measured by the OIS and ODS metrics.

Abstract
This paper introduces an approach to enhance seismic fault recognition through self-supervised pretraining. Seismic fault interpretation holds great significance in the fields of geophysics and geology. However, conventional methods for seismic fault recognition encounter various issues, including dependence on data quality and quantity, as well as susceptibility to interpreter subjectivity. Currently, automated fault recognition methods proposed based on small synthetic datasets experience performance degradation when applied to actual seismic data. To address these challenges, we have introduced the concept of self-supervised learning, utilizing a substantial amount of relatively easily obtainable unlabeled seismic data for pretraining. Specifically, we have employed the Swin Transformer model as the core network and employed the SimMIM pretraining task to capture unique features related to discontinuities in seismic data. During the fine-tuning phase, inspired by edge detection techniques, we have also refined the structure of the Swin-UNETR model, enabling multiscale decoding and fusion for more effective fault detection. Experimental results demonstrate that our proposed method attains state-of-the-art performance on the Thebe dataset, as measured by the OIS and ODS metrics.

摘要

Multivessel Coronary Artery Segmentation and Stenosis Localisation using Ensemble Learning

paper_url: http://arxiv.org/abs/2310.17954
repo_url: None
paper_authors: Muhammad Bilal, Dinis Martinho, Reiner Sim, Adnan Qayyum, Hunaid Vohra, Massimo Caputo, Taofeek Akinosho, Sofiat Abioye, Zaheer Khan, Waleed Niaz, Junaid Qadir
for: 这个研究旨在提供一个基于机器学习的自动化诊断方案，以帮助cardiologists诊断折叠动脉疾病（CAD）。methods: 该研究使用了一种结合多个基线模型的 ensemble 模型，通过逐渐提高性能的训练策略，包括多个阶段的预training、多血管分割和精度提高等。results: 该研究的结果显示，使用这种方法可以Double the predictive accuracy of the proposed solution，并且通过进一步纠正错误的blob来进行精度提高。最终得到的结果为 coronary artery segmentation 的 mean F1 score 为 37.69%，和 stenosis localization 的 mean F1 score 为 39.41%。

Abstract
Coronary angiography analysis is a common clinical task performed by cardiologists to diagnose coronary artery disease (CAD) through an assessment of atherosclerotic plaque's accumulation. This study introduces an end-to-end machine learning solution developed as part of our solution for the MICCAI 2023 Automatic Region-based Coronary Artery Disease diagnostics using x-ray angiography imagEs (ARCADE) challenge, which aims to benchmark solutions for multivessel coronary artery segmentation and potential stenotic lesion localisation from X-ray coronary angiograms. We adopted a robust baseline model training strategy to progressively improve performance, comprising five successive stages of binary class pretraining, multivessel segmentation, fine-tuning using class frequency weighted dataloaders, fine-tuning using F1-based curriculum learning strategy (F1-CLS), and finally multi-target angiogram view classifier-based collective adaptation. Unlike many other medical imaging procedures, this task exhibits a notable degree of interobserver variability. %, making it particularly amenable to automated analysis. Our ensemble model combines the outputs from six baseline models using the weighted ensembling approach, which our analysis shows is found to double the predictive accuracy of the proposed solution. The final prediction was further refined, targeting the correction of misclassified blobs. Our solution achieved a mean F1 score of $37.69\%$ for coronary artery segmentation, and $39.41\%$ for stenosis localisation, positioning our team in the 5th position on both leaderboards. This work demonstrates the potential of automated tools to aid CAD diagnosis, guide interventions, and improve the accuracy of stent injections in clinical settings.

摘要
coronary angiography 分析是一种常见的临床任务，由医生用于诊断液体动脉疾病（CAD）的评估，包括atherosclerotic plaque的积累。这项研究介绍了一种基于我们的解决方案的自动化解决方案，用于MICCAI 2023 自动区域基础 coronary artery disease 诊断（ARCADE）挑战，以获得多个血管 segmentation 和可能的狭窄 lesion 的位置。我们采用了一种可靠的基线模型训练策略，包括五个顺序的 binary class pretraining、多血管 segmentation、精度调整使用类频率加载器、F1-based curriculum learning strategy（F1-CLS）和最后是多视图 coronary angiogram 类型的集成adaptation。与许多医疗影像过程不同，这个任务具有显著的Interobserver variability，使其更适合自动分析。我们的集成模型将六个基线模型的输出结合使用重量加权ensembleapproach，我们的分析显示可以double predictive accuracy of the proposed solution。最终预测还进行了进一步的纠正，以正确化错误的 blob。我们的解决方案在 coronary artery segmentation 方面 achievement mean F1 score of 37.69%，并在 localisation 方面 achievement mean F1 score of 39.41%，位于领先board 的第五名。这项工作 demonstarted the potential of automated tools to aid CAD diagnosis, guide interventions, and improve the accuracy of stent injections in clinical settings.

Shape-centered Representation Learning for Visible-Infrared Person Re-identification

paper_url: http://arxiv.org/abs/2310.17952
repo_url: None
paper_authors: Shuang Li, Jiaxu Leng, Ji Gan, Mengjingcheng Mo, Xinbo Gao
for: 这个论文主要目标是提高人各个感知中的人脸识别率，尤其是在可见光和红外光之间的模式转换问题上。
methods: 该论文提出了一种基于形状特征学习的Shape-centered Representation Learning框架（ScRL），包括Shape Feature Propagation（SFP）和Infrared Shape Restitution（ISR）等技术，以提高人脸识别率。
results: 该论文的实验结果表明，ScRL可以在人脸识别任务中实现remarkable的性能，其中 Rank-1（mAP）精度达到76.1%, 71.2%, 92.4%（72.6%, 52.9%, 86.7%）在SYSU-MM01、HITSZ-VCM和RegDB数据集上。

Abstract
Current Visible-Infrared Person Re-Identification (VI-ReID) methods prioritize extracting distinguishing appearance features, ignoring the natural resistance of body shape against modality changes. Initially, we gauged the discriminative potential of shapes by a straightforward concatenation of shape and appearance features. However, two unresolved issues persist in the utilization of shape features. One pertains to the dependence on auxiliary models for shape feature extraction in the inference phase, along with the errors in generated infrared shapes due to the intrinsic modality disparity. The other issue involves the inadequately explored correlation between shape and appearance features. To tackle the aforementioned challenges, we propose the Shape-centered Representation Learning framework (ScRL), which focuses on learning shape features and appearance features associated with shapes. Specifically, we devise the Shape Feature Propagation (SFP), facilitating direct extraction of shape features from original images with minimal complexity costs during inference. To restitute inaccuracies in infrared body shapes at the feature level, we present the Infrared Shape Restitution (ISR). Furthermore, to acquire appearance features related to shape, we design the Appearance Feature Enhancement (AFE), which accentuates identity-related features while suppressing identity-unrelated features guided by shape features. Extensive experiments are conducted to validate the effectiveness of the proposed ScRL. Achieving remarkable results, the Rank-1 (mAP) accuracy attains 76.1%, 71.2%, 92.4% (72.6%, 52.9%, 86.7%) on the SYSU-MM01, HITSZ-VCM, RegDB datasets respectively, outperforming existing state-of-the-art methods.

摘要
当前可见红外人重认（VI-ReID）方法强调抽出特征特征，忽视人体形态自然对模态变化的抵抗性。我们首先评估特征的推诉潜力，通过简单 concatenation shape 和 appearance 特征。但是，在使用 shape 特征时，存在两个不解决的问题。其一是在推理阶段依赖 auxilary 模型来EXTRACT shape 特征，同时因内生模态差而产生的生成红外形态错误。另一个问题是 shape 和 appearance 特征之间的相关性未得到充分探索。为了解决这些挑战，我们提出了 Shape-centered Representation Learning 框架（ScRL），它注重学习 shape 特征和 appearance 特征相关的 shape。具体来说，我们设计了 Shape Feature Propagation （SFP），它可以在原始图像中直接EXTRACT shape 特征，降低推理复杂性。此外，我们还提出了 Infrared Shape Restitution （ISR），用于在特征层修复红外形态错误。此外，我们还设计了 Appearance Feature Enhancement （AFE），它可以强调身份相关的特征，同时避免身份不相关的特征，以shape特征为引导。我们进行了广泛的实验，以验证 ScRL 的效果。得到了惊人的结果，VI-ReID 方法的 Rank-1（mAP）精度达到 76.1%、71.2%、92.4%（72.6%、52.9%、86.7%），在 SYSU-MM01、HITSZ-VCM 和 RegDB 数据集上，分别高于当前状态的前iers。

Instance Segmentation under Occlusions via Location-aware Copy-Paste Data Augmentation

paper_url: http://arxiv.org/abs/2310.17949
repo_url: https://github.com/nguyendinhson-kaist/mmsports23-seg-autoid
paper_authors: Son Nguyen, Mikel Lainsa, Hung Dao, Daeyoung Kim, Giang Nguyen
for: 本研究主要针对计算机视觉中的 occlusion 问题进行解决，具体来说是在 instance segmentation 领域中。
methods: 本研究使用了一种新的数据增强技术，可以生成更多的训练样本，以及一种新的深度学习架构 Hybrid Task Cascade (HTC) 框架，以提高 segmentation 性能。
results: 本研究在 MMSports 2023 DeepSportRadar 比赛中取得了很好的结果，其中 occlusion 得分 (OM) 为 0.533，位于领导者板卡的第一名。

Abstract
Occlusion is a long-standing problem in computer vision, particularly in instance segmentation. ACM MMSports 2023 DeepSportRadar has introduced a dataset that focuses on segmenting human subjects within a basketball context and a specialized evaluation metric for occlusion scenarios. Given the modest size of the dataset and the highly deformable nature of the objects to be segmented, this challenge demands the application of robust data augmentation techniques and wisely-chosen deep learning architectures. Our work (ranked 1st in the competition) first proposes a novel data augmentation technique, capable of generating more training samples with wider distribution. Then, we adopt a new architecture - Hybrid Task Cascade (HTC) framework with CBNetV2 as backbone and MaskIoU head to improve segmentation performance. Furthermore, we employ a Stochastic Weight Averaging (SWA) training strategy to improve the model's generalization. As a result, we achieve a remarkable occlusion score (OM) of 0.533 on the challenge dataset, securing the top-1 position on the leaderboard. Source code is available at this https://github.com/nguyendinhson-kaist/MMSports23-Seg-AutoID.

摘要
干扰是计算机视觉领域的长期问题，特别是在实例分割方面。ACM MMSports 2023 DeepSportRadar datasets 已经引入了专门用于人体分割的篮球场景，以及特殊的评价指标 для干扰情况。由于数据集的规模较小和需要分割的对象具有高度变形的特点，这个挑战需要应用robust的数据扩展技术和合适的深度学习架构。我们的工作（在比赛中排名第一）首先提出了一种新的数据扩展技术，能够生成更多的训练样本，并且具有更广泛的分布。然后，我们采用了一个新的框架——Hybrid Task Cascade（HTC）框架，其中CBNetV2 作为 backing 和 MaskIoU 头部来提高分割性能。此外，我们还使用了一种Stochastic Weight Averaging（SWA）训练策略，以提高模型的泛化性。因此，我们在挑战数据集上实现了干扰分数（OM）为 0.533，在 liderboard 上排名第一。源代码可以在以下链接中找到：https://github.com/nguyendinhson-kaist/MMSports23-Seg-AutoID。

Diversifying Spatial-Temporal Perception for Video Domain Generalization

paper_url: http://arxiv.org/abs/2310.17942
repo_url: https://github.com/kunyulin/stdn
paper_authors: Kun-Yu Lin, Jia-Run Du, Yipeng Gao, Jiaming Zhou, Wei-Shi Zheng
for: 学习通用视频分类模型，并在不同目标领域中进行泛化。
methods: 利用多种空间和时间维度的多cue学习，以找到可能的领域不受影响的cue。
results: 在三个不同类型的benchmark上进行了广泛的实验，证明了我们的方法的有效性和多样性。

Abstract
Video domain generalization aims to learn generalizable video classification models for unseen target domains by training in a source domain. A critical challenge of video domain generalization is to defend against the heavy reliance on domain-specific cues extracted from the source domain when recognizing target videos. To this end, we propose to perceive diverse spatial-temporal cues in videos, aiming to discover potential domain-invariant cues in addition to domain-specific cues. We contribute a novel model named Spatial-Temporal Diversification Network (STDN), which improves the diversity from both space and time dimensions of video data. First, our STDN proposes to discover various types of spatial cues within individual frames by spatial grouping. Then, our STDN proposes to explicitly model spatial-temporal dependencies between video contents at multiple space-time scales by spatial-temporal relation modeling. Extensive experiments on three benchmarks of different types demonstrate the effectiveness and versatility of our approach.

摘要
视频领域通用化目标在培养源领域中学习通用的视频分类模型，以便在目标领域中进行推理。一个关键的挑战是防止在目标视频识别中过重依赖源领域特有的特征。为此，我们提议利用视频中的多样化空间-时间特征，找到可能的领域不受影响的特征。我们提出了一种新的模型，即空间-时间多样化网络（STDN），它在视频数据中提高多样化性。首先，我们的 STDN 提出了在个体帧中发现多种空间特征的方法，并进行空间组合。然后，我们的 STDN 利用多个空间-时间尺度的空间-时间关系模型，以模拟视频内容之间的空间-时间相互关系。我们在三个不同类型的 benchmark 上进行了广泛的实验，并证明了我们的方法的有效性和多样性。

DocStormer: Revitalizing Multi-Degraded Colored Document Images to Pristine PDF

paper_url: http://arxiv.org/abs/2310.17910
repo_url: None
paper_authors: Chaowei Liu, Jichun Li, Yihua Teng, Chaoqun Wang, Nuo Xu, Jihao Wu, Dandan Tu
for: 提高多层次陌生文档图像的Restoration至其潜在的PDF版本
methods: 基于”Perceive-then-Restore”模式的 transformer 块，加上 GAN 和优质PDF图像，以减少陌生度和提高视觉质量
results: 实验结果显示， DocStormer 可以有效地恢复多层次陌生文档图像，提供了一个新的 Restoration 方法，可以填补当前学术领域中的一个知识漏洞。

Abstract
For capturing colored document images, e.g. posters and magazines, it is common that multiple degradations such as shadows, wrinkles, etc., are simultaneously introduced due to external factors. Restoring multi-degraded colored document images is a great challenge, yet overlooked, as most existing algorithms focus on enhancing color-ignored document images via binarization. Thus, we propose DocStormer, a novel algorithm designed to restore multi-degraded colored documents to their potential pristine PDF. The contributions are: firstly, we propose a "Perceive-then-Restore" paradigm with a reinforced transformer block, which more effectively encodes and utilizes the distribution of degradations. Secondly, we are the first to utilize GAN and pristine PDF magazine images to narrow the distribution gap between the enhanced results and PDF images, in pursuit of less degradation and better visual quality. Thirdly, we propose a non-parametric strategy, PFILI, which enables a smaller training scale and larger testing resolutions with acceptable detail trade-off, while saving memory and inference time. Fourthly, we are the first to propose a novel Multi-Degraded Colored Document image Enhancing dataset, named MD-CDE, for both training and evaluation. Experimental results show that the DocStormer exhibits superior performance, capable of revitalizing multi-degraded colored documents into their potential pristine digital versions, which fills the current academic gap from the perspective of method, data, and task.

摘要
For capturing 颜色文档图像，如 poster 和杂志， external factors 可能同时引入多种干扰， such as shadows 和折皮等。 Restoring 多干扰的颜色文档图像是一大挑战，尤其是被忽略的，因为大多数现有算法都专注于提高无色文档图像的明暗分割。 Therefore, we propose DocStormer, a novel algorithm designed to restore 多干扰的颜色文档图像 to its potential pristine PDF. The contributions are:Firstly, we propose a "Perceive-then-Restore" paradigm with a reinforced transformer block, which more effectively encodes and utilizes the distribution of degradations.Secondly, we are the first to utilize GAN and pristine PDF magazine images to narrow the distribution gap between the enhanced results and PDF images, in pursuit of less degradation and better visual quality.Thirdly, we propose a non-parametric strategy, PFILI, which enables a smaller training scale and larger testing resolutions with acceptable detail trade-off, while saving memory and inference time.Fourthly, we are the first to propose a novel Multi-Degraded Colored Document image Enhancing dataset, named MD-CDE, for both training and evaluation. Experimental results show that the DocStormer exhibits superior performance, capable of revitalizing 多干扰的颜色文档图像 into its potential pristine digital versions, which fills the current academic gap from the perspective of method, data, and task.

Impressions: Understanding Visual Semiotics and Aesthetic Impact

paper_url: http://arxiv.org/abs/2310.17887
repo_url: None
paper_authors: Julia Kruk, Caleb Ziems, Diyi Yang
for: investigate the semiotics of images and how specific visual features and design choices can elicit specific emotions, thoughts, and beliefs.
methods: design an annotation task heavily inspired by image analysis techniques in the Visual Arts to collect image-caption pairs and unique annotations exploring impact, pragmatic image description, impressions, and aesthetic design choices.
results: existing multimodal image captioning and conditional generation models struggle to simulate plausible human responses to images, but this dataset significantly improves their ability to model impressions and aesthetic evaluations of images through fine-tuning and few-shot adaptation.Here is the full translation of the paper’s abstract in Simplified Chinese:
for: 这个研究旨在 investigate the semiotics of images, and how specific visual features and design choices can elicit specific emotions, thoughts, and beliefs.
methods: 这个研究使用了一个 heavily inspired by image analysis techniques in the Visual Arts 的 annotation task，收集了 1,440 个 image-caption pairs 和 4,320 个 unique annotations，探讨 impact, pragmatic image description, impressions, 和 aesthetic design choices.
results: 现有的 multimodal image captioning 和 conditional generation models 对 images 的 simulated human responses 表现不佳，但是这个 dataset 能够 significantly improve 这些模型的 ability to model impressions 和 aesthetic evaluations of images through fine-tuning 和 few-shot adaptation.

Abstract
Is aesthetic impact different from beauty? Is visual salience a reflection of its capacity for effective communication? We present Impressions, a novel dataset through which to investigate the semiotics of images, and how specific visual features and design choices can elicit specific emotions, thoughts and beliefs. We posit that the impactfulness of an image extends beyond formal definitions of aesthetics, to its success as a communicative act, where style contributes as much to meaning formation as the subject matter. However, prior image captioning datasets are not designed to empower state-of-the-art architectures to model potential human impressions or interpretations of images. To fill this gap, we design an annotation task heavily inspired by image analysis techniques in the Visual Arts to collect 1,440 image-caption pairs and 4,320 unique annotations exploring impact, pragmatic image description, impressions, and aesthetic design choices. We show that existing multimodal image captioning and conditional generation models struggle to simulate plausible human responses to images. However, this dataset significantly improves their ability to model impressions and aesthetic evaluations of images through fine-tuning and few-shot adaptation.

摘要
是美学影响与美的区别？视觉吸引力是通信效果的反映吗？我们介绍Impressions，一个新的数据集，用于探讨图像的 semiotics，并如何specific visual features和设计选择可以引发specific emotions, thoughts和beliefs。我们认为图像的吸引力不仅限于传统的美学定义，还包括图像作为通信行为的成功度，style与subject matter共同形成意义。但是，先前的图像描述数据集不适用于激发人类的印象或解释。为了填补这个空白，我们设计了一个基于图像分析技术的image描述任务，收集了1,440个图像-描述对和4,320个特有的批注，探讨影响、实用描述、印象和美学设计选择。我们发现，现有的多modal图像描述和条件生成模型在模拟人类对图像的回应方面表现不佳。但是，这个数据集可以大幅提高这些模型对图像印象和美学评价的能力。

Reconstructive Latent-Space Neural Radiance Fields for Efficient 3D Scene Representations

paper_url: http://arxiv.org/abs/2310.17880
repo_url: None
paper_authors: Tristan Aumentado-Armstrong, Ashkan Mirzaei, Marcus A. Brubaker, Jonathan Kelly, Alex Levinshtein, Konstantinos G. Derpanis, Igor Gilitschenski
for: 这paper aimed to improve the efficiency of Neural Radiance Fields (NeRFs) for 3D scene representation, while maintaining high image quality.
methods: 该paper使用了一个 autoencoder (AE) 与 NeRF 结合，将 latent features 渲染并用 convolutional decoding 来生成新的视图。
results: 相比标准色域 NeRFs，latent-space NeRF 可以生成更高质量的新视图，并且可以在三倍的渲染速度下得到更好的效果。此外，通过缩小 AE 架构，可以控制效率和图像质量之间的交易，并达到更高的渲染速度。

Abstract
Neural Radiance Fields (NeRFs) have proven to be powerful 3D representations, capable of high quality novel view synthesis of complex scenes. While NeRFs have been applied to graphics, vision, and robotics, problems with slow rendering speed and characteristic visual artifacts prevent adoption in many use cases. In this work, we investigate combining an autoencoder (AE) with a NeRF, in which latent features (instead of colours) are rendered and then convolutionally decoded. The resulting latent-space NeRF can produce novel views with higher quality than standard colour-space NeRFs, as the AE can correct certain visual artifacts, while rendering over three times faster. Our work is orthogonal to other techniques for improving NeRF efficiency. Further, we can control the tradeoff between efficiency and image quality by shrinking the AE architecture, achieving over 13 times faster rendering with only a small drop in performance. We hope that our approach can form the basis of an efficient, yet high-fidelity, 3D scene representation for downstream tasks, especially when retaining differentiability is useful, as in many robotics scenarios requiring continual learning.

摘要

Siamese-DETR for Generic Multi-Object Tracking

paper_url: http://arxiv.org/abs/2310.17875
repo_url: None
paper_authors: Qiankun Liu, Yichen Li, Yuqi Jiang, Ying Fu
for: 本研究的目的是提出一种简单 yet effective的 Generic Multi-Object Tracking (GMOT) 方法，以便在不同场景中检测和跟踪动态对象。methods: 本研究使用了 Siamese-DETR 方法，其中利用了 detection 数据集 (e.g., COCO) 进行训练，并 introduce 了一种动态匹配训练策略以使用提供的筛选器。results: 实验结果显示，Siamese-DETR 在 GMOT-40 数据集上表现出色，至今为止比 EXISTS 的 MOT 方法更高。

Abstract
The ability to detect and track the dynamic objects in different scenes is fundamental to real-world applications, e.g., autonomous driving and robot navigation. However, traditional Multi-Object Tracking (MOT) is limited to tracking objects belonging to the pre-defined closed-set categories. Recently, Open-Vocabulary MOT (OVMOT) and Generic MOT (GMOT) are proposed to track interested objects beyond pre-defined categories with the given text prompt and template image. However, the expensive well pre-trained (vision-)language model and fine-grained category annotations are required to train OVMOT models. In this paper, we focus on GMOT and propose a simple but effective method, Siamese-DETR, for GMOT. Only the commonly used detection datasets (e.g., COCO) are required for training. Different from existing GMOT methods, which train a Single Object Tracking (SOT) based detector to detect interested objects and then apply a data association based MOT tracker to get the trajectories, we leverage the inherent object queries in DETR variants. Specifically: 1) The multi-scale object queries are designed based on the given template image, which are effective for detecting different scales of objects with the same category as the template image; 2) A dynamic matching training strategy is introduced to train Siamese-DETR on commonly used detection datasets, which takes full advantage of provided annotations; 3) The online tracking pipeline is simplified through a tracking-by-query manner by incorporating the tracked boxes in previous frame as additional query boxes. The complex data association is replaced with the much simpler Non-Maximum Suppression (NMS). Extensive experimental results show that Siamese-DETR surpasses existing MOT methods on GMOT-40 dataset by a large margin.

摘要
能力检测和跟踪不同场景中的动态对象是实际应用中的基本要求，例如自动驾驶和机器人导航。然而，传统的多对象跟踪（MOT）仅能跟踪预定的关闭集类型的对象。最近，开放词汇MOT（OVMOT）和通用MOT（GMOT）被提出，以检测与给定模板图像中的对象相关的对象。然而，需要昂贵的高级见语言模型和细化类别标注来训练OVMOT模型。在本文中，我们将关注GMOT，并提出一种简单 yet effective的方法：Siamese-DETR。只需使用常用的检测数据集（例如COCO）进行训练。与现有GMOT方法不同，我们不会训练单个对象检测器来检测兴趣对象，而是利用DETR变体中的内置对象查询。具体来说，我们做了以下三个方法：1）基于给定模板图像的多尺度对象查询，可以有效地检测不同的对象大小与模板图像中的同一类型对象; 2）我们引入了动态匹配训练策略，以利用提供的注释来训练Siamese-DETR; 3）通过将跟踪框架简化为查询方式，并将已跟踪的框架作为额外的查询框架，替代复杂的数据关联。这里的数据关联被替换为非最大Suppression（NMS）。我们的实验结果表明，Siamese-DETR在GMOT-40数据集上大幅超越现有MOT方法。

SmooSeg: Smoothness Prior for Unsupervised Semantic Segmentation

paper_url: http://arxiv.org/abs/2310.17874
repo_url: https://github.com/mc-lan/smooseg
paper_authors: Mengcheng Lan, Xinjiang Wang, Yiping Ke, Jiaxing Xu, Litong Feng, Wayne Zhang
for: 这个论文主要针对不具有人工标注的图像分割 tasks，即将图像分割为Semantic groups而不需要人工标注。
methods: 我们提出了一个 novel 的方法，即 SmooSeg，它利用自我supervised learning 方法来模型对观测到的变化之间的关系，并将这些变化映射到Semantic groups中。我们还引入了一个新的平滑性损失函数，它可以在不同的Semantic groups之间实现平滑的变化，同时保留不同Semantic groups之间的关系。
results: 根据我们的实验结果，SmooSeg 可以对 COCOStuff、Cityscapes 和 Potsdam-3 等三个数据集进行高效的分割，并且与 STEGO 相比，SmooSeg 可以提高 pixel accuracy 的表现。具体来说，在 COCOStuff 数据集上，SmooSeg 可以提高 pixel accuracy 的表现+14.9%，在 Cityscapes 数据集上提高 +13.0%，在 Potsdam-3 数据集上提高 +5.7%。

Abstract
Unsupervised semantic segmentation is a challenging task that segments images into semantic groups without manual annotation. Prior works have primarily focused on leveraging prior knowledge of semantic consistency or priori concepts from self-supervised learning methods, which often overlook the coherence property of image segments. In this paper, we demonstrate that the smoothness prior, asserting that close features in a metric space share the same semantics, can significantly simplify segmentation by casting unsupervised semantic segmentation as an energy minimization problem. Under this paradigm, we propose a novel approach called SmooSeg that harnesses self-supervised learning methods to model the closeness relationships among observations as smoothness signals. To effectively discover coherent semantic segments, we introduce a novel smoothness loss that promotes piecewise smoothness within segments while preserving discontinuities across different segments. Additionally, to further enhance segmentation quality, we design an asymmetric teacher-student style predictor that generates smoothly updated pseudo labels, facilitating an optimal fit between observations and labeling outputs. Thanks to the rich supervision cues of the smoothness prior, our SmooSeg significantly outperforms STEGO in terms of pixel accuracy on three datasets: COCOStuff (+14.9%), Cityscapes (+13.0%), and Potsdam-3 (+5.7%).

摘要
无监督semantic segmentation是一项复杂的任务，它的目标是将图像分割成semantic组without manual annotation. 先前的研究主要依靠自动学习方法来激活先前的semantic consistency或self-supervised learning方法，这些方法经常忽视图像分割的coherence性质. 在这篇论文中，我们表明了smoothness prior，即close features in a metric space share the same semantics，可以大大简化segmentation。在这个思想下，我们提出了一种新的方法called SmooSeg，它利用self-supervised learning方法来表示observations的closeness关系作为smoothness信号。为了有效发现coherent semantic segments，我们引入了一种新的smoothness loss，该损失函数激活piecewise smoothness within segments while preserving discontinuities across different segments。此外，我们还设计了一种异形 teacher-student 预测器，该预测器可以生成smoothly updated pseudo labels，使得observations和labeling输出之间进行优化的适应。由于smoothness prior提供了丰富的监督信号，我们的SmooSeg在COCOStuff (+14.9%), Cityscapes (+13.0%), and Potsdam-3 (+5.7%)三个数据集上都显著超过STEGO的像素准确率。

Grid Jigsaw Representation with CLIP: A New Perspective on Image Clustering

paper_url: http://arxiv.org/abs/2310.17869
repo_url: None
paper_authors: Zijie Song, Zhenzhen Hu, Richang Hong
for: 这种图像归一化学习是计算机视觉领域中的一种不可或缺的基础技术，它可以帮助图像进行有效的分类和识别。
methods: 该文章提出了一种基于缝隙的图像归一化方法，即Grid Jigsaw Representation（GJR），该方法通过模拟人类缝隙图像的方式，来提高模型对图像的特征分解和分类能力。
results: 该文章通过对多个标准 benchmark 数据集进行测试，证明了GJR模块可以帮助图像归一化进行更好的分类和识别，并且在速度和精度两个方面具有优于传统方法的优势。此外，文章还提出了一种基于预训练的Grid Jigsaw Representation（pGJR）方法，该方法可以在快速的 converges 过程中提高图像归一化的效果。

Abstract
Unsupervised representation learning for image clustering is essential in computer vision. Although the advancement of visual models has improved image clustering with efficient visual representations, challenges still remain. Firstly, these features often lack the ability to represent the internal structure of images, hindering the accurate clustering of visually similar images. Secondly, the existing features tend to lack finer-grained semantic labels, limiting the ability to capture nuanced differences and similarities between images. In this paper, we first introduce Jigsaw based strategy method for image clustering called Grid Jigsaw Representation (GJR) with systematic exposition from pixel to feature in discrepancy against human and computer. We emphasize that this algorithm, which mimics human jigsaw puzzle, can effectively improve the model to distinguish the spatial feature between different samples and enhance the clustering ability. GJR modules are appended to a variety of deep convolutional networks and tested with significant improvements on a wide range of benchmark datasets including CIFAR-10, CIFAR-100/20, STL-10, ImageNet-10 and ImageNetDog-15. On the other hand, convergence efficiency is always an important challenge for unsupervised image clustering. Recently, pretrained representation learning has made great progress and released models can extract mature visual representations. It is obvious that use the pretrained model as feature extractor can speed up the convergence of clustering where our aim is to provide new perspective in image clustering with reasonable resource application and provide new baseline. Further, we innovate pretrain-based Grid Jigsaw Representation (pGJR) with improvement by GJR. The experiment results show the effectiveness on the clustering task with respect to the ACC, NMI and ARI three metrics and super fast convergence speed.

摘要
自然无监督学习是计算机视觉中不可或缺的一部分。虽然视觉模型的进步使得图像归类得到了有效的视觉表示，但是还存在一些挑战。首先，这些特征通常缺乏表示图像内部结构的能力，使得准确归类类似图像 become more difficult.其次，现有的特征通常缺乏更细grained的Semantic Label，限制了捕捉图像之间细微差异和相似性的能力。在这篇论文中，我们首先介绍了基于Jigsaw策略的图像归类方法，即Grid Jigsaw Representation（GJR），并进行系统性的描述从像素到特征之间的差异。我们强调这种算法，类似于人类的缺失图形，可以有效地提高模型对图像之间的空间特征的分辨率，从而提高归类能力。GJR模块被附加到了多种深度卷积网络中，并在各种benchmark数据集上进行了广泛的测试，包括CIFAR-10、CIFAR-100/20、STL-10、ImageNet-10和ImageNetDog-15。然而，无监督图像归类中的收敛效率总是一个重要的挑战。最近，预训练的表征学习已经取得了很大的进步，释放出了许多高质量的视觉表示。可以看到，使用预训练模型作为特征提取器可以加速归类的收敛速度。我们的目标是提供一种新的视角，以及一种合理的资源应用，以提高图像归类的效果。此外，我们还创新了预训练基于Grid Jigsaw Representation（pGJR），通过改进GJR来提高归类效果。实验结果表明，pGJR在归类任务中对ACC、NMI和ARI三个 metric具有显著的效果，并且具有超快的收敛速度。

What You See Is What You Detect: Towards better Object Densification in 3D detection

paper_url: http://arxiv.org/abs/2310.17842
repo_url: https://github.com/orbis36/wysiwyd
paper_authors: Tianran Liu, Zeping Zhang Morteza Mousa Pasandi, Robert Laganiere
for: The paper is written for improving the accuracy of 3D object detection from Lidar signals, specifically addressing the issue of object completion in 3D perception.
methods: The paper proposes a visible part completion method that requires only a small number of prediction points, which is based on a mesh-deformation-based approach to augment the point set associated with visible foreground objects. The method consists of two parts: an Intra-Frustum Segmentation Transformer (IFST) and a Mesh Depth Completion Network(MDCNet).
results: The paper shows that the proposed method can provide up to 12.2% performance improvements over most of the public baseline models on the KITTI and NuScenes dataset, bringing the state-of-the-art to a new level.Here is the information in Simplified Chinese text:
for: 本文是为了提高三元素探测从激光信号中的准确性，特别是对三元素完成问题的解决。
methods: 本文提出了一种可见部分完成方法，只需要一小部分的预测点，基于 mesh 变形来增强可见前景对象的点集。该方法由两部分组成：内部 Frustum 分割 transformer (IFST) 和 mesh 深度完成网络 (MDCNet)。
results: 本文显示，提出的方法可以在 KITTI 和 NuScenes 数据集上提供最多 12.2% 的性能提升，将状态艺术带到新的水平。

Abstract
Recent works have demonstrated the importance of object completion in 3D Perception from Lidar signal. Several methods have been proposed in which modules were used to densify the point clouds produced by laser scanners, leading to better recall and more accurate results. Pursuing in that direction, we present, in this work, a counter-intuitive perspective: the widely-used full-shape completion approach actually leads to a higher error-upper bound especially for far away objects and small objects like pedestrians. Based on this observation, we introduce a visible part completion method that requires only 11.3\% of the prediction points that previous methods generate. To recover the dense representation, we propose a mesh-deformation-based method to augment the point set associated with visible foreground objects. Considering that our approach focuses only on the visible part of the foreground objects to achieve accurate 3D detection, we named our method What You See Is What You Detect (WYSIWYD). Our proposed method is thus a detector-independent model that consists of 2 parts: an Intra-Frustum Segmentation Transformer (IFST) and a Mesh Depth Completion Network(MDCNet) that predicts the foreground depth from mesh deformation. This way, our model does not require the time-consuming full-depth completion task used by most pseudo-lidar-based methods. Our experimental evaluation shows that our approach can provide up to 12.2\% performance improvements over most of the public baseline models on the KITTI and NuScenes dataset bringing the state-of-the-art to a new level. The codes will be available at \textcolor[RGB]{0,0,255}{\url{https://github.com/Orbis36/WYSIWYD}

摘要
最近的研究表明3D感知从激光信号中的物体完成是非常重要的。许多方法已经被提出，其中包括使用模块来增强激光扫描仪生成的点云，从而提高精度和准确性。在这个方向下，我们在这项工作中提出了一个Counter-Intuitive Perspective：广泛使用的全形完成方法实际上会导致远距离物体和小物体（如人肉）的高错误上界。基于这一观察，我们引入可见部分完成方法，只需11.3%的预测点。为了恢复稠密表示，我们提议一种基于网格扭形的方法，用于补充可见前景物体的点集。由于我们的方法只关注可见前景物体来实现准确3D探测，因此我们将其命名为What You See Is What You Detect（WYSIWYD）。我们的提出的方法包括两部分：Intra-Frustum Segmentation Transformer（IFST）和Mesh Depth Completion Network（MDCNet），它们分别预测前景物体的深度和网格扭形。这样，我们的模型不需要时间consuming的全深度完成任务，与大多数 pseudo-lidar 基于的方法不同。我们的实验评估表明，我们的方法可以在 KITTI 和 NuScenes 数据集上提供Up to 12.2%的性能提升，将状态艺术引入到新的水平。代码将在 \textcolor[RGB]{0,0,255}{\url{https://github.com/Orbis36/WYSIWYD} 上提供。