2023-11-10

cs.CV

cs.CV - 2023-11-10

Flatness-aware Adversarial Attack

paper_url: http://arxiv.org/abs/2311.06423
repo_url: None
paper_authors: Mingyuan Fan, Xiaodan Li, Cen Chen, Yinggui Wang
for: 这 paper 的目的是通过利用抗击器的传输性来发动黑盒攻击。
methods: 这 paper 使用的方法是通过组合多个转换后的输入来生成抗击器。
results: compared with 现有基elines，这 paper 的方法可以明显提高抗击器的传输性。Here’s the full translation of the paper’s abstract in Simplified Chinese:
for: 这 paper 的目的是通过利用抗击器的传输性来发动黑盒攻击。
methods: 这 paper 使用的方法是通过组合多个转换后的输入来生成抗击器。
results: compared with 现有基elines，这 paper 的方法可以明显提高抗击器的传输性。I hope this helps! Let me know if you have any other questions.

Abstract
The transferability of adversarial examples can be exploited to launch black-box attacks. However, adversarial examples often present poor transferability. To alleviate this issue, by observing that the diversity of inputs can boost transferability, input regularization based methods are proposed, which craft adversarial examples by combining several transformed inputs. We reveal that input regularization based methods make resultant adversarial examples biased towards flat extreme regions. Inspired by this, we propose an attack called flatness-aware adversarial attack (FAA) which explicitly adds a flatness-aware regularization term in the optimization target to promote the resultant adversarial examples towards flat extreme regions. The flatness-aware regularization term involves gradients of samples around the resultant adversarial examples but optimizing gradients requires the evaluation of Hessian matrix in high-dimension spaces which generally is intractable. To address the problem, we derive an approximate solution to circumvent the construction of Hessian matrix, thereby making FAA practical and cheap. Extensive experiments show the transferability of adversarial examples crafted by FAA can be considerably boosted compared with state-of-the-art baselines.

摘要
“敌方模型可以通过对抗性示例的转移性攻击。但是，对抗性示例通常具有差的转移性。为解决这个问题，我们观察到输入多标的帮助，可以提高对抗性示例的转移性。我们提出了基于输入调整的方法，这些方法通过组合多个对抗性示例的转换而创建对抗性示例。我们发现这些对抗性示例倾向于扁平极大区域。受这些想法所影响，我们提出了一种名为扁平识别攻击（FAA）的攻击方法。这个方法将在优化目标中添加一个扁平识别调整项，以便提高对抗性示例的转移性。扁平识别调整项需要在高维度空间中评估扁平方向的梯度，但是评估梯度通常是不可能的。为解决这个问题，我们 derive an approximate solution，以便在高维度空间中评估扁平方向的梯度，并且让FAA实用且便宜。实验结果表明，由FAA创建的对抗性示例的转移性可以与现有基准相比大大提高。”

EviPrompt: A Training-Free Evidential Prompt Generation Method for Segment Anything Model in Medical Images

paper_url: http://arxiv.org/abs/2311.06400
repo_url: None
paper_authors: Yinsong Xu, Jiaqi Tang, Aidong Men, Qingchao Chen
for: 这篇论文的目的是提出一种无需训练的证据提示生成方法，以解决医疗影像分类中的专业知识干预和领域差距问题。
methods: 这篇论文提出了一种基于医疗影像内在相似性的训练�free evidential prompt generation方法，仅需一个参考影像�annotationPair，可以大幅减少 Labeling 和计算资源的需求。
results: 该方法可以自动生成适当的证据提示，以提高 SAM 在医疗影像分类中的应用和有用性。 evaluations across a broad range of tasks and modalities confirm its efficacy.

Abstract
Medical image segmentation has immense clinical applicability but remains a challenge despite advancements in deep learning. The Segment Anything Model (SAM) exhibits potential in this field, yet the requirement for expertise intervention and the domain gap between natural and medical images poses significant obstacles. This paper introduces a novel training-free evidential prompt generation method named EviPrompt to overcome these issues. The proposed method, built on the inherent similarities within medical images, requires only a single reference image-annotation pair, making it a training-free solution that significantly reduces the need for extensive labeling and computational resources. First, to automatically generate prompts for SAM in medical images, we introduce an evidential method based on uncertainty estimation without the interaction of clinical experts. Then, we incorporate the human prior into the prompts, which is vital for alleviating the domain gap between natural and medical images and enhancing the applicability and usefulness of SAM in medical scenarios. EviPrompt represents an efficient and robust approach to medical image segmentation, with evaluations across a broad range of tasks and modalities confirming its efficacy.

摘要
医学图像分割具有巨大的临床应用前提，但是它仍然是一个挑战，尽管深度学习在发展。 seg anything模型（SAM）在这一点方面表现出潜力，但是需要专家干预和医学图像和自然图像之间的领域差距问题带来了重大障碍。这篇论文介绍了一种新的无需训练的证据提示生成方法，名为EviPrompt，以解决这些问题。我们的方法基于医学图像之间的自然相似性，只需要一个参考图像-标注对，可以减少了大量的标注和计算资源。首先，我们引入了一种基于不确定性估计的证据方法，无需互动式临床专家。然后，我们将人类优先级 integrate 到提示中，这是关键的，可以减少医学图像和自然图像之间的领域差距，提高SAM在医学场景中的应用和实用性。EviPrompt表示一种高效和可靠的医学图像分割方法，评估结果 across 多种任务和模式表明其效果。

A design of Convolutional Neural Network model for the Diagnosis of the COVID-19

paper_url: http://arxiv.org/abs/2311.06394
repo_url: https://github.com/Jafar-Abdollahi/Automated-detection-of-COVID-19-cases-using-deep-neural-networks-with-CTS-images
paper_authors: Xinyuan Song
for:这种研究的目的是为了提供一种准确地识别COVID-19的肺部X射线图像分类方法，以帮助临床中心和医院诊断COVID-19。methods:这种方法基于19层卷积神经网络（CNN），并对三类（肺炎、正常、COVID）和四类（肺擦亮、正常、COVID-19、肺炎）进行分类。研究人员还对一些已经预训练的网络进行比较，包括Inception、Alexnet、ResNet50、Squeezenet和VGG19。results:实验结果表明，提出的CNN方法在准确率、特异性、准确率、敏感度和归一化矩阵等指标上具有明显的优势，超过了现有的发布过程。这种方法可以为临床医生提供一个有用的工具，帮助他们准确地诊断COVID-19。

Abstract
With the spread of COVID-19 around the globe over the past year, the usage of artificial intelligence (AI) algorithms and image processing methods to analyze the X-ray images of patients' chest with COVID-19 has become essential. The COVID-19 virus recognition in the lung area of a patient is one of the basic and essential needs of clicical centers and hospitals. Most research in this field has been devoted to papers on the basis of deep learning methods utilizing CNNs (Convolutional Neural Network), which mainly deal with the screening of sick and healthy people.In this study, a new structure of a 19-layer CNN has been recommended for accurately recognition of the COVID-19 from the X-ray pictures of chest. The offered CNN is developed to serve as a precise diagnosis system for a three class (viral pneumonia, Normal, COVID) and a four classclassification (Lung opacity, Normal, COVID-19, and pneumonia). A comparison is conducted among the outcomes of the offered procedure and some popular pretrained networks, including Inception, Alexnet, ResNet50, Squeezenet, and VGG19 and based on Specificity, Accuracy, Precision, Sensitivity, Confusion Matrix, and F1-score. The experimental results of the offered CNN method specify its dominance over the existing published procedures. This method can be a useful tool for clinicians in deciding properly about COVID-19.

摘要
随着 COVID-19 在过去一年内的全球蔓延，使用人工智能（AI）算法和图像处理方法来分析患 COVID-19 患者的X射线图像已成为必需的。识别患 COVID-19 病毒在患者的肺部是临床中心和医院的基本和必要需求。大多数研究都集中在基于深度学习方法的 CNN（卷积神经网络）上，主要是用于健康和疾病人的分类。在本研究中，一种新的19层 CNN 结构被建议用于准确地识别 X射线图像中的 COVID-19。这个 CNN 结构是用于三类（肺病毒感染、正常、COVID）和四类分类（肺抑血、正常、COVID-19、肺炎）。对于这些结果和一些常用的预训练网络（如 Inception、Alexnet、ResNet50、Squeezenet 和 VGG19）进行了比较，并根据具体性、准确率、精度、敏感度和冲激矩阵来评估。实验结果表明，提出的 CNN 方法在现有发表的方法中具有优势。这种方法可以成为临床医生决策 COVID-19 的有用工具。

Towards A Unified Neural Architecture for Visual Recognition and Reasoning

paper_url: http://arxiv.org/abs/2311.06386
repo_url: None
paper_authors: Calvin Luo, Boqing Gong, Ting Chen, Chen Sun
for: 这篇论文主要针对视觉理解的两大柱子：认知和理解。
methods: 该论文提出了一种基于多任务转换器的协同架构，可以同时解决视觉认知和理解两个任务。
results: 研究发现，对象检测任务对视觉理解具有最大的帮助，并且该架构自动生成了对象中心的表示。此外，研究还发现了不同架构设计对视觉理解的影响。

Abstract
Recognition and reasoning are two pillars of visual understanding. However, these tasks have an imbalance in focus; whereas recent advances in neural networks have shown strong empirical performance in visual recognition, there has been comparably much less success in solving visual reasoning. Intuitively, unifying these two tasks under a singular framework is desirable, as they are mutually dependent and beneficial. Motivated by the recent success of multi-task transformers for visual recognition and language understanding, we propose a unified neural architecture for visual recognition and reasoning with a generic interface (e.g., tokens) for both. Our framework enables the principled investigation of how different visual recognition tasks, datasets, and inductive biases can help enable spatiotemporal reasoning capabilities. Noticeably, we find that object detection, which requires spatial localization of individual objects, is the most beneficial recognition task for reasoning. We further demonstrate via probing that implicit object-centric representations emerge automatically inside our framework. Intriguingly, we discover that certain architectural choices such as the backbone model of the visual encoder have a significant impact on visual reasoning, but little on object detection. Given the results of our experiments, we believe that visual reasoning should be considered as a first-class citizen alongside visual recognition, as they are strongly correlated but benefit from potentially different design choices.

摘要
<>视觉理解的两个柱子是认知和理解。然而，这两个任务在注意力方面存在偏见，而且近年来神经网络的实验性表现在视觉认知方面强大，而在视觉理解方面相对落后。可是，将这两个任务集成到一个共同框架中是有利的，因为它们是互相依赖的和有益的。鼓励 by recent success of multi-task transformers for visual recognition and language understanding, we propose a unified neural architecture for visual recognition and reasoning with a generic interface (e.g., tokens) for both. Our framework enables the principled investigation of how different visual recognition tasks, datasets, and inductive biases can help enable spatiotemporal reasoning capabilities.发现结果显示，对象检测，需要物体的空间局部化，是最有利的认知任务 для理解。我们还通过探测发现了自动内生的卷积表示。进一步的实验结果表明，certain architectural choices such as the backbone model of the visual encoder have a significant impact on visual reasoning, but little on object detection. given the results of our experiments, we believe that visual reasoning should be considered as a first-class citizen alongside visual recognition, as they are strongly correlated but benefit from potentially different design choices. Traditional Chinese translation:<>Visual understanding 的两个柱子是识别和理解。然而，这两个任务在注意力方面存在偏见，而且近年来神经网络的实验性表现在视觉认知方面强大，而在视觉理解方面相对落后。可是，将这两个任务集成到一个共同框架中是有利的，因为它们是互相依赖的和有益的。鼓励 by recent success of multi-task transformers for visual recognition and language understanding, we propose a unified neural architecture for visual recognition and reasoning with a generic interface (e.g., tokens) for both. Our framework enables the principled investigation of how different visual recognition tasks, datasets, and inductive biases can help enable spatiotemporal reasoning capabilities.发现结果显示，对象检测，需要物体的空间局部化，是最有利的认知任务 для理解。我们还通过探测发现了自动内生的卷积表示。进一步的实验结果表明，certain architectural choices such as the backbone model of the visual encoder have a significant impact on visual reasoning, but little on object detection. given the results of our experiments, we believe that visual reasoning should be considered as a first-class citizen alongside visual recognition, as they are strongly correlated but benefit from potentially different design choices.

Image Classification using Combination of Topological Features and Neural Networks

paper_url: http://arxiv.org/abs/2311.06375
repo_url: None
paper_authors: Mariana Dória Prata Lima, Gilson Antonio Giraldi, Gastão Florêncio Miranda Junior
for: 本研究使用 persist homology 方法，一种在 topological data analysis (TDA) 中常用的技术，以提取数据空间中的基本 topological 特征，并将其与深度学习特征结合以进行分类任务。
methods: 本研究首先从复杂体系中构建了筛选，然后计算了 persistent homology 类型，并将其在筛选中的演化visualized through persistence diagram。此外，我们还应用了vectorization技术，使这些 topological 信息与机器学习算法兼容。
results: 我们的方法可以在 MNIST 数据集中分类多个类型的图像，并且比基eline 的结果更高。我们的分析还表明，在多类分类任务中， topological 信息可以提高神经网络的准确率，但是计算 persist homology 的计算复杂性增加。这是我们知道的第一个结合深度学习特征和 topological 特征的多类分类任务。

Abstract
In this work we use the persistent homology method, a technique in topological data analysis (TDA), to extract essential topological features from the data space and combine them with deep learning features for classification tasks. In TDA, the concepts of complexes and filtration are building blocks. Firstly, a filtration is constructed from some complex. Then, persistent homology classes are computed, and their evolution along the filtration is visualized through the persistence diagram. Additionally, we applied vectorization techniques to the persistence diagram to make this topological information compatible with machine learning algorithms. This was carried out with the aim of classifying images from multiple classes in the MNIST dataset. Our approach inserts topological features into deep learning approaches composed by single and two-streams neural networks architectures based on a multi-layer perceptron (MLP) and a convolutional neral network (CNN) taylored for multi-class classification in the MNIST dataset. In our analysis, we evaluated the obtained results and compared them with the outcomes achieved through the baselines that are available in the TensorFlow library. The main conclusion is that topological information may increase neural network accuracy in multi-class classification tasks with the price of computational complexity of persistent homology calculation. Up to the best of our knowledge, it is the first work that combines deep learning features and the combination of topological features for multi-class classification tasks.

摘要
在这项工作中，我们使用 persistente homology 方法，一种 topological data analysis（TDA）中的技术，以提取数据空间中的基本 topological 特征，并将其与深度学习特征结合以进行分类任务。在 TDA 中，复杂设与筛选是建筑 Material。首先，一个筛选是从一个复杂中构造出来。然后， persistente homology 类是计算出来，并将其在筛选的演化中可见化 durch persistence 图。此外，我们还应用了vectorization技术来使这些 topological 信息与机器学习算法兼容。这是为了在 MNIST 数据集中分类图像。我们的方法把 topological 特征与单流和两流 neural network 架构（基于 multi-layer perceptron 和 convolutional neural network）结合以进行多类分类。在我们的分析中，我们评估了获得的结果，并与存在于 TensorFlow 库中的基eline 结果进行比较。结论是：topological 信息可能会增加多类分类任务中 neural network 精度，但是 persistente homology 计算的计算复杂度会增加。据我们所知，这是首次将 deep learning 特征与 topological 特征结合以进行多类分类任务。

Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks

paper_url: http://arxiv.org/abs/2311.06242
repo_url: None
paper_authors: Bin Xiao, Haiping Wu, Weijian Xu, Xiyang Dai, Houdong Hu, Yumao Lu, Michael Zeng, Ce Liu, Lu Yuan
for: Florence-2 is a novel vision foundation model that can perform a variety of computer vision and vision-language tasks with simple text-based instructions.methods: Florence-2 uses a sequence-to-sequence structure and large-scale, high-quality annotated data to train the model for versatile and comprehensive vision tasks.results: Florence-2 demonstrated strong zero-shot and fine-tuning capabilities, making it a competitive vision foundation model for a variety of tasks.Here is the text in Simplified Chinese:for: florence-2 是一种 novel 的视觉基础模型，可以通过简单的文本指令来执行多种计算机视觉和视觉语言任务。methods: florence-2 使用 sequence-to-sequence 结构和大规模、高质量的注解数据来训练模型，以执行多元和全面的视觉任务。results: florence-2 在多种任务上表现出了强大的零配置和微调能力，使其成为计算机视觉领域的竞争力强的视觉基础模型。

Abstract
We introduce Florence-2, a novel vision foundation model with a unified, prompt-based representation for a variety of computer vision and vision-language tasks. While existing large vision models excel in transfer learning, they struggle to perform a diversity of tasks with simple instructions, a capability that implies handling the complexity of various spatial hierarchy and semantic granularity. Florence-2 was designed to take text-prompt as task instructions and generate desirable results in text forms, whether it be captioning, object detection, grounding or segmentation. This multi-task learning setup demands large-scale, high-quality annotated data. To this end, we co-developed FLD-5B that consists of 5.4 billion comprehensive visual annotations on 126 million images, using an iterative strategy of automated image annotation and model refinement. We adopted a sequence-to-sequence structure to train Florence-2 to perform versatile and comprehensive vision tasks. Extensive evaluations on numerous tasks demonstrated Florence-2 to be a strong vision foundation model contender with unprecedented zero-shot and fine-tuning capabilities.

摘要
我们介绍 Florence-2，一种新型视觉基础模型，具有一个统一的提示基础表示，用于多种计算机视觉和视觉语言任务。现有的大型视觉模型在转移学习方面表现出色，但它们在执行简单的指令下表现不佳，这表明它们不能处理多种空间层次和semantic粒度的复杂性。Florence-2是根据文本提示进行任务指令，并生成desirable的结果，无论是captioning、对象检测、grounding或分割。这种多任务学习设置需要大规模、高质量的注解数据。为此，我们共同开发了 FLD-5B，包括126万张图像的5.4亿次全面视觉注解，使用了迭代的自动图像注解和模型优化策略。我们采用了序列到序列结构来训练 Florence-2，以便它可以执行多种灵活和全面的视觉任务。广泛的评估表明，Florence-2是一个强大的视觉基础模型候选人，具有历史上未有的零shot和微调能力。

Learning Human Action Recognition Representations Without Real Humans

paper_url: http://arxiv.org/abs/2311.06231
repo_url: https://github.com/howardzh01/ppma
paper_authors: Howard Zhong, Samarth Mishra, Donghyun Kim, SouYoung Jin, Rameswar Panda, Hilde Kuehne, Leonid Karlinsky, Venkatesh Saligrama, Aude Oliva, Rogerio Feris
for: 这个论文的目的是研究是否可以使用不包含真实人类图像的数据进行人体动作识别模型的预训练。methods: 这篇论文使用了一个新的预训练策略，即 Privacy-Preserving MAE-Align，将真实人类图像去除后的数据和 sintetic数据组合使用，以提高预训练模型的表现。results: 该论文的实验结果表明，使用 Privacy-Preserving MAE-Align 策略可以提高预训练模型的表现，并将人体动作识别模型的表现与无人体动作识别模型的表现进行比较。此外，该论文还提供了一个可用于复现研究的开源 benchmark。

Abstract
Pre-training on massive video datasets has become essential to achieve high action recognition performance on smaller downstream datasets. However, most large-scale video datasets contain images of people and hence are accompanied with issues related to privacy, ethics, and data protection, often preventing them from being publicly shared for reproducible research. Existing work has attempted to alleviate these problems by blurring faces, downsampling videos, or training on synthetic data. On the other hand, analysis on the transferability of privacy-preserving pre-trained models to downstream tasks has been limited. In this work, we study this problem by first asking the question: can we pre-train models for human action recognition with data that does not include real humans? To this end, we present, for the first time, a benchmark that leverages real-world videos with humans removed and synthetic data containing virtual humans to pre-train a model. We then evaluate the transferability of the representation learned on this data to a diverse set of downstream action recognition benchmarks. Furthermore, we propose a novel pre-training strategy, called Privacy-Preserving MAE-Align, to effectively combine synthetic data and human-removed real data. Our approach outperforms previous baselines by up to 5% and closes the performance gap between human and no-human action recognition representations on downstream tasks, for both linear probing and fine-tuning. Our benchmark, code, and models are available at https://github.com/howardzh01/PPMA .

摘要
大规模视频数据的预训练已成为实现高效人体动作识别的必备条件。然而，大多数大规模视频数据包含人脸图像，因此会附带隐私、伦理和数据保护等问题，常常使得这些数据无法公开分享，对于可重复的研究。现有的工作尝试解决这些问题，通过让人脸模糊、视频下采样或使用生成的数据进行训练。然而，对于隐私保持的模型转移性的分析却受到限制。在这项工作中，我们提出了以下问题：可以我们在不包含真实人类数据的情况下进行人体动作识别预训练吗？为此，我们提供了一个新的数据集，其中包含了人类去除后的真实视频和虚拟人类生成的数据，用于预训练模型。然后，我们评估了这种数据的表示学习到下游动作识别任务中的转移性，并提出了一种新的预训练策略，即隐私保持MAE-Align。我们的方法比前一代基eline上提高了5%，并将人类动作识别和无人类动作识别表示之间的性能差距降到最小。我们的数据集、代码和模型可以在https://github.com/howardzh01/PPMA上下载。

Diffusion Models for Earth Observation Use-cases: from cloud removal to urban change detection

paper_url: http://arxiv.org/abs/2311.06222
repo_url: https://github.com/furio1999/EO_Diffusion
paper_authors: Fulvio Sanguigni, Mikolaj Czerkawski, Lorenzo Papa, Irene Amerini, Bertrand Le Saux
for: 这篇论文旨在展示 diffusion 模型对于卫星影像数据的应用，并提出了三个实际应用案例。
methods: 这篇论文使用了 diffusion 模型，包括云除和填充、数据集生成 для变化检测任务、以及城市规划。
results: 这篇论文获得了云除和填充、数据集生成、城市规划等三个实际应用案例中的良好结果。

Abstract
The advancements in the state of the art of generative Artificial Intelligence (AI) brought by diffusion models can be highly beneficial in novel contexts involving Earth observation data. After introducing this new family of generative models, this work proposes and analyses three use cases which demonstrate the potential of diffusion-based approaches for satellite image data. Namely, we tackle cloud removal and inpainting, dataset generation for change-detection tasks, and urban replanning.

摘要
“现代生成人工智能（AI）技术的进步，即扩散模型，在地球观测数据中可以获得非常有利的效果。本研究首次介绍了这种新的生成模型家族，然后提出和分析了三个使用场景，即云除和填充、数据集生成 для变化检测任务、和城市规划。”Here's a breakdown of the translation:* 现代生成人工智能 (AI) 技术 (技术) - This phrase is translated as "现代生成人工智能（AI）技术" in Simplified Chinese.* 的进步 (进步) - This word is translated as "的进步" in Simplified Chinese.* 即扩散模型 (扩散模型) - This phrase is translated as "即扩散模型" in Simplified Chinese.* 在地球观测数据中 (在地球观测数据中) - This phrase is translated as "在地球观测数据中" in Simplified Chinese.* 可以获得非常有利的效果 (可以获得非常有利的效果) - This phrase is translated as "可以获得非常有利的效果" in Simplified Chinese.* 本研究 (本研究) - This word is translated as "本研究" in Simplified Chinese.* 首次介绍了 (首次介绍了) - This phrase is translated as "首次介绍了" in Simplified Chinese.* 这种新的生成模型家族 (这种新的生成模型家族) - This phrase is translated as "这种新的生成模型家族" in Simplified Chinese.* 然后 (然后) - This word is translated as "然后" in Simplified Chinese.* 提出和分析了三个使用场景 (提出和分析了三个使用场景) - This phrase is translated as "提出和分析了三个使用场景" in Simplified Chinese.* 即云除和填充 (即云除和填充) - This phrase is translated as "即云除和填充" in Simplified Chinese.* 数据集生成 для变化检测任务 (数据集生成 для变化检测任务) - This phrase is translated as "数据集生成 для变化检测任务" in Simplified Chinese.* 和城市规划 (和城市规划) - This phrase is translated as "和城市规划" in Simplified Chinese.

Semantic-aware Video Representation for Few-shot Action Recognition

paper_url: http://arxiv.org/abs/2311.06218
repo_url: None
paper_authors: Yutao Tang, Benjamin Bejar, Rene Vidal
for: 提高ew-shot动作识别性能，解决现有方法依赖2D帧级别表示，缺乏有效的文本 semantics incorporation和简单的类别分类方法等问题。
methods: 提出了一种简单 yet effective的Semantic-Aware Few-Shot Action Recognition（SAFSAR）模型，通过直接使用3D特征提取器和有效的特征融合方案，以及简单的高度相似性分类方法，实现了更好的性能而无需额外 комponents for temporal modeling或复杂的距离函数。
results: 在五个具有不同设定的ew-shot动作识别benchmark上，经验表明，提出的SAFSAR模型可以显著提高状态 искусственный的性能。

Abstract
Recent work on action recognition leverages 3D features and textual information to achieve state-of-the-art performance. However, most of the current few-shot action recognition methods still rely on 2D frame-level representations, often require additional components to model temporal relations, and employ complex distance functions to achieve accurate alignment of these representations. In addition, existing methods struggle to effectively integrate textual semantics, some resorting to concatenation or addition of textual and visual features, and some using text merely as an additional supervision without truly achieving feature fusion and information transfer from different modalities. In this work, we propose a simple yet effective Semantic-Aware Few-Shot Action Recognition (SAFSAR) model to address these issues. We show that directly leveraging a 3D feature extractor combined with an effective feature-fusion scheme, and a simple cosine similarity for classification can yield better performance without the need of extra components for temporal modeling or complex distance functions. We introduce an innovative scheme to encode the textual semantics into the video representation which adaptively fuses features from text and video, and encourages the visual encoder to extract more semantically consistent features. In this scheme, SAFSAR achieves alignment and fusion in a compact way. Experiments on five challenging few-shot action recognition benchmarks under various settings demonstrate that the proposed SAFSAR model significantly improves the state-of-the-art performance.

摘要
In this work, we propose a simple yet effective Semantic-Aware Few-Shot Action Recognition (SAFSAR) model to address these issues. Our approach leverages a 3D feature extractor combined with an effective feature-fusion scheme and a simple cosine similarity for classification, which improves performance without the need for extra components for temporal modeling or complex distance functions.We also introduce an innovative scheme to encode textual semantics into the video representation, which adaptively fuses features from text and video and encourages the visual encoder to extract more semantically consistent features. This scheme allows for compact and effective alignment and fusion of textual and visual information.Experiments on five challenging few-shot action recognition benchmarks under various settings demonstrate that the proposed SAFSAR model significantly improves the state-of-the-art performance.

Instant3D: Fast Text-to-3D with Sparse-View Generation and Large Reconstruction Model

paper_url: http://arxiv.org/abs/2311.06214
repo_url: None
paper_authors: Jiahao Li, Hao Tan, Kai Zhang, Zexiang Xu, Fujun Luan, Yinghao Xu, Yicong Hong, Kalyan Sunkavalli, Greg Shakhnarovich, Sai Bi
for: This paper aims to generate high-quality and diverse 3D assets from text prompts in a feed-forward manner.
methods: The proposed method Instant3D uses a two-stage paradigm, which first generates a sparse set of four structured and consistent views from text in one shot with a fine-tuned 2D text-to-image diffusion model, and then directly regresses the NeRF from the generated images with a novel transformer-based sparse-view reconstructor.
results: The method can generate high-quality, diverse and Janus-free 3D assets within 20 seconds, which is two orders of magnitude faster than previous optimization-based methods that can take 1 to 10 hours.

Abstract
Text-to-3D with diffusion models have achieved remarkable progress in recent years. However, existing methods either rely on score distillation-based optimization which suffer from slow inference, low diversity and Janus problems, or are feed-forward methods that generate low quality results due to the scarcity of 3D training data. In this paper, we propose Instant3D, a novel method that generates high-quality and diverse 3D assets from text prompts in a feed-forward manner. We adopt a two-stage paradigm, which first generates a sparse set of four structured and consistent views from text in one shot with a fine-tuned 2D text-to-image diffusion model, and then directly regresses the NeRF from the generated images with a novel transformer-based sparse-view reconstructor. Through extensive experiments, we demonstrate that our method can generate high-quality, diverse and Janus-free 3D assets within 20 seconds, which is two order of magnitude faster than previous optimization-based methods that can take 1 to 10 hours. Our project webpage: https://jiahao.ai/instant3d/.

摘要
In this paper, we propose Instant3D, a novel method that generates high-quality and diverse 3D assets from text prompts in a feed-forward manner. We adopt a two-stage paradigm:1. First, we generate a sparse set of four structured and consistent views from text in one shot with a fine-tuned 2D text-to-image diffusion model.2. Then, we directly regress the NeRF from the generated images with a novel transformer-based sparse-view reconstructor.Through extensive experiments, we demonstrate that our method can generate high-quality, diverse, and Janus-free 3D assets within 20 seconds, which is two orders of magnitude faster than previous optimization-based methods that can take 1 to 10 hours. Our project webpage is .

ASSIST: Interactive Scene Nodes for Scalable and Realistic Indoor Simulation

paper_url: http://arxiv.org/abs/2311.06211
repo_url: None
paper_authors: Zhide Zhong, Jiakai Cao, Songen Gu, Sirui Xie, Weibo Gao, Liyi Luo, Zike Yan, Hao Zhao, Guyue Zhou
for: 这篇论文旨在提出一种基于神经网络的物体尺度场，用于实现复杂的物体和场景的真实化和组合渲染。
methods: 该方法使用一种新的场景节点数据结构，它将每个物体的信息存储在一起，以便在线交互和跨场景设定中进行交互。该结构还包括一个可微 differentiable神经网络、相关的 bounding box 和semantic feature，以便通过鼠标/键盘控制或语言指令进行简单的交互。
results: 实验表明，该方法可以实现可扩展的真实化和组合渲染，并生成三维彩色图像、深度图像和精确的分割mask。

Abstract
We present ASSIST, an object-wise neural radiance field as a panoptic representation for compositional and realistic simulation. Central to our approach is a novel scene node data structure that stores the information of each object in a unified fashion, allowing online interaction in both intra- and cross-scene settings. By incorporating a differentiable neural network along with the associated bounding box and semantic features, the proposed structure guarantees user-friendly interaction on independent objects to scale up novel view simulation. Objects in the scene can be queried, added, duplicated, deleted, transformed, or swapped simply through mouse/keyboard controls or language instructions. Experiments demonstrate the efficacy of the proposed method, where scaled realistic simulation can be achieved through interactive editing and compositional rendering, with color images, depth images, and panoptic segmentation masks generated in a 3D consistent manner.

摘要
我们提出了ASSIST，一种对象级别神经辐射场，用于实现组合和实际的 simulate 作业。我们的方法的核心是一种新的场景节点数据结构，可以同时存储每个对象的信息，以便在线上交互和跨场景设置。通过结合可导式神经网络和相关的 bounding box 和semantic feature，我们的结构确保了用户友好的交互，可以通过鼠标/键盘控制或语言指令来查询、添加、复制、删除、转换或换位对象。实验表明，我们的方法可以实现协助编辑和组合渲染，并生成3D保持一致的颜色图像、深度图像和panographic分割mask。

An Automated Pipeline for Tumour-Infiltrating Lymphocyte Scoring in Breast Cancer

paper_url: http://arxiv.org/abs/2311.06185
repo_url: https://github.com/adamshephard/tiager
paper_authors: Adam J Shephard, Mostafa Jahanifar, Ruoyu Wang, Muhammad Dawood, Simon Graham, Kastytis Sidlauskas, Syed Ali Khurram, Nasir M Rajpoot, Shan E Ahmed Raza
For: 本研究使用深度学习算法对 breast cancer 整幕影像进行 TILs 分数计算，以提高诊断和预后评估。* Methods: 我们的方法首先分别分类 tumour 和 stroma 区域，然后在 tumour-associated stroma 中检测 TILs，并生成 TILs 分数。我们的方法基于 Efficient-UNet 架构，并且具有 state-of-the-art 的性能在 tumour/stroma 区域分 segmentation 和 TILs 检测中。* Results: 我们的研究表明，我们的自动 TILs 分数系统可以准确预测 breast cancer 患者的 survival 结果，并且与 Pathologist 的评估结果相符。

Abstract
Tumour-infiltrating lymphocytes (TILs) are considered as a valuable prognostic markers in both triple-negative and human epidermal growth factor receptor 2 (HER2) breast cancer. In this study, we introduce an innovative deep learning pipeline based on the Efficient-UNet architecture to compute a TILs score for breast cancer whole slide images. Our pipeline first segments tumour-stroma regions and generates a tumour bulk mask. Subsequently, it detects TILs within the tumour-associated stroma, generating a TILs score by closely mirroring the pathologist's workflow. Our method exhibits state-of-the-art performance in segmenting tumour/stroma areas and TILs detection, as demonstrated by internal cross-validation on the TiGER Challenge training dataset and evaluation on the final leaderboards. Additionally, our TILs score proves competitive in predicting survival outcomes within the same challenge, underscoring the clinical relevance and potential of our automated TILs scoring system as a breast cancer prognostic tool.

摘要
肿瘤浸泡免疫细胞（TILs）在三重阴性和人顺体外生长因子受体2（HER2）乳腺癌中被视为有价值的诊断标志。本研究提出了一种创新的深度学习管道，基于Efficient-UNet架构，计算乳腺癌整个染色体影像中TILs分数。我们的管道首先分 segment tumor-stroma区域，并生成肿瘤涂抹mask。然后，它检测TILs在肿瘤相关的Connective tissue中，生成TILs分数，与病理学家的工作流程几乎相同。我们的方法在分 segment tumor/stroma区域和TILs检测方面表现出了状态之arte的表现，经过内部交叉验证在TiGER Challenge训练数据集上，并在最终的排名中进行了评估。此外，我们的TILs分数能够预测乳腺癌存活结果，这 highlights the clinical relevance and potential of our automated TILs scoring system as a breast cancer prognostic tool。

Automatic Report Generation for Histopathology images using pre-trained Vision Transformers

paper_url: http://arxiv.org/abs/2311.06176
repo_url: None
paper_authors: Saurav Sengupta, Donald E. Brown
for: 这个研究的目的是为了自动生成医学影像报告。
methods: 这个研究使用了现有的预训练的感知 трансформа器，在一个two-step过程中，首先使用它对4096x4096大小的整个数组图像（Whole Slide Image，WSI）进行编码，然后使用它作为编码器和LSTM复合器进行报告生成。
results: 这个研究获得了一个不错的性能和可移植性的报告生成机制，可以考虑整个高分辨率图像，而不只是patches。此外，这个研究还使用了现有的强大预训练的层次感知 transformer，并证明其在零损失分类以及报告生成中的有用性。

Abstract
Deep learning for histopathology has been successfully used for disease classification, image segmentation and more. However, combining image and text modalities using current state-of-the-art methods has been a challenge due to the high resolution of histopathology images. Automatic report generation for histopathology images is one such challenge. In this work, we show that using an existing pre-trained Vision Transformer in a two-step process of first using it to encode 4096x4096 sized patches of the Whole Slide Image (WSI) and then using it as the encoder and an LSTM decoder for report generation, we can build a fairly performant and portable report generation mechanism that takes into account the whole of the high resolution image, instead of just the patches. We are also able to use representations from an existing powerful pre-trained hierarchical vision transformer and show its usefulness in not just zero shot classification but also for report generation.

摘要
深度学习在 Histopathology 中已经得到了成功，用于疾病分类、图像分割和更多的应用。然而，将图像和文本模式结合使用现有的状态太的方法是一个挑战，主要是因为 histopathology 图像的高分辨率。自动生成 histopathology 图像的报告是一个这样的挑战。在这项工作中，我们表明了使用现有的预训练 Vision Transformer 进行两步处理：首先，将 Whole Slide Image (WSI) 的 4096x4096 大小的patches 使用 Vision Transformer 进行编码，然后使用 Vision Transformer 作为编码器和 LSTM 解码器进行报告生成。我们发现，这种方法可以建立一个性能较高且可移植的报告生成机制，可以考虑整个高分辨率图像，而不仅仅是patches。此外，我们还可以使用现有的强大预训练 hierarchical Vision Transformer 的表示，并证明其在零shot分类以及报告生成中的用用。

Deep Fast Vision: A Python Library for Accelerated Deep Transfer Learning Vision Prototyping

paper_url: http://arxiv.org/abs/2311.06169
repo_url: https://github.com/fabprezja/deep-fast-vision
paper_authors: Fabi Prezja
for: 提高深度学习视觉领域的易用性和普及率，帮助非专家用户快速入门深度学习。
methods: 使用Python库实现简单化深度学习过程，提供易于理解的嵌入式字典定义，使得非专家用户可以轻松获得结果。
results: 提供一个简单、扩展性强的深度学习工具，帮助bridge Complex deep learning frameworks和各种用户需求，推动深度学习的普及和应用。

Abstract
Deep learning-based vision is characterized by intricate frameworks that often necessitate a profound understanding, presenting a barrier to newcomers and limiting broad adoption. With many researchers grappling with the constraints of smaller datasets, there's a pronounced reliance on pre-trained neural networks, especially for tasks such as image classification. This reliance is further intensified in niche imaging areas where obtaining vast datasets is challenging. Despite the widespread use of transfer learning as a remedy to the small dataset dilemma, a conspicuous absence of tailored auto-ML solutions persists. Addressing these challenges is "Deep Fast Vision", a python library that streamlines the deep learning process. This tool offers a user-friendly experience, enabling results through a simple nested dictionary definition, helping to democratize deep learning for non-experts. Designed for simplicity and scalability, Deep Fast Vision appears as a bridge, connecting the complexities of existing deep learning frameworks with the needs of a diverse user base.

摘要
深度学习视觉 caracteriza por frameworks intrincados que a menudo requieren una comprensión profunda, lo que puede representar una barrera para los principiantes y limitaciones en la adopción amplia. Con muchos investigadores lidiando con los límites de conjuntos de datos más pequeños, hay una reliance pronunciada en redes neuronales preentrenzadas, especialmente para tareas como clasificación de imágenes. Esta reliance se vuelve a intensificar en áreas de imagen nicho donde obtener conjuntos de datos vastos es desafiante. A pesar del uso amplio de aprendizaje transferido como una solución a la dilema de conjuntos de datos pequeños, una ausencia conspicua de soluciones de Auto-ML personalizadas persiste. Para abordar estos desafíos, se presenta "Deep Fast Vision", una biblioteca de Python que simplifica el proceso de aprendizaje profundo. Esta herramienta ofrece una experiencia de usuario amigable, permitiendo resultados a través de una definición de diccionario nestado simple, ayudando a democratizar el aprendizaje profundo para no expertos. Diseñada para la simplicidad y escalabilidad, Deep Fast Vision se presenta como un puente que conecta las complejidades de los marcos existentes de aprendizaje profundo con las necesidades de una base de usuarios diversa.

An Evaluation of Forensic Facial Recognition

paper_url: http://arxiv.org/abs/2311.06145
repo_url: https://github.com/DanielDdungu/Real-Time-Face-Recognition
paper_authors: Justin Norman, Shruti Agarwal, Hany Farid
for: 本研究旨在评估 faces recognition 系统在真实世界情况下的表现，特别是在低分辨率、低质量、部分遮挡的图像对标准面部数据库进行比较。
methods: 本研究使用了大量的synthetic facial dataset和控制 facial forensic lineup，以模拟真实世界中的面部识别情况。两种流行的神经网络基于的识别系统进行了评估。
results: 研究发现， previously reported face recognition accuracy 高于 95% 下降到了 65% 以下，表明面部识别系统在这种更加复杂的刑事enario中表现不佳。

Abstract
Recent advances in machine learning and computer vision have led to reported facial recognition accuracies surpassing human performance. We question if these systems will translate to real-world forensic scenarios in which a potentially low-resolution, low-quality, partially-occluded image is compared against a standard facial database. We describe the construction of a large-scale synthetic facial dataset along with a controlled facial forensic lineup, the combination of which allows for a controlled evaluation of facial recognition under a range of real-world conditions. Using this synthetic dataset, and a popular dataset of real faces, we evaluate the accuracy of two popular neural-based recognition systems. We find that previously reported face recognition accuracies of more than 95% drop to as low as 65% in this more challenging forensic scenario.

摘要
最近的机器学习和计算机视觉技术发展，已经使facial recognition系统的准确率超过人类表现。我们问题是这些系统在真实世界冤家enario中是否能够维持高度的准确率，例如 comparing a low-resolution, low-quality, partially-occluded image against a standard facial database。我们描述了一个大规模的 sintetic facial dataset的构建，以及一个控制的 facial forensic lineup，这两个组合允许我们在不同的真实世界条件下进行控制的评估。使用这个 sintetic dataset，以及一个流行的实际面孔数据集，我们评估了两个流行的神经网络基于的认识系统的准确率。我们发现，以前报道的面recognition准确率高于95%下降到了65%的这样的更加挑战的冤家enario中。

Federated Learning Across Decentralized and Unshared Archives for Remote Sensing Image Classification

paper_url: http://arxiv.org/abs/2311.06141
repo_url: None
paper_authors: Barış Büyüktaş, Gencer Sumbul, Begüm Demir
for: 这paper aimsto explore the potential of federated learning (FL) in remote sensing (RS) and compare state-of-the-art FL algorithms for image classification tasks.
methods: 本paper使用了多种state-of-the-art FL algorithms, including federated averaging (FedAvg), federated transfer learning (FedTL), and federated meta-learning (FedMeta). The authors also conducted a theoretical comparison of the algorithms based on their local training complexity, aggregation complexity, learning efficiency, communication cost, and scalability.
results: 经过实验研究， authors found that FedAvg and FedTL outperformed other algorithms under different decentralization scenarios. Additionally, the authors derived a guideline for selecting suitable FL algorithms in RS based on the characteristics of the decentralized data.

Abstract
Federated learning (FL) enables the collaboration of multiple deep learning models to learn from decentralized data archives (i.e., clients) without accessing data on clients. Although FL offers ample opportunities in knowledge discovery from distributed image archives, it is seldom considered in remote sensing (RS). In this paper, as a first time in RS, we present a comparative study of state-of-the-art FL algorithms. To this end, we initially provide a systematic review of the FL algorithms presented in the computer vision community for image classification problems, and select several state-of-the-art FL algorithms based on their effectiveness with respect to training data heterogeneity across clients (known as non-IID data). After presenting an extensive overview of the selected algorithms, a theoretical comparison of the algorithms is conducted based on their: 1) local training complexity; 2) aggregation complexity; 3) learning efficiency; 4) communication cost; and 5) scalability in terms of number of clients. As the classification task, we consider multi-label classification (MLC) problem since RS images typically consist of multiple classes, and thus can simultaneously be associated with multi-labels. After the theoretical comparison, experimental analyses are presented to compare them under different decentralization scenarios in terms of MLC performance. Based on our comprehensive analyses, we finally derive a guideline for selecting suitable FL algorithms in RS. The code of this work will be publicly available at https://git.tu-berlin.de/rsim/FL-RS.

摘要
Federated 学习（FL）允许多个深度学习模型在分布式数据存储（即客户端）上学习而不需要访问客户端上的数据。尽管FL在分布式图像存储中提供了丰富的机会，它在远程感知（RS）领域几乎未得到考虑。在这篇论文中，我们为RS领域的首次应用FL算法进行了比较研究。为此，我们首先提供了计算机视觉社区中关于图像分类问题的FL算法的系统性评论，并选择了一些在客户端数据不同性（即非Identical和不同）上显示出效果的FL算法。接着，我们对选择的算法进行了理论性比较，包括：1）本地训练复杂度；2）聚合复杂度；3）学习效率；4）通信成本；和5）可扩展性。作为分类任务，我们考虑了多标签分类（MLC）问题，因为RS图像通常包含多个类别，可以同时被关联到多个标签。在理论比较后，我们进行了实验分析，对不同的分布式场景进行了MLC性能的比较。根据我们的全面分析，我们最终提出了RS中FL算法选择的指南。代码将在https://git.tu-berlin.de/rsim/FL-RS上公开。

MonoProb: Self-Supervised Monocular Depth Estimation with Interpretable Uncertainty

paper_url: http://arxiv.org/abs/2311.06137
repo_url: https://github.com/cea-list/monoprob
paper_authors: Rémi Marsal, Florian Chabot, Angelique Loesch, William Grolleau, Hichem Sahbi
for: 这 paper written for 自动驾驶汽车等应用环境分析。
methods: 该 paper 使用了一种新的无监督单目深度估计方法，即 MonoProb，可以在单一前进推理中提供可解释的uncertainty，表示网络对深度预测的预期错误。
results: 该 paper 的实验结果显示，MonoProb 可以提高depth和uncertainty的性能，并且可以在不增加推理时间的情况下提供depth和uncertainty的测量。

Abstract
Self-supervised monocular depth estimation methods aim to be used in critical applications such as autonomous vehicles for environment analysis. To circumvent the potential imperfections of these approaches, a quantification of the prediction confidence is crucial to guide decision-making systems that rely on depth estimation. In this paper, we propose MonoProb, a new unsupervised monocular depth estimation method that returns an interpretable uncertainty, which means that the uncertainty reflects the expected error of the network in its depth predictions. We rethink the stereo or the structure-from-motion paradigms used to train unsupervised monocular depth models as a probabilistic problem. Within a single forward pass inference, this model provides a depth prediction and a measure of its confidence, without increasing the inference time. We then improve the performance on depth and uncertainty with a novel self-distillation loss for which a student is supervised by a pseudo ground truth that is a probability distribution on depth output by a teacher. To quantify the performance of our models we design new metrics that, unlike traditional ones, measure the absolute performance of uncertainty predictions. Our experiments highlight enhancements achieved by our method on standard depth and uncertainty metrics as well as on our tailored metrics. https://github.com/CEA-LIST/MonoProb

摘要
自我监督的单目深度估算方法目标在critical应用中，如自动驾驶车辆环境分析。为了避免这些方法的潜在缺陷，对depth估算的预测 confidence quantification是关键的，以帮助基于depth估算的决策系统。在这篇论文中，我们提出了MonoProb，一种新的无监督单目深度估算方法，该方法返回可解释的uncertainty，即网络的depth预测错误预期值。我们将单目或stereo/structure-from-motion paradigms用于无监督单目深度模型的训练转换为一个概率问题。在单个前向传播推理过程中，该模型提供了depth预测和其 confidence的度量，不会增加推理时间。我们然后通过一种新的自我混合损失来提高depth和uncertainty的性能，其中学生被监督于一个 pseudo 真实数据，该数据是一个depth输出的概率分布。为了衡量我们的模型性能，我们设计了新的metric，与传统metric不同，可以量化uncertainty预测的绝对性能。我们的实验表明，我们的方法在标准深度和uncertainty metric以及我们定制的metric上具有显著提高。References:* GitHub:

Fight Fire with Fire: Combating Adversarial Patch Attacks using Pattern-randomized Defensive Patches

paper_url: http://arxiv.org/abs/2311.06122
repo_url: None
paper_authors: Jianan Feng, Jiachun Li, Changqing Miao, Jianjun Huang, Wei You, Wenchang Shi, Bin Liang
for: 防御 adversarial patch 攻击
methods: 使用活动防御策略，插入 canary 和 woodpecker 两种防御补丁，不改变目标模型
results: canary 和 woodpecker 实现高性能，能够抗击未知攻击方法，时间开销有限；对防御意识攻击也具有 suficient 鲁棒性

Abstract
Object detection has found extensive applications in various tasks, but it is also susceptible to adversarial patch attacks. Existing defense methods often necessitate modifications to the target model or result in unacceptable time overhead. In this paper, we adopt a counterattack approach, following the principle of "fight fire with fire," and propose a novel and general methodology for defending adversarial attacks. We utilize an active defense strategy by injecting two types of defensive patches, canary and woodpecker, into the input to proactively probe or weaken potential adversarial patches without altering the target model. Moreover, inspired by randomization techniques employed in software security, we employ randomized canary and woodpecker injection patterns to defend against defense-aware attacks. The effectiveness and practicality of the proposed method are demonstrated through comprehensive experiments. The results illustrate that canary and woodpecker achieve high performance, even when confronted with unknown attack methods, while incurring limited time overhead. Furthermore, our method also exhibits sufficient robustness against defense-aware attacks, as evidenced by adaptive attack experiments.

摘要

Exploring the Efficacy of Base Data Augmentation Methods in Deep Learning-Based Radiograph Classification of Knee Joint Osteoarthritis

paper_url: http://arxiv.org/abs/2311.06118
repo_url: None
paper_authors: Fabi Prezja, Leevi Annala, Sampsa Kiiskinen, Timo Ojala
for: 该研究旨在检测关节骨块炎（KOA），一种全球范围内导致残疾的主要原因。
methods: 该研究使用深度学习方法进行KOA诊断，并利用数据增强技术来增加数据多样性。
results: 研究发现，使用恶意增强技术可以提高KOA分类模型的性能，但其他常用的增强技术则常下降性能。研究还发现，存在可能的混淆区域在图像中，这使得模型可以准确地分类KL0和KL4等级，而不需要考虑关节部分。这一观察表明了模型可能利用不相关的特征来进行分类。

Abstract
Diagnosing knee joint osteoarthritis (KOA), a major cause of disability worldwide, is challenging due to subtle radiographic indicators and the varied progression of the disease. Using deep learning for KOA diagnosis requires broad, comprehensive datasets. However, obtaining these datasets poses significant challenges due to patient privacy concerns and data collection restrictions. Additive data augmentation, which enhances data variability, emerges as a promising solution. Yet, it's unclear which augmentation techniques are most effective for KOA. This study explored various data augmentation methods, including adversarial augmentations, and their impact on KOA classification model performance. While some techniques improved performance, others commonly used underperformed. We identified potential confounding regions within the images using adversarial augmentation. This was evidenced by our models' ability to classify KL0 and KL4 grades accurately, with the knee joint omitted. This observation suggested a model bias, which might leverage unrelated features for classification currently present in radiographs. Interestingly, removing the knee joint also led to an unexpected improvement in KL1 classification accuracy. To better visualize these paradoxical effects, we employed Grad-CAM, highlighting the associated regions. Our study underscores the need for careful technique selection for improved model performance and identifying and managing potential confounding regions in radiographic KOA deep learning.

摘要
诊断膝关节骨关节炎（KOA）具有挑战性，主要原因是诊断标准化不够，疾病进程变化多样化。使用深度学习诊断KOA需要广泛、全面的数据集。然而，获得这些数据集具有难题，主要是因为患者隐私问题和数据收集限制。添加数据增强技术可以解决这个问题。然而，不同的增强技术对KOA分类模型的影响是不确定的。本研究探讨了不同的数据增强方法，包括对抗增强技术，对KOA分类模型的影响。一些技术提高了表现，而其他们则常常表现不佳。我们使用对抗增强技术 indentified可能的混合区域内 immagini，这是通过我们的模型可以准确地分类KL0和KL4等级，而不需要膝关节。这一观察表明了我们的模型可能受到了不相关的特征的影响，从而导致模型偏好。意外地，去掉膝关节也导致了KL1等级的准确率提高。为了更好地visualize这些paraoxical效应，我们使用Grad-CAM，显示关联区域。本研究表明，选择合适的技术和识别和管理可能的混合区域在诊断KOA的深度学习中是非常重要的。

Dual input stream transformer for eye-tracking line assignment

paper_url: http://arxiv.org/abs/2311.06095
repo_url: None
paper_authors: Thomas M. Mercier, Marcin Budka, Martin R. Vasilev, Julie A. Kirkby, Bernhard Angele, Timothy J. Slattery
for: 本研究的目的是解决阅读数据中的眩晕问题，通过分配眩晕点到文本行中的最佳线程。
methods: 本研究提出了一种基于Transformer的双输入流Transformer（DIST）模型，通过对多个实例的DIST模型进行ensemble学习，以提高眩晕点分配的准确率。
results: 对于九种经典方法的比较，DIST模型在九个多样化的数据集上达到了98.5%的平均准确率，显示DIST模型的优越性。

Abstract
We introduce a novel Dual Input Stream Transformer (DIST) for the challenging problem of assigning fixation points from eye-tracking data collected during passage reading to the line of text that the reader was actually focused on. This post-processing step is crucial for analysis of the reading data due to the presence of noise in the form of vertical drift. We evaluate DIST against nine classical approaches on a comprehensive suite of nine diverse datasets, and demonstrate DIST's superiority. By combining multiple instances of the DIST model in an ensemble we achieve an average accuracy of 98.5\% across all datasets. Our approach presents a significant step towards addressing the bottleneck of manual line assignment in reading research. Through extensive model analysis and ablation studies, we identify key factors that contribute to DIST's success, including the incorporation of line overlap features and the use of a second input stream. Through evaluation on a set of diverse datasets we demonstrate that DIST is robust to various experimental setups, making it a safe first choice for practitioners in the field.

摘要
我们介绍了一种新的双输入流转换器（DIST），用于从读者眼动数据中分配焦点点到实际阅读的行。这是阅读数据分析中的一个关键后处理步骤，因为存在垂直滑动的噪声。我们对九种经典方法进行了评估，并示出了 DIST 的优越性。通过将多个 DIST 模型 ensemble 组合，我们在所有数据集上实现了平均准确率为 98.5%。我们的方法为阅读研究中的手动线 assigning 带来了一个重要的突破口。通过广泛的模型分析和减少学习，我们确定了 DIST 成功的关键因素，包括将行 overlap 特征并入和使用第二个输入流。我们在多个不同的数据集上进行了评估，并证明了 DIST 在不同的实际设置下具有Robustness，使其成为领域中的首选方法。

Enhancing Rock Image Segmentation in Digital Rock Physics: A Fusion of Generative AI and State-of-the-Art Neural Networks

paper_url: http://arxiv.org/abs/2311.06079
repo_url: None
paper_authors: Zhaoyang Ma, Xupeng He, Hyung Kwak, Jun Gao, Shuyu Sun, Bicheng Yan
for: 提高数字岩石物理中的岩石微结构分割精度和稳定性，使用先进的生成AI模型和深度学习网络。
methods: 使用扩展的生成AI模型（Diffusion Model）生成大量的CT/SEM和二进制分割对，并使用U-Net、Attention-U-net和TransUNet三种神经网络进行分割。
results: 研究表明，通过将扩展的生成AI模型与高级神经网络结合，可以提高分割精度和一致性，并减少专家数据的需求。TransU-Net表现出色，在岩石微结构分割中实现最高的准确率和IoU指标。

Abstract
In digital rock physics, analysing microstructures from CT and SEM scans is crucial for estimating properties like porosity and pore connectivity. Traditional segmentation methods like thresholding and CNNs often fall short in accurately detailing rock microstructures and are prone to noise. U-Net improved segmentation accuracy but required many expert-annotated samples, a laborious and error-prone process due to complex pore shapes. Our study employed an advanced generative AI model, the diffusion model, to overcome these limitations. This model generated a vast dataset of CT/SEM and binary segmentation pairs from a small initial dataset. We assessed the efficacy of three neural networks: U-Net, Attention-U-net, and TransUNet, for segmenting these enhanced images. The diffusion model proved to be an effective data augmentation technique, improving the generalization and robustness of deep learning models. TransU-Net, incorporating Transformer structures, demonstrated superior segmentation accuracy and IoU metrics, outperforming both U-Net and Attention-U-net. Our research advances rock image segmentation by combining the diffusion model with cutting-edge neural networks, reducing dependency on extensive expert data and boosting segmentation accuracy and robustness. TransU-Net sets a new standard in digital rock physics, paving the way for future geoscience and engineering breakthroughs.

摘要
在数字岩石物理中，分析微结构从CT和SEM扫描图像是关键的，以估算Properties like porosity和连通性。传统的分 segmentation方法，如阈值和CNNs，经常不能准确地描述岩石微结构，同时容易受到噪声的影响。U-Net提高了分 segmentation 精度，但需要大量由专家标注的样本，这是一个费时的和容易出错的过程，因为岩石的pores shapes是复杂的。我们的研究使用了一种先进的生成AI模型，扩散模型，以超越这些限制。这个模型生成了大量的CT/SEM和二进制分 segmentation对from a small initial dataset。我们评估了三个神经网络：U-Net、Attention-U-net和TransUNet，用于这些加强图像的分 segmentation。扩散模型证明是一种有效的数据增强技术，可以提高深度学习模型的普遍性和可靠性。TransU-Net，具有Transformer结构，在分 segmentation精度和IoU指标方面表现出色，超越了U-Net和Attention-U-net。我们的研究提高了岩石图像分 segmentation的准确性和可靠性，并减少了对专家数据的依赖。TransU-Net设置了新的标准在数字岩石物理中，开创了未来地球科学和工程的突破。

Learning-Based Biharmonic Augmentation for Point Cloud Classification

paper_url: http://arxiv.org/abs/2311.06070
repo_url: None
paper_authors: Jiacheng Wei, Guosheng Lin, Henghui Ding, Jie Hu, Kim-Hui Yap
for: 提高点云数据集的样本数量和多样性，以便进行更好的数据 augmentation。
methods: 我们提出了一种新的数据增强技术 called Biharmonic Augmentation (BA)，它通过对现有3D结构进行平滑非RIGID变换来增加数据集的多样性。我们使用一个CoefNet来预测权重，以将多个几何体的变换概率拼接起来。
results: 我们的实验表明，Biharmonic Augmentation 可以显著提高点云数据集的性能，并且在不同的网络设计下都显示出优秀的成果。

Abstract
Point cloud datasets often suffer from inadequate sample sizes in comparison to image datasets, making data augmentation challenging. While traditional methods, like rigid transformations and scaling, have limited potential in increasing dataset diversity due to their constraints on altering individual sample shapes, we introduce the Biharmonic Augmentation (BA) method. BA is a novel and efficient data augmentation technique that diversifies point cloud data by imposing smooth non-rigid deformations on existing 3D structures. This approach calculates biharmonic coordinates for the deformation function and learns diverse deformation prototypes. Utilizing a CoefNet, our method predicts coefficients to amalgamate these prototypes, ensuring comprehensive deformation. Moreover, we present AdvTune, an advanced online augmentation system that integrates adversarial training. This system synergistically refines the CoefNet and the classification network, facilitating the automated creation of adaptive shape deformations contingent on the learner status. Comprehensive experimental analysis validates the superiority of Biharmonic Augmentation, showcasing notable performance improvements over prevailing point cloud augmentation techniques across varied network designs.

摘要
点云数据集经常受到不充分的样本数量的限制，使得数据增强成为一项挑战。传统方法，如rigid transformations和缩放，受到限制，因为它们不能改变个体样本的形状。我们介绍了一种新的和高效的数据增强技术——幂函数增强（BA）方法。BA方法通过计算幂函数坐标并学习多样化填充的投影函数，以增强点云数据的多样性。此外，我们还提出了一种名为 AdvTune的高级在线增强系统，该系统通过对CoefNet和分类网络进行对抗训练，自动生成适应性的形态变换，以适应学习者的不同状态。经过广泛的实验分析，我们证明了幂函数增强的优越性，在不同的网络设计下展现出了显著的性能提升。

Attributes Grouping and Mining Hashing for Fine-Grained Image Retrieval

paper_url: http://arxiv.org/abs/2311.06067
repo_url: None
paper_authors: Xin Lu, Shikun Chen, Yichao Cao, Xin Zhou, Xiaobo Lu
for: 本研究旨在提高大规模媒体搜索中的Hashing方法，以便实现精细图像检索。
methods: 我们提出了一种Attributes Grouping and Mining Hashing（AGMH）方法，该方法通过组合多个描述符来生成全面的特征表示。同时，我们还提出了一种Attention Dispersion Loss（ADL）和Stepwise Interactive External Attention（SIEA）两种方法，以便学习细致的特征和对象之间的相关性。
results: 实验结果表明，AGMH方法在精细图像检索任务中具有最佳性能，并且超过了现有的状态态方法。

Abstract
In recent years, hashing methods have been popular in the large-scale media search for low storage and strong representation capabilities. To describe objects with similar overall appearance but subtle differences, more and more studies focus on hashing-based fine-grained image retrieval. Existing hashing networks usually generate both local and global features through attention guidance on the same deep activation tensor, which limits the diversity of feature representations. To handle this limitation, we substitute convolutional descriptors for attention-guided features and propose an Attributes Grouping and Mining Hashing (AGMH), which groups and embeds the category-specific visual attributes in multiple descriptors to generate a comprehensive feature representation for efficient fine-grained image retrieval. Specifically, an Attention Dispersion Loss (ADL) is designed to force the descriptors to attend to various local regions and capture diverse subtle details. Moreover, we propose a Stepwise Interactive External Attention (SIEA) to mine critical attributes in each descriptor and construct correlations between fine-grained attributes and objects. The attention mechanism is dedicated to learning discrete attributes, which will not cost additional computations in hash codes generation. Finally, the compact binary codes are learned by preserving pairwise similarities. Experimental results demonstrate that AGMH consistently yields the best performance against state-of-the-art methods on fine-grained benchmark datasets.

摘要
To address this limitation, we propose an Attributes Grouping and Mining Hashing (AGMH) method, which groups and embeds category-specific visual attributes in multiple descriptors to generate a comprehensive feature representation for efficient fine-grained image retrieval. We also design an Attention Dispersion Loss (ADL) to force the descriptors to attend to various local regions and capture diverse subtle details. Additionally, we propose a Stepwise Interactive External Attention (SIEA) to mine critical attributes in each descriptor and construct correlations between fine-grained attributes and objects. The attention mechanism is dedicated to learning discrete attributes, which does not require additional computations in hash code generation. Finally, we learn compact binary codes by preserving pairwise similarities.Experimental results show that AGMH consistently outperforms state-of-the-art methods on fine-grained benchmark datasets.

Lidar-based Norwegian tree species detection using deep learning

paper_url: http://arxiv.org/abs/2311.06066
repo_url: None
paper_authors: Martijn Vermeer, Jacob Alexander Hay, David Völgyes, Zsófia Koma, Johannes Breidenbach, Daniele Stefano Maria Fantin
for: 本研究旨在提高 Norwegische Wälder 中树species的映射效率，并且使用价格公平的 lidar 数据来进行类别。
methods: 本研究使用了深度学习的 tree species 分类模型，使用 lidar 影像进行分类，并且使用 focal loss 损失函数进行训练。
results: 本研究在独立验证中获得了 macro-averaged F1 分数0.70，与对比 aerial 或 aerial 和 lidar 结合的模型相当。

Abstract
Background: The mapping of tree species within Norwegian forests is a time-consuming process, involving forest associations relying on manual labeling by experts. The process can involve both aerial imagery, personal familiarity, or on-scene references, and remote sensing data. The state-of-the-art methods usually use high resolution aerial imagery with semantic segmentation methods. Methods: We present a deep learning based tree species classification model utilizing only lidar (Light Detection And Ranging) data. The lidar images are segmented into four classes (Norway Spruce, Scots Pine, Birch, background) with a U-Net based network. The model is trained with focal loss over partial weak labels. A major benefit of the approach is that both the lidar imagery and the base map for the labels have free and open access. Results: Our tree species classification model achieves a macro-averaged F1 score of 0.70 on an independent validation with National Forest Inventory (NFI) in-situ sample plots. That is close to, but below the performance of aerial, or aerial and lidar combined models.

摘要
Background: 挪威森林中的树种分类是一项时间consuming的过程， Forest associations 需要经过专家 manually labeling。这个过程可以使用空中图像、个人熟悉或场景参考，以及遥感数据。现状的方法通常使用高分辨率空中图像与语义分割方法。Methods: 我们提出了一种基于深度学习的树种分类模型，只使用激光探测（Light Detection And Ranging）数据。激光图像被分类为四类（挪威落叶松、苏格兰杉、桦树、背景），使用基于U-Net的网络进行分类。模型通过 focal loss 对部分弱标签进行训练。这种方法的一个主要优点是，激光图像和基础地图均有免费和开放的 accessed。Results: 我们的树种分类模型在独立验证中以 macro-averaged F1 分数为 0.70 达到了高水平。与气象、气象和激光组合模型相比，其性能只有一些下降。

Improved Positional Encoding for Implicit Neural Representation based Compact Data Representation

paper_url: http://arxiv.org/abs/2311.06059
repo_url: None
paper_authors: Bharath Bhushan Damodaran, Francois Schnitzler, Anne Lambert, Pierre Hellier
for: 提高含义表示（Implicit Neural Representation，INR）中信息的重建质量
methods: 使用位置编码法捕捉高频信息，提出一种新的 позицион编码方法，具有更多的频率基准，从而实现更好的数据压缩和重建质量
results: 实验显示，提出的方法可以在压缩任务中获得显著的增益，同时在新视角合成中实现更高的重建质量，而无需增加任何复杂性。

Abstract
Positional encodings are employed to capture the high frequency information of the encoded signals in implicit neural representation (INR). In this paper, we propose a novel positional encoding method which improves the reconstruction quality of the INR. The proposed embedding method is more advantageous for the compact data representation because it has a greater number of frequency basis than the existing methods. Our experiments shows that the proposed method achieves significant gain in the rate-distortion performance without introducing any additional complexity in the compression task and higher reconstruction quality in novel view synthesis.

摘要
文本翻译为简化中文。使用位置编码 capture高频信息编码的含义信号在偏挥 neural representation（INR）中。本文提出了一种新的位置编码方法，可以提高INR重建质量。该嵌入方法具有更多的频率基准，与现有方法相比，具有更好的数据压缩性。我们的实验表明，该方法可以在压缩任务中实现显著的加速性和高质量重建。

Ulcerative Colitis Mayo Endoscopic Scoring Classification with Active Learning and Generative Data Augmentation

paper_url: http://arxiv.org/abs/2311.06057
repo_url: None
paper_authors: Ümit Mert Çağlar, Alperen İnci, Oğuz Hanoğlu, Görkem Polat, Alptekin Temizel
For: This paper aims to improve the accuracy of endoscopic image analysis for Ulcerative Colitis (UC) diagnosis and severity classification, by using active learning and generative augmentation methods.* Methods: The proposed method involves generating a large number of synthetic samples using a small dataset of real endoscopic images, and then using active learning to select the most informative samples for training a classifier.* Results: The method achieved improved classification performance compared to using only the original labeled examples, with a QWK score increase from 68.1% to 74.5%. Additionally, the method required three times fewer real images to achieve equivalent performance.

Abstract
Endoscopic imaging is commonly used to diagnose Ulcerative Colitis (UC) and classify its severity. It has been shown that deep learning based methods are effective in automated analysis of these images and can potentially be used to aid medical doctors. Unleashing the full potential of these methods depends on the availability of large amount of labeled images; however, obtaining and labeling these images are quite challenging. In this paper, we propose a active learning based generative augmentation method. The method involves generating a large number of synthetic samples by training using a small dataset consisting of real endoscopic images. The resulting data pool is narrowed down by using active learning methods to select the most informative samples, which are then used to train a classifier. We demonstrate the effectiveness of our method through experiments on a publicly available endoscopic image dataset. The results show that using synthesized samples in conjunction with active learning leads to improved classification performance compared to using only the original labeled examples and the baseline classification performance of 68.1% increases to 74.5% in terms of Quadratic Weighted Kappa (QWK) Score. Another observation is that, attaining equivalent performance using only real data necessitated three times higher number of images.

摘要
便门影像是通常用于诊断发炎性结肠炎（UC）和评估其严重程度。研究表明，深度学习基本方法可以有效地自动分析这些影像，并可能用于帮助医学博士。然而，获取和标注这些影像是很困难的。在这篇论文中，我们提出了一个活动学习基本的生成增强方法。这个方法包括生成一大量的 sintetic 样本，通过训练一小型的实际便门影像集合。然后，使用活动学习方法选择最有用的样本，并将它们用于训练分类器。我们通过实验显示了我们的方法的有效性，使用 sintetic 样本和活动学习可以将分类性能提高至原始标注数据的 68.1% 提高至 74.5%，即quadratic Weighted Kappa（QWK）分数。此外，我们发现，仅使用实际数据就能够取得相等的性能，需要三倍多的数据。

Learning Contrastive Self-Distillation for Ultra-Fine-Grained Visual Categorization Targeting Limited Samples

paper_url: http://arxiv.org/abs/2311.06056
repo_url: None
paper_authors: Ziye Fang, Xin Jiang, Hao Tang, Zechao Li
for: 本研究针对具有细分类别的 Multimedia 分析 tasks 提出了一个创新的框架，即 CSDNet，以便更好地处理具有复杂的类别分割和有限的数据的 Ultra-Fine-Grained Visual Categorization (Ultra-FGVC) 任务。
methods: CSDNet 构造由三个主要模组组成：Subcategory-Specific Discrepancy Parsing (SSDP)、Dynamic Discrepancy Learning (DDL) 和 Subcategory-Specific Discrepancy Transfer (SSDT)，这些模组共同增强了深度模型在不同的预测层次（instance、feature、logit）之间的通用性。具体来说，SSDP 模组通过将不同观点的扩展样本添加到标本集中，以显示出类别下的特殊差异；DDL 模组使用动态内存随机抽出器来优化特征学习空间，通过反对推对称推对称学习；SSDT 模组则通过一种新的自体激发方式在预测层次上进行自体激发，以更好地吸收类别下的差异知识。
results: 实验结果显示，CSDNet 在 Ultra-FGVC 任务上具有更高的表现力和适应力，较前一些现有的方法。这说明 CSDNet 能够更好地处理具有复杂的类别分割和有限的数据的 Ultra-FGVC 任务。

Abstract
In the field of intelligent multimedia analysis, ultra-fine-grained visual categorization (Ultra-FGVC) plays a vital role in distinguishing intricate subcategories within broader categories. However, this task is inherently challenging due to the complex granularity of category subdivisions and the limited availability of data for each category. To address these challenges, this work proposes CSDNet, a pioneering framework that effectively explores contrastive learning and self-distillation to learn discriminative representations specifically designed for Ultra-FGVC tasks. CSDNet comprises three main modules: Subcategory-Specific Discrepancy Parsing (SSDP), Dynamic Discrepancy Learning (DDL), and Subcategory-Specific Discrepancy Transfer (SSDT), which collectively enhance the generalization of deep models across instance, feature, and logit prediction levels. To increase the diversity of training samples, the SSDP module introduces augmented samples from different viewpoints to spotlight subcategory-specific discrepancies. Simultaneously, the proposed DDL module stores historical intermediate features by a dynamic memory queue, which optimizes the feature learning space through iterative contrastive learning. Furthermore, the SSDT module is developed by a novel self-distillation paradigm at the logit prediction level of raw and augmented samples, which effectively distills more subcategory-specific discrepancies knowledge from the inherent structure of limited training data without requiring additional annotations. Experimental results demonstrate that CSDNet outperforms current state-of-the-art Ultra-FGVC methods, emphasizing its powerful efficacy and adaptability in addressing Ultra-FGVC tasks.

摘要
在智能多媒体分析领域， ultra-fine-grained视觉分类（Ultra-FGVC）扮演着重要的角色，用于分辨更加细致的分类划分。然而，这种任务具有较复杂的分类划分和数据有限的问题。为解决这些挑战，本文提出了 CSDNet 框架，它通过对比学习和自适应学习来学习特定于 Ultra-FGVC 任务的抽象表示。CSDNet 包括三个主要模块：特征特异性分析（SSDP）、动态不同分学习（DDL）和特征特异性转移（SSDT），这些模块共同提高了深度模型的通用性 across instance、feature和logit 预测层次。为了增加训练样本的多样性，SSDP 模块通过不同视点生成的增强样本来强调特征特异性。同时，提出的 DDL 模块通过动态缓存队列来优化特性学习空间，通过反卷积学习来优化特征学习。此外，SSDT 模块通过一种新的自适应学习方式来在 raw 和增强样本的 logit 预测层次上进行自适应学习，从而更好地提取有限训练数据中的特征特异性知识。实验结果表明，CSDNet 在 Ultra-FGVC 任务上表现出色，证明了其强大的效果和适应性。

Refining the ONCE Benchmark with Hyperparameter Tuning

paper_url: http://arxiv.org/abs/2311.06054
repo_url: None
paper_authors: Maksim Golyadkin, Alexander Gambashidze, Ildar Nurgaliev, Ilya Makarov
for: 本研究旨在评估 semi-supervised learning 方法在点云数据上的性能。
methods: 本研究使用 semi-supervised learning 方法，并对点云数据进行自动标注。
results: 研究发现，使用不同的搜索空间和模型可以获得更高的性能，但使用无标注数据的贡献相对较少。

Abstract
In response to the growing demand for 3D object detection in applications such as autonomous driving, robotics, and augmented reality, this work focuses on the evaluation of semi-supervised learning approaches for point cloud data. The point cloud representation provides reliable and consistent observations regardless of lighting conditions, thanks to advances in LiDAR sensors. Data annotation is of paramount importance in the context of LiDAR applications, and automating 3D data annotation with semi-supervised methods is a pivotal challenge that promises to reduce the associated workload and facilitate the emergence of cost-effective LiDAR solutions. Nevertheless, the task of semi-supervised learning in the context of unordered point cloud data remains formidable due to the inherent sparsity and incomplete shapes that hinder the generation of accurate pseudo-labels. In this study, we consider these challenges by posing the question: "To what extent does unlabelled data contribute to the enhancement of model performance?" We show that improvements from previous semi-supervised methods may not be as profound as previously thought. Our results suggest that simple grid search hyperparameter tuning applied to a supervised model can lead to state-of-the-art performance on the ONCE dataset, while the contribution of unlabelled data appears to be comparatively less exceptional.

摘要
“这个研究旨在评估半监督学习方法在3D物体探测应用中的表现，特别是针对基于LiDAR感知器的应用。点云表示提供可靠和一致的观察，不受照明条件影响。在LiDAR应用中，标签数据的标注是非常重要的，但对于半监督学习方法而言，自动化3D标签的自动化过程是一个挑战，可以实现成本下降和LiDAR解决方案的发展。然而，在无序点云数据上进行半监督学习的任务仍然是一个挑战，因为点云数据的缺乏和形状不对称对于生成准确pseudo标签所造成阻碍。在这个研究中，我们询问：“半监督学习中无标的数据对模型性能的贡献为何？”我们发现，与之前的半监督方法相比，改进的空间搜寻参数可以带来更好的性能，而无标的数据对模型性能的贡献相对较少。”

2D Image head pose estimation via latent space regression under occlusion settings

paper_url: http://arxiv.org/abs/2311.06038
repo_url: https://github.com/sipg-isr/Occlusion_HPE
paper_authors: José Celestino, Manuel Marques, Jacinto C. Nascimento, João Paulo Costeira
for: 本研究旨在提高 occluded scenario 下的人头pose 估算精度，提供更可靠的人机交互方案。
methods: 该研究提出了一种基于 latent space regression 的深度学习方法，通过更好地结构化 occluded scenrio 下的问题，提高 head pose estimation 的精度。
results: 对于 occluded 和 non-occluded scenrio 下的数据集，该方法比多种 state-of-the-art 方法表现出较高的精度，并且在实际应用中（如人机交互场景）也显示出了良好的性能。

Abstract
Head orientation is a challenging Computer Vision problem that has been extensively researched having a wide variety of applications. However, current state-of-the-art systems still underperform in the presence of occlusions and are unreliable for many task applications in such scenarios. This work proposes a novel deep learning approach for the problem of head pose estimation under occlusions. The strategy is based on latent space regression as a fundamental key to better structure the problem for occluded scenarios. Our model surpasses several state-of-the-art methodologies for occluded HPE, and achieves similar accuracy for non-occluded scenarios. We demonstrate the usefulness of the proposed approach with: (i) two synthetically occluded versions of the BIWI and AFLW2000 datasets, (ii) real-life occlusions of the Pandora dataset, and (iii) a real-life application to human-robot interaction scenarios where face occlusions often occur. Specifically, the autonomous feeding from a robotic arm.

摘要
头部方向是计算机视觉领域中一个具有广泛应用的挑战，已经有广泛的研究。然而，当前状态的先进系统仍然在干扰场景下表现不佳，对多种任务场景表现不可靠。这项工作提出了一种基于深度学习的新方法，用于在干扰场景下进行头部pose估计。这种策略基于latent space regression作为基础，以更好地结构化干扰场景中的问题。我们的模型在干扰场景下超过了多种现有方法，并在非干扰场景下实现了类似的准确率。我们通过以下三种实验来证明提出的方法的有用性：（i）使用BIWI和AFLW2000 datasets中的两个 sintetically occluded版本，（ii）使用Pandora dataset中的真实干扰场景，（iii）在人机交互场景中使用 autonomous feeding from a robotic arm。

Diagonal Hierarchical Consistency Learning for Semi-supervised Medical Image Segmentation

paper_url: http://arxiv.org/abs/2311.06031
repo_url: None
paper_authors: Heejoon Koo
for: 这篇论文是用于医疗影像分类的，以提高诊断和治疗的精度。
methods: 这篇论文使用了一个新的框架，即DiHC-Net，以提高医疗影像分类的稳定性和准确性。DiHC-Net包含多个子模型，每个子模型有相同的多尺度架构，但每个子模型有不同的子层，如数值升降和Normalization层。此外，这篇论文还提出了一个新的 diagonally hierarchical consistency 的方法，用于在不同尺度上强制这些子模型之间的一致性。
results: 实验结果显示，DiHC-Net 比之前的所有方法在公共 Left Atrium (LA) 数据集上表现更好，具有更高的稳定性和准确性。

Abstract
Medical image segmentation, which is essential for many clinical applications, has achieved almost human-level performance via data-driven deep learning techniques. Nevertheless, its performance is predicated on the costly process of manually annotating a large amount of medical images. To this end, we propose a novel framework for robust semi-supervised medical image segmentation using diagonal hierarchical consistency (DiHC-Net). First, it is composed of multiple sub-models with identical multi-scale architecture but with distinct sub-layers, such as up-sampling and normalisation layers. Second, a novel diagonal hierarchical consistency is enforced between one model's intermediate and final prediction and other models' soft pseudo labels in a diagonal hierarchical fashion. Experimental results verify the efficacy of our simple framework, outperforming all previous approaches on public Left Atrium (LA) dataset.

摘要
医疗图像分割，对许多临床应用而言是必需的，已经通过数据驱动的深度学习技术实现了几乎人类水平的性能。然而，其性能受到手动标注大量医疗图像的高成本过程的限制。为此，我们提出了一种新的框架，即对称层次一致性网络（DiHC-Net）。这个框架包括多个子模型，每个子模型具有相同的多scales架构，但各自具有不同的子层，例如升降sample和normal化层。其次，我们提出了一种新的对称层次一致性，即在一个模型的中间预测和最终预测之间，以及其他模型的软 Pseudo标签之间的对称层次一致性。实验结果表明，我们的简单框架可以准确地 segment 医疗图像，并且超过了所有之前的方法在公共左心脏（LA）数据集上的性能。

U3DS$^3$: Unsupervised 3D Semantic Scene Segmentation

paper_url: http://arxiv.org/abs/2311.06018
repo_url: None
paper_authors: Jiaxu Liu, Zhengdi Yu, Toby P. Breckon, Hubert P. H. Shum
for: 这篇论文主要用于解决3D点云分割问题，即不需要大量标注数据来进行训练。
methods: 该方法基于点云自身的特征信息，无需模型预训练，通过扩展点云的方式生成超点，然后通过空间划分和迭代训练使用 pseudo-标签进行学习。
results: 该方法在ScanNet和SemanticKITTI datasets上达到了状态略的表现，并在S3DIS dataset上获得了竞争性的结果。

Abstract
Contemporary point cloud segmentation approaches largely rely on richly annotated 3D training data. However, it is both time-consuming and challenging to obtain consistently accurate annotations for such 3D scene data. Moreover, there is still a lack of investigation into fully unsupervised scene segmentation for point clouds, especially for holistic 3D scenes. This paper presents U3DS$^3$, as a step towards completely unsupervised point cloud segmentation for any holistic 3D scenes. To achieve this, U3DS$^3$ leverages a generalized unsupervised segmentation method for both object and background across both indoor and outdoor static 3D point clouds with no requirement for model pre-training, by leveraging only the inherent information of the point cloud to achieve full 3D scene segmentation. The initial step of our proposed approach involves generating superpoints based on the geometric characteristics of each scene. Subsequently, it undergoes a learning process through a spatial clustering-based methodology, followed by iterative training using pseudo-labels generated in accordance with the cluster centroids. Moreover, by leveraging the invariance and equivariance of the volumetric representations, we apply the geometric transformation on voxelized features to provide two sets of descriptors for robust representation learning. Finally, our evaluation provides state-of-the-art results on the ScanNet and SemanticKITTI, and competitive results on the S3DIS, benchmark datasets.

摘要
现代点云分割方法大多依赖于 ricly annotated 3D 训练数据。然而，获得一致性的精确标注对于 such 3D 场景数据是时间consuming 和挑战性的。此外，还缺乏对全自动点云分割的完全无监督场景进行研究，特别是 для holistic 3D 场景。本文提出了 U3DS$^3$，这是一种Step towards completely unsupervised point cloud segmentation for any holistic 3D scenes。为实现这一目标，U3DS$^3$ 利用了一种通用无监督分割方法，可以在 both indoor and outdoor static 3D point clouds 中实现全3D场景分割，无需模型预训练。U3DS$^3$ 的首先步骤是生成 superpoints 基于场景的 геометрических特征。然后，它通过空间归一化方法进行学习，然后通过 iterative training 使用 pseudo-labels 根据集中点生成。此外，通过利用点云的几何特征，我们将其映射到 voxelized 特征上，并将其作为 Robust 表示学习的两个集。最后，我们的评估结果表明 U3DS$^3$ 在 ScanNet 和 SemanticKITTI 数据集上获得了状态对的结果，并在 S3DIS 数据集上获得了竞争性的结果。

Polar-Net: A Clinical-Friendly Model for Alzheimer’s Disease Detection in OCTA Images

paper_url: http://arxiv.org/abs/2311.06009
repo_url: https://github.com/iAaronLau/Polar-Net-Pytorch
paper_authors: Shouyue Liu, Jinkui Hao, Yanwu Xu, Huazhu Fu, Xinyu Guo, Jiang Liu, Yalin Zheng, Yonghuai Liu, Jiong Zhang, Yitian Zhao
for: 检测阿尔ツheimer病（AD）的可能性，通过图像征识Retinal microvasculature。
methods: 使用抽象的深度计算机视觉方法，以及Polar-Net模型，将OCTA图像坐标系从Cartesian坐标系映射到极坐标系，并使用ETDRS格子基于地区分析方法。
results: 与现有方法相比，Polar-Net模型在私人和公共数据集上表现出色，并提供更加有价值的病理证据，证明了Retinal microvasculature变化和AD之间的相关性。

Abstract
Optical Coherence Tomography Angiography (OCTA) is a promising tool for detecting Alzheimer's disease (AD) by imaging the retinal microvasculature. Ophthalmologists commonly use region-based analysis, such as the ETDRS grid, to study OCTA image biomarkers and understand the correlation with AD. However, existing studies have used general deep computer vision methods, which present challenges in providing interpretable results and leveraging clinical prior knowledge. To address these challenges, we propose a novel deep-learning framework called Polar-Net. Our approach involves mapping OCTA images from Cartesian coordinates to polar coordinates, which allows for the use of approximate sector convolution and enables the implementation of the ETDRS grid-based regional analysis method commonly used in clinical practice. Furthermore, Polar-Net incorporates clinical prior information of each sector region into the training process, which further enhances its performance. Additionally, our framework adapts to acquire the importance of the corresponding retinal region, which helps researchers and clinicians understand the model's decision-making process in detecting AD and assess its conformity to clinical observations. Through evaluations on private and public datasets, we have demonstrated that Polar-Net outperforms existing state-of-the-art methods and provides more valuable pathological evidence for the association between retinal vascular changes and AD. In addition, we also show that the two innovative modules introduced in our framework have a significant impact on improving overall performance.

摘要
Optical Coherence Tomography Angiography (OCTA) 是一种有前途的工具，用于检测阿尔ツ海默病（AD），通过呈现Retinal Microvasculature的图像。医生们通常使用区域分析方法，如ETDRS 格，来研究 OCTA 图像标记和了解与 AD 的相关性。然而，现有的研究都使用了通用的深度计算机视觉方法，这会带来解释结果的困难和不能充分利用临床前知识。为解决这些挑战，我们提出了一种新的深度学习框架，即Polar-Net。我们的方法通过将 OCTA 图像从Cartesian坐标系转换到极坐标系，使得可以使用 Approximate Sector Convolution 和实施ETDRS 格基于的区域分析方法，这样可以更好地利用临床前知识。此外，Polar-Net 还将临床前知识 incorporated 到训练过程中，进一步提高其性能。此外，我们的框架还可以评估相应的Retinal 区域的重要性，帮助研究人员和医生理解模型的决策过程中的AD 检测和评估模型是否符合临床观察。经过评估private和公共数据集，我们展示了Polar-Net 可以超过现有的状态对方法，并提供更有价值的病理证据，用于关系Retinal 血管变化和AD的相关性。此外，我们还发现了两个创新模块在我们的框架中具有重要作用，即提高总性能。

Keystroke Verification Challenge (KVC): Biometric and Fairness Benchmark Evaluation

paper_url: http://arxiv.org/abs/2311.06000
repo_url: None
paper_authors: Giuseppe Stragapede, Ruben Vera-Rodriguez, Ruben Tolosana, Aythami Morales, Naser Damer, Julian Fierrez, Javier Ortega-Garcia
for: 本研究旨在提高键盘动作生物认证的性能和公正性。
methods: 本研究使用了新的实验框架和公平指标来评估键盘动作生物认证系统的性能和公正性。
results: 研究发现，通过减少键盘动作生物认证系统中文本内容的分析，可以保持atisfactory的性能，同时减少隐私泄露风险。Here’s the translation in Simplified Chinese:
for: 本研究旨在提高键盘动作生物认证的性能和公正性。
methods: 本研究使用了新的实验框架和公平指标来评估键盘动作生物认证系统的性能和公正性。
results: 研究发现，通过减少键盘动作生物认证系统中文本内容的分析，可以保持 satisfactory 的性能，同时减少隐私泄露风险。

Abstract
Analyzing keystroke dynamics (KD) for biometric verification has several advantages: it is among the most discriminative behavioral traits; keyboards are among the most common human-computer interfaces, being the primary means for users to enter textual data; its acquisition does not require additional hardware, and its processing is relatively lightweight; and it allows for transparently recognizing subjects. However, the heterogeneity of experimental protocols and metrics, and the limited size of the databases adopted in the literature impede direct comparisons between different systems, thus representing an obstacle in the advancement of keystroke biometrics. To alleviate this aspect, we present a new experimental framework to benchmark KD-based biometric verification performance and fairness based on tweet-long sequences of variable transcript text from over 185,000 subjects, acquired through desktop and mobile keyboards, extracted from the Aalto Keystroke Databases. The framework runs on CodaLab in the form of the Keystroke Verification Challenge (KVC). Moreover, we also introduce a novel fairness metric, the Skewed Impostor Ratio (SIR), to capture inter- and intra-demographic group bias patterns in the verification scores. We demonstrate the usefulness of the proposed framework by employing two state-of-the-art keystroke verification systems, TypeNet and TypeFormer, to compare different sets of input features, achieving a less privacy-invasive system, by discarding the analysis of text content (ASCII codes of the keys pressed) in favor of extended features in the time domain. Our experiments show that this approach allows to maintain satisfactory performance.

摘要
分析键盘动态（KD） для生物认证有很多优点：它是最有特征的行为特征之一；键盘是人机界面中最常用的输入设备之一，用户通过键盘输入文本数据；获取它不需要额外硬件，处理也较轻量级，可透明地识别用户。然而，实验室协议和度量的多样性，以及文献中所采用的数据库的小型，使得不同系统之间的比较困难，从而阻碍了键盘生物认证的进步。为了解决这一问题，我们提出了一个新的实验框架，用于评估基于键盘动态的生物认证性和公正性，并在 CodaLab 上进行了 Keystroke Verification Challenge（KVC）。此外，我们还介绍了一种新的公正度指标，即不良假冒比率（SIR），用于捕捉 между组和内组偏见偏好的识别分数。我们通过使用两个现有的键盘认证系统，TypeNet 和 TypeFormer，对不同的输入特征进行比较，实现了一种更加隐私的系统，通过抛弃ASCII码的分析以获得更多的时间域特征。我们的实验结果表明，这种方法可以保持满意的性能。

Vision Big Bird: Random Sparsification for Full Attention

paper_url: http://arxiv.org/abs/2311.05988
repo_url: None
paper_authors: Zhemin Zhang, Xun Gong
for: 该研究旨在提出一种基于Transformers的视觉模型，以提高视觉任务的性能。
methods: 该模型使用三组头部，其中第一组使用卷积神经网络提取地方特征并提供位置信息，第二组使用随机抽样窗口进行笛卡尔抽样自注意计算，第三组将键值的分辨率减小通过平均抽取来保持全局注意的稀热性。
results: 实验结果表明，视觉大鸟模型可以维持自注意的稀热性，并且可以安全地去除位置编码。该模型在常见视觉任务中达到竞争性性能。

Abstract
Recently, Transformers have shown promising performance in various vision tasks. However, the high costs of global self-attention remain challenging for Transformers, especially for high-resolution vision tasks. Inspired by one of the most successful transformers-based models for NLP: Big Bird, we propose a novel sparse attention mechanism for Vision Transformers (ViT). Specifically, we separate the heads into three groups, the first group used convolutional neural network (CNN) to extract local features and provide positional information for the model, the second group used Random Sampling Windows (RS-Win) for sparse self-attention calculation, and the third group reduces the resolution of the keys and values by average pooling for global attention. Based on these components, ViT maintains the sparsity of self-attention while maintaining the merits of Big Bird (i.e., the model is a universal approximator of sequence functions and is Turing complete). Moreover, our results show that the positional encoding, a crucial component in ViTs, can be safely removed in our model. Experiments show that Vision Big Bird demonstrates competitive performance on common vision tasks.

摘要
近些时间，变换器在各种视觉任务中表现出了扎实的能力。然而，全球自注意的高成本仍然是变换器的挑战，特别是高分辨率视觉任务。受Big Bird模型的启发，我们提出了一种新的稀疏注意机制 для视觉变换器（ViT）。具体来说，我们将头分为三组：第一组使用卷积神经网络（CNN）提取地方特征并提供模型位置信息，第二组使用随机抽取窗口（RS-Win）进行稀疏自注意计算，第三组将键和值的分辨率降低到平均抽取。这些组件使得ViT可以保持自注意的稀疏性，同时保持Big Bird模型的优点（即模型是序列函数的通用近似器和Turing完善的）。此外，我们的实验表明，ViT中的位置编码可以安全地移除。Vision Big Bird在常见视觉任务中显示了竞争力强的性能。

Comparing Male Nyala and Male Kudu Classification using Transfer Learning with ResNet-50 and VGG-16

paper_url: http://arxiv.org/abs/2311.05981
repo_url: None
paper_authors: T. T Lemani, T. L. van Zyl
for: 这paper的目的是为了研究使用深度学习和计算机视觉技术来快速和高精度地识别野生动物，以便为管理和保护决策提供信息。
methods: 本paper使用了预训练模型VGG-16和ResNet-50，通过转移学习方法进行精度调整，以便在野生环境中识别♂️戴蛭和♂️牛羚。
results: эксperimental结果显示，在550张图像中，预训练后的VGG-16和ResNet-50模型均达到了97.7%的准确率，而不进行调整的模型则达到了93.2%的准确率。然而，这些结果是基于一个小样本大小的评估，因此可能不具有足够的可靠性和普遍性。

Abstract
Reliable and efficient monitoring of wild animals is crucial to inform management and conservation decisions. The process of manually identifying species of animals is time-consuming, monotonous, and expensive. Leveraging on advances in deep learning and computer vision, we investigate in this paper the efficiency of pre-trained models, specifically the VGG-16 and ResNet-50 model, in identifying a male Kudu and a male Nyala in their natural habitats. These pre-trained models have proven to be efficient in animal identification in general. Still, there is little research on animals like the Kudu and Nyala, who are usually well camouflaged and have similar features. The method of transfer learning used in this paper is the fine-tuning method. The models are evaluated before and after fine-tuning. The experimental results achieved an accuracy of 93.2\% and 97.7\% for the VGG-16 and ResNet-50 models, respectively, before fine-tuning and 97.7\% for both models after fine-tuning. Although these results are impressive, it should be noted that they were taken over a small sample size of 550 images split in half between the two classes; therefore, this might not cater to enough scenarios to get a full conclusion of the efficiency of the models. Therefore, there is room for more work in getting a more extensive dataset and testing and extending to the female counterparts of these species and the whole antelope species.

摘要
可靠和高效的野生动物监测是管理和保护决策的关键。手动识别动物种类是时间consuming、单调和昂贵的。利用深度学习和计算机视觉的进步，我们在这篇论文中 investigate了VGG-16和ResNet-50模型在自然环境中识别♂️普通鹿和♂️涂猪的能力。这些预训练模型在动物识别方面有效。然而，关于鹿和涂猪这些动物，它们通常很隐藏，外表相似，有少量研究。本文使用的方法是转移学习方法。模型在 Fine-tuning 前和后的评估结果表明，VGG-16和ResNet-50模型在自然环境中识别♂️普通鹿和♂️涂猪的能力具有93.2%和97.7%的准确率，分别是之前和之后 Fine-tuning。虽然这些结果很出色，但是它们是基于550张图像，其中有一半是♂️普通鹿和♂️涂猪两类的样本，因此这并不能代表充分的场景，因此还有很多空间 для进一步的测试和扩展。

Quantized Distillation: Optimizing Driver Activity Recognition Models for Resource-Constrained Environments

paper_url: http://arxiv.org/abs/2311.05970
repo_url: https://github.com/calvintanama/qd-driver-activity-reco
paper_authors: Calvin Tanama, Kunyu Peng, Zdravko Marinov, Rainer Stiefelhagen, Alina Roitberg
for: 这篇论文旨在提出轻量级的驾驶活动识别框架，以提高自驾车时的资源效率。
methods: 这篇论文使用了知识传授和模型量化来将3D MobileNet简化，以保持模型精度而实现资源效率。
results: 实验结果显示，这个新的框架可以与已有的框架相比，三倍减少模型大小，并提高执行速度1.4倍。

Abstract
Deep learning-based models are at the forefront of most driver observation benchmarks due to their remarkable accuracies but are also associated with high computational costs. This is challenging, as resources are often limited in real-world driving scenarios. This paper introduces a lightweight framework for resource-efficient driver activity recognition. The framework enhances 3D MobileNet, a neural architecture optimized for speed in video classification, by incorporating knowledge distillation and model quantization to balance model accuracy and computational efficiency. Knowledge distillation helps maintain accuracy while reducing the model size by leveraging soft labels from a larger teacher model (I3D), instead of relying solely on original ground truth data. Model quantization significantly lowers memory and computation demands by using lower precision integers for model weights and activations. Extensive testing on a public dataset for in-vehicle monitoring during autonomous driving demonstrates that this new framework achieves a threefold reduction in model size and a 1.4-fold improvement in inference time, compared to an already optimized architecture. The code for this study is available at https://github.com/calvintanama/qd-driver-activity-reco.

摘要
深度学习模型在驾驶员观察benchmark中领先，主要是因为它们的准确率非常高，但是也因为计算成本很高。这会在实际驾驶场景中带来挑战，因为资源经常是有限的。这篇论文介绍了一个轻量级框架，用于提高驾驶员活动识别的资源效率。这个框架基于3D MobileNet neural网络，通过知识传授和模型量化来平衡模型准确率和计算效率。知识传授可以使得模型尺寸减小，而不会影响准确率，而模型量化可以减少内存和计算需求。经过对一个公共数据集进行了广泛的测试，表明这个新的框架可以将模型尺寸减少三分之一，并提高推理时间1.4倍，相比之前优化的架构。代码可以在https://github.com/calvintanama/qd-driver-activity-reco中下载。

A Neural Height-Map Approach for the Binocular Photometric Stereo Problem

paper_url: http://arxiv.org/abs/2311.05958
repo_url: None
paper_authors: Fotios Logothetis, Ignas Budvytis, Roberto Cipolla
for: 本文提出了一种新的、实用的双目光学深度测试（PS）框架，它的获取速度与单视PS相同，但是可以显著提高估算结果的geometry质量。
methods: 本文使用了深度学习来拟合表面和文本UREpresentation，通过最小化表面法向量偏差来实现形状估算。
results: 本文在DiLiGenT-MV数据集和LUCES-ST数据集上达到了状态对抗性表现，并且在binocular PS setup中实现了同样的获取速度和优化表达。

Abstract
In this work we propose a novel, highly practical, binocular photometric stereo (PS) framework, which has same acquisition speed as single view PS, however significantly improves the quality of the estimated geometry. As in recent neural multi-view shape estimation frameworks such as NeRF, SIREN and inverse graphics approaches to multi-view photometric stereo (e.g. PS-NeRF) we formulate shape estimation task as learning of a differentiable surface and texture representation by minimising surface normal discrepancy for normals estimated from multiple varying light images for two views as well as discrepancy between rendered surface intensity and observed images. Our method differs from typical multi-view shape estimation approaches in two key ways. First, our surface is represented not as a volume but as a neural heightmap where heights of points on a surface are computed by a deep neural network. Second, instead of predicting an average intensity as PS-NeRF or introducing lambertian material assumptions as Guo et al., we use a learnt BRDF and perform near-field per point intensity rendering. Our method achieves the state-of-the-art performance on the DiLiGenT-MV dataset adapted to binocular stereo setup as well as a new binocular photometric stereo dataset - LUCES-ST.

摘要
在这项工作中，我们提出了一种新的、高度实用的双目光学三角形（PS）框架，其具有与单视PS相同的获取速度，但可以显著提高估计几何的质量。我们的方法与现代神经网络多视图形态估计框架（如NeRF、SIREN和反射图像推导法）类似，将形态估计任务定义为通过最小化多个变化光图像中的法向缺失来学习可导表面和 текстура表示。我们的方法与传统多视图形态估计方法有两点不同：首先，我们的表面不是一个体积，而是一个深度神经网络中的高度图像；其次，我们不是预测平均Intensity，而是使用学习的BRDF进行靠近场near-field render。我们的方法在DiLiGenT-MV数据集中适配了双目掌控设置以及一个新的双目光学三角形数据集——LUCES-ST中达到了状态盘的性能。

Post-training Quantization with Progressive Calibration and Activation Relaxing for Text-to-Image Diffusion Models

paper_url: http://arxiv.org/abs/2311.06322
repo_url: None
paper_authors: Siao Tang, Xin Wang, Hong Chen, Chaoyu Guan, Zewen Wu, Yansong Tang, Wenwu Zhu
for:This paper focuses on developing a novel post-training quantization method for text-to-image diffusion models, specifically targeting widely used large pretrained models like Stable Diffusion and Stable Diffusion XL.methods:The proposed method, called PCR (Progressive Calibration and Relaxing), consists of two key strategies: progressive calibration and activation relaxing. The former considers the accumulated quantization error across timesteps, while the latter improves performance with negligible cost.results:The proposed method and a new benchmark (QDiffBench) are extensively evaluated on Stable Diffusion and Stable Diffusion XL. The results show that the proposed method achieves superior performance and is the first to achieve quantization for Stable Diffusion XL while maintaining performance. Additionally, QDiffBench provides a more accurate evaluation of text-to-image diffusion model quantization by considering the distribution gap and generalization performance outside the calibration dataset.

Abstract
Diffusion models have achieved great success due to their remarkable generation ability. However, their high computational overhead is still a troublesome problem. Recent studies have leveraged post-training quantization (PTQ) to compress diffusion models. However, most of them only focus on unconditional models, leaving the quantization of widely used large pretrained text-to-image models, e.g., Stable Diffusion, largely unexplored. In this paper, we propose a novel post-training quantization method PCR (Progressive Calibration and Relaxing) for text-to-image diffusion models, which consists of a progressive calibration strategy that considers the accumulated quantization error across timesteps, and an activation relaxing strategy that improves the performance with negligible cost. Additionally, we demonstrate the previous metrics for text-to-image diffusion model quantization are not accurate due to the distribution gap. To tackle the problem, we propose a novel QDiffBench benchmark, which utilizes data in the same domain for more accurate evaluation. Besides, QDiffBench also considers the generalization performance of the quantized model outside the calibration dataset. Extensive experiments on Stable Diffusion and Stable Diffusion XL demonstrate the superiority of our method and benchmark. Moreover, we are the first to achieve quantization for Stable Diffusion XL while maintaining the performance.

摘要
Diffusion 模型在生成能力方面已经取得了很大的成功，但是它们的计算开销仍然是一个痛苦的问题。latest studies have leveraged post-training quantization (PTQ) to compress diffusion models, but most of them only focus on unconditional models, leaving the quantization of widely used large pretrained text-to-image models, such as Stable Diffusion, largely unexplored. In this paper, we propose a novel post-training quantization method PCR (Progressive Calibration and Relaxing) for text-to-image diffusion models, which consists of a progressive calibration strategy that considers the accumulated quantization error across timesteps, and an activation relaxing strategy that improves the performance with negligible cost. Additionally, we demonstrate that the previous metrics for text-to-image diffusion model quantization are not accurate due to the distribution gap. To tackle this problem, we propose a novel QDiffBench benchmark, which utilizes data in the same domain for more accurate evaluation. Besides, QDiffBench also considers the generalization performance of the quantized model outside the calibration dataset. Extensive experiments on Stable Diffusion and Stable Diffusion XL demonstrate the superiority of our method and benchmark. Moreover, we are the first to achieve quantization for Stable Diffusion XL while maintaining the performance.

Efficient Segmentation with Texture in Ore Images Based on Box-supervised Approach

paper_url: http://arxiv.org/abs/2311.05929
repo_url: https://github.com/mvme-hbut/oreinst
paper_authors: Guodong Sun, Delong Huang, Yuting Peng, Le Cheng, Bo Wu, Yang Zhang
For: The paper is written for image segmentation of crushed ores in a complex working environment, where high-powered computing equipment is difficult to deploy, and the ore distribution is stacked, making it challenging to identify complete features.* Methods: The proposed method uses a ghost feature pyramid network (Ghost-FPN) to process features obtained from the backbone, an optimized detection head to obtain accurate features, and a fusion feature similarity-based loss function that combines Lab color space (Lab) and local binary patterns (LBP) texture features to improve accuracy while incurring no loss.* Results: The proposed method achieves over 50 frames per second with a small model size of 21.6 MB, and maintains a high level of accuracy compared with state-of-the-art approaches on ore image datasets.Here is the information in Simplified Chinese text:* For: 该文章是为了处理受损矿石的图像分割，在复杂的工作环境下，高性能计算设备困难执行，矿石分布叠加，难以识别完整的特征。* 方法: 提出了一种使用 Ghost Feature Pyramid Network (Ghost-FPN) 处理从底层获得的特征，优化检测头以获得准确的特征，并将 Lab 色彩空间 (Lab) 和本地二进制模式 (LBP) 文本特征组合成一个 fusional 特征相似性基于损失函数，以提高准确性而不产生损失。* 结果: 提出的方法可以在 MS COCO 上达到更高于 50 帧/秒的速度，并且在矿石图像数据集上保持高级别的准确性，而且与当前状态艺术方法相比，模型大小只有 21.6 MB。源代码可以在 \url{https://github.com/MVME-HBUT/OREINST} 上获取。

Abstract
Image segmentation methods have been utilized to determine the particle size distribution of crushed ores. Due to the complex working environment, high-powered computing equipment is difficult to deploy. At the same time, the ore distribution is stacked, and it is difficult to identify the complete features. To address this issue, an effective box-supervised technique with texture features is provided for ore image segmentation that can identify complete and independent ores. Firstly, a ghost feature pyramid network (Ghost-FPN) is proposed to process the features obtained from the backbone to reduce redundant semantic information and computation generated by complex networks. Then, an optimized detection head is proposed to obtain the feature to maintain accuracy. Finally, Lab color space (Lab) and local binary patterns (LBP) texture features are combined to form a fusion feature similarity-based loss function to improve accuracy while incurring no loss. Experiments on MS COCO have shown that the proposed fusion features are also worth studying on other types of datasets. Extensive experimental results demonstrate the effectiveness of the proposed method, which achieves over 50 frames per second with a small model size of 21.6 MB. Meanwhile, the method maintains a high level of accuracy compared with the state-of-the-art approaches on ore image dataset. The source code is available at \url{https://github.com/MVME-HBUT/OREINST}.

摘要
Image segmentation方法已经在粉碎矿物中使用来确定粉碎物的大小分布。由于工作环境复杂，高功率计算设备困难提供。同时，矿物分布叠加，难以识别完整的特征。为解决这个问题，一种有效的盒子-监督法（Box-supervised）是提供了用于矿物图像分割的方法，可以识别完整独立的矿物。首先，一种鬼Feature pyramid网络（Ghost-FPN）是提出来处理来自后处理网络的特征，以减少复杂网络生成的重复semantic信息和计算。然后，一种优化的检测头是提出来，以获得维持准确性的特征。最后，Lab色彩空间（Lab）和本地二进制模式（LBP）的xture特征被组合以形成一个混合特征相似度基于的损失函数，以提高准确性而不损失一切。在MS COCO上进行了实验，表明提出的混合特征也值得进行其他类型的数据集上的研究。广泛的实验结果表明提出的方法的有效性，可以在20 frames/s的小型模型大小为21.6 MB下达到50 frames/s的性能水平，同时保持与当前最佳方法在矿物图像 dataset 的高级别准确性。源代码可以在 \url{https://github.com/MVME-HBUT/OREINST} 上获取。

Automated Sperm Assessment Framework and Neural Network Specialized for Sperm Video Recognition

paper_url: http://arxiv.org/abs/2311.05927
repo_url: https://github.com/ftkr12/rostfine
paper_authors: Takuro Fujii, Hayato Nakagawa, Teppei Takeshima, Yasushi Yumura, Tomoki Hamagami
for: 本研究旨在提高受孕助手技术的成功率，通过对精子评估使用深度学习方法进行改进。
methods: 本研究使用了视频数据集，其中包括精子头、 neck 和 tail 的视频，并使用了软标签来标注数据。提出了基于视频认识的精子评估框架和 neural network 模型 RoSTFine。
results: 实验结果表明，RoSTFine 可以提高精子评估性能，并强调重要的精子部分（即头和 neck）。

Abstract
Infertility is a global health problem, and an increasing number of couples are seeking medical assistance to achieve reproduction, at least half of which are caused by men. The success rate of assisted reproductive technologies depends on sperm assessment, in which experts determine whether sperm can be used for reproduction based on morphology and motility of sperm. Previous sperm assessment studies with deep learning have used datasets comprising images that include only sperm heads, which cannot consider motility and other morphologies of sperm. Furthermore, the labels of the dataset are one-hot, which provides insufficient support for experts, because assessment results are inconsistent between experts, and they have no absolute answer. Therefore, we constructed the video dataset for sperm assessment whose videos include sperm head as well as neck and tail, and its labels were annotated with soft-label. Furthermore, we proposed the sperm assessment framework and the neural network, RoSTFine, for sperm video recognition. Experimental results showed that RoSTFine could improve the sperm assessment performances compared to existing video recognition models and focus strongly on important sperm parts (i.e., head and neck).

摘要
世界各地有增加的 couples 为了成婚而寻求医疗帮助，至少有一半是由男方引起的不孕。协助生殖技术的成功率取决于精子评估，专家们通过精子形态和运动能力来决定精子是否适用于生殖。之前的精子评估研究使用深度学习都使用了只包含精子头部的图像集合，这无法考虑精子的运动和其他形态。此外，数据集的标签都是一元化的，这不足以支持专家，因为评估结果存在差异 между 专家，并没有绝对的答案。因此，我们建立了包含精子头部、脖子和尾部的视频数据集，并使用了软标签来标注数据集。此外，我们提出了精子评估框架和基于视频的神经网络模型 RoSTFine，用于精子视频识别。实验结果表明，RoSTFine 可以提高精子评估性能，并强调精子重要部分（即头和脖子）。

Inter-object Discriminative Graph Modeling for Indoor Scene Recognition

paper_url: http://arxiv.org/abs/2311.05919
repo_url: None
paper_authors: Chuanxin Song, Hanbo Wu, Xin Ma
for: This paper focuses on improving indoor scene recognition by leveraging object information within scenes to enhance feature representations.methods: The proposed approach uses a probabilistic perspective to capture object-scene discriminative relationships, which are then transformed into an Inter-Object Discriminative Prototype (IODP). The Discriminative Graph Network (DGN) is constructed to incorporate inter-object discriminative knowledge into the image representation through graph convolution.results: The proposed approach achieves state-of-the-art results on several widely used scene datasets, demonstrating the effectiveness of the proposed approach.

Abstract
Variable scene layouts and coexisting objects across scenes make indoor scene recognition still a challenging task. Leveraging object information within scenes to enhance the distinguishability of feature representations has emerged as a key approach in this domain. Currently, most object-assisted methods use a separate branch to process object information, combining object and scene features heuristically. However, few of them pay attention to interpretably handle the hidden discriminative knowledge within object information. In this paper, we propose to leverage discriminative object knowledge to enhance scene feature representations. Initially, we capture the object-scene discriminative relationships from a probabilistic perspective, which are transformed into an Inter-Object Discriminative Prototype (IODP). Given the abundant prior knowledge from IODP, we subsequently construct a Discriminative Graph Network (DGN), in which pixel-level scene features are defined as nodes and the discriminative relationships between node features are encoded as edges. DGN aims to incorporate inter-object discriminative knowledge into the image representation through graph convolution. With the proposed IODP and DGN, we obtain state-of-the-art results on several widely used scene datasets, demonstrating the effectiveness of the proposed approach.

摘要
<>变量场景布局和场景中的对象共存，indoor场景认知仍然是一项挑战性任务。利用场景中对象信息来增强特征表示的方法已经成为indoor场景认知领域的关键方法。现有大多数对象协助方法使用分立支线处理对象信息，混合对象和场景特征的方式。然而，其中很少听从解释地处理隐藏的推理知识。在本文中，我们提议利用隐藏的推理知识来增强场景特征表示。首先，我们从概率角度捕捉对象-场景推理关系，并将其转化为间对象推理原型（IODP）。在IODP的丰富先验知识基础上，我们随后建立一个推理图网络（DGN），其中像素级场景特征被定义为节点，图中的节点间的推理关系被编码为边。DGN的目标是通过图 convolution来将间对象推理知识integrated到图像表示中。与我们提议的IODP和DGN，我们在多个常用的场景数据集上获得了state-of-the-art的结果，证明了我们的方法的有效性。

Semantic Map Guided Synthesis of Wireless Capsule Endoscopy Images using Diffusion Models

paper_url: http://arxiv.org/abs/2311.05889
repo_url: None
paper_authors: Haejin Lee, Jeongwoo Ju, Jonghyuck Lee, Yeoun Joo Lee, Heechul Jung
for: 该研究旨在提高无线填充内scopic检查（WCE）结果的解释效率，并提供更多和更多样式的WCE图像，以便更好地诊断肠道疾病。
methods: 该研究使用生成模型，具体来说是分散模型（DM），生成多样化的WCE图像。该模型还利用视觉化缩放引擎（VS）生成的semantic map，以提高生成图像的控制性和多样性。
results: 该研究通过视觉检查和视觉图灵测试，证明了该方法的效果，可以生成真实和多样化的WCE图像。

Abstract
Wireless capsule endoscopy (WCE) is a non-invasive method for visualizing the gastrointestinal (GI) tract, crucial for diagnosing GI tract diseases. However, interpreting WCE results can be time-consuming and tiring. Existing studies have employed deep neural networks (DNNs) for automatic GI tract lesion detection, but acquiring sufficient training examples, particularly due to privacy concerns, remains a challenge. Public WCE databases lack diversity and quantity. To address this, we propose a novel approach leveraging generative models, specifically the diffusion model (DM), for generating diverse WCE images. Our model incorporates semantic map resulted from visualization scale (VS) engine, enhancing the controllability and diversity of generated images. We evaluate our approach using visual inspection and visual Turing tests, demonstrating its effectiveness in generating realistic and diverse WCE images.

摘要
无线胶囊内视镜（WCE）是一种非侵入性的方法，用于观察肠道系统，对肠道疾病的诊断非常重要。然而，解读WCE结果可以是时间consuming和疲劳的。现有的研究已经使用深度神经网络（DNNs）自动检测肠道病变，但获得充分的训练样本，特别是由于隐私问题，仍然是一个挑战。公共WCE数据库缺乏多样性和数量。为解决这个问题，我们提出了一种新的方法，利用生成模型（DM）生成多样的WCE图像。我们的模型具有视觉化缩放引擎（VS）生成的semantic map，从而提高生成图像的可控性和多样性。我们通过视觉检查和视觉图灵测试评估了我们的方法，并证明其效果在生成真实和多样的WCE图像。

Central Angle Optimization for 360-degree Holographic 3D Content

paper_url: http://arxiv.org/abs/2311.05878
repo_url: None
paper_authors: Hakdong Kim, Minsung Yoon, Cheongwon Kim
for: 这个论文是为了提出一种用于深度学习基于深度地图估计来生成真实的投影内容的方法。
methods: 该方法使用了对邻近摄像头视角点的中心角值进行分析，以选择最佳的中心角，以生成高质量的投影内容。
results: 经验表明，选择最佳中心角可以提高投影内容的质量。

Abstract
In this study, we propose a method to find an optimal central angle in deep learning-based depth map estimation used to produce realistic holographic content. The acquisition of RGB-depth map images as detailed as possible must be performed to generate holograms of high quality, despite the high computational cost. Therefore, we introduce a novel pipeline designed to analyze various values of central angles between adjacent camera viewpoints equidistant from the origin of an object-centered environment. Then we propose the optimal central angle to generate high-quality holographic content. The proposed pipeline comprises key steps such as comparing estimated depth maps and comparing reconstructed CGHs (Computer-Generated Holograms) from RGB images and estimated depth maps. We experimentally demonstrate and discuss the relationship between the central angle and the quality of digital holographic content.

摘要
在这项研究中，我们提出了一种方法来找出深度学习基于深度地图估计的优化中心角，以生成真实的投射内容。为了生成高质量的投射，需要获取RGB-深度地图图像，这些图像需要尽可能详细，但计算成本高。因此，我们提出了一个新的管道，用于分析不同中心角之间的RGB-深度地图图像。然后，我们提出了优化中心角，以生成高质量的投射内容。该管道包括以下关键步骤：对RGB图像和估计的深度地图进行比较，并对计算机生成的投射（CGH）和估计的深度地图进行比较。我们在实验中证明并讨论了中心角与数字投射内容质量之间的关系。

Automated Heterogeneous Low-Bit Quantization of Multi-Model Deep Learning Inference Pipeline

paper_url: http://arxiv.org/abs/2311.05870
repo_url: None
paper_authors: Jayeeta Mondal, Swarnava Dey, Arijit Mukherjee
for: 这个论文是为了提出一种自动化多层神经网络（多个DNN）的量化方法，以便在边缘部署中实现精度-延迟平衡。
methods: 该论文使用了多种深度学习（DL）推理管线，包括多任务学习（MTL）和集成学习（EL）等，以提高模型的准确率。
results: 该论文通过自动化量化方法，实现了多个DNNs的精度-延迟平衡，并且提高了边缘部署中的模型性能。

Abstract
Multiple Deep Neural Networks (DNNs) integrated into single Deep Learning (DL) inference pipelines e.g. Multi-Task Learning (MTL) or Ensemble Learning (EL), etc., albeit very accurate, pose challenges for edge deployment. In these systems, models vary in their quantization tolerance and resource demands, requiring meticulous tuning for accuracy-latency balance. This paper introduces an automated heterogeneous quantization approach for DL inference pipelines with multiple DNNs.

摘要
多层神经网络（DNN）组合在单个深度学习（DL）推理管道中，例如多任务学习（MTL）或集成学习（EL）等，虽然非常准确，但对边缘部署带来挑战。这些系统中的模型异常量化忍耐和资源需求，需要精确地调整以实现准确率和延迟之间的平衡。本文介绍了一种自动化多类量化方法 для DL推理管道中的多个DNN。

paper_url: http://arxiv.org/abs/2311.05863
repo_url: https://github.com/Pter61/vlpmarker
paper_authors: Yuanmin Tang, Jing Yu, Keke Gai, Xiangyan Qu, Yue Hu, Gang Xiong, Qi Wu
for: 这个研究是为了提供一个安全和可靠的版权标识方法来防止模型EXTRACTION攻击，以保护运算在多媒体Embedding as a Service（EaaS）上的知识产权和商业所有权。methods: 本研究使用了附加 trigger 的方法来将版权标识Inserted into VLPs，并通过嵌入式扩展Transformation来实现高质量的版权验证和最小化模型性能影响。此外，我们还提出了一种协力Copyright验证策略，通过融合 triggers和嵌入分布来增强标识的可靠性，抵抗不同的攻击。results: 我们的实验结果显示，提出的版权标识方法是有效和安全的，可以在不同的数据集上验证VLPs的版权，并对于模型EXTRACTION攻击进行防护。此外，我们还提出了一种可行的Out-of-distribution trigger选择方法，使得版权标识可以在实际世界中进行实现。

Abstract
Recent advances in vision-language pre-trained models (VLPs) have significantly increased visual understanding and cross-modal analysis capabilities. Companies have emerged to provide multi-modal Embedding as a Service (EaaS) based on VLPs (e.g., CLIP-based VLPs), which cost a large amount of training data and resources for high-performance service. However, existing studies indicate that EaaS is vulnerable to model extraction attacks that induce great loss for the owners of VLPs. Protecting the intellectual property and commercial ownership of VLPs is increasingly crucial yet challenging. A major solution of watermarking model for EaaS implants a backdoor in the model by inserting verifiable trigger embeddings into texts, but it is only applicable for large language models and is unrealistic due to data and model privacy. In this paper, we propose a safe and robust backdoor-based embedding watermarking method for VLPs called VLPMarker. VLPMarker utilizes embedding orthogonal transformation to effectively inject triggers into the VLPs without interfering with the model parameters, which achieves high-quality copyright verification and minimal impact on model performance. To enhance the watermark robustness, we further propose a collaborative copyright verification strategy based on both backdoor trigger and embedding distribution, enhancing resilience against various attacks. We increase the watermark practicality via an out-of-distribution trigger selection approach, removing access to the model training data and thus making it possible for many real-world scenarios. Our extensive experiments on various datasets indicate that the proposed watermarking approach is effective and safe for verifying the copyright of VLPs for multi-modal EaaS and robust against model extraction attacks. Our code is available at https://github.com/Pter61/vlpmarker.

摘要
近期，视觉语言预训模型（VLP）的进步已经提高了视觉理解和跨模态分析的能力。企业出现了基于VLP的多Modal Embedding as a Service（EaaS），但是这需要大量的训练数据和资源来提供高性能服务。然而，现有研究表明，EaaS受到模型抽取攻击，这会导致VLP的所有者受到很大的损失。保护VLP的知识产权和商业所有权是一项 increasinly 杰出的任务，但是它具有挑战。在这篇论文中，我们提出了一种安全和可靠的VLP embedding水印方法，称为VLPMarker。VLPMarker利用Embedding ortogonal transformation来有效地插入触发器到VLP中，而不会对模型参数产生影响，从而实现高质量的版权验证和最小的影响。为增强水印鲜度，我们进一步提出了基于触发器和embedding分布的共同版权验证策略，提高了对各种攻击的抗性。此外，我们还提出了一种基于非典型触发器的选择方法，使得水印更加实用。我们的实验表明，提议的水印方法是安全和可靠的，可以用于验证VLP的版权在多Modal EaaS中。我们的代码可以在https://github.com/Pter61/vlpmarker上下载。

Domain Generalization by Learning from Privileged Medical Imaging Information

paper_url: http://arxiv.org/abs/2311.05861
repo_url: None
paper_authors: Steven Korevaar, Ruwan Tennakoon, Ricky O’Brien, Dwarikanath Mahapatra, Alireza Bab-Hadiasha
for: 这种研究旨在提高医疗图像分类模型对数据分布变化的适应能力。
methods: 作者提出了一种新的方法，即利用特权信息（如肿体形态或位置）来强化领域泛化能力。
results: 研究表明，使用特权信息可以提高医疗图像分类模型对outsider数据的分类精度，从0.911提高到0.934。

Abstract
Learning the ability to generalize knowledge between similar contexts is particularly important in medical imaging as data distributions can shift substantially from one hospital to another, or even from one machine to another. To strengthen generalization, most state-of-the-art techniques inject knowledge of the data distribution shifts by enforcing constraints on learned features or regularizing parameters. We offer an alternative approach: Learning from Privileged Medical Imaging Information (LPMII). We show that using some privileged information such as tumor shape or location leads to stronger domain generalization ability than current state-of-the-art techniques. This paper demonstrates that by using privileged information to predict the severity of intra-layer retinal fluid in optical coherence tomography scans, the classification accuracy of a deep learning model operating on out-of-distribution data improves from $0.911$ to $0.934$. This paper provides a strong starting point for using privileged information in other medical problems requiring generalization.

摘要
学习在类似上下文中总结知识的能力对医疗成像非常重要，因为数据分布可能在不同医院或机器之间差异很大。为强化总结，大多数当前领先技术会在学习特征或参数上强制加入数据分布偏移的约束。我们提出了一种不同的方法：使用特权医疗成像信息学习（LPMII）。我们表明，使用特权信息，如肿瘤形态或位置，可以增强领域总结能力，比现有领先技术更高。这篇论文展示了，通过使用特权信息预测optical coherence tomography扫描中内层血液的严重程度，深度学习模型在不同数据上的分类精度从0.911提高到0.934。这篇论文提供了使用特权信息在医疗问题中的强大起点。

Layer-wise Auto-Weighting for Non-Stationary Test-Time Adaptation

paper_url: http://arxiv.org/abs/2311.05858
repo_url: https://github.com/junia3/LayerwiseTTA
paper_authors: Junyoung Park, Jin Kim, Hyeongjun Kwon, Ilhoon Yoon, Kwanghoon Sohn
for: 这篇论文主要关注于在实际应用中进行模型更新和适应，并且面临不断变化的目标分布问题。
methods: 本文提出了一个层别自动调整算法，通过利用渔业信息矩阵（FIM）设计学习重量，以选择相关于对数据量变化的层而忽略不相关的层。此外，本文还提出了一个对数矩阵对应的幂函数减少器，以使certain层几乎冻结，以减少忘记和错误累累。
results: 实验结果显示，本文的方法比传统的连续和慢速更新方法更好，同时可以很大程度地降低计算负载，强调了FIM-based learning weight在适应持续变化的目标分布方面的重要性。

Abstract
Given the inevitability of domain shifts during inference in real-world applications, test-time adaptation (TTA) is essential for model adaptation after deployment. However, the real-world scenario of continuously changing target distributions presents challenges including catastrophic forgetting and error accumulation. Existing TTA methods for non-stationary domain shifts, while effective, incur excessive computational load, making them impractical for on-device settings. In this paper, we introduce a layer-wise auto-weighting algorithm for continual and gradual TTA that autonomously identifies layers for preservation or concentrated adaptation. By leveraging the Fisher Information Matrix (FIM), we first design the learning weight to selectively focus on layers associated with log-likelihood changes while preserving unrelated ones. Then, we further propose an exponential min-max scaler to make certain layers nearly frozen while mitigating outliers. This minimizes forgetting and error accumulation, leading to efficient adaptation to non-stationary target distribution. Experiments on CIFAR-10C, CIFAR-100C, and ImageNet-C show our method outperforms conventional continual and gradual TTA approaches while significantly reducing computational load, highlighting the importance of FIM-based learning weight in adapting to continuously or gradually shifting target domains.

摘要
（注：以下是简化中文版本）随着实际应用中数据分布的不断变化，测试时间适应（TTA）在部署后是必需的。然而，实际中的目标分布不断变化带来了悬峰忘却和错误积累的挑战。现有的TTA方法对非站立性目标分布非常有效，但是 computational load 过高，使其无法实现在设备上进行。在这篇论文中，我们提出了一种层 wise auto-weighting 算法，用于逐渐和积极地适应非站立性目标分布。我们首先通过 Fisher Information Matrix (FIM) 设计学习权重，以选择与 log-likelihood 变化相关的层，并保留不相关的层。然后，我们进一步提出了一种对数抑制器，使certain层变得几乎冻结，并 Mitigate 异常值。这有效地减少了忘却和错误积累，从而实现了高效地适应非站立性目标分布。我们的方法在 CIFAR-10C、CIFAR-100C 和 ImageNet-C 上进行了实验，并证明了我们的方法在不断变化的目标分布下对 TTA 进行了改进，并且可以减少计算负担，强调 FIM 基于的学习权重在适应不断变化的目标分布中的重要性。

Uncertainty-aware Single View Volumetric Rendering for Medical Neural Radiance Fields

paper_url: http://arxiv.org/abs/2311.05836
repo_url: None
paper_authors: Jing Hu, Qinrui Fan, Shu Hu, Siwei Lyu, Xi Wu, Xin Wang
for: 本研究旨在提出一种基于生成辐射场的不确定性意识MedNeRF网络，以便从2DX射影图像中学习CT投影图像的连续表示。
methods: 该网络使用生成辐射场来获取内部结构和深度信息，并使用适应性损失量来保证生成图像的质量。
results: 我们在公共可用的膝盖和胸部数据集上训练了我们的模型，并对单个X射影图像进行CT投影图像的Rendering，并与其他基于生成辐射场的方法进行比较。

Abstract
In the field of clinical medicine, computed tomography (CT) is an effective medical imaging modality for the diagnosis of various pathologies. Compared with X-ray images, CT images can provide more information, including multi-planar slices and three-dimensional structures for clinical diagnosis. However, CT imaging requires patients to be exposed to large doses of ionizing radiation for a long time, which may cause irreversible physical harm. In this paper, we propose an Uncertainty-aware MedNeRF (UMedNeRF) network based on generated radiation fields. The network can learn a continuous representation of CT projections from 2D X-ray images by obtaining the internal structure and depth information and using adaptive loss weights to ensure the quality of the generated images. Our model is trained on publicly available knee and chest datasets, and we show the results of CT projection rendering with a single X-ray and compare our method with other methods based on generated radiation fields.

摘要
在临床医学领域，计算机断层成像（CT）是一种有效的医疗影像Modalities，用于诊断多种疾病。相比X射线图像，CT图像可以提供更多的信息，包括多平面切片和三维结构，为临床诊断提供更多的参考。然而，CT成像需要患者长时间暴露于大剂量辐射，可能会导致不可逆的物理损害。在本文中，我们提出了基于生成辐射场的不确定性意识MedNeRF（UMedNeRF）网络。该网络可以通过获取内部结构和深度信息，从2D X射线图像中生成CT投影图像，并使用适应损失质量来保证生成图像质量。我们的模型在公共可用的膝盖和胸部数据集上进行训练，并对CT投影图像的生成进行了比较。

Diffusion Shape Prior for Wrinkle-Accurate Cloth Registration

paper_url: http://arxiv.org/abs/2311.05828
repo_url: None
paper_authors: Jingfan Guo, Fabian Prada, Donglai Xiang, Javier Romero, Chenglei Wu, Hyun Soo Park, Takaaki Shiratori, Shunsuke Saito
for: 用于实现基于实际数据的动态外观模型和物理参数估计
methods: 使用 diffusion models 学习shape prior，并提出基于函数图的多个阶段引导方案来稳定注registrations
results: 在高精度捕捉到的实际衣服上，提出的方法比VAE或PCA基于的surface registration更好地泛化，并在扩展和减少扩展测试中都能够超越优化基于和学习基于的非rigid registration方法。

Abstract
Registering clothes from 4D scans with vertex-accurate correspondence is challenging, yet important for dynamic appearance modeling and physics parameter estimation from real-world data. However, previous methods either rely on texture information, which is not always reliable, or achieve only coarse-level alignment. In this work, we present a novel approach to enabling accurate surface registration of texture-less clothes with large deformation. Our key idea is to effectively leverage a shape prior learned from pre-captured clothing using diffusion models. We also propose a multi-stage guidance scheme based on learned functional maps, which stabilizes registration for large-scale deformation even when they vary significantly from training data. Using high-fidelity real captured clothes, our experiments show that the proposed approach based on diffusion models generalizes better than surface registration with VAE or PCA-based priors, outperforming both optimization-based and learning-based non-rigid registration methods for both interpolation and extrapolation tests.

摘要
<>文本翻译成简化中文。<>注册Textureless clothes from 4D scans with vertex-accurate correspondence is challenging, yet important for dynamic appearance modeling and physics parameter estimation from real-world data. However, previous methods either rely on texture information, which is not always reliable, or achieve only coarse-level alignment. In this work, we present a novel approach to enabling accurate surface registration of texture-less clothes with large deformation. Our key idea is to effectively leverage a shape prior learned from pre-captured clothing using diffusion models. We also propose a multi-stage guidance scheme based on learned functional maps, which stabilizes registration for large-scale deformation even when they vary significantly from training data. Using high-fidelity real captured clothes, our experiments show that the proposed approach based on diffusion models generalizes better than surface registration with VAE or PCA-based priors, outperforming both optimization-based and learning-based non-rigid registration methods for both interpolation and extrapolation tests.Note: The translation is done using Google Translate, and may not be perfect. Please let me know if you need any further assistance.

Adaptive Variance Thresholding: A Novel Approach to Improve Existing Deep Transfer Vision Models and Advance Automatic Knee-Joint Osteoarthritis Classification

paper_url: http://arxiv.org/abs/2311.05799
repo_url: None
paper_authors: Fabi Prezja, Leevi Annala, Sampsa Kiiskinen, Suvi Lahtinen, Timo Ojala
for: 本研究旨在提高骨关节风溃病（KOA）的诊断精度，通过应用深度学习方法和自适应变量阈值控制（AVT）、神经建构搜索（NAS）等技术。
methods: 本研究使用的方法包括深度学习模型的预训练和特点化 Variance Thresholding（AVT）、Neural Architecture Search（NAS）等。
results: 本研究的结果表明，通过应用我们的方法，可以提高预训练KOA模型的初始准确率，并将NAS输入向量空间减少60倍，从而提高推理速度和优化超参数搜索。此外，我们还应用了这种方法于一个外部已经训练的KOA分类模型，并得到了较好的效果，使其成为骨关节风溃病分类模型之一。

Abstract
Knee-Joint Osteoarthritis (KOA) is a prevalent cause of global disability and is inherently complex to diagnose due to its subtle radiographic markers and individualized progression. One promising classification avenue involves applying deep learning methods; however, these techniques demand extensive, diversified datasets, which pose substantial challenges due to medical data collection restrictions. Existing practices typically resort to smaller datasets and transfer learning. However, this approach often inherits unnecessary pre-learned features that can clutter the classifier's vector space, potentially hampering performance. This study proposes a novel paradigm for improving post-training specialized classifiers by introducing adaptive variance thresholding (AVT) followed by Neural Architecture Search (NAS). This approach led to two key outcomes: an increase in the initial accuracy of the pre-trained KOA models and a 60-fold reduction in the NAS input vector space, thus facilitating faster inference speed and a more efficient hyperparameter search. We also applied this approach to an external model trained for KOA classification. Despite its initial performance, the application of our methodology improved its average accuracy, making it one of the top three KOA classification models.

摘要
膝关节骨关节炎 (KOA) 是全球最常见的残疾原因之一，而其诊断却因为它的微不足和个人化进程而被认为是复杂的。深度学习技术可能会有所助益，但这些技术需要大量多样化的数据集，医疗数据收集限制成为了主要挑战。现有的做法通常是使用更小的数据集和转移学习。然而，这种方法可能会固化预先学习的特征，从而降低表现。本研究提出了一种改进后期特殊化分类器的新方法，通过适应差异阈值调整 (AVT) 和神经网络搜索 (NAS)。这种方法导致了两个关键的结果：首先，提高了预训练 KOA 模型的初始精度；其次，将 NAS 输入向量空间减少到 60 倍，从而提高了推理速度和搜索效率。我们还应用了这种方法于一个外部用于 KOA 分类的模型。尽管它的初始表现不佳，但通过我们的方法改进，其平均精度得到了提高，成为了 KOA 分类模型之一。

Synthesizing Bidirectional Temporal States of Knee Osteoarthritis Radiographs with Cycle-Consistent Generative Adversarial Neural Networks

paper_url: http://arxiv.org/abs/2311.05798
repo_url: None
paper_authors: Fabi Prezja, Leevi Annala, Sampsa Kiiskinen, Suvi Lahtinen, Timo Ojala
for: 预测患者患有满月股骨骨折病（KOA）的可能性，增强数据采集和预测模型训练。
methods: 使用CycleGAN模型将真实的X光图像扩展到不同的KOA阶段，并通过验证使用Convolutional Neural Network（CNN）来证明模型的可靠性。
results: 模型能够有效地将病例阶段转换为不同的阶段，特别是将晚期病例阶段转换为早期阶段，并且能够抑制骨质增生和扩大膝关节空间，这些特征都是早期KOA的典型表现。

Abstract
Knee Osteoarthritis (KOA), a leading cause of disability worldwide, is challenging to detect early due to subtle radiographic indicators. Diverse, extensive datasets are needed but are challenging to compile because of privacy, data collection limitations, and the progressive nature of KOA. However, a model capable of projecting genuine radiographs into different OA stages could augment data pools, enhance algorithm training, and offer pre-emptive prognostic insights. In this study, we trained a CycleGAN model to synthesize past and future stages of KOA on any genuine radiograph. The model was validated using a Convolutional Neural Network that was deceived into misclassifying disease stages in transformed images, demonstrating the CycleGAN's ability to effectively transform disease characteristics forward or backward in time. The model was particularly effective in synthesizing future disease states and showed an exceptional ability to retroactively transition late-stage radiographs to earlier stages by eliminating osteophytes and expanding knee joint space, signature characteristics of None or Doubtful KOA. The model's results signify a promising potential for enhancing diagnostic models, data augmentation, and educational and prognostic usage in healthcare. Nevertheless, further refinement, validation, and a broader evaluation process encompassing both CNN-based assessments and expert medical feedback are emphasized for future research and development.

摘要
髋关节滤出病 (KOA) 是全球最主要的残疾原因之一，但早期检测困难由于病理表像不具有明显的特征。收集延伸的数据集是困难的，主要因为隐私、数据收集限制和滤出病的进行性。然而，一种能将真实的X光像投影到不同的滤出病阶段的模型可以增加数据库，提高算法训练和提供预防性预测。本研究中，我们使用了循环GAN模型将过去和未来的滤出病阶段投影到任何真实的X光像上。我们验证了这种模型，使用了一个 convolutional neural network (CNN) 被欺骗到在转换后的图像中错误地分类病种特征，表明循环GAN模型可以有效地将病种特征转换到不同的时间阶段。特别是在将未来的病状投影到当前阶段的情况下，模型表现出了极高的效果。此外，模型还可以逆转晚期X光像，使其变回早期阶段，这是 none 或 doubtful KOA 的特征之一。这些结果表明这种模型在改善诊断模型、数据增强和教学和预测方面具有普遍的潜力。然而，进一步的优化、验证和更广泛的评估过程，包括使用 CNN 基础的评估和专业医疗反馈，是未来研究和开发的重点。

2023-11-10

Flatness-aware Adversarial Attack

EviPrompt: A Training-Free Evidential Prompt Generation Method for Segment Anything Model in Medical Images

A design of Convolutional Neural Network model for the Diagnosis of the COVID-19

Towards A Unified Neural Architecture for Visual Recognition and Reasoning

Image Classification using Combination of Topological Features and Neural Networks

Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks

Learning Human Action Recognition Representations Without Real Humans

Diffusion Models for Earth Observation Use-cases: from cloud removal to urban change detection

Semantic-aware Video Representation for Few-shot Action Recognition

Instant3D: Fast Text-to-3D with Sparse-View Generation and Large Reconstruction Model

ASSIST: Interactive Scene Nodes for Scalable and Realistic Indoor Simulation

An Automated Pipeline for Tumour-Infiltrating Lymphocyte Scoring in Breast Cancer

Automatic Report Generation for Histopathology images using pre-trained Vision Transformers

Deep Fast Vision: A Python Library for Accelerated Deep Transfer Learning Vision Prototyping

An Evaluation of Forensic Facial Recognition

Federated Learning Across Decentralized and Unshared Archives for Remote Sensing Image Classification

MonoProb: Self-Supervised Monocular Depth Estimation with Interpretable Uncertainty

Fight Fire with Fire: Combating Adversarial Patch Attacks using Pattern-randomized Defensive Patches

Exploring the Efficacy of Base Data Augmentation Methods in Deep Learning-Based Radiograph Classification of Knee Joint Osteoarthritis

Dual input stream transformer for eye-tracking line assignment

Enhancing Rock Image Segmentation in Digital Rock Physics: A Fusion of Generative AI and State-of-the-Art Neural Networks

Learning-Based Biharmonic Augmentation for Point Cloud Classification

Attributes Grouping and Mining Hashing for Fine-Grained Image Retrieval

Lidar-based Norwegian tree species detection using deep learning

Improved Positional Encoding for Implicit Neural Representation based Compact Data Representation

Ulcerative Colitis Mayo Endoscopic Scoring Classification with Active Learning and Generative Data Augmentation

Learning Contrastive Self-Distillation for Ultra-Fine-Grained Visual Categorization Targeting Limited Samples

Refining the ONCE Benchmark with Hyperparameter Tuning

2D Image head pose estimation via latent space regression under occlusion settings

Diagonal Hierarchical Consistency Learning for Semi-supervised Medical Image Segmentation

U3DS$^3$: Unsupervised 3D Semantic Scene Segmentation

Polar-Net: A Clinical-Friendly Model for Alzheimer’s Disease Detection in OCTA Images

Keystroke Verification Challenge (KVC): Biometric and Fairness Benchmark Evaluation

Vision Big Bird: Random Sparsification for Full Attention

Comparing Male Nyala and Male Kudu Classification using Transfer Learning with ResNet-50 and VGG-16

Quantized Distillation: Optimizing Driver Activity Recognition Models for Resource-Constrained Environments

A Neural Height-Map Approach for the Binocular Photometric Stereo Problem

Post-training Quantization with Progressive Calibration and Activation Relaxing for Text-to-Image Diffusion Models

Efficient Segmentation with Texture in Ore Images Based on Box-supervised Approach

Automated Sperm Assessment Framework and Neural Network Specialized for Sperm Video Recognition

Inter-object Discriminative Graph Modeling for Indoor Scene Recognition

Semantic Map Guided Synthesis of Wireless Capsule Endoscopy Images using Diffusion Models

Central Angle Optimization for 360-degree Holographic 3D Content

Automated Heterogeneous Low-Bit Quantization of Multi-Model Deep Learning Inference Pipeline

Watermarking Vision-Language Pre-trained Models for Multi-modal Embedding as a Service

Domain Generalization by Learning from Privileged Medical Imaging Information

Layer-wise Auto-Weighting for Non-Stationary Test-Time Adaptation

Uncertainty-aware Single View Volumetric Rendering for Medical Neural Radiance Fields

Diffusion Shape Prior for Wrinkle-Accurate Cloth Registration

Adaptive Variance Thresholding: A Novel Approach to Improve Existing Deep Transfer Vision Models and Advance Automatic Knee-Joint Osteoarthritis Classification

Synthesizing Bidirectional Temporal States of Knee Osteoarthritis Radiographs with Cycle-Consistent Generative Adversarial Neural Networks