2023-11-13

cs.CV

cs.CV - 2023-11-13

Assessing Test-time Variability for Interactive 3D Medical Image Segmentation with Diverse Point Prompts

paper_url: http://arxiv.org/abs/2311.07806
repo_url: https://github.com/medicl-vu/variability
paper_authors: Hao Li, Han Liu, Dewei Hu, Jiacheng Wang, Ipek Oguz
for: 这 paper 是为了研究 interactive medical image segmentation 的可靠性和可重现性问题。
methods: 这 paper 使用了 prompt engineering 技术，通过在测试时提供用户提示来生成精准的分割结果。
results: 这 paper 的实验结果表明，在使用多个点提示时，可以提高分割精度和可重现性。同时，通过对提示的位置和数量进行优化，可以提高分割结果的可靠性。

Abstract
Interactive segmentation model leverages prompts from users to produce robust segmentation. This advancement is facilitated by prompt engineering, where interactive prompts serve as strong priors during test-time. However, this is an inherently subjective and hard-to-reproduce process. The variability in user expertise and inherently ambiguous boundaries in medical images can lead to inconsistent prompt selections, potentially affecting segmentation accuracy. This issue has not yet been extensively explored for medical imaging. In this paper, we assess the test-time variability for interactive medical image segmentation with diverse point prompts. For a given target region, the point is classified into three sub-regions: boundary, margin, and center. Our goal is to identify a straightforward and efficient approach for optimal prompt selection during test-time based on three considerations: (1) benefits of additional prompts, (2) effects of prompt placement, and (3) strategies for optimal prompt selection. We conduct extensive experiments on the public Medical Segmentation Decathlon dataset for challenging colon tumor segmentation task. We suggest an optimal strategy for prompt selection during test-time, supported by comprehensive results. The code is publicly available at https://github.com/MedICL-VU/variability

摘要
互动分割模型利用用户提供的提示来生成稳定的分割。这个进步得到了提示工程学支持，其中互动提示在试用时作为强大的先验条件。然而，这是一个内在受主观和难以重复的过程。用户专业知识和医疗影像中的自然欠缺缘点可能导致提示选择不稳定，这可能对分割精度产生影响。这个问题在医疗影像分割领域仍未得到充分探讨。在这篇研究中，我们评估了互动医疗影像分割中多个点提示的试用时间多样性。对于给定目标区域，点会被分为三个子区域：boundary、margin和center。我们的目标是发现一个简单和高效的方法，以便在试用时间选择最佳提示，基于以下三个考虑因素：1. 额外提示的优点2. 提示位置的影响3. 最佳提示选择策略我们在公共的医疗影像分割十大项目数据集上进行了广泛的实验，以挑战colon淋巴癌分割任务。我们建议一个最佳的提示选择策略，并提供了全面的结果。软件可以在https://github.com/MedICL-VU/variability上获取。

CSLP-AE: A Contrastive Split-Latent Permutation Autoencoder Framework for Zero-Shot Electroencephalography Signal Conversion

paper_url: http://arxiv.org/abs/2311.07788
repo_url: https://github.com/andersxa/cslp-ae
paper_authors: Anders Vestergaard Nørskov, Alexander Neergaard Zahid, Morten Mørup
for: 提高EEG数据的抽象和普适性，以提高脑功能的研究和分析。
methods: 提出了一种基于对比学习的划分卷积自适应器（CSLP-AE）框架，通过对比学习来引导卷积的 latent splits 表示主体（样式）和任务（内容）的特征，以提高 EEG 的抽象和普适性。
results: 比较 CSLP-AE 与传统的超级vised、无监督（AE）和自监督（对比学习）训练方法，发现 CSLP-AE 提供了更好的普适性和适应性，并且可以实现零例转换 между未看过的主体。

Abstract
Electroencephalography (EEG) is a prominent non-invasive neuroimaging technique providing insights into brain function. Unfortunately, EEG data exhibit a high degree of noise and variability across subjects hampering generalizable signal extraction. Therefore, a key aim in EEG analysis is to extract the underlying neural activation (content) as well as to account for the individual subject variability (style). We hypothesize that the ability to convert EEG signals between tasks and subjects requires the extraction of latent representations accounting for content and style. Inspired by recent advancements in voice conversion technologies, we propose a novel contrastive split-latent permutation autoencoder (CSLP-AE) framework that directly optimizes for EEG conversion. Importantly, the latent representations are guided using contrastive learning to promote the latent splits to explicitly represent subject (style) and task (content). We contrast CSLP-AE to conventional supervised, unsupervised (AE), and self-supervised (contrastive learning) training and find that the proposed approach provides favorable generalizable characterizations of subject and task. Importantly, the procedure also enables zero-shot conversion between unseen subjects. While the present work only considers conversion of EEG, the proposed CSLP-AE provides a general framework for signal conversion and extraction of content (task activation) and style (subject variability) components of general interest for the modeling and analysis of biological signals.

摘要
电 electroencephalography (EEG) 是一种非侵入性的脑成像技术，提供了脑功能的重要信息。然而，EEG 数据具有高度的噪声和主题变化，使得普遍化的信号提取受到阻碍。因此，EEG 分析中的一个关键目标是提取内在的神经活动（内容）以及考虑主题变化（样式）。我们假设可以将 EEG 信号转换到不同的任务和主题之间，需要提取潜在表示主题和任务的独特表示。这个目标可以通过对 EEG 信号进行对比学习来实现。我们提出了一种新的对比拆分潜在嵌入 autoencoder（CSLP-AE）框架，该框架直接优化了 EEG 信号的转换。关键是，潜在表示被对比学习导引，以便主题和任务分别具有独特的表示。与传统的超级vised、无级vised（AE）和自我超级vised（对比学习）训练相比，我们发现提出的方法可以提供有利的总体特征表示。特别是，该方法还允许零shot转换到未看到的主题。虽然本研究只考虑了 EEG 的转换，但我们提出的 CSLP-AE 提供了一种通用的信号转换和内容（任务活动）以及样式（主题变化）组成部分的模型化和分析的框架。

A Data-Free Approach to Mitigate Catastrophic Forgetting in Federated Class Incremental Learning for Vision Tasks

paper_url: http://arxiv.org/abs/2311.07784
repo_url: None
paper_authors: Sara Babakniya, Zalan Fabian, Chaoyang He, Mahdi Soltanolkotabi, Salman Avestimehr
for: 防止深度学习模型忘记之前学习的信息，特别在联合学习（Federated Learning，FL）中，数据分布式，每个用户的数据可能会独立变化。
methods: 提出了一种基于生成模型的联合分类增量学习框架，利用生成模型Synthesize samples from past distributions，以减轻忘记现象。保持隐私性，生成模型在服务器端使用数据free方法进行训练，不需要客户端提供数据。
results: 在多个数据集上进行了广泛的实验，证明了该方法可以减轻忘记现象，并且不需要用户保存过去的数据或模型。

Abstract
Deep learning models often suffer from forgetting previously learned information when trained on new data. This problem is exacerbated in federated learning (FL), where the data is distributed and can change independently for each user. Many solutions are proposed to resolve this catastrophic forgetting in a centralized setting. However, they do not apply directly to FL because of its unique complexities, such as privacy concerns and resource limitations. To overcome these challenges, this paper presents a framework for \textbf{federated class incremental learning} that utilizes a generative model to synthesize samples from past distributions. This data can be later exploited alongside the training data to mitigate catastrophic forgetting. To preserve privacy, the generative model is trained on the server using data-free methods at the end of each task without requesting data from clients. Moreover, our solution does not demand the users to store old data or models, which gives them the freedom to join/leave the training at any time. Additionally, we introduce SuperImageNet, a new regrouping of the ImageNet dataset specifically tailored for federated continual learning. We demonstrate significant improvements compared to existing baselines through extensive experiments on multiple datasets.

摘要
深度学习模型经常面临新数据训练时忘记之前学习的信息问题。这个问题在 federated learning（FL）中更加严重，因为数据分布在不同用户处并可以独立地变化。许多解决方案在中央化设置下提出来解决这个慢速衰减问题，但这些解决方案不直接适用于 FL 因为其独特的复杂性，如隐私问题和资源限制。为了突破这些挑战，本文提出了一个框架，即 federated class incremental learning（FL-CIL），该框架利用生成模型来synthesize过去分布的样本。这些样本可以在训练数据之 alongside 使用，以 Mitigate 慢速衰减。在保护隐私方面，生成模型在服务器上使用数据free方法进行训练，而无需请求客户端上传数据。此外，我们的解决方案不需要用户保留过去的数据或模型，这给了他们在训练中任意时间加入/离开的自由。此外，我们还提出了一个新的 ImageNet 数据集重新分组，即 SuperImageNet，特意设计用于 federated continual learning。通过了广泛的实验，我们证明了我们的方案与现有基eline存在显著的改进。

FedOpenHAR: Federated Multi-Task Transfer Learning for Sensor-Based Human Activity Recognition

paper_url: http://arxiv.org/abs/2311.07765
repo_url: None
paper_authors: Egemen İşgüder, Özlem Durmaz İncel
for: 这篇论文主要针对的是如何使用分布式机器学习技术来实现基于嗅感器的人体活动识别和设备位置识别两个任务。
methods: 本文使用了联邦传输学习技术，通过将多个小型数据集合并训练一个共享模型，以实现两个任务的同时进行多任务学习。
results: 试验结果表明，通过使用分布式机器学习技术和转移学习，可以在不同的数据集和环境下实现类似于中央化训练的高精度。

Abstract
Motion sensors integrated into wearable and mobile devices provide valuable information about the device users. Machine learning and, recently, deep learning techniques have been used to characterize sensor data. Mostly, a single task, such as recognition of activities, is targeted, and the data is processed centrally at a server or in a cloud environment. However, the same sensor data can be utilized for multiple tasks and distributed machine-learning techniques can be used without the requirement of the transmission of data to a centre. This paper explores Federated Transfer Learning in a Multi-Task manner for both sensor-based human activity recognition and device position identification tasks. The OpenHAR framework is used to train the models, which contains ten smaller datasets. The aim is to obtain model(s) applicable for both tasks in different datasets, which may include only some label types. Multiple experiments are carried in the Flower federated learning environment using the DeepConvLSTM architecture. Results are presented for federated and centralized versions under different parameters and restrictions. By utilizing transfer learning and training a task-specific and personalized federated model, we obtained a similar accuracy with training each client individually and higher accuracy than a fully centralized approach.

摘要
<> translate into Simplified Chinese运动传感器 интегрирован到了携带式和移动设备中提供了设备用户的有价值信息。机器学习和最近的深度学习技术已经用于描述传感器数据。通常，单个任务，如活动识别，是目标，并将数据处理中央服务器或云环境中。然而，同一个传感器数据可以用于多个任务，并可以使用分布式机器学习技术，无需数据传输到中心。这篇论文探讨了联邦传输学习在多任务方式下的应用，用于满足感知设备的人体活动识别和设备位置识别两个任务。使用OpenHAR框架进行训练，该框架包含10个较小的数据集。目标是在不同的数据集中获得适用于两个任务的模型，可能只包含一些标签类型。在花开联邦学习环境中进行多个实验，使用深度卷积LSTM架构。结果显示，通过使用传输学习和训练任务特定和个性化联邦模型，我们可以获得与每个客户端分别训练的相似准确率，并高于完全中央化方法。Note: Simplified Chinese is used here, as it is more widely used in mainland China. If you prefer Traditional Chinese, I can also provide the translation.

Quality-Aware Prototype Memory for Face Representation Learning

paper_url: http://arxiv.org/abs/2311.07734
repo_url: None
paper_authors: Evgeny Smirnov, Vasiliy Galyuk, Evgeny Lukyanets
for: 提高 Prototype Memory 模型的扩展性和精度，使其能够更好地处理低质量或具有异常特征的人脸图像。
methods: 提出了一种简单有效的质量感知技术，使得 Prototype Memory 在生成核心时使用不同的权重来处理不同质量的人脸图像，从而提高模型的精度和稳定性。
results: 通过对多个人脸识别benchmark进行了广泛的实验，证明了提出的方法可以提高 Prototype Memory 模型的性能，并且在不同质量的人脸图像下表现更加稳定。

Abstract
Prototype Memory is a powerful model for face representation learning. It enables the training of face recognition models using datasets of any size, with on-the-fly generation of prototypes (classifier weights) and efficient ways of their utilization. Prototype Memory demonstrated strong results in many face recognition benchmarks. However, the algorithm of prototype generation, used in it, is prone to the problems of imperfectly calculated prototypes in case of low-quality or poorly recognizable faces in the images, selected for the prototype creation. All images of the same person, presented in the mini-batch, used with equal weights, and the resulting averaged prototype could be contaminated with imperfect embeddings of such face images. It can lead to misdirected training signals and impair the performance of the trained face recognition models. In this paper, we propose a simple and effective way to improve Prototype Memory with quality-aware prototype generation. Quality-Aware Prototype Memory uses different weights for images of different quality in the process of prototype generation. With this improvement, prototypes get more valuable information from high-quality images and less hurt by low-quality ones. We propose and compare several methods of quality estimation and usage, perform extensive experiments on the different face recognition benchmarks and demonstrate the advantages of the proposed model compared to the basic version of Prototype Memory.

摘要
protoype memory 是一种强大的人脸表示学习模型。它允许通过任意大小的数据集进行人脸识别模型的训练，并且可以在实时生成核心示例（分类器权重）并有效地使用它们。 prototype memory 在许多人脸识别benchmark中显示出了强大的成果。然而， prototype memory 中使用的核心生成算法在低质量或difficult to recognize的面部图像时存在一些问题。当使用相同的人脸图像进行批处理时，所有图像均被视为等重，并且所得到的平均核心可能受到低质量图像的杂乱影响。这可能导致训练信号受扰和人脸识别模型的性能下降。在这篇论文中，我们提出了一种简单有效的方法，用于提高 protoype memory 的质量感知。我们使用不同的权重来处理不同质量的面部图像，以便核心获得更多的高质量图像中的有价值信息，并且减少低质量图像的影响。我们提出了多种质量估计和使用方法，进行了广泛的实验，并在不同的人脸识别benchmark上显示了提高模型的优势。

To See is to Believe: Prompting GPT-4V for Better Visual Instruction Tuning

paper_url: http://arxiv.org/abs/2311.07574
repo_url: https://github.com/x2fd/lvis-instruct4v
paper_authors: Junke Wang, Lingchen Meng, Zejia Weng, Bo He, Zuxuan Wu, Yu-Gang Jiang
for: 提高大型多媒体模型（LLaVA-1.5）的性能，使其在各种困难的混合媒体任务中表现出色。
methods: 引入了高级的视觉指令集（LVIS-Instruct4V），通过Prompting强大的GPT-4V模型使用图像来生成高精度的视觉指令数据。
results: 经验验证和案例研究表明，高质量的视觉指令数据可以提高LLaVA-1.5模型在多种混合媒体任务中的性能，比如LLaVA$^w$（76.7 vs. 70.7）和MM-Vet（40.2 vs. 35.4）等。

Abstract
Existing visual instruction tuning methods typically prompt large language models with textual descriptions to generate instruction-following data. Despite the promising performance achieved, these descriptions are derived from image annotations, which are oftentimes coarse-grained. Furthermore, the instructions might even contradict the visual content without observing the entire visual context. To address this challenge, we introduce a fine-grained visual instruction dataset, LVIS-Instruct4V, which contains 220K visually aligned and context-aware instructions produced by prompting the powerful GPT-4V with images from LVIS. Through experimental validation and case studies, we demonstrate that high-quality visual instructional data could improve the performance of LLaVA-1.5, a state-of-the-art large multimodal model, across a wide spectrum of benchmarks by clear margins. Notably, by simply replacing the LLaVA-Instruct with our LVIS-Instruct4V, we achieve better results than LLaVA on most challenging LMM benchmarks, e.g., LLaVA$^w$ (76.7 vs. 70.7) and MM-Vet (40.2 vs. 35.4). We release our data and model at https://github.com/X2FD/LVIS-INSTRUCT4V.

摘要
现有的视觉指令优化方法通常是通过文本描述向大语言模型提供图像描述来生成指令跟踪数据。尽管这些描述可以达到出色的性能，但是这些描述通常是基于图像注释，它们可能是粗糙的。此外，指令可能与视觉内容不匹配，而不考虑整个视觉上下文。为解决这个挑战，我们引入了高级的视觉指令数据集，LVIS-Instruct4V，它包含220K个视觉对应的高级指令，这些指令是通过提交强大的GPT-4V图像从LVIS来生成的。通过实验验证和案例研究，我们证明了高质量的视觉指令数据可以提高LLaVA-1.5，一个状态之最大的多Modal模型，在多种 bencmarks 上的性能，并且具有明显的优势。例如，通过将LLaVA-Instruct替换为我们的LVIS-Instruct4V，我们可以在许多最复杂的LMM bencmarks上获得更好的结果，例如LLaVA$^w$ (76.7 vs. 70.7) 和 MM-Vet (40.2 vs. 35.4)。我们在 GitHub 上发布了我们的数据和模型，可以通过以下链接获取：https://github.com/X2FD/LVIS-INSTRUCT4V。

Fast Normalized Cross-Correlation for Template Matching with Rotations

paper_url: http://arxiv.org/abs/2311.07561
repo_url: None
paper_authors: José María Almira, Harold Phelippeau, Antonio Martinez-Sanchez
for: 用于快速化Template Matching计算（特别是3D图像）
methods: 使用一种新的数学理论，可以同时处理平移和旋转，减少计算复杂性
results: 可以快速地回归模板实例的位置和旋转角度信息

Abstract
Normalized cross-correlation is the reference approach to carry out template matching on images. When it is computed in Fourier space, it can handle efficiently template translations but it cannot do so with template rotations. Including rotations requires sampling the whole space of rotations, repeating the computation of the correlation each time. This article develops an alternative mathematical theory to handle efficiently, at the same time, rotations and translations. Our proposal has a reduced computational complexity because it does not require to repeatedly sample the space of rotations. To do so, we integrate the information relative to all rotated versions of the template into a unique symmetric tensor template -which is computed only once per template-. Afterward, we demonstrate that the correlation between the image to be processed with the independent tensor components of the tensorial template contains enough information to recover template instance positions and rotations. Our proposed method has the potential to speed up conventional template matching computations by a factor of several magnitude orders for the case of 3D images.

摘要

VGSG: Vision-Guided Semantic-Group Network for Text-based Person Search

paper_url: http://arxiv.org/abs/2311.07514
repo_url: None
paper_authors: Shuting He, Hao Luo, Wei Jiang, Xudong Jiang, Henghui Ding
for: 本文提出了一种基于视觉引导的Semantic-Group Network（VGSG），用于文本基于人脸搜索，以EXTRACTING高精度的视觉和文本特征。
methods: 本文提出了两个模块：Semantic-Group Textual Learning（SGTL）模块和Vision-guided Knowledge Transfer（VGKT）模块。SGTL模块使用语言表达的semantic cues来分组文本特征，而VGKT模块使用视觉引导的注意力来EXTRACTING视觉相关的文本特征。
results: 实验结果表明，VGSG比state-of-the-art方法高效、高精度地实现文本基于人脸搜索。

Abstract
Text-based Person Search (TBPS) aims to retrieve images of target pedestrian indicated by textual descriptions. It is essential for TBPS to extract fine-grained local features and align them crossing modality. Existing methods utilize external tools or heavy cross-modal interaction to achieve explicit alignment of cross-modal fine-grained features, which is inefficient and time-consuming. In this work, we propose a Vision-Guided Semantic-Group Network (VGSG) for text-based person search to extract well-aligned fine-grained visual and textual features. In the proposed VGSG, we develop a Semantic-Group Textual Learning (SGTL) module and a Vision-guided Knowledge Transfer (VGKT) module to extract textual local features under the guidance of visual local clues. In SGTL, in order to obtain the local textual representation, we group textual features from the channel dimension based on the semantic cues of language expression, which encourages similar semantic patterns to be grouped implicitly without external tools. In VGKT, a vision-guided attention is employed to extract visual-related textual features, which are inherently aligned with visual cues and termed vision-guided textual features. Furthermore, we design a relational knowledge transfer, including a vision-language similarity transfer and a class probability transfer, to adaptively propagate information of the vision-guided textual features to semantic-group textual features. With the help of relational knowledge transfer, VGKT is capable of aligning semantic-group textual features with corresponding visual features without external tools and complex pairwise interaction. Experimental results on two challenging benchmarks demonstrate its superiority over state-of-the-art methods.

摘要
文本基于人体搜索（TBPS）目的是 retrieve图像指定的人体文本描述。为TBPS提取细腻的本地特征并进行modal的同步是关键。现有方法通过外部工具或重量级modal交互来实现明确的modal同步细腻特征，这是不效率和时间consuming。在这种工作中，我们提议一种视觉引导 semantic-group 网络（VGSG） для文本基于人体搜索，以提取Well-aligned的视觉和文本特征。在我们的VGSG中，我们开发了一种Semantic-Group Textual Learning（SGTL）模块和一种视觉引导知识传输（VGKT）模块，以提取文本本地特征。在SGTL中，我们基于语言表达的semanticcue分组文本特征从通道维度，以便获得本地文本表示。在VGKT中，我们employn了视觉引导注意力来提取视觉相关的文本特征，这些特征是自然地与视觉cue相关，并被称为视觉引导的文本特征。此外，我们设计了一种关系知识传输，包括一种视觉语言相似传输和一种类 probablity传输，以适应ively propagate vision-guided文本特征的信息到semantic-group文本特征。通过关系知识传输，VGKT能够将semantic-group文本特征与相应的视觉特征同步，不需要外部工具和复杂的对抗式交互。实验结果表明，我们的方法在两个复杂的benchmark上表现出色，与当前方法相比有superiority。

Temporal Performance Prediction for Deep Convolutional Long Short-Term Memory Networks

paper_url: http://arxiv.org/abs/2311.07477
repo_url: None
paper_authors: Laura Fieback, Bidya Dash, Jakob Spiegelberg, Hanno Gottschalk
for: 这篇论文主要目标是量化深度 semantic segmentation 网络的预测uncertainty，以便在安全关键任务中使用。
methods: 这篇论文使用了 convolutional long short-term memory 网络，可以不仅提供semantic segmentation，还可以预测下一步的segmentation。这些模型使用细胞状态来广播previous data中的信息，通过接受时间序列输入来预测未来一步或更多步。
results: 这篇论文提出了一种temporal postprocessing方法，可以估算 convolutional long short-term memory 网络的预测性能，包括预测 intersect over union 值或者 классификация为 zero 或大于 zero。为此，我们创建了基于 temporal cell state 的输入指标，并研究了不同的模型来估算预测质量基于这些指标。我们还研究了考虑 cell states 的数量对提posed metrics 的影响。

Abstract
Quantifying predictive uncertainty of deep semantic segmentation networks is essential in safety-critical tasks. In applications like autonomous driving, where video data is available, convolutional long short-term memory networks are capable of not only providing semantic segmentations but also predicting the segmentations of the next timesteps. These models use cell states to broadcast information from previous data by taking a time series of inputs to predict one or even further steps into the future. We present a temporal postprocessing method which estimates the prediction performance of convolutional long short-term memory networks by either predicting the intersection over union of predicted and ground truth segments or classifying between intersection over union being equal to zero or greater than zero. To this end, we create temporal cell state-based input metrics per segment and investigate different models for the estimation of the predictive quality based on these metrics. We further study the influence of the number of considered cell states for the proposed metrics.

摘要
深度semantic segmentation网络的预测不确定性评估是安全关键任务中的一项重要任务。在自动驾驶应用中，where video数据可用，卷积长短期记忆网络可以不仅提供semantic segmentation，还可以预测下一步的segmentation。这些模型使用细胞状态来广播从前一个数据中的信息，通过接受一系列输入来预测一步或更多步的未来。我们提出了时间Postprocessing方法，该方法可以估计卷积长短期记忆网络的预测性能，通过预测 predicted和实际数据中的交集union的交集或者判断intersection over union是否大于或等于零来实现。为此，我们创建了时间细胞状态基于的输入度量，并研究不同的模型来估计预测质量基于这些度量。我们进一步研究了考虑的细胞状态数量对提posed度量的影响。

Masked Face Dataset Generation and Masked Face Recognition

paper_url: http://arxiv.org/abs/2311.07475
repo_url: https://github.com/luisrui/seeing-ai-system
paper_authors: Rui Cai, Xuying Ning, Peter N. Belhumeur
for: 本研究旨在解决post-epidemic era面罩妨碍普通面 recognition问题。
methods: 研究人员创建了更加具有挑战性的面罩人脸数据集，并使用了自适应的数据增强策略来提高模型在实际场景中的性能。
results: 研究人员通过自适应模型和数据增强策略，在50个人脸数据集上实现了最高的95%的测试精度。

Abstract
In the post-pandemic era, wearing face masks has posed great challenge to the ordinary face recognition. In the previous study, researchers has applied pretrained VGG16, and ResNet50 to extract features on the elaborate curated existing masked face recognition (MFR) datasets, RMFRD and SMFRD. To make the model more adaptable to the real world situation where the sample size is smaller and the camera environment has greater changes, we created a more challenging masked face dataset ourselves, by selecting 50 identities with 1702 images from Labelled Faces in the Wild (LFW) Dataset, and simulated face masks through key point detection. The another part of our study is to solve the masked face recognition problem, and we chose models by referring to the former state of the art results, instead of directly using pretrained models, we fine tuned the model on our new dataset and use the last linear layer to do the classification directly. Furthermore, we proposed using data augmentation strategy to further increase the test accuracy, and fine tuned a new networks beyond the former study, one of the most SOTA networks, Inception ResNet v1. The best test accuracy on 50 identity MFR has achieved 95%.

摘要
在post-epidemic era，面具穿戴对于普通面Recognition posed great challenge。在前一项研究中，研究人员使用预训练的VGG16和ResNet50提取特征从 elaborate curated existing masked face recognition（MFR）数据集RMFRD和SMFRD。为了让模型更适应实际世界中的小样本大小和摄像头环境更大的变化，我们创建了更加挑战性的面具face数据集，选择了50个人的1702张图像从Labelled Faces in the Wild（LFW）数据集，并通过关键点检测来生成模拟面具。另一方面，我们解决了面具认知问题，选择了根据前一项state of the art结果进行选择，而不是直接使用预训练模型，我们在我们新数据集上进行了微调，并使用最后的直方向来进行直接分类。此外，我们提议使用数据增强策略来进一步提高测试准确率，并微调了新的网络，其中一个最新的SOTA网络，Inception ResNet v1。我们在50个人MFR测试中获得了95%的最高测试准确率。

Language Grounded QFormer for Efficient Vision Language Understanding

paper_url: http://arxiv.org/abs/2311.07449
repo_url: None
paper_authors: Moulik Choraria, Nitesh Sekhar, Yue Wu, Xu Zhang, Prateek Singhal, Lav R. Varshney
for: 这篇论文的目的是提出更有效的视频语言匹配方法，以提高视频语言模型的预训练效率。
methods: 该论文使用了Query Transformer（QFormer）方法，启发自BLIP-2模型，通过bridging frozen modalities来实现视频语言对应。
results: compared to现有基elines，该方法提高了视频语言预训练的效率。

Abstract
Large-scale pretraining and instruction tuning have been successful for training general-purpose language models with broad competencies. However, extending to general-purpose vision-language models is challenging due to the distributional diversity in visual inputs. A recent line of work explores vision-language instruction tuning, taking inspiration from the Query Transformer (QFormer) approach proposed in BLIP-2 models for bridging frozen modalities. However, these approaches rely heavily on large-scale multi-modal pretraining for representation learning before eventual finetuning, incurring a huge computational overhead, poor scaling, and limited accessibility. To that end, we propose a more efficient method for QFormer-based vision-language alignment and demonstrate the effectiveness of our strategy compared to existing baselines in improving the efficiency of vision-language pretraining.

摘要
大规模预训练和指导调整已经成功地训练了通用语言模型，但扩展到通用视语模型却存在许多挑战，主要是因为视觉输入的分布差异。现有的一些研究借鉴Query Transformer（QFormer）方法，在BLIP-2模型中实现停滞模式之间的桥接。然而，这些方法均依赖于大规模多modal预训练，以便 eventually fine-tuning，导致计算开销巨大、缓慢、可达性有限。为了解决这个问题，我们提出了一种更高效的QFormer基于视语Alignment方法，并证明了我们的策略与现有基准相比，可以提高视语预训练的效率。

Story-to-Motion: Synthesizing Infinite and Controllable Character Animation from Long Text

paper_url: http://arxiv.org/abs/2311.07446
repo_url: None
paper_authors: Zhongfei Qing, Zhongang Cai, Zhitao Yang, Lei Yang
for: 这个研究的目的是生成基于故事的自然人体动作，以便改变动画、游戏和电影等领域的ланд膝景。
methods: 该研究使用了当代大语言模型作为文本驱动动作调度器，EXTRACTING a series of (text, position, duration) pairs from long text，并开发了文本驱动动作检索方案，该方案包括动作匹配和动作Semantics和轨迹约束。
results: 该系统可以生成可控的无限长动作和轨迹，与输入文本保持一致。该系统在三个不同的子任务中（轨迹跟踪、时间动作组合和动作混合）都超过了之前的状态 искусственный动作生成方法。

Abstract
Generating natural human motion from a story has the potential to transform the landscape of animation, gaming, and film industries. A new and challenging task, Story-to-Motion, arises when characters are required to move to various locations and perform specific motions based on a long text description. This task demands a fusion of low-level control (trajectories) and high-level control (motion semantics). Previous works in character control and text-to-motion have addressed related aspects, yet a comprehensive solution remains elusive: character control methods do not handle text description, whereas text-to-motion methods lack position constraints and often produce unstable motions. In light of these limitations, we propose a novel system that generates controllable, infinitely long motions and trajectories aligned with the input text. (1) We leverage contemporary Large Language Models to act as a text-driven motion scheduler to extract a series of (text, position, duration) pairs from long text. (2) We develop a text-driven motion retrieval scheme that incorporates motion matching with motion semantic and trajectory constraints. (3) We design a progressive mask transformer that addresses common artifacts in the transition motion such as unnatural pose and foot sliding. Beyond its pioneering role as the first comprehensive solution for Story-to-Motion, our system undergoes evaluation across three distinct sub-tasks: trajectory following, temporal action composition, and motion blending, where it outperforms previous state-of-the-art motion synthesis methods across the board. Homepage: https://story2motion.github.io/.

摘要
将文本转换为人类动作的潜在潜力可能会改变动画、游戏和电影产业的景观。一个新而具有挑战性的任务是故事到动作（Story-to-Motion），当角色需要根据长文本描述移动到不同的位置并执行特定的动作时，这个任务需要文本驱动的动作控制和高级控制之间的融合。现有的角色控制和文本到动作方法已经解决了相关的问题，但是一个全面的解决方案仍然未知：角色控制方法不能处理文本描述，而文本到动作方法缺乏位置约束并经常产生不稳定的动作。为了解决这些限制，我们提出了一种新的系统，该系统可以根据输入文本生成可控的无限长动作和轨迹，并与文本保持一致。我们利用当代大型自然语言模型作为文本驱动动作计划器，从长文本中提取一系列（文本、位置、持续时间）对。我们开发了一种文本驱动动作检索方法，该方法将包括动作匹配、动作Semantics和轨迹约束。我们还设计了一种进步的掩蔽变换器，以解决通常出现在过渡动作中的不自然姿势和脚滑动。我们的系统不仅是Story-to-Motion的首个全面解决方案，还在三个不同的子任务上进行评估：轨迹跟踪、时间动作组合和动作混合，在所有的情况下都超越了先前的动画生成方法。更多信息请访问我们的主页：。

Supersampling of Data from Structured-light Scanner with Deep Learning

paper_url: http://arxiv.org/abs/2311.07432
repo_url: None
paper_authors: Martin Melicherčík, Lukáš Gajdošech, Viktor Kocur, Martin Madaras
for: 提高3D摄像机获得的深度图的分辨率
methods: 修改FDSR和DKN深度学习模型以适应高分辨率数据，并实现数据预处理技术以稳定训练。
results: 使用自定义 dataset 对 1200 个 3D 扫描进行训练，并通过质量和量化指标评估高分辨率 depth map 的结果。

Abstract
This paper focuses on increasing the resolution of depth maps obtained from 3D cameras using structured light technology. Two deep learning models FDSR and DKN are modified to work with high-resolution data, and data pre-processing techniques are implemented for stable training. The models are trained on our custom dataset of 1200 3D scans. The resulting high-resolution depth maps are evaluated using qualitative and quantitative metrics. The approach for depth map upsampling offers benefits such as reducing the processing time of a pipeline by first downsampling a high-resolution depth map, performing various processing steps at the lower resolution and upsampling the resulting depth map or increasing the resolution of a point cloud captured in lower resolution by a cheaper device. The experiments demonstrate that the FDSR model excels in terms of faster processing time, making it a suitable choice for applications where speed is crucial. On the other hand, the DKN model provides results with higher precision, making it more suitable for applications that prioritize accuracy.

摘要
The proposed approach for depth map upsampling offers several benefits, including reducing the processing time of a pipeline by first downsampling a high-resolution depth map, performing various processing steps at a lower resolution, and then upsampling the resulting depth map. Additionally, the approach can increase the resolution of a point cloud captured in lower resolution by a cheaper device.Experiments demonstrate that the FDSR model is faster and more suitable for applications where speed is crucial. On the other hand, the DKN model provides results with higher precision, making it more suitable for applications that prioritize accuracy.

Optimising Human-AI Collaboration by Learning Convincing Explanations

paper_url: http://arxiv.org/abs/2311.07426
repo_url: None
paper_authors: Alex J. Chan, Alihan Huyuk, Mihaela van der Schaar
for: 本研究旨在开发一种可以协助人类做出决策的协同系统，以确保系统的安全性和性能。
methods: 本研究使用的方法包括开发一种名为Ardent的算法，可以快速学习人类对解释的喜好，并根据这些喜好为人类提供最有用的解释。
results: 研究结果表明，使用Ardent算法可以有效地改善决策过程中的透明度和责任感，并且在一个复杂的图像分类任务中表现出了适用性。

Abstract
Machine learning models are being increasingly deployed to take, or assist in taking, complicated and high-impact decisions, from quasi-autonomous vehicles to clinical decision support systems. This poses challenges, particularly when models have hard-to-detect failure modes and are able to take actions without oversight. In order to handle this challenge, we propose a method for a collaborative system that remains safe by having a human ultimately making decisions, while giving the model the best opportunity to convince and debate them with interpretable explanations. However, the most helpful explanation varies among individuals and may be inconsistent across stated preferences. To this end we develop an algorithm, Ardent, to efficiently learn a ranking through interaction and best assist humans complete a task. By utilising a collaborative approach, we can ensure safety and improve performance while addressing transparency and accountability concerns. Ardent enables efficient and effective decision-making by adapting to individual preferences for explanations, which we validate through extensive simulations alongside a user study involving a challenging image classification task, demonstrating consistent improvement over competing systems.

摘要

Robust semi-supervised segmentation with timestep ensembling diffusion models

paper_url: http://arxiv.org/abs/2311.07421
repo_url: None
paper_authors: Margherita Rosnati, Melanie Roschewitz, Ben Glocker
For: This paper is written for the task of semi-supervised medical image segmentation, with a focus on domain generalization.* Methods: The paper uses denoising diffusion probabilistic models (DDPMs) to model the distribution of natural images and perform image segmentation. The authors propose an improved ensemble scheme that leverages information-dense small steps and the regularizing effect of larger steps to generate predictions.* Results: The paper shows that the proposed method significantly outperforms other methods in domain-shifted settings while retaining competitive performance in-domain. The results highlight the potential of DDPMs for semi-supervised medical image segmentation and provide insights into optimising their performance under domain shift.Here’s the Chinese translation of the three pieces of information:* For: 这篇论文是为 semi-supervised 医疗图像分割任务而写的，尤其是在领域映射上。* Methods: 该论文使用 denoising diffusion probabilistic models (DDPMs) 来模型自然图像的分布，并用这些模型进行图像分割。作者们提出了一种改进的ensemble scheme，该 scheme利用小步骤中的信息压缩和大步骤的正则化效果来生成预测。* Results: 论文显示，提出的方法在领域转换下表现出色，同时保持了在领域内的竞争性表现。结果表明 DDPMs 在 semi-supervised 医疗图像分割中具有潜在的潜力，并提供了优化其性能下领域转换的灵感。

Abstract
Medical image segmentation is a challenging task, made more difficult by many datasets' limited size and annotations. Denoising diffusion probabilistic models (DDPM) have recently shown promise in modelling the distribution of natural images and were successfully applied to various medical imaging tasks. This work focuses on semi-supervised image segmentation using diffusion models, particularly addressing domain generalisation. Firstly, we demonstrate that smaller diffusion steps generate latent representations that are more robust for downstream tasks than larger steps. Secondly, we use this insight to propose an improved esembling scheme that leverages information-dense small steps and the regularising effect of larger steps to generate predictions. Our model shows significantly better performance in domain-shifted settings while retaining competitive performance in-domain. Overall, this work highlights the potential of DDPMs for semi-supervised medical image segmentation and provides insights into optimising their performance under domain shift.

摘要
医学图像分割是一项具有挑战性的任务，由于许多数据集的限制性和标注而更加困难。在最近的研究中，涉泳扩散概率模型（DDPM）已经表现出模拟自然图像的分布的潜力，并在各种医学成像任务中得到了成功应用。本研究将关注 semi-supervised 图像分割使用涉泳模型，特别是域泛化。我们首先证明了小步 diffusion 生成的潜在表示更加鲁棒，用于下游任务的下游任务。其次，我们利用这一点来提出一种改进的拼接方案，利用信息密集的小步和大步的约束效果来生成预测。我们的模型在域Shift 的设置下表现出显著改善，同时保持了在预测中的竞争性。总之，这种研究高亮了 DDPM 在 semi-supervised 医学图像分割中的潜力，并提供了在域Shift 下优化其性能的 Insight。

Mitigating Backdoors within Deep Neural Networks in Data-limited Configuration

paper_url: http://arxiv.org/abs/2311.07417
repo_url: None
paper_authors: Soroush Hashemifar, Saeed Parsa, Morteza Zakeri-Nasrabadi
for: 防止深度神经网络（DNN）被黑客攻击，提高DNN的安全性。
methods: 利用潜在恶意 neuron 的特征，实时检测 DNN 中潜在攻击的诡计。
results: 对 CIFAR-10 数据集进行实验，提出了一种基于 activation values、weights 和 neuron 之间关系的潜在攻击检测方法，可以降低攻击成功的机会超过 50%，同时不会对模型性能产生显著影响。此外，该方法比基准方法快三倍。

Abstract
As the capacity of deep neural networks (DNNs) increases, their need for huge amounts of data significantly grows. A common practice is to outsource the training process or collect more data over the Internet, which introduces the risks of a backdoored DNN. A backdoored DNN shows normal behavior on clean data while behaving maliciously once a trigger is injected into a sample at the test time. In such cases, the defender faces multiple difficulties. First, the available clean dataset may not be sufficient for fine-tuning and recovering the backdoored DNN. Second, it is impossible to recover the trigger in many real-world applications without information about it. In this paper, we formulate some characteristics of poisoned neurons. This backdoor suspiciousness score can rank network neurons according to their activation values, weights, and their relationship with other neurons in the same layer. Our experiments indicate the proposed method decreases the chance of attacks being successful by more than 50% with a tiny clean dataset, i.e., ten clean samples for the CIFAR-10 dataset, without significantly deteriorating the model's performance. Moreover, the proposed method runs three times as fast as baselines.

摘要
深度神经网络（DNN）的容量增长，需要的数据量也在增长。一种常见的做法是外部进行训练过程或者在互联网上收集更多数据，这会引入恶意修改DNN的风险。一个恶意修改后的DNN会在测试时采样中注入触发器后表现出错误行为，对于防御者而言，存在多种困难。首先，可用的干净数据集可能不够用于精度调整和恢复恶意修改后的DNN。其次，在实际应用中，无法回收触发器的信息，无法准确地恢复恶意修改后的DNN。在这篇论文中，我们描述了恶意修改后的神经元特征。我们提出的方法可以根据神经元的活动值、权重和相同层中其他神经元的关系，对神经元进行恶意修改检测。我们的实验结果表明，我们的方法可以在使用十个干净样本（CIFAR-10数据集）的情况下，降低攻击成功的机会超过50%，而无需对模型性能产生显著的影响。此外，我们的方法比基线方法快三倍。

FIRST: A Million-Entry Dataset for Text-Driven Fashion Synthesis and Design

paper_url: http://arxiv.org/abs/2311.07414
repo_url: None
paper_authors: Zhen Huang, Yihao Li, Dong Pei, Jiapeng Zhou, Xuliang Ning, Jianlin Han, Xiaoguang Han, Xuejun Chen
for: 提高人工智能生成内容(AIGC)中的时尚设计和生成能力，推动传统时尚业的革新和发展。
methods: 介绍了一个新的高分辨率时尚图像和详细结构化文本描述（FIRST）集合，包含一百万个时尚图像和多个层次结构的文本描述。
results: 经过实验表明，FIRST 集合是必需的 для进一步提高时尚生成和设计系统的创造力和想象力。

Abstract
Text-driven fashion synthesis and design is an extremely valuable part of artificial intelligence generative content(AIGC), which has the potential to propel a tremendous revolution in the traditional fashion industry. To advance the research on text-driven fashion synthesis and design, we introduce a new dataset comprising a million high-resolution fashion images with rich structured textual(FIRST) descriptions. In the FIRST, there is a wide range of attire categories and each image-paired textual description is organized at multiple hierarchical levels. Experiments on prevalent generative models trained over FISRT show the necessity of FIRST. We invite the community to further develop more intelligent fashion synthesis and design systems that make fashion design more creative and imaginative based on our dataset. The dataset will be released soon.

摘要
文本驱动时尚合成和设计是人工智能生成内容(AIGC)中非常有价值的一部分，有potential可以驱动传统时尚业的巨大革命。为进一步推进文本驱动时尚合成和设计的研究，我们介绍了一个新的数据集，包含100万高分辨率时尚图片和详细结构化文本描述(FIRST)。在FIRST中，有很多服装类别，每个图片对应的文本描述分别组织在多个层次结构中。经过现有的生成模型在FIRST上训练，我们发现了FIRST的必要性。我们邀请社区为我们的数据集进一步开发更智能的时尚合成和设计系统，使时尚设计更有创意和想象力。数据集即将发布。

Towards Automatic Honey Bee Flower-Patch Assays with Paint Marking Re-Identification

paper_url: http://arxiv.org/abs/2311.07407
repo_url: None
paper_authors: Luke Meyers, Josué Rodríguez Cordero, Carlos Corrada Bravo, Fanfan Noel, José Agosto-Rivera, Tugrul Giray, Rémi Mégret
for: 这 paper 用于 automatize the analysis of behavioral assays involving honey bees in the field, where marking has to be as lightweight as possible.
methods: 这 paper 使用 paint markings 和 contrastive learning with a ResNet backbone 和 triplet loss 来实现 bees 的 re-identification.
results: 这 paper 提供了一个 novel dataset for bees re-identification with paint-markings, 并实现了 almost perfect recognition in closed setting where identities are known in advance. Additionally, the paper shows the potential to fully automate the visit detection and provides preliminary results of compute time for future real-time deployment in the field on an edge device.

Abstract
In this paper, we show that paint markings are a feasible approach to automatize the analysis of behavioral assays involving honey bees in the field where marking has to be as lightweight as possible. We contribute a novel dataset for bees re-identification with paint-markings with 4392 images and 27 identities. Contrastive learning with a ResNet backbone and triplet loss led to identity representation features with almost perfect recognition in closed setting where identities are known in advance. Diverse experiments evaluate the capability to generalize to separate IDs, and show the impact of using different body parts for identification, such as using the unmarked abdomen only. In addition, we show the potential to fully automate the visit detection and provide preliminary results of compute time for future real-time deployment in the field on an edge device.

摘要
在这篇论文中，我们展示了使用涂抹标记来自动化野外Behavioral assays中的蜂群行为分析是可行的。我们提供了一个新的涂抹标记数据集，包含4392张图像和27个标识。对比学习使用ResNet底层和 triplet损失，可以从closed设定中获得几乎完美的识别结果。我们还进行了多种实验，以评估这些特征的泛化能力和使用不同的身体部分进行识别的影响。此外，我们还展示了可以全自动化访问检测，并提供了未来在野外实时部署的计算时间预算。

Processing and Segmentation of Human Teeth from 2D Images using Weakly Supervised Learning

paper_url: http://arxiv.org/abs/2311.07398
repo_url: None
paper_authors: Tomáš Kunzo, Viktor Kocur, Lukáš Gajdošech, Martin Madaras
For: 这个研究旨在提出一种弱地监督的牙齿分类方法，以减少手动标注的需求。* Methods: 我们使用热点检测网络的输出热点和中间特征图库来引导分类过程。* Results: 我们在TriDental数据集上实现了比前方法更高的准确性和适居性。

Abstract
Teeth segmentation is an essential task in dental image analysis for accurate diagnosis and treatment planning. While supervised deep learning methods can be utilized for teeth segmentation, they often require extensive manual annotation of segmentation masks, which is time-consuming and costly. In this research, we propose a weakly supervised approach for teeth segmentation that reduces the need for manual annotation. Our method utilizes the output heatmaps and intermediate feature maps from a keypoint detection network to guide the segmentation process. We introduce the TriDental dataset, consisting of 3000 oral cavity images annotated with teeth keypoints, to train a teeth keypoint detection network. We combine feature maps from different layers of the keypoint detection network, enabling accurate teeth segmentation without explicit segmentation annotations. The detected keypoints are also used for further refinement of the segmentation masks. Experimental results on the TriDental dataset demonstrate the superiority of our approach in terms of accuracy and robustness compared to state-of-the-art segmentation methods. Our method offers a cost-effective and efficient solution for teeth segmentation in real-world dental applications, eliminating the need for extensive manual annotation efforts.

摘要
teeth 分割是dentistry 图像分析中的一项关键任务，以确定精确的诊断和治疗规划。 although supervised deep learning methods can be used for teeth segmentation, they often require extensive manual annotation of segmentation masks, which is time-consuming and costly. In this research, we propose a weakly supervised approach for teeth segmentation that reduces the need for manual annotation. Our method utilizes the output heatmaps and intermediate feature maps from a keypoint detection network to guide the segmentation process. We introduce the TriDental dataset, consisting of 3000 oral cavity images annotated with teeth keypoints, to train a teeth keypoint detection network. We combine feature maps from different layers of the keypoint detection network, enabling accurate teeth segmentation without explicit segmentation annotations. The detected keypoints are also used for further refinement of the segmentation masks. Experimental results on the TriDental dataset demonstrate the superiority of our approach in terms of accuracy and robustness compared to state-of-the-art segmentation methods. Our method offers a cost-effective and efficient solution for teeth segmentation in real-world dental applications, eliminating the need for extensive manual annotation efforts.Here's the translation in Traditional Chinese: teeth 分割是dentistry 图像分析中的一项关键任务，以确定精确的诊断和治疗规划。 although supervised deep learning methods can be used for teeth segmentation, they often require extensive manual annotation of segmentation masks, which is time-consuming and costly. In this research, we propose a weakly supervised approach for teeth segmentation that reduces the need for manual annotation. Our method utilizes the output heatmaps and intermediate feature maps from a keypoint detection network to guide the segmentation process. We introduce the TriDental dataset, consisting of 3000 oral cavity images annotated with teeth keypoints, to train a teeth keypoint detection network. We combine feature maps from different layers of the keypoint detection network, enabling accurate teeth segmentation without explicit segmentation annotations. The detected keypoints are also used for further refinement of the segmentation masks. Experimental results on the TriDental dataset demonstrate the superiority of our approach in terms of accuracy and robustness compared to state-of-the-art segmentation methods. Our method offers a cost-effective and efficient solution for teeth segmentation in real-world dental applications, eliminating the need for extensive manual annotation efforts.

Evaluating the Significance of Outdoor Advertising from Driver’s Perspective Using Computer Vision

paper_url: http://arxiv.org/abs/2311.07390
repo_url: None
paper_authors: Zuzana Černeková, Zuzana Berger Haladová, Ján Špirka, Viktor Kocur
for: 评估路边广告的重要性，预防 drivers 的分心驾驶
methods: 使用 YOLOv8 检测器和不同的对象跟踪方法，并使用随机森林分类器将广告牌分为三类基于驾驶员围注时间、吸引力和大小
results: 获得了 38.5 HOTA 的最佳方法，并在测试集上达到 75.8% 的测试精度

Abstract
Outdoor advertising, such as roadside billboards, plays a significant role in marketing campaigns but can also be a distraction for drivers, potentially leading to accidents. In this study, we propose a pipeline for evaluating the significance of roadside billboards in videos captured from a driver's perspective. We have collected and annotated a new BillboardLamac dataset, comprising eight videos captured by drivers driving through a predefined path wearing eye-tracking devices. The dataset includes annotations of billboards, including 154 unique IDs and 155 thousand bounding boxes, as well as eye fixation data. We evaluate various object tracking methods in combination with a YOLOv8 detector to identify billboard advertisements with the best approach achieving 38.5 HOTA on BillboardLamac. Additionally, we train a random forest classifier to classify billboards into three classes based on the length of driver fixations achieving 75.8% test accuracy. An analysis of the trained classifier reveals that the duration of billboard visibility, its saliency, and size are the most influential features when assessing billboard significance.

摘要
外部广告，如道路旁的大型广告牌，在市场推广活动中扮演着重要的角色，但也可能对 drivers 引起干扰，导致交通事故。在这项研究中，我们提出了一个用于评估路边大型广告的管道。我们收集了和标注了一个新的 BillboardLamac 数据集，包括八个驾驶员驾驶过定制路线，并且穿着眼动追踪设备拍摄的八个视频。该数据集包括了大型广告的标识符、154个唯一标识符和155千个 bounding box，以及眼动数据。我们评估了多种对象跟踪方法，并结合 YOLOv8 检测器来识别大型广告，最佳方法的 HOTA 得分为 38.5。此外，我们训练了随机森林分类器，以类别大型广告为三种基于驾驶员固定时间的长度、吸引力和大小。分析训练的分类器显示，大型广告的可见时间、吸引力和大小是评估大型广告重要性的最重要的特征。

Classification of developmental and brain disorders via graph convolutional aggregation

paper_url: http://arxiv.org/abs/2311.07370
repo_url: None
paper_authors: Ibrahim Salim, A. Ben Hamza
for: 本研究旨在提高脑疾病预测性能，特别是在脑发育和脑退化疾病预测方面。
methods: 本研究提出了一种协调 нормализа化图 convolutional neural network，通过图像和非图像特征的组合，以及跳过连接和标识映射，以学习图节点表示。
results: 对两个大型数据集（ABIDE和ADNI）进行了比较，研究结果显示，与最近的基线方法相比，该方法在预测自闭症症和阿尔ц海默病等脑疾病方面达到了Relative improvement of 50%和13.56%。

Abstract
While graph convolution based methods have become the de-facto standard for graph representation learning, their applications to disease prediction tasks remain quite limited, particularly in the classification of neurodevelopmental and neurodegenerative brain disorders. In this paper, we introduce an aggregator normalization graph convolutional network by leveraging aggregation in graph sampling, as well as skip connections and identity mapping. The proposed model learns discriminative graph node representations by incorporating both imaging and non-imaging features into the graph nodes and edges, respectively, with the aim of augmenting predictive capabilities and providing a holistic perspective on the underlying mechanisms of brain disorders. Skip connections enable the direct flow of information from the input features to later layers of the network, while identity mapping helps maintain the structural information of the graph during feature learning. We benchmark our model against several recent baseline methods on two large datasets, Autism Brain Imaging Data Exchange (ABIDE) and Alzheimer's Disease Neuroimaging Initiative (ADNI), for the prediction of autism spectrum disorder and Alzheimer's disease, respectively. Experimental results demonstrate the competitive performance of our approach in comparison with recent baselines in terms of several evaluation metrics, achieving relative improvements of 50% and 13.56% in classification accuracy over graph convolutional networks on ABIDE and ADNI, respectively.

摘要
而 graph convolution based 方法在图像表示学习中已经成为了标准，但它们在诊断脑病任务中的应用还很有限，特别是在脑发育和脑退化性疾病的分类中。在这篇论文中，我们提出了一种归并normalization图 convolutional neural network，通过图像采样中的归并和跳过连接，以及标识映射来学习图节点表示。我们的模型通过将图节点和边分别归并到图节点和边上，并将各种影像和非影像特征集成到图节点和边上，以提高预测能力和为脑病的内部机制提供整体视图。跳过连接使得输入特征直接流入网络的后 layer，而标识映射保持图的结构信息在特征学习过程中。我们对 Autism Brain Imaging Data Exchange（ABIDE）和 Alzheimer's Disease Neuroimaging Initiative（ADNI）两个大数据集进行了比较，并与最近的基eline方法进行了比较。实验结果表明，我们的方法在 ABIDE 和 ADNI 上的分类精度有50%和13.56%的提高，相比于图 convolutional networks。

ActiveDC: Distribution Calibration for Active Finetuning

paper_url: http://arxiv.org/abs/2311.07634
repo_url: None
paper_authors: Wenshuai Xu, Zhenhui Hu, Yu Lu, Jinzhou Meng, Qingjie Liu, Yunhong Wang
for: 这 paper 是关于 active finetuning 的研究，具体来说是如何选择用于精度 tuning 的数据subset，以避免模型过拟合。
methods: 这 paper 提出了一种新的方法 called ActiveDC，它首先选择用于精度 tuning 的数据subset，然后对这些数据进行分布准确化，以使模型在不同的 sampling ratio 下表现更好。
results: 根据 paper 的实验结果，ActiveDC 在三个图像分类任务中均表现出色，特别是当 sampling ratio 较低时，与基eline的性能提升可达 10%。

Abstract
The pretraining-finetuning paradigm has gained popularity in various computer vision tasks. In this paradigm, the emergence of active finetuning arises due to the abundance of large-scale data and costly annotation requirements. Active finetuning involves selecting a subset of data from an unlabeled pool for annotation, facilitating subsequent finetuning. However, the use of a limited number of training samples can lead to a biased distribution, potentially resulting in model overfitting. In this paper, we propose a new method called ActiveDC for the active finetuning tasks. Firstly, we select samples for annotation by optimizing the distribution similarity between the subset to be selected and the entire unlabeled pool in continuous space. Secondly, we calibrate the distribution of the selected samples by exploiting implicit category information in the unlabeled pool. The feature visualization provides an intuitive sense of the effectiveness of our approach to distribution calibration. We conducted extensive experiments on three image classification datasets with different sampling ratios. The results indicate that ActiveDC consistently outperforms the baseline performance in all image classification tasks. The improvement is particularly significant when the sampling ratio is low, with performance gains of up to 10%. Our code will be released.

摘要
“强化-微调”方法在不同的计算机视觉任务中得到了广泛的应用。在这种方法中，由于大规模数据和注解成本的增加，出现了活跃微调的问题。活跃微调是选择未标注 pool 中的一 subset 进行注解，以便后续微调。然而，使用有限的训练样本可能会导致模型过拟合。在这篇论文中，我们提出了一种新的方法called ActiveDC，用于活跃微调任务。首先，我们通过优化未标注 pool 中 subset 的分布相似性来选择需要注解的样本。其次，我们利用未标注 pool 中的隐藏类信息来准确化选择的样本的分布。图像可视化提供了对我们方法的分布准确化效果的直观感。我们在三个图像分类任务中进行了广泛的实验，结果表明，ActiveDC 在所有图像分类任务中具有比基eline性能更高的表现。具体来说，当采样比率低时，ActiveDC 的表现提升可达 10%。我们将代码发布。

Registered and Segmented Deformable Object Reconstruction from a Single View Point Cloud

paper_url: http://arxiv.org/abs/2311.07357
repo_url: None
paper_authors: Pit Henrich, Balázs Gyenes, Paul Maria Scheikl, Gerhard Neumann, Franziska Mathis-Ullrich
for: 对于扭变的实际物体进行弹性对象处理，我们经常想要与特定的对象部分进行交互，这些部分在非扭变的模型中已经被定义。我们需要一种系统可以从感知数据中识别和定位这些部分。
methods: 我们提出了一种使用神经占用函数进行弹性对象注册，并在注册过程中学习对象分割。由于结果已经包含了分割信息，我们可以跳过注册步骤。
results: 我们在许多扭变物体的实际和模拟数据上进行测试，并证明了我们的方法可以强健地找到这些部分。我们还提出了一种简单的采样算法来生成更好的占用学习训练数据。

Abstract
In deformable object manipulation, we often want to interact with specific segments of an object that are only defined in non-deformed models of the object. We thus require a system that can recognize and locate these segments in sensor data of deformed real world objects. This is normally done using deformable object registration, which is problem specific and complex to tune. Recent methods utilize neural occupancy functions to improve deformable object registration by registering to an object reconstruction. Going one step further, we propose a system that in addition to reconstruction learns segmentation of the reconstructed object. As the resulting output already contains the information about the segments, we can skip the registration process. Tested on a variety of deformable objects in simulation and the real world, we demonstrate that our method learns to robustly find these segments. We also introduce a simple sampling algorithm to generate better training data for occupancy learning.

摘要
在可变形物体操作中，我们经常希望与特定的物体段只在非变形模型中定义的段进行交互。因此，我们需要一个系统可以从感知数据中识别和定位这些段。通常通过不适应物体注册来实现这一点，但是这是特定问题和复杂的调整。现代方法使用神经占据函数来改进不适应物体注册，并在注册过程中学习对象重建。我们的方法会在此基础上进一步，不仅是重建物体，还会学习物体的分割。由于结果中已经包含了段的信息，因此我们可以跳过注册步骤。我们在模拟和实际场景中测试了这种方法，并证明我们的方法可以坚定地找到这些段。我们还介绍了一种简单的采样算法，用于生成更好的占据学习训练数据。

Deformable Groupwise Registration Using a Locally Low-Rank Dissimilarity Metric for Myocardial Strain Estimation from Cardiac Cine MRI Images

paper_url: http://arxiv.org/abs/2311.07348
repo_url: None
paper_authors: Haiyang Chen, Juan Gao, Chenxi Hu
For: The paper is written for cardiac function assessment using cardiac cine MRI images.* Methods: The proposed method uses a deformable groupwise registration-based two-step strategy with a locally low-rank (LLR) dissimilarity metric for CMR-FT.* Results: The proposed method achieved more accurate tracking and strain estimation compared to other methods, especially in late diastole, and may facilitate more accurate assessment of cardiac dysfunction.

Abstract
Objective: Cardiovascular magnetic resonance-feature tracking (CMR-FT) represents a group of methods for myocardial strain estimation from cardiac cine MRI images. Established CMR-FT methods are mainly based on optical flow or pairwise registration. However, these methods suffer from either inaccurate estimation of large motion or drift effect caused by accumulative tracking errors. In this work, we propose a deformable groupwise registration method using a locally low-rank (LLR) dissimilarity metric for CMR-FT. Methods: The proposed method (Groupwise-LLR) tracks the feature points by a groupwise registration-based two-step strategy. Unlike the globally low-rank (GLR) dissimilarity metric, the proposed LLR metric imposes low-rankness on local image patches rather than the whole image. We quantitatively compared Groupwise-LLR with the Farneback optical flow, a pairwise registration method, and a GLR-based groupwise registration method on simulated and in vivo datasets. Results: Results from the simulated dataset showed that Groupwise-LLR achieved more accurate tracking and strain estimation compared with the other methods. Results from the in vivo dataset showed that Groupwise-LLR achieved more accurate tracking and elimination of the drift effect in late-diastole. Inter-observer reproducibility of strain estimates was similar between all studied methods. Conclusion: The proposed method estimates myocardial strains more accurately due to the application of a groupwise registration-based tracking strategy and an LLR-based dissimilarity metric. Significance: The proposed CMR-FT method may facilitate more accurate estimation of myocardial strains, especially in diastole, for clinical assessments of cardiac dysfunction.

摘要
目的：卡地里尺骨磁共振成像-特征跟踪（CMR-FT）是一种基于卡地里磁共振成像图像的肌肉弹性测量方法。现有的CMR-FT方法主要基于光流或对应注册。但这些方法受到大动量或偏移效应的影响，导致测量不准确。在这项工作中，我们提出了一种可变地方法（Groupwise-LLR），使用本地低级（LLR）相似度度量进行群组注册。方法：Groupwise-LLR方法使用两步策略，首先使用光流或对应注册来跟踪特征点，然后使用LLR度量进行群组注册。与全图像的GLR度量不同，LLR度量在本地图像区域强制低级。我们对使用Farneback光流、对应注册方法和GLR基于的群组注册方法进行了量比较。结果：在模拟数据集上，Groupwise-LLR方法在跟踪和弹性测量方面表现更加准确，而在实验数据集上，Groupwise-LLR方法在晚 диастоле阶段消除了偏移效应。对于各种方法的幂等复制性，结果类似。结论：Groupwise-LLR方法可以更加准确地测量肌肉弹性，特别是在 диаastole阶段。意义：Groupwise-LLR方法可能为临床评估卡地里功能障碍提供更加准确的肌肉弹性测量。

Connecting the Dots: Graph Neural Network Powered Ensemble and Classification of Medical Images

paper_url: http://arxiv.org/abs/2311.07321
repo_url: https://github.com/aryan-at-ul/aics_2023_submission
paper_authors: Aryan Singh, Pepijn Van de Ven, Ciarán Eising, Patrick Denny
for: 这篇论文是为了提出一个可靠、经济可行且可扩展的医疗影像分类方法。
methods: 这篇论文使用了 Image Foresting Transform 来最佳化医疗影像的分割，然后将分割后的图像转换为 graf-structured 数据，使用 Graph Neural Networks (GNNs) 进行特征提取和关系建模。 ensemble 使用三种不同的 GNN 架构来提高其内模的稳定性。
results: 在这篇论文中，这种方法在验证targeting pneumonia classification 的任务中，较前一代的 Deep Neural Networks (DNNs) 高效，同时降低了 Parameters 的数量，从而降低了训练数据的成本和训练时间，并且减少了偏见。

Abstract
Deep learning models have demonstrated remarkable results for various computer vision tasks, including the realm of medical imaging. However, their application in the medical domain is limited due to the requirement for large amounts of training data, which can be both challenging and expensive to obtain. To mitigate this, pre-trained models have been fine-tuned on domain-specific data, but such an approach can suffer from inductive biases. Furthermore, deep learning models struggle to learn the relationship between spatially distant features and their importance, as convolution operations treat all pixels equally. Pioneering a novel solution to this challenge, we employ the Image Foresting Transform to optimally segment images into superpixels. These superpixels are subsequently transformed into graph-structured data, enabling the proficient extraction of features and modeling of relationships using Graph Neural Networks (GNNs). Our method harnesses an ensemble of three distinct GNN architectures to boost its robustness. In our evaluations targeting pneumonia classification, our methodology surpassed prevailing Deep Neural Networks (DNNs) in performance, all while drastically cutting down on the parameter count. This not only trims down the expenses tied to data but also accelerates training and minimizes bias. Consequently, our proposition offers a sturdy, economically viable, and scalable strategy for medical image classification, significantly diminishing dependency on extensive training data sets.

摘要
深度学习模型在各种计算机视觉任务中表现出色，其中包括医疗影像领域。然而，它们在医疗领域的应用受到大量训练数据的限制，这些数据可能具有困难和成本高的获取。为此，人们通常使用预训练模型进行精度调整，但这种方法可能受到逻辑偏见的影响。另外，深度学习模型在图像中的特征之间关系学习不够，因为卷积操作对所有像素进行了平等的处理。为了解决这个挑战，我们提出了图像森林变换，用于优化图像分割成超像素。这些超像素然后被转换成图形结构数据，使得通过图形神经网络（GNN）进行特征提取和模型建立关系。我们的方法使用了三种不同的GNN架构 ensemble，以提高其可靠性。在我们的评估中，我们的方法在肺炎分类任务上表现出excel，而且和传统的深度神经网络相比，它的参数数量大幅减少。这不仅减少了与数据集的成本，还加快训练和降低了偏见。因此，我们的提案提供了一种强大、经济可行和可扩展的医疗图像分类策略，significantly reducing the dependence on extensive training data sets.

What Large Language Models Bring to Text-rich VQA?

paper_url: http://arxiv.org/abs/2311.07306
repo_url: None
paper_authors: Xuejing Liu, Wei Tang, Xinzhe Ni, Jinghui Lu, Rui Zhao, Zechao Li, Fei Tan
for: 本研究探讨了基于语言模型（LLM）的文本rich VQA方法的优势和瓶颈。
methods: 作者将视觉和语言模块分离，使用外部OCR模型recognize图像中的文本，并使用LLM来回答问题。整个框架不需训练，利用LLM的在场能力。
results: 研究发现，基于LLM的方法在四个文本rich VQA dataset上达到了superior表现，而且基于ablation study，LLM具有更强的理解能力，可能为VQA问题提供有用的知识。然而，研究发现，LLM在视觉部分存在瓶颈。同时， combining OCR模块与MLLM也得到了 pleasantly results。

Abstract
Text-rich VQA, namely Visual Question Answering based on text recognition in the images, is a cross-modal task that requires both image comprehension and text recognition. In this work, we focus on investigating the advantages and bottlenecks of LLM-based approaches in addressing this problem. To address the above concern, we separate the vision and language modules, where we leverage external OCR models to recognize texts in the image and Large Language Models (LLMs) to answer the question given texts. The whole framework is training-free benefiting from the in-context ability of LLMs. This pipeline achieved superior performance compared to the majority of existing Multimodal Large Language Models (MLLM) on four text-rich VQA datasets. Besides, based on the ablation study, we find that LLM brings stronger comprehension ability and may introduce helpful knowledge for the VQA problem. The bottleneck for LLM to address text-rich VQA problems may primarily lie in visual part. We also combine the OCR module with MLLMs and pleasantly find that the combination of OCR module with MLLM also works. It's worth noting that not all MLLMs can comprehend the OCR information, which provides insights into how to train an MLLM that preserves the abilities of LLM.

摘要
文字丰富的VQA（视觉问答）技术，即基于图像中文字识别的图像问答，是跨Modal的任务，需要图像理解和文字识别。在这项工作中，我们主要关注LLM（大语言模型）在解决这个问题上的优势和瓶颈。为了解决这个问题，我们将视觉和语言模块分开，利用外部OCR模型来recognize图像中的文字，并利用LLM来为给定文字提供答案。整个框架不需要训练，借助LLM在context中的能力。这个管道在四个文字丰富VQA数据集上 achieved superior performance，并且基于ablation study，我们发现LLM具有更强的理解能力，可能为VQA问题带来有用的知识。然而，我们发现LLM在处理文字丰富VQA问题的瓶颈主要在视觉部分。此外，我们还将OCR模块与MLLM（多Modal大语言模型）相结合，并发现这种结合也能够工作。需要注意的是，不 всіMLLM都能理解OCR信息，这提供了如何训练一个MLLM，以便它保留LLM的能力。

Dynamically Weighted Factor-Graph for Feature-based Geo-localization

paper_url: http://arxiv.org/abs/2311.07301
repo_url: None
paper_authors: Miguel Ángel Muñoz-Bañón, Alejandro Olivas, Edison Velasco-Sánchez, Francisco A. Candelas, Fernando Torres
for: 本研究旨在提高Feature-based地理定位的精度，使其能够在杂乱环境中提供更加可靠的定位结果。
methods: 本研究使用动态加权因子图模型来优化汽车的路径估计，并在检测中使用LiDAR传感器来进行数据质量评估。此外，还包括GNSS基于的先前误差估计。
results: 对比当前最佳地理定位方法，本研究在杂乱环境中显示了更高的精度和更少的偏差。此外，当检测数据失去时，本方法也能够成功地 mitigate 偏差和异常值。

Abstract
Feature-based geo-localization relies on associating features extracted from aerial imagery with those detected by the vehicle's sensors. This requires that the type of landmarks must be observable from both sources. This no-variety of feature types generates poor representations that lead to outliers and deviations, produced by ambiguities and lack of detections respectively. To mitigate these drawbacks, in this paper, we present a dynamically weighted factor graph model for the vehicle's trajectory estimation. The weight adjustment in this implementation depends on information quantification in the detections performed using a LiDAR sensor. Also, a prior (GNSS-based) error estimation is included in the model. Then, when the representation becomes ambiguous or sparse, the weights are dynamically adjusted to rely on the corrected prior trajectory, mitigating in this way outliers and deviations. We compare our method against state-of-the-art geo-localization ones in a challenging ambiguous environment, where we also cause detection losses. We demonstrate mitigation of the mentioned drawbacks where the other methods fail.

摘要
<> translate text into Simplified ChineseFeature-based geo-localization 基于将从飞行图像中提取的特征与车辆的感知器上探测到的特征进行关联。这需要两者之间的特征类型相同。这种单一的特征类型生成粗糙的表示，导致异常和偏差，由扩大和检测不足所致。为了解决这些缺点，在本文中，我们提出了一种动态权重因子图模型，用于车辆的路径估计。在这个实现中，权重调整取决于利用 LiDAR 探测器进行的检测信息量化。此外，还包括GNSS-based先前错误估计。当表示变得混乱或稀缺时，权重会动态调整，以依靠修正后的先前路径，从而 Mitigate 异常和偏差。我们与现有的地理localization方法进行比较，在一个复杂的混乱环境中，我们还引起检测损失。我们示出了消除这些缺点，其他方法失败的情况。

Multi Sentence Description of Complex Manipulation Action Videos

paper_url: http://arxiv.org/abs/2311.07285
repo_url: None
paper_authors: Fatemeh Ziaeetabar, Reza Safabakhsh, Saeedeh Momtazi, Minija Tamosiunaite, Florentin Wörgötter
for: automatización de descripciones de videos
methods: combina estrategias estadísticas y end-to-end, utilizando LSTM para generar descripciones de videos con diferentes niveles de detalle
results: produce descripciones más realistas que otras aproximaciones competidoras

Abstract
Automatic video description requires the generation of natural language statements about the actions, events, and objects in the video. An important human trait, when we describe a video, is that we are able to do this with variable levels of detail. Different from this, existing approaches for automatic video descriptions are mostly focused on single sentence generation at a fixed level of detail. Instead, here we address video description of manipulation actions where different levels of detail are required for being able to convey information about the hierarchical structure of these actions relevant also for modern approaches of robot learning. We propose one hybrid statistical and one end-to-end framework to address this problem. The hybrid method needs much less data for training, because it models statistically uncertainties within the video clips, while in the end-to-end method, which is more data-heavy, we are directly connecting the visual encoder to the language decoder without any intermediate (statistical) processing step. Both frameworks use LSTM stacks to allow for different levels of description granularity and videos can be described by simple single-sentences or complex multiple-sentence descriptions. In addition, quantitative results demonstrate that these methods produce more realistic descriptions than other competing approaches.

摘要
自动视频描述需要生成自然语言 sentences about the actions, events, and objects in the video. 人类特点是，当我们描述视频时，我们可以选择不同的级别细节。而现有的自动视频描述方法大多集中在固定级别的句子生成上。在这里，我们解决了视频描述操作动作的问题，其中不同级别的细节是必须以便传递视频中动作层次结构的信息，同时也是现代机器人学习方法的关键。我们提出了一个混合统计学和终端链接的方法，以及一个终端链接方法。两种方法都使用 LSTM 堆来允许不同级别的描述细节，并且视频可以通过单个句子或复杂多句 sentences 来描述。此外，我们对其他竞争方法进行了量化比较，并证明了这些方法生成的描述更加真实。

LT-ViT: A Vision Transformer for multi-label Chest X-ray classification

paper_url: http://arxiv.org/abs/2311.07263
repo_url: None
paper_authors: Umar Marikkar, Sara Atito, Muhammad Awais, Adam Mahdi
for: 这个研究旨在提高静脉肺X射像（CXR）的医学图像识别 task 中使用 Vision Transformers（ViTs）的性能。
methods: 本研究使用了LT-ViT，一种具有共同注意力的 трансформа器，它结合了图像对象和随机初始化的帮助 tokens，以提高模型的性能。
results: 本研究所得到的结果显示：(1) LT-ViT 在两个公开的 CXR 数据集上比既存的顶尖性能提高; (2) LT-ViT 可以应用于不同的预训法，因此是无预设的; (3) LT-ViT 可以提供模型解释性，不需要 grad-cam 和其他相关技术。

Abstract
Vision Transformers (ViTs) are widely adopted in medical imaging tasks, and some existing efforts have been directed towards vision-language training for Chest X-rays (CXRs). However, we envision that there still exists a potential for improvement in vision-only training for CXRs using ViTs, by aggregating information from multiple scales, which has been proven beneficial for non-transformer networks. Hence, we have developed LT-ViT, a transformer that utilizes combined attention between image tokens and randomly initialized auxiliary tokens that represent labels. Our experiments demonstrate that LT-ViT (1) surpasses the state-of-the-art performance using pure ViTs on two publicly available CXR datasets, (2) is generalizable to other pre-training methods and therefore is agnostic to model initialization, and (3) enables model interpretability without grad-cam and its variants.

摘要
医疗影像任务中广泛采用了视Transformers（ViTs），现有一些努力在视频语言训练中使用ViTs进行胸部X射影像（CXRs）的训练。然而，我们认为还有可能在视力只训练中使用ViTs进行CXRs的提升，通过多尺度信息的汇集，这种方法已经证明对非转换网络有利。因此，我们开发了LT-ViT，一种利用图像征素和随机初始化的auxiliary征素来实现图像特征的共同注意力。我们的实验表明，LT-ViT可以：1. 使用纯度ViTs在两个公开available的CXR数据集上超越现有的状态态强性表现。2. 可以在其他预训练方法上 generalized，因此不виси于模型的初始化。3. 可以无需grad-cam和其他相关图像解释方法，具有解释性。

Sketch-based Video Object Segmentation: Benchmark and Analysis

paper_url: http://arxiv.org/abs/2311.07261
repo_url: None
paper_authors: Ruolin Yang, Da Li, Conghui Hu, Timothy Hospedales, Honggang Zhang, Yi-Zhe Song
for: 本研究旨在提出一种基于绘图的视频对象 segmentation任务，并提供一个新的标准测试集（Sketch-DAVIS16、Sketch-DAVIS17和Sketch-YouTube-VOS），以便更好地评估视频对象 segmentation 算法。
methods: 本研究使用了 STCN 基eline，并对不同类型的参考（语言表达、scribble和绘图）进行比较，以找出最有效的参考方式。
results: 实验结果显示，使用绘图作为参考是最有效的方式，它比使用语言表达和scribble更有效，同时也更容易进行标注。

Abstract
Reference-based video object segmentation is an emerging topic which aims to segment the corresponding target object in each video frame referred by a given reference, such as a language expression or a photo mask. However, language expressions can sometimes be vague in conveying an intended concept and ambiguous when similar objects in one frame are hard to distinguish by language. Meanwhile, photo masks are costly to annotate and less practical to provide in a real application. This paper introduces a new task of sketch-based video object segmentation, an associated benchmark, and a strong baseline. Our benchmark includes three datasets, Sketch-DAVIS16, Sketch-DAVIS17 and Sketch-YouTube-VOS, which exploit human-drawn sketches as an informative yet low-cost reference for video object segmentation. We take advantage of STCN, a popular baseline of semi-supervised VOS task, and evaluate what the most effective design for incorporating a sketch reference is. Experimental results show sketch is more effective yet annotation-efficient than other references, such as photo masks, language and scribble.

摘要
参考基础视频对象分割是一个emerging topic，旨在将每帧视频内对应的目标物体进行分割，以参考一个 giventext or photo mask。然而，语言表达可能会对某些概念发送不确定的信息，而且在一帧中相似的物体可能很难以通过语言区分。另一方面，photo masks是costly to annotate并且在实际应用中 menos practical。这篇论文介绍了一个新的任务：sketch-based video object segmentation，以及相关的benchmark和强大基线。我们的benchmark包括Sketch-DAVIS16、Sketch-DAVIS17和Sketch-YouTube-VOS三个 dataset，这些dataset利用人类手绘的sketches作为视频对象分割的参考。我们利用STCN，一个受欢迎的semi-supervised VOS任务的基eline，进行评估，并评估在不同的参考方法中，sketch是否比其他参考方法更有效率。实验结果表明，sketch比其他参考方法更有效率，并且可以实现更低的注释成本。

Simultaneous Clutter Detection and Semantic Segmentation of Moving Objects for Automotive Radar Data

paper_url: http://arxiv.org/abs/2311.07247
repo_url: None
paper_authors: Johannes Kopp, Dominik Kellner, Aldi Piroli, Vinzenz Dallabetta, Klaus Dietmayer
for: 本研究旨在实现同时解决干扰点云和 semantic segmentation问题，而不是将它们处理为两个独立的任务。
methods: 我们提出了一个新的增强多头架构造，以及一种用于表示网络预测值的新方法，以便同时解决两个任务。
results: 我们在广泛的评估中展示了我们的设置高效且超越了现有的网络模型，在RadarScenes dataset上的 semantic segmentation任务中。

Abstract
The unique properties of radar sensors, such as their robustness to adverse weather conditions, make them an important part of the environment perception system of autonomous vehicles. One of the first steps during the processing of radar point clouds is often the detection of clutter, i.e. erroneous points that do not correspond to real objects. Another common objective is the semantic segmentation of moving road users. These two problems are handled strictly separate from each other in literature. The employed neural networks are always focused entirely on only one of the tasks. In contrast to this, we examine ways to solve both tasks at the same time with a single jointly used model. In addition to a new augmented multi-head architecture, we also devise a method to represent a network's predictions for the two tasks with only one output value. This novel approach allows us to solve the tasks simultaneously with the same inference time as a conventional task-specific model. In an extensive evaluation, we show that our setup is highly effective and outperforms every existing network for semantic segmentation on the RadarScenes dataset.

摘要
射频探测器的特有性，如其对恶劣天气的稳定性，使其成为自动驾驶车辆环境感知系统中重要的一部分。在处理射频点云时，一个常见的第一步是检测噪声，即无关实际物体的错误点。另一个常见的目标是对移动路用户进行 semantic segmentation。在文献中，这两个问题通常被视为独立的两个任务，并且使用的神经网络总是专注于单一任务。相比之下，我们研究了同时解决这两个任务的方法，使用了一种新的加密多头架构，以及一种用于表示网络预测值的新方法。这种新approach允许我们在同一个推理时间内同时解决两个任务。在广泛的评估中，我们发现了我们的设置非常有效，并在RadarScenes dataset上超越了每个存在的网络。

DeepMetricEye: Metric Depth Estimation in Periocular VR Imagery

paper_url: http://arxiv.org/abs/2311.07235
repo_url: None
paper_authors: Yitong Sun, Zijian Zhou, Cyriel Diels, Ali Asadipour
for: 提供一种用于计算视网膜区域的深度图的轻量级框架，以便在VR头戴式设备上实现可量化的眼部区域计算。
methods: 使用基于U-Net 3+深度学习框架的轻量级框架，将Relative Measurements转换为可量化的Periocular Depth Maps。
results: 在36名参与者的测试中，方法表现出色，在眼部全球精度评估实验和豹心眼径测量方面达到了显著的效果。

Abstract
Despite the enhanced realism and immersion provided by VR headsets, users frequently encounter adverse effects such as digital eye strain (DES), dry eye, and potential long-term visual impairment due to excessive eye stimulation from VR displays and pressure from the mask. Recent VR headsets are increasingly equipped with eye-oriented monocular cameras to segment ocular feature maps. Yet, to compute the incident light stimulus and observe periocular condition alterations, it is imperative to transform these relative measurements into metric dimensions. To bridge this gap, we propose a lightweight framework derived from the U-Net 3+ deep learning backbone that we re-optimised, to estimate measurable periocular depth maps. Compatible with any VR headset equipped with an eye-oriented monocular camera, our method reconstructs three-dimensional periocular regions, providing a metric basis for related light stimulus calculation protocols and medical guidelines. Navigating the complexities of data collection, we introduce a Dynamic Periocular Data Generation (DPDG) environment based on UE MetaHuman, which synthesises thousands of training images from a small quantity of human facial scan data. Evaluated on a sample of 36 participants, our method exhibited notable efficacy in the periocular global precision evaluation experiment, and the pupil diameter measurement.

摘要
尽管虚拟现实（VR）头戴设备提供了更加真实和沉浸的用户体验，但用户们经常出现不良影响，如数字眼疲病（DES）、干燥眼睛和可能的长期视力障碍，这些影响是由 VR 显示器和头戴设备的压力所致。现有的 VR 头戴设备通常配备有面向眼睛的单目镜像传感器，以分割眼部特征地图。然而，为了计算入射光刺激和观察眼部状态的变化，需要将这些相对测量转换成 metric 维度。为了bridging这个差距，我们提出了一个轻量级的框架，基于 U-Net 3+ 深度学习基础，用于估计可测量的眼部深度地图。与任何配备有面向眼睛单目镜像传感器的 VR 头戴设备兼容，我们的方法可重construct三维眼部区域，提供metric 基础 для相关的光刺激计算协议和医疗指南。在数据收集的复杂性方面，我们引入了一个基于 UE MetaHuman 的动态眼部数据生成环境（DPDG），该环境可以从小量的人类脸部扫描数据中生成数千个训练图像。在一个样本中，我们的方法在眼部全球精度评估实验和评估眼径大小方面表现出了明显的有效性。

Multi-task learning for joint weakly-supervised segmentation and aortic arch anomaly classification in fetal cardiac MRI

paper_url: http://arxiv.org/abs/2311.07234
repo_url: https://github.com/svrtk/masc-multi-task-segmentation-and-classification
paper_authors: Paula Ramirez, Alena Uus, Milou P. M. van Poppel, Irina Grigorescu, Johannes K. Steinweg, David F. A. Lloyd, Kuberan Pushparajah, Andrew P. King, Maria Deprez
for: 这个研究的目的是为了帮助3D胎儿血管图像的自动化分类和检测，以提高诊断confidence。
methods: 这个研究使用了深度学习的标签卷积和注意力3D U-Net segmentation，以及密集度121的疾病分类。
results: 研究结果表明，我们的提出的训练策略在Label propagation和仅使用卷积过程中训练的网络之上表现出色，并且我们的分类器在与T2w图像进行joint training后表现出色，其平均权衡准确率为0.99（0.01）。

Abstract
Congenital Heart Disease (CHD) is a group of cardiac malformations present already during fetal life, representing the prevailing category of birth defects globally. Our aim in this study is to aid 3D fetal vessel topology visualisation in aortic arch anomalies, a group which encompasses a range of conditions with significant anatomical heterogeneity. We present a multi-task framework for automated multi-class fetal vessel segmentation from 3D black blood T2w MRI and anomaly classification. Our training data consists of binary manual segmentation masks of the cardiac vessels' region in individual subjects and fully-labelled anomaly-specific population atlases. Our framework combines deep learning label propagation using VoxelMorph with 3D Attention U-Net segmentation and DenseNet121 anomaly classification. We target 11 cardiac vessels and three distinct aortic arch anomalies, including double aortic arch, right aortic arch, and suspected coarctation of the aorta. We incorporate an anomaly classifier into our segmentation pipeline, delivering a multi-task framework with the primary motivation of correcting topological inaccuracies of the segmentation. The hypothesis is that the multi-task approach will encourage the segmenter network to learn anomaly-specific features. As a secondary motivation, an automated diagnosis tool may have the potential to enhance diagnostic confidence in a decision support setting. Our results showcase that our proposed training strategy significantly outperforms label propagation and a network trained exclusively on propagated labels. Our classifier outperforms a classifier trained exclusively on T2w volume images, with an average balanced accuracy of 0.99 (0.01) after joint training. Adding a classifier improves the anatomical and topological accuracy of all correctly classified double aortic arch subjects.

摘要
《固有心脏病（CHD）》是胎生时已经存在的心脏畸形，全球范围内最常见的出生畸形之一。我们在这项研究中的目标是通过自动化多类胎 vessle分割和畸形分类来提高3D黑血MRI中胎 vessle topology的可视化。我们的训练数据包括各个个体的手动Binary manual segmentation masks of the cardiac vessels' region和具有精度的畸形特征的各个畸形人体 Atlases。我们的框架结合了深度学习标签传播使用VoxelMorph和3D Attention U-Net segmentation和DenseNet121畸形分类。我们目标是11个心脏血管和三种不同的脊梁畸形，包括双脊梁、右脊梁和可能的脊梁缺陷。我们将畸形分类器 incorporated into our segmentation pipeline，实现一个多任务框架，主要目的是 correction topological inaccuracies of the segmentation。我们 hypothesize that the multi-task approach will encourage the segmenter network to learn anomaly-specific features。作为次要目的，一个自动诊断工具可能会提高诊断自信性。我们的结果表明，我们的训练策略在比label propagation和专门训练在propagated labels的网络之上显著提高。我们的分类器在与T2wVolume Image进行同时训练后，其平均平衡准确率为0.99（0.01）。增加分类器可以提高所有正确分类double aortic arch的 анатомиче和 topological 准确性。

Few Shot Learning for the Classification of Confocal Laser Endomicroscopy Images of Head and Neck Tumors

paper_url: http://arxiv.org/abs/2311.07216
repo_url: None
paper_authors: Marc Aubreville, Zhaoya Pan, Matti Sievert, Jonas Ammeling, Jonathan Ganz, Nicolai Oetter, Florian Stelzle, Ann-Kathrin Frenken, Katharina Breininger, Miguel Goncalves
for: 这种研究旨在发展一种基于confocal laser endomicroscopy（CLE）的自动分析方法，以帮助外科医生在移除头颈肿瘤时确保安全的边缘。
methods: 这些研究人员使用了四种流行的几个shot学习（FSL）方法，以评估它们在不同的生理结构领域中的泛化能力。
results: 研究结果表明，FSL在CLE图像上是可能的，但受到patient数量和生理结构领域的多样性的影响。在 vocals folds（VF）图像上，最佳方法达到了79.6%的 médiane accuracy，而在sinunasal tumors（SNT）图像上只达到了61.6%的 médiane accuracy。

Abstract
The surgical removal of head and neck tumors requires safe margins, which are usually confirmed intraoperatively by means of frozen sections. This method is, in itself, an oversampling procedure, which has a relatively low sensitivity compared to the definitive tissue analysis on paraffin-embedded sections. Confocal laser endomicroscopy (CLE) is an in-vivo imaging technique that has shown its potential in the live optical biopsy of tissue. An automated analysis of this notoriously difficult to interpret modality would help surgeons. However, the images of CLE show a wide variability of patterns, caused both by individual factors but also, and most strongly, by the anatomical structures of the imaged tissue, making it a challenging pattern recognition task. In this work, we evaluate four popular few shot learning (FSL) methods towards their capability of generalizing to unseen anatomical domains in CLE images. We evaluate this on images of sinunasal tumors (SNT) from five patients and on images of the vocal folds (VF) from 11 patients using a cross-validation scheme. The best respective approach reached a median accuracy of 79.6% on the rather homogeneous VF dataset, but only of 61.6% for the highly diverse SNT dataset. Our results indicate that FSL on CLE images is viable, but strongly affected by the number of patients, as well as the diversity of anatomical patterns.

摘要
surgical removal of head and neck tumors 需要安全的边缘，通常通过冻结部分来确认。这是一种过度采样的方法，其敏感度相对较低于 définitive tissue analysis on paraffin-embedded sections。Confocal laser endomicroscopy (CLE) 是一种实时成像技术，可以在实时生物组织检查中提供信息。然而，CLE 图像具有很大的变化，由于个体因素以及镜头检查的结构，这是一项具有挑战性的模式识别任务。在这项工作中，我们评估了四种流行的少数shot learning（FSL）方法，以其能否在未经见过的生理结构中广泛应用。我们使用了五名患者的 sinunasal tumors（SNT）图像和 11名患者的 vocal folds（VF）图像，采用交叉验证方式进行评估。最佳方法在 relativamente homogeneous VF 数据集上达到了79.6%的 median 准确率，但只有61.6%的准确率在高度多样化的 SNT 数据集上。我们的结果表明，FSL on CLE 图像是可行的，但受到patient 数量以及生理结构的多样化的影响。

A method for quantifying sectoral optic disc pallor in fundus photographs and its association with peripapillary RNFL thickness

paper_url: http://arxiv.org/abs/2311.07213
repo_url: None
paper_authors: Samuel Gibbon, Graciela Muniz-Terrera, Fabian SL Yii, Charlene Hamid, Simon Cox, Ian JC Maccormick, Andrew J Tatham, Craig Ritchie, Emanuele Trucco, Baljean Dhillon, Thomas J MacGillivray
for: 本研究的目的是开发一种自动地量化眼睛背部缺乏症状的方法，以及与背部 nerf fibre layer (pRNFL) 厚度的关系。
methods: 我们使用深度学习来 segment 眼睛背部、芳香突起和血管在眼睛照片中，并测量缺乏症状。我们通过对 pRNFL 厚度来自光共振扫描仪获得的数据进行分析，并在 118 名参与者中评估了缺乏症状和 pRNFL 厚度之间的关系。此外，我们还使用临床诊断为衰竭（N=45）的图像进行测量，并与健康控制群（N=46）进行比较。
results: 我们开发了一种可自动量化眼睛背部缺乏症状的软件。我们发现，缺乏症状与 pRNFL 厚度在全体、 temporal 下部、nasal/temporal 比例和整个眼睛背部之间存在关系。此外，衰竭组的缺乏症状也显著高于健康组。最后，我们也证明了这种分析方法对于相机类型、图像格式和分辨率的变化是可Robust。

Abstract
Purpose: To develop an automatic method of quantifying optic disc pallor in fundus photographs and determine associations with peripapillary retinal nerve fibre layer (pRNFL) thickness. Methods: We used deep learning to segment the optic disc, fovea, and vessels in fundus photographs, and measured pallor. We assessed the relationship between pallor and pRNFL thickness derived from optical coherence tomography scans in 118 participants. Separately, we used images diagnosed by clinical inspection as pale (N=45) and assessed how measurements compared to healthy controls (N=46). We also developed automatic rejection thresholds, and tested the software for robustness to camera type, image format, and resolution. Results: We developed software that automatically quantified disc pallor across several zones in fundus photographs. Pallor was associated with pRNFL thickness globally (\b{eta} = -9.81 (SE = 3.16), p < 0.05), in the temporal inferior zone (\b{eta} = -29.78 (SE = 8.32), p < 0.01), with the nasal/temporal ratio (\b{eta} = 0.88 (SE = 0.34), p < 0.05), and in the whole disc (\b{eta} = -8.22 (SE = 2.92), p < 0.05). Furthermore, pallor was significantly higher in the patient group. Lastly, we demonstrate the analysis to be robust to camera type, image format, and resolution. Conclusions: We developed software that automatically locates and quantifies disc pallor in fundus photographs and found associations between pallor measurements and pRNFL thickness. Translational relevance: We think our method will be useful for the identification, monitoring and progression of diseases characterized by disc pallor/optic atrophy, including glaucoma, compression, and potentially in neurodegenerative disorders.

摘要
Methods: 我们使用深度学习来分割盘绿、芳香眼和血管在眼科照片中，并测量盘绿。我们在118名参与者中评估了盘绿与pRNFL厚度之间的关系，并分别使用被诊断为脊梁（N=45）和健康控制群（N=46）的图像进行比较。此外，我们还开发了自动拒绝阈值，并测试了软件的对camera类型、图像格式和分辨率的Robustness。Results: 我们开发了一种可以自动量化盘绿在多个区域的眼科照片中的软件。盘绿与pRNFL厚度之间存在全体(\b{eta} = -9.81 (SE = 3.16), p < 0.05), temporo- inferior区域(\b{eta} = -29.78 (SE = 8.32), p < 0.01)和nasal/temporal比率(\b{eta} = 0.88 (SE = 0.34), p < 0.05)之间的关系。此外，盘绿在病例群体中高于健康控制群。最后，我们证明了这种分析方法对camera类型、图像格式和分辨率的Robustness。Conclusions: 我们开发了一种可以自动分割盘绿、芳香眼和血管的软件，并发现了盘绿测量与pRNFL厚度之间的关系。Translational relevance: 我们认为这种方法将有用于识别、监测和评估 caracterized by disc pallor/optic atrophy的疾病，包括 glaucoma, compression, 和可能的neurodegenerative disorders。

paper_url: http://arxiv.org/abs/2311.07630
repo_url: None
paper_authors: Zhaojian Li, Bin Zhao, Yuan Yuan
for: 本研究提出了一种基于生成对抗学习的视觉导向的双耳立体声音生成方法，以提供更加听众式的听众体验。
methods: 本方法使用生成器和判定器，通过视觉共享的共同视觉信息来引导生成器和判定器分别工作。在生成对抗阶段，共同视觉信息被逐渐更新，allowing生成器和判定器发挥相互协作的作用。
results: 该方法在2个数据集和5个评价指标上达到了状态对抗的性能，并且在实际应用中可以提供空间实际的双耳立体声音。

Abstract
Binaural stereo audio is recorded by imitating the way the human ear receives sound, which provides people with an immersive listening experience. Existing approaches leverage autoencoders and directly exploit visual spatial information to synthesize binaural stereo, resulting in a limited representation of visual guidance. For the first time, we propose a visually guided generative adversarial approach for generating binaural stereo audio from mono audio. Specifically, we develop a Stereo Audio Generation Model (SAGM), which utilizes shared spatio-temporal visual information to guide the generator and the discriminator to work separately. The shared visual information is updated alternately in the generative adversarial stage, allowing the generator and discriminator to deliver their respective guided knowledge while visually sharing. The proposed method learns bidirectional complementary visual information, which facilitates the expression of visual guidance in generation. In addition, spatial perception is a crucial attribute of binaural stereo audio, and thus the evaluation of stereo spatial perception is essential. However, previous metrics failed to measure the spatial perception of audio. To this end, a metric to measure the spatial perception of audio is proposed for the first time. The proposed metric is capable of measuring the magnitude and direction of spatial perception in the temporal dimension. Further, considering its function, it is feasible to utilize it instead of demanding user studies to some extent. The proposed method achieves state-of-the-art performance on 2 datasets and 5 evaluation metrics. Qualitative experiments and user studies demonstrate that the method generates space-realistic stereo audio.

摘要
人类耳朵所接收的声音方式为我们录制双耳立体声音提供了启发，从而为人们提供了沉浸式听众体验。现有方法利用自动编码器并直接利用视觉空间信息来生成双耳立体声音，但这会导致视觉指导的限制表现。我们为首次提出了基于视觉导向生成抗战斗方法，通过共享视觉信息来导引生成器和批判器分开工作。特别是，我们开发了双耳声音生成模型（SAGM），该模型利用共享的空间时间视觉信息来导引生成器和批判器。共享的视觉信息在生成抗战斗阶段不断更新，allowing生成器和批判器同时传递各自的导引知识，从而实现了视觉共享。此外，空间感知是双耳立体声音的重要特征，因此评估双耳立体声音的空间感知是必要的。然而，先前的指标未能评估音频中的空间感知。为此，我们提出了一种新的指标来评估音频中的空间感知。该指标可以评估音频中的空间感知的大小和方向。此外，由于其功能，可以在一定程度上取代用户研究。我们的方法在两个数据集和五个评价指标上达到了状态的最佳性能。Qualitative实验和用户研究表明，我们的方法可以生成真实的空间声音。

MonoDiffusion: Self-Supervised Monocular Depth Estimation Using Diffusion Model

paper_url: http://arxiv.org/abs/2311.07198
repo_url: https://github.com/shuweishao/monodiffusion
paper_authors: Shuwei Shao, Zhongcai Pei, Weihai Chen, Dingchi Sun, Peter C. Y. Chen, Zhengguo Li
for: 本研究旨在提出一种新的自监督深度估计框架，称为MONODIFFUSION，该框架通过形式化为迭代减噪过程来实现。
methods: 我们在这种情况下开发了一种 pseudo 基准数据扩充过程，以帮助 MONODIFFUSION 的扩充。此外，我们还开发了一种masked visual condition机制，以提高模型的减噪能力。
results: 我们在 KITTI 和 Make3D 数据集上进行了广泛的实验，并证明了 MONODIFFUSION 在自监督深度估计中超过了先前的状态时刻。源代码将在 https://github.com/ShuweiShao/MonoDiffusion 上发布。

Abstract
Over the past few years, self-supervised monocular depth estimation that does not depend on ground-truth during the training phase has received widespread attention. Most efforts focus on designing different types of network architectures and loss functions or handling edge cases, e.g., occlusion and dynamic objects. In this work, we introduce a novel self-supervised depth estimation framework, dubbed MonoDiffusion, by formulating it as an iterative denoising process. Because the depth ground-truth is unavailable in the training phase, we develop a pseudo ground-truth diffusion process to assist the diffusion in MonoDiffusion. The pseudo ground-truth diffusion gradually adds noise to the depth map generated by a pre-trained teacher model. Moreover,the teacher model allows applying a distillation loss to guide the denoised depth. Further, we develop a masked visual condition mechanism to enhance the denoising ability of model. Extensive experiments are conducted on the KITTI and Make3D datasets and the proposed MonoDiffusion outperforms prior state-of-the-art competitors. The source code will be available at https://github.com/ShuweiShao/MonoDiffusion.

摘要
Note:* "自动" (zìdòng) is used instead of "self-supervised" to emphasize the lack of ground truth during training.* "推理" (tiělǐ) is used instead of "denoising" to emphasize the iterative process.* "假" (jiǎ) is used instead of "pseudo" to emphasize the artificial nature of the pseudo ground truth.* "导学" (dǎoxué) is used instead of "distillation" to emphasize the role of the teacher model.* "遮盖" (miànjià) is used instead of "masked" to emphasize the hiding of the visual information.* "能力" (nénglì) is used instead of "ability" to emphasize the capability of the model.

Fitting tree model with CNN and geodesics to track vesselsand application to Ultrasound Localization Microscopy data

paper_url: http://arxiv.org/abs/2311.07188
repo_url: None
paper_authors: Théo Bertrand, Laurent D. Cohen
for: 探测血管网络中的重要标志点（通过CNN进行本地化和分类），并将血管表示为最小距离树图。
methods: 利用地odesic方法探测血管的几何特征，并在位置和orientation空间中使用空间位置和orientation来准确表示2D血管为树结构。
results: 虽然ULM数据的标注稀缺性是本研究的一大障碍，但是使用ULM数据构建的 Orientation Score 可以提供好的地odesics для跟踪血管。

Abstract
Segmentation of tubular structures in vascular imaging is a well studied task, although it is rare that we try to infuse knowledge of the tree-like structure of the regions to be detected. Our work focuses on detecting the important landmarks in the vascular network (via CNN performing both localization and classification of the points of interest) and representing vessels as the edges in some minimal distance tree graph. We leverage geodesic methods relevant to the detection of vessels and their geometry, making use of the space of positions and orientations so that 2D vessels can be accurately represented as trees. We build our model to carry tracking on Ultrasound Localization Microscopy (ULM) data, proposing to build a good cost function for tracking on this type of data. We also test our framework on synthetic and eye fundus data. Results show that scarcity of well annotated ULM data is an obstacle to localization of vascular landmarks but the Orientation Score built from ULM data yields good geodesics for tracking blood vessels.

摘要
干流结构分割在血管成像中是已经广泛研究的任务，然而rarely 我们会利用树状结构的知识来检测这些区域。我们的工作是通过使用卷积神经网络进行本地化和分类 interested points，并将血管表示为最小距离树图的边。我们利用血管的推断方法和几何学特性，使用空间位置和方向的空间，以便精确地表示2D血管为树。我们建立了一个良好的成本函数，以便在这种数据上进行跟踪。我们还在synthetic和眼球膜数据上测试了我们的框架。结果表明，ULM数据的缺乏高质量标注是血管地标的本地化的主要障碍，但是我们构建的Orientation Score从ULM数据中得到了良好的地odesics для跟踪血管。

Regenerating Arbitrary Video Sequences with Distillation Path-Finding

paper_url: http://arxiv.org/abs/2311.07170
repo_url: None
paper_authors: Thi-Ngoc-Hanh Le, Sheng-Yi Yao, Chun-Te Wu, Tong-Yee Lee
for: 这篇论文是为了提供一种可互动的框架，帮助用户根据自己的喜好选择视频动画的开场场景。
methods: 这篇论文使用了一种名为RSFNet的网络来学习视频帧集的特征相关性，然后使用一种新的路径找索算法（SDPF）来基于源视频的运动方向来生成新的动画序列。
results: 该框架可以生成新的动画序列，并且可以在 cartoon 和自然场景中生成更加精准和自然的动画。这些结果超越了现有的商业应用程序和先前的研究工作。

Abstract
If the video has long been mentioned as a widespread visualization form, the animation sequence in the video is mentioned as storytelling for people. Producing an animation requires intensive human labor from skilled professional artists to obtain plausible animation in both content and motion direction, incredibly for animations with complex content, multiple moving objects, and dense movement. This paper presents an interactive framework to generate new sequences according to the users' preference on the starting frame. The critical contrast of our approach versus prior work and existing commercial applications is that novel sequences with arbitrary starting frame are produced by our system with a consistent degree in both content and motion direction. To achieve this effectively, we first learn the feature correlation on the frameset of the given video through a proposed network called RSFNet. Then, we develop a novel path-finding algorithm, SDPF, which formulates the knowledge of motion directions of the source video to estimate the smooth and plausible sequences. The extensive experiments show that our framework can produce new animations on the cartoon and natural scenes and advance prior works and commercial applications to enable users to obtain more predictable results.

摘要
如果视频已经被广泛认为是视觉化的形式，视频中的动画序列被视为人们的故事tellding。制作动画需要凝心的人工劳动，从技巧备受训练的艺术家手中获得可信度的动画内容和动作方向，特别是对于具有复杂内容、多个移动对象和紧张运动的动画。本文提出了一种互动框架，可以根据用户的首帧偏好生成新的序列。我们的方法与先前的工作和商业应用程序的重要对比点在于，我们的系统可以生成novel的序列，并且在内容和动作方向上具有一致的度。为了实现这一目标，我们首先通过我们提出的网络 called RSFNet 学习视频帧集中的特征相关性。然后，我们开发了一种新的路径找索算法，SDPF，该算法利用源视频中的动作方向知识来估算可靠和可信度的新序列。我们的实验表明，我们的框架可以生成在 cartoon 和自然场景中的新动画，并超越先前的工作和商业应用程序，让用户可以更加预测性地获得结果。

NDDepth: Normal-Distance Assisted Monocular Depth Estimation and Completion

paper_url: http://arxiv.org/abs/2311.07166
repo_url: https://github.com/ShuweiShao/NDDepth
paper_authors: Shuwei Shao, Zhongcai Pei, Weihai Chen, Peter C. Y. Chen, Zhengguo Li
for: 本研究旨在提出一种基于物理（几何）的深度学习框架，用于单目深度估计和完成任务。
methods: 我们提出了一种新的深度估计和完成方法，即先估计表面法向和到原点距离图像，然后将其转换为深度图像。此外，我们还开发了一个额外的深度头，以增强方法的稳定性。
results: 我们在NYU-Depth-v2、KITTI和SUN RGB-D数据集上进行了广泛的实验，结果表明我们的方法在单目深度估计和完成任务中表现出色，超越了先前的状态OF-the-art竞争者。

Abstract
Over the past few years, monocular depth estimation and completion have been paid more and more attention from the computer vision community because of their widespread applications. In this paper, we introduce novel physics (geometry)-driven deep learning frameworks for these two tasks by assuming that 3D scenes are constituted with piece-wise planes. Instead of directly estimating the depth map or completing the sparse depth map, we propose to estimate the surface normal and plane-to-origin distance maps or complete the sparse surface normal and distance maps as intermediate outputs. To this end, we develop a normal-distance head that outputs pixel-level surface normal and distance. Meanwhile, the surface normal and distance maps are regularized by a developed plane-aware consistency constraint, which are then transformed into depth maps. Furthermore, we integrate an additional depth head to strengthen the robustness of the proposed frameworks. Extensive experiments on the NYU-Depth-v2, KITTI and SUN RGB-D datasets demonstrate that our method exceeds in performance prior state-of-the-art monocular depth estimation and completion competitors. The source code will be available at https://github.com/ShuweiShao/NDDepth.

摘要
过去几年，单目深度估计和完成已经在计算机视觉社区获得了更多的关注，因为它们在各种应用场景中具有广泛的应用前景。在这篇论文中，我们介绍了一种新的物理（几何）驱动的深度学习框架，假设3D场景由块状平面组成。而不是直接估计深度图或完成缺失的深度图，我们提议估计像素级面法向和平面到原点距离图。为此，我们开发了一个面法距离头，该头输出像素级面法向和距离。此外，我们还开发了一个扩展的深度头，以强化我们提议的框架的稳定性。广泛的实验表明，我们的方法在NYU-Depth-v2、KITTI和SUN RGB-D数据集上的性能较前状态的单目深度估计和完成竞争者高。代码将在https://github.com/ShuweiShao/NDDepth上公开。

CycleGANAS: Differentiable Neural Architecture Search for CycleGAN

paper_url: http://arxiv.org/abs/2311.07162
repo_url: None
paper_authors: Taegun An, Changhee Joo
for: 这个论文是为了搜索CyclesGAN的神经网络架构，用于无对照图像转换任务。
methods: 这个框架使用了一系列简单的ResNet基于细胞，并开发了一种有效地搜索大搜索空间的搜索方法。
results: 我们的框架可以不 только有效地找到高性能的架构，而且也能够成功地解决数据不均衡问题。

Abstract
We develop a Neural Architecture Search (NAS) framework for CycleGAN that carries out unpaired image-to-image translation task. Extending previous NAS techniques for Generative Adversarial Networks (GANs) to CycleGAN is not straightforward due to the task difference and greater search space. We design architectures that consist of a stack of simple ResNet-based cells and develop a search method that effectively explore the large search space. We show that our framework, called CycleGANAS, not only effectively discovers high-performance architectures that either match or surpass the performance of the original CycleGAN, but also successfully address the data imbalance by individual architecture search for each translation direction. To our best knowledge, it is the first NAS result for CycleGAN and shed light on NAS for more complex structures.

摘要
我们开发了一个基于神经网络搜索（NAS）框架，用于实现无对照图像到图像翻译任务。与前一代GANs NAS技术不同，将NAS技术应用于CycleGAN任务不是直接的，因为任务的不同和搜索空间的更大。我们设计了一个由堆式简单的ResNet基于细胞组成的 architecture，并开发了一种有效地探索大型搜索空间的搜索方法。我们显示，我们的框架，称为CycleGANAS，不仅能够有效地找到高性能的architecture，并且成功地解决了数据不均衡问题，通过个体搜索每个翻译方向。据我们所知，这是NAS的首次成果，并照亮了NAS的更复杂结构的应用。

Detecting As Labeling: Rethinking LiDAR-camera Fusion in 3D Object Detection

paper_url: http://arxiv.org/abs/2311.07152
repo_url: https://github.com/HuangJunJie2017/BEVDet
paper_authors: Junjie Huang, Yun Ye, Zhujin Liang, Yi Shan, Dalong Du
for: 本文旨在提出一种新的3D物体检测方法，以解决LiDAR-camera结合体系中的过拟合问题。
methods: 本文提出了一种基于’检测为标签’（Detecting As Labeling，DAL）的新方法，通过imitating数据注释过程，建立了一个简单的预测管道，并使用了最简单的训练方法来最小化依赖性和提高可移植性。
results: 对比 existed方法，提出的DAL方法在性能和可移植性两个方面具有明显的优势，可以作为未来研发和实际应用的理想基线。

Abstract
3D object Detection with LiDAR-camera encounters overfitting in algorithm development which is derived from the violation of some fundamental rules. We refer to the data annotation in dataset construction for theory complementing and argue that the regression task prediction should not involve the feature from the camera branch. By following the cutting-edge perspective of 'Detecting As Labeling', we propose a novel paradigm dubbed DAL. With the most classical elementary algorithms, a simple predicting pipeline is constructed by imitating the data annotation process. Then we train it in the simplest way to minimize its dependency and strengthen its portability. Though simple in construction and training, the proposed DAL paradigm not only substantially pushes the performance boundary but also provides a superior trade-off between speed and accuracy among all existing methods. With comprehensive superiority, DAL is an ideal baseline for both future work development and practical deployment. The code has been released to facilitate future work on https://github.com/HuangJunJie2017/BEVDet.

摘要
三元 объек特检测遇到了过拟合问题在算法开发中，这是由数据注解在数据集建构中的违反基本规则所致。我们提出了一种新的思路，称为“检测为标注”（DAL）。我们采用了最经典的元素算法，构建了一个简单的预测管道，并通过模仿数据注解过程来训练。尽管简单构建和训练，但提议的 DAL 方法不仅可以显著提高性能boundary，还提供了速度和准确性之间的优秀平衡。在所有现有方法中，DAL 具有最高的全面优势，是未来研发和实践中的理想基eline。代码已经发布到了https://github.com/HuangJunJie2017/BEVDet，以便未来研究。

PadChannel: Improving CNN Performance through Explicit Padding Encoding

paper_url: http://arxiv.org/abs/2311.07623
repo_url: https://github.com/aussieseaweed/pad-channel
paper_authors: Juho Kim
for: 提高 CNN 的表征EXTRACTION 精度，通过将paddingstatus编码为另一个输入通道，使 CNN 能够轻松地分辨真正的像素和padding区域。
methods: 提出了 PadChannel padding 方法，该方法将 padding status 编码为另一个输入通道，以便 CNN 能够轻松地分辨真正的像素和padding区域。
results: 在 ImageNet-1K 图像分类任务上，通过 incorporating PadChannel into several prominent CNN architectures，实现了小幅提升性能和明显减少了变差值，而且计算成本增加不多。

Abstract
In convolutional neural networks (CNNs), padding plays a pivotal role in preserving spatial dimensions throughout the layers. Traditional padding techniques do not explicitly distinguish between the actual image content and the padded regions, potentially causing CNNs to incorrectly interpret the boundary pixels or regions that resemble boundaries. This ambiguity can lead to suboptimal feature extraction. To address this, we propose PadChannel, a novel padding method that encodes padding statuses as an additional input channel, enabling CNNs to easily distinguish genuine pixels from padded ones. By incorporating PadChannel into several prominent CNN architectures, we observed small performance improvements and notable reductions in the variances on the ImageNet-1K image classification task at marginal increases in the computational cost. The source code is available at https://github.com/AussieSeaweed/pad-channel

摘要
在卷积神经网络（CNN）中，填充扮演着保持空间维度的重要角色。传统的填充技术不能显式地区分实际图像内容和填充区域，可能导致CNN incorrectly interpretBoundary pixels或区域，从而导致优化特征提取的困难。为解决这个问题，我们提议PadChannel，一种新的填充方法，它将填充状态编码为一个额外输入通道，使CNN可以轻松地 отличи出真实的像素与填充区域。通过将PadChannel integrating into severaleminent CNN architectures, we observed small performance improvements and notable reductions in the variances on the ImageNet-1K image classification task at marginal increases in the computational cost. Source code available at 。

Attention-Challenging Multiple Instance Learning for Whole Slide Image Classification

paper_url: http://arxiv.org/abs/2311.07125
repo_url: https://github.com/dazhangyu123/acmil
paper_authors: Yunlong Zhang, Honglin Li, Yuxuan Sun, Sunyi Zheng, Chenglu Zhu, Lin Yang
for: 本研究旨在解决多个实例学习（MIL）方法在整个扫描图像（WSI）分析中遇到的过拟合问题。
methods: 本研究提出了一种名为 Attention-Challenging MIL（ACMIL）的方法，旨在让注意力机制能够捕捉更多的挑战性预测实例。ACMIL 使用了两种技术：多支分支注意力（MBA）和随机 Top-K 实例屏蔽（STKIM）。
results: 在三个 WSI 数据集上进行评估，ACMIL 表现出色，超过了现有方法。此外，通过热图视化、UMAP 视化和注意值统计，本研究详细展示了 ACMIL 在超越过拟合挑战的效果。代码可以在 \url{https://github.com/dazhangyu123/ACMIL} 上获取。

Abstract
Overfitting remains a significant challenge in the application of Multiple Instance Learning (MIL) methods for Whole Slide Image (WSI) analysis. Visualizing heatmaps reveals that current MIL methods focus on a subset of predictive instances, hindering effective model generalization. To tackle this, we propose Attention-Challenging MIL (ACMIL), aimed at forcing the attention mechanism to capture more challenging predictive instances. ACMIL incorporates two techniques, Multiple Branch Attention (MBA) to capture richer predictive instances and Stochastic Top-K Instance Masking (STKIM) to suppress simple predictive instances. Evaluation on three WSI datasets outperforms state-of-the-art methods. Additionally, through heatmap visualization, UMAP visualization, and attention value statistics, this paper comprehensively illustrates ACMIL's effectiveness in overcoming the overfitting challenge. The source code is available at \url{https://github.com/dazhangyu123/ACMIL}.

摘要
多个实例学习（MIL）方法在整个扫描图像（WSI）分析中仍然存在至关重要的挑战，即过拟合。使用热图可视化显示，当前MIL方法往往会围绕一部分预测实例集中心化，从而降低模型的泛化能力。为解决这个问题，我们提议了吸引挑战（ACMIL）方法，旨在让吸引机制捕捉更多的挑战预测实例。ACMIL方法包括多支分支吸引（MBA）和随机Top-K实例屏蔽（STKIM）两种技术，以捕捉更加丰富的预测实例和避免简单的预测实例。在三个WSI数据集上进行评估，ACMIL方法超越了现有方法。此外，通过热图可视化、UMAP可视化和吸引值统计，本文全面地展示了ACMIL方法在超越过拟合挑战的效果。ACMIL源代码可以在GitHub上获取，具体请参考\url{https://github.com/dazhangyu123/ACMIL}.

SpectralGPT: Spectral Foundation Model

paper_url: http://arxiv.org/abs/2311.07113
repo_url: None
paper_authors: Danfeng Hong, Bing Zhang, Xuyang Li, Yuxuan Li, Chenyu Li, Jing Yao, Naoto Yokoya, Hao Li, Xiuping Jia, Antonio Plaza, Gamba Paolo, Jon Atli Benediktsson, Jocelyn Chanussot
for: 这个研究旨在开发一个能够处理颜色spectral remote sensing（RS）影像的通用基础模型，以探索spectral RS 数据的应用前景。
methods: 本研究使用了一种名为SpectralGPT的新型3D生成预训练trasnformer（GPT），可以处理不同大小、分辨率、时间序列和区域的spectral RS 影像，并且可以进行预训练和进一步训练。
results: 根据我们的评估结果，预训练后的SpectralGPT模型具有超过6000万个参数，并且在四个下渠任务中（单/多标Scene分类、semantic segmentation和变化检测）表现出了明显的性能提升。

Abstract
The foundation model has recently garnered significant attention due to its potential to revolutionize the field of visual representation learning in a self-supervised manner. While most foundation models are tailored to effectively process RGB images for various visual tasks, there is a noticeable gap in research focused on spectral data, which offers valuable information for scene understanding, especially in remote sensing (RS) applications. To fill this gap, we created for the first time a universal RS foundation model, named SpectralGPT, which is purpose-built to handle spectral RS images using a novel 3D generative pretrained transformer (GPT). Compared to existing foundation models, SpectralGPT 1) accommodates input images with varying sizes, resolutions, time series, and regions in a progressive training fashion, enabling full utilization of extensive RS big data; 2) leverages 3D token generation for spatial-spectral coupling; 3) captures spectrally sequential patterns via multi-target reconstruction; 4) trains on one million spectral RS images, yielding models with over 600 million parameters. Our evaluation highlights significant performance improvements with pretrained SpectralGPT models, signifying substantial potential in advancing spectral RS big data applications within the field of geoscience across four downstream tasks: single/multi-label scene classification, semantic segmentation, and change detection.

摘要
底层模型最近受到了各种视觉学任务自动学习的潜在革命性的注意力。大多数底层模型都是针对RGB图像进行多种视觉任务的效果优化的。然而，在 spectral 数据方面，有一定的研究欠差，这些数据具有Scene 理解中的价值，尤其是在远程感知（RS）应用中。为了填补这个欠差，我们创造了首次的universal RS 底层模型，名为SpectralGPT，它使用了一种新的三维生成预训练变换器（GPT）来处理 spectral RS 图像。与现有的底层模型相比，SpectralGPT 具有以下优势：1. 可以处理不同大小、分辨率、时间序列和区域的输入图像，从而使用广泛的RS大数据进行全面利用。2. 通过三维token生成来实现空间-spectral的coupling。3. 通过多个目标重建来捕捉spectral序列的特征。4. 在一百万个spectral RS 图像上进行训练，实现了模型具有超过6亿个参数。我们的评估表明，使用预训练SpectralGPT模型可以获得显著的性能提升，这表明了这种方法在RS大数据应用中具有潜在的潜力。在四个下游任务中，预训练SpectralGPT模型都显示了显著的性能提升：单/多标Scene 分类、semantic segmentation和变化检测。

paper_url: http://arxiv.org/abs/2311.07090
repo_url: None
paper_authors: Yachun Mi, Yu Li, Yan Shu, Chen Hui, Puchao Zhou, Shaohui Liu
For: 这篇论文主要关注视频质量评估 (VQA) 领域，旨在模拟人类视觉系统 (HVS) 对视频质量的评估。* Methods: 该论文提出了一种新的 VQA 方法，即 CLiF-VQA，它考虑了视频中人类情感的影响，同时也考虑了视频的空间特征。为了有效提取视频中人类情感的特征，该方法首次利用 CLIP 和人类情感的一致性进行研究。具体来说，该方法设计了多个对人类情感有关的目标和主观描述作为推荐。此外，该方法还提出了一种基于 CLIP 的 semantic feature extractor (SFE)，可以从视频帧中提取人类情感相关的特征。* Results: 该论文的实验结果表明，提出的 CLiF-VQA 方法在多个 VQA 数据集上表现出色。

Abstract
Video Quality Assessment (VQA) aims to simulate the process of perceiving video quality by the human visual system (HVS). The judgments made by HVS are always influenced by human subjective feelings. However, most of the current VQA research focuses on capturing various distortions in the spatial and temporal domains of videos, while ignoring the impact of human feelings. In this paper, we propose CLiF-VQA, which considers both features related to human feelings and spatial features of videos. In order to effectively extract features related to human feelings from videos, we explore the consistency between CLIP and human feelings in video perception for the first time. Specifically, we design multiple objective and subjective descriptions closely related to human feelings as prompts. Further we propose a novel CLIP-based semantic feature extractor (SFE) which extracts features related to human feelings by sliding over multiple regions of the video frame. In addition, we further capture the low-level-aware features of the video through a spatial feature extraction module. The two different features are then aggregated thereby obtaining the quality score of the video. Extensive experiments show that the proposed CLiF-VQA exhibits excellent performance on several VQA datasets.

摘要

GazeForensics: DeepFake Detection via Gaze-guided Spatial Inconsistency Learning

paper_url: http://arxiv.org/abs/2311.07075
repo_url: None
paper_authors: Qinlin He, Chunlei Peng, Dechuang Liu, Nannan Wang, Xinbo Gao
for: 这个研究旨在提高深伪检测的精度，以保护人们的隐私和公共安全。
methods: 本研究使用了3D双眼视线估计模型来取得视线表现，并与通用特征结合使用，以增强深伪检测模型的表现。
results: 实验结果显示，提案的GazeForensics方法可以超过目前的州OF-THE-ART方法。

Abstract
DeepFake detection is pivotal in personal privacy and public safety. With the iterative advancement of DeepFake techniques, high-quality forged videos and images are becoming increasingly deceptive. Prior research has seen numerous attempts by scholars to incorporate biometric features into the field of DeepFake detection. However, traditional biometric-based approaches tend to segregate biometric features from general ones and freeze the biometric feature extractor. These approaches resulted in the exclusion of valuable general features, potentially leading to a performance decline and, consequently, a failure to fully exploit the potential of biometric information in assisting DeepFake detection. Moreover, insufficient attention has been dedicated to scrutinizing gaze authenticity within the realm of DeepFake detection in recent years. In this paper, we introduce GazeForensics, an innovative DeepFake detection method that utilizes gaze representation obtained from a 3D gaze estimation model to regularize the corresponding representation within our DeepFake detection model, while concurrently integrating general features to further enhance the performance of our model. Experiment results reveal that our proposed GazeForensics outperforms the current state-of-the-art methods.

摘要
深度假像检测对个人隐私和公共安全具有核心作用。随着深度假像技术的不断提高，高质量的假造视频和图像变得越来越欺骗性。先前的研究中，学者们已经尝试将生物特征 incorporated 到深度假像检测领域中。然而，传统的生物特征基于的方法通常将生物特征与通用特征分离开来，这会导致排除有价值的通用特征，从而导致性能下降，最终无法完全利用生物信息的潜在优势。此外，近年来对 DeepFake 检测中的视线真实性的研究并未受到足够的关注。本文提出了一种名为 GazeForensics 的创新的 DeepFake 检测方法，该方法使用来自 3D 视线估计模型获得的视线表示来补做对应的表示在我们的 DeepFake 检测模型中，同时并入通用特征以进一步提高我们的模型的性能。实验结果表明，我们的提议的 GazeForensics 方法在当前状态的方法中具有优异的性能。

$L_0$-Sampler: An $L_{0}$ Model Guided Volume Sampling for NeRF

paper_url: http://arxiv.org/abs/2311.07044
repo_url: None
paper_authors: Liangchen Li, Juyong Zhang
for:This paper aims to improve the Neural Radiance Fields (NeRF) method by proposing a new sampling strategy called $L_0$-Sampler.methods:The $L_0$-Sampler incorporates the $L_0$ model into the weight function $w(t)$ to guide the sampling process, using piecewise exponential functions for interpolation.results:The proposed $L_0$-Sampler can achieve stable performance improvements in NeRF and related tasks like 3D reconstruction, with the advantage of being easily implemented with few lines of code.Here is the text in Simplified Chinese:for:这篇论文目标是改进Neural Radiance Fields（NeRF）方法，提出一种新的采样策略 called $L_0$-Sampler。methods:$L_0$-Sampler将$L_0$模型integrated into weight函数$w(t)$中，使用piecewise exponential函数进行插值。results:提议的$L_0$-Sampler可以在NeRF和相关任务如3D重建中实现稳定性提升，同时易于实现，只需几行代码。

Abstract
Since being proposed, Neural Radiance Fields (NeRF) have achieved great success in related tasks, mainly adopting the hierarchical volume sampling (HVS) strategy for volume rendering. However, the HVS of NeRF approximates distributions using piecewise constant functions, which provides a relatively rough estimation. Based on the observation that a well-trained weight function $w(t)$ and the $L_0$ distance between points and the surface have very high similarity, we propose $L_0$-Sampler by incorporating the $L_0$ model into $w(t)$ to guide the sampling process. Specifically, we propose to use piecewise exponential functions rather than piecewise constant functions for interpolation, which can not only approximate quasi-$L_0$ weight distributions along rays quite well but also can be easily implemented with few lines of code without additional computational burden. Stable performance improvements can be achieved by applying $L_0$-Sampler to NeRF and its related tasks like 3D reconstruction. Code is available at https://ustc3dv.github.io/L0-Sampler/ .

摘要
Since being proposed, Neural Radiance Fields (NeRF) have achieved great success in related tasks, mainly adopting the hierarchical volume sampling (HVS) strategy for volume rendering. However, the HVS of NeRF approximates distributions using piecewise constant functions, which provides a relatively rough estimation. Based on the observation that a well-trained weight function $w(t)$ and the $L_0$ distance between points and the surface have very high similarity, we propose $L_0$-Sampler by incorporating the $L_0$ model into $w(t)$ to guide the sampling process. Specifically, we propose to use piecewise exponential functions rather than piecewise constant functions for interpolation, which can not only approximate quasi-$L_0$ weight distributions along rays quite well but also can be easily implemented with few lines of code without additional computational burden. Stable performance improvements can be achieved by applying $L_0$-Sampler to NeRF and its related tasks like 3D reconstruction. Code is available at https://ustc3dv.github.io/L0-Sampler/.Here's the word-for-word translation of the text into Simplified Chinese: desde que se propuso, Neural Radiance Fields (NeRF) han tenido gran éxito en tareas relacionadas, principalmente adoptando la estrategia de muestreo de volumen jerárquico (HVS) para la renderización de volumenes. Sin embargo, el HVS de NeRF aproxima distribuciones utilizando funciones constantes piecewise, lo que proporciona una estimación relativamente rough. Basándonos en la observación de que una función de peso bien entrenada $w(t)$ y la distancia $L_0$ entre puntos y la superficie tienen una alta similitud, propodemos $L_0$-Sampler al incorporar el modelo $L_0$ en $w(t)$ para guiar el proceso de muestreo. Específicamente, propodemos utilizar funciones exponenciales piecewise en lugar de funciones constantes piecewise para la interpolación, lo que puede no solo aproximar distribucciones de peso quasi-$L_0$ en rayas muy bien but also puede ser fácilmente implementado con pocas líneas de código sin un carga adicional computacional. Se pueden lograr mejores mejoras estables en el rendimiento al aplicar $L_0$-Sampler a NeRF y tareas relacionadas como la reconstrucción 3D. El código está disponible en https://ustc3dv.github.io/L0-Sampler/.

Open-Vocabulary Video Anomaly Detection

paper_url: http://arxiv.org/abs/2311.07042
repo_url: None
paper_authors: Peng Wu, Xuerong Zhou, Guansong Pang, Yujia Sun, Jing Liu, Peng Wang, Yanning Zhang
for: 这 paper 旨在解决开放 vocabulary video anomaly detection (OVVAD) 问题，即通过预训练大型模型来检测和分类已知和未知异常。
methods: 该 paper 提出了一种分解 OVVAD 任务为两个互补任务 – 类型不敏感检测和类型特定分类 – 并同时优化两个任务。具体来说，该 paper 提出了一种 semantic knowledge injection module，用于将大型语言模型中的 semantic knowledge 引入检测任务中，以及一种 anomaly synthesis module，用于通过大型视觉生成模型生成 pseudo 未知异常视频，以便分类任务中更好地检测和分类各种 seen 和 unseen 异常。
results: 该 paper 的实验结果表明，其模型在 OVVAD 任务中取得了状态公平的表现，比如在 three widely-used benchmark 上的 experiment 中。

Abstract
Video anomaly detection (VAD) with weak supervision has achieved remarkable performance in utilizing video-level labels to discriminate whether a video frame is normal or abnormal. However, current approaches are inherently limited to a closed-set setting and may struggle in open-world applications where there can be anomaly categories in the test data unseen during training. A few recent studies attempt to tackle a more realistic setting, open-set VAD, which aims to detect unseen anomalies given seen anomalies and normal videos. However, such a setting focuses on predicting frame anomaly scores, having no ability to recognize the specific categories of anomalies, despite the fact that this ability is essential for building more informed video surveillance systems. This paper takes a step further and explores open-vocabulary video anomaly detection (OVVAD), in which we aim to leverage pre-trained large models to detect and categorize seen and unseen anomalies. To this end, we propose a model that decouples OVVAD into two mutually complementary tasks -- class-agnostic detection and class-specific classification -- and jointly optimizes both tasks. Particularly, we devise a semantic knowledge injection module to introduce semantic knowledge from large language models for the detection task, and design a novel anomaly synthesis module to generate pseudo unseen anomaly videos with the help of large vision generation models for the classification task. These semantic knowledge and synthesis anomalies substantially extend our model's capability in detecting and categorizing a variety of seen and unseen anomalies. Extensive experiments on three widely-used benchmarks demonstrate our model achieves state-of-the-art performance on OVVAD task.

摘要
视频异常检测（VAD）通过弱监督得到了非常出色的表现，可以使用视频帧级别的标签来判断视频帧是否正常。然而，现有的方法受限于关闭集成环境，可能在开放世界应用中遇到未知的异常类型。一些最近的研究尝试解决更加现实的设定，开放集成VAD，以便在测试数据中未经训练的异常类型上检测异常。然而，这种设定仅仅是预测帧异常分数，无法识别特定的异常类型，尽管这种能力是建立更加知ledge的视频监测系统的关键。本文尝试一步更进一步，探索开放词汇视频异常检测（OVVAD），我们希望通过利用预训练大型模型来检测和分类已知和未知异常。为此，我们提议一个模型，将OVVAD分解成两个互补性任务：无关类型检测和类型特定分类，并同时优化两个任务。特别是，我们设计了一个语义知识注入模块，将语义知识从大型语言模型引入检测任务，并设计了一个异常生成模块，通过大视力生成模型生成 Pseudo 未知异常视频。这些语义知识和生成异常substantially 提高了我们模型的异常检测和分类能力。广泛的实验表明，我们的模型在OVVAD任务中具有状态级别的表现。

Pretrain like Your Inference: Masked Tuning Improves Zero-Shot Composed Image Retrieval

paper_url: http://arxiv.org/abs/2311.07622
repo_url: None
paper_authors: Junyang Chen, Hanjiang Lai
for: This paper focuses on zero-shot composed image retrieval (ZS-CIR), which aims to retrieve a target image based on textual modifications to a reference image without triplet labeling.
methods: The paper introduces a novel unlabeled and pre-trained masked tuning approach to reduce the gap between the pre-trained model and the downstream CIR task. The approach uses the text and the masked image to learn the modifications of the original image.
results: The approach significantly outperforms the baseline models on three ZS-CIR datasets, including FashionIQ, CIRR, and CIRCO.

Abstract
Zero-shot composed image retrieval (ZS-CIR), which aims to retrieve a target image based on textual modifications to a reference image without triplet labeling, has gained more and more attention. Current ZS-CIR research mainly relies on two unlabeled pre-trained models: the vision-language model, e.g., CLIP, and the Pic2Word/textual inversion model. However, the pre-trained models and CIR tasks have substantial discrepancies, where the pre-trained models learn the similarities between vision and language but CIR aims to learn the modifications of the image guided by text. In this paper, we introduce a novel unlabeled and pre-trained masked tuning approach to reduce the gap between the pre-trained model and the downstream CIR task. We first reformulate the pre-trained vision-language contrastive learning as the CIR task, where we randomly mask input image patches to generate $\langle$masked image, text, image$\rangle$ triple from an image-text pair. Then, we propose a masked tuning, which uses the text and the masked image to learn the modifications of the original image. With such a simple design, it can learn to capture fine-grained text-guided modifications. Extensive experimental results demonstrate the significant superiority of our approach over the baseline models on three ZS-CIR datasets, including FashionIQ, CIRR, and CIRCO.

摘要
Zero-shot组合图像检索（ZS-CIR），旨在基于文本修改参照图像而不需要三元标注，已经吸引了更多的关注。当前ZS-CIR研究主要基于两个无标注预训练模型：视觉语言模型，例如CLIP，以及Pic2Word/文本反转模型。然而，预训练模型和CIR任务之间存在substantial差异，预训练模型学习视觉和语言之间的相似性，而CIR任务则是学习文本指导图像的修改。在这篇论文中，我们介绍了一种新的无标注预训练掩模型调整方法，以减少预训练模型和下游CIR任务之间的差异。我们首先将预训练视觉语言对比学习重新формализова为CIR任务，将输入图像块随机掩蔽，生成$\langle$掩模图像、文本、原始图像$\rangle$三元组。然后，我们提议一种掩模调整，使用文本和掩模图像来学习原始图像的修改。这种简单的设计可以学习到细致的文本指导修改。我们对三个ZS-CIR数据集进行了广泛的实验，结果表明我们的方法在baseline模型上显著超越。

TTMFN: Two-stream Transformer-based Multimodal Fusion Network for Survival Prediction

paper_url: http://arxiv.org/abs/2311.07033
repo_url: None
paper_authors: Ruiquan Ge, Xiangyang Hu, Rungen Huang, Gangyong Jia, Yaqi Wang, Renshu Gu, Changmiao Wang, Elazab Ahmed, Linyan Wang, Juan Ye, Ye Li
for: 预测癌症患者生存时间的研究
methods: 提议一种基于深度学习的两树式多modal合并网络（TTMFN），将生物 PATHOLOGICAL 图像和基因表达数据融合以提高预测性能
results: TTMFN 在四个来自 The Cancer Genome Atlas 的数据集上实现了最佳或与状态艺术方法相当的预测结果，提高了患者生存时间的预测精度

Abstract
Survival prediction plays a crucial role in assisting clinicians with the development of cancer treatment protocols. Recent evidence shows that multimodal data can help in the diagnosis of cancer disease and improve survival prediction. Currently, deep learning-based approaches have experienced increasing success in survival prediction by integrating pathological images and gene expression data. However, most existing approaches overlook the intra-modality latent information and the complex inter-modality correlations. Furthermore, existing modalities do not fully exploit the immense representational capabilities of neural networks for feature aggregation and disregard the importance of relationships between features. Therefore, it is highly recommended to address these issues in order to enhance the prediction performance by proposing a novel deep learning-based method. We propose a novel framework named Two-stream Transformer-based Multimodal Fusion Network for survival prediction (TTMFN), which integrates pathological images and gene expression data. In TTMFN, we present a two-stream multimodal co-attention transformer module to take full advantage of the complex relationships between different modalities and the potential connections within the modalities. Additionally, we develop a multi-head attention pooling approach to effectively aggregate the feature representations of the two modalities. The experiment results on four datasets from The Cancer Genome Atlas demonstrate that TTMFN can achieve the best performance or competitive results compared to the state-of-the-art methods in predicting the overall survival of patients.

摘要
生存预测在医学家开发癌症治疗协议中发挥关键作用。现有证据表明，多modal数据可以帮助诊断癌症疾病并提高生存预测。目前，深度学习基于的方法在生存预测中经历了增长的成功，通过将 PATHOLOGICAL IMAGES 和基因表达数据集成起来。然而，大多数现有方法忽视INTRA-MODALITY LATENT INFORMATION和复杂的交叉modalities关系。此外，现有的modalities不完全利用神经网络的庞大表达能力进行特征聚合，也忽视了特征之间的关系。因此，以提高预测性能的目的，我们建议提出一种新的深度学习基于的方法。我们提出了一种名为 Two-stream Transformer-based Multimodal Fusion Network 的新框架（TTMFN），它将 PATHOLOGICAL IMAGES 和基因表达数据集成起来。在 TTMFN 中，我们提出了一种两树多模态协作变换模块，以便充分利用不同modalities之间的复杂关系和可能的连接。此外，我们开发了一种多头注意池化方法，以有效地聚合 PATHOLOGICAL IMAGES 和基因表达数据的特征表示。实验结果表明，在 The Cancer Genome Atlas 上的四个数据集上，TTMFN 可以获得最佳性能或与当前状态艺术方法竞争。

PICS in Pics: Physics Informed Contour Selection for Rapid Image Segmentation

paper_url: http://arxiv.org/abs/2311.07002
repo_url: None
paper_authors: Vikas Dwivedi, Balaji Srinivasan, Ganapathy Krishnamurthi
for: 这篇论文的目的是提出一个可读性好的深度图像分类模型训练方法，并且不需要大量、高品质的标注数据。
methods: 这篇论文使用了Physics Informed Contour Selection（PICS）算法，它是一种可解释的、以物理为指导的图像分类算法，它结合了Physics-Informed Neural Networks（PINNs）和活动曲线模型（snake）。PICS使用了立方spline代替深度神经网络，因此它很快速和计算轻量级。
results: 这篇论文透过实验显示PICS可以快速和高效地完成3D图像分类，并且可以借由转移学习来加速分类。PICS还引入了一个新的凸形积分项，以增强分类质量。总的来说，PICS具有许多新的特点，例如网络架构、转移学习和物理灵感损失，因此显示了可循环和可进一步改进的潜力。

Abstract
Effective training of deep image segmentation models is challenging due to the need for abundant, high-quality annotations. Generating annotations is laborious and time-consuming for human experts, especially in medical image segmentation. To facilitate image annotation, we introduce Physics Informed Contour Selection (PICS) - an interpretable, physics-informed algorithm for rapid image segmentation without relying on labeled data. PICS draws inspiration from physics-informed neural networks (PINNs) and an active contour model called snake. It is fast and computationally lightweight because it employs cubic splines instead of a deep neural network as a basis function. Its training parameters are physically interpretable because they directly represent control knots of the segmentation curve. Traditional snakes involve minimization of the edge-based loss functionals by deriving the Euler-Lagrange equation followed by its numerical solution. However, PICS directly minimizes the loss functional, bypassing the Euler Lagrange equations. It is the first snake variant to minimize a region-based loss function instead of traditional edge-based loss functions. PICS uniquely models the three-dimensional (3D) segmentation process with an unsteady partial differential equation (PDE), which allows accelerated segmentation via transfer learning. To demonstrate its effectiveness, we apply PICS for 3D segmentation of the left ventricle on a publicly available cardiac dataset. While doing so, we also introduce a new convexity-preserving loss term that encodes the shape information of the left ventricle to enhance PICS's segmentation quality. Overall, PICS presents several novelties in network architecture, transfer learning, and physics-inspired losses for image segmentation, thereby showing promising outcomes and potential for further refinement.

摘要
实现深度图像分类模型的训练非常困难，因为需要充足的、高品质的标注。生成标注是人工专家很传统和时间耗费的，特别是医疗图像分类。为了促进图像标注，我们提出了物理决定曲线选择（PICS）：一种可读性的、物理决定的算法，不需要标注数据。PICS受到物理决定神经网络（PINNs）和活动曲线模型（snake）的启发，它快速且轻量级的，因为它使用立方体spline而不是深度神经网络作为基础函数。它的训练参数是物理可解的，因为它们直接表示分类曲线的控制点。传统的蛇涉及到透过监督学习减少边界基于损失函数的最小化，但PICS直接对数据进行损失函数的最小化，不需要监督学习。PICS是首个将区域基于损失函数最小化，而不是传统的边界基于损失函数最小化。PICS具有实现三维（3D）分类过程的不稳定偏微分方程（PDE），可以通过转移学习加速分类。为了证明其效果，我们将PICS应用于公开可用的心脏组织数据集上3D左心脏分类。同时，我们也引入了一个新的凸形积分函数，以增强PICS的分类质量。总之，PICS具有训练网络架构、转移学习和物理决定损失函数等多个新特点，这些特点使得PICS在图像分类方面显示出了可塑性和潜力。

2023-11-13

Assessing Test-time Variability for Interactive 3D Medical Image Segmentation with Diverse Point Prompts

CSLP-AE: A Contrastive Split-Latent Permutation Autoencoder Framework for Zero-Shot Electroencephalography Signal Conversion

A Data-Free Approach to Mitigate Catastrophic Forgetting in Federated Class Incremental Learning for Vision Tasks

FedOpenHAR: Federated Multi-Task Transfer Learning for Sensor-Based Human Activity Recognition

Quality-Aware Prototype Memory for Face Representation Learning

To See is to Believe: Prompting GPT-4V for Better Visual Instruction Tuning

Fast Normalized Cross-Correlation for Template Matching with Rotations

VGSG: Vision-Guided Semantic-Group Network for Text-based Person Search

Temporal Performance Prediction for Deep Convolutional Long Short-Term Memory Networks

Masked Face Dataset Generation and Masked Face Recognition

Language Grounded QFormer for Efficient Vision Language Understanding

Story-to-Motion: Synthesizing Infinite and Controllable Character Animation from Long Text

Supersampling of Data from Structured-light Scanner with Deep Learning

Optimising Human-AI Collaboration by Learning Convincing Explanations

Robust semi-supervised segmentation with timestep ensembling diffusion models

Mitigating Backdoors within Deep Neural Networks in Data-limited Configuration

FIRST: A Million-Entry Dataset for Text-Driven Fashion Synthesis and Design

Towards Automatic Honey Bee Flower-Patch Assays with Paint Marking Re-Identification

Processing and Segmentation of Human Teeth from 2D Images using Weakly Supervised Learning

Evaluating the Significance of Outdoor Advertising from Driver’s Perspective Using Computer Vision

Classification of developmental and brain disorders via graph convolutional aggregation

ActiveDC: Distribution Calibration for Active Finetuning

Registered and Segmented Deformable Object Reconstruction from a Single View Point Cloud

Deformable Groupwise Registration Using a Locally Low-Rank Dissimilarity Metric for Myocardial Strain Estimation from Cardiac Cine MRI Images

Connecting the Dots: Graph Neural Network Powered Ensemble and Classification of Medical Images

What Large Language Models Bring to Text-rich VQA?

Dynamically Weighted Factor-Graph for Feature-based Geo-localization

Multi Sentence Description of Complex Manipulation Action Videos

LT-ViT: A Vision Transformer for multi-label Chest X-ray classification

Sketch-based Video Object Segmentation: Benchmark and Analysis

Simultaneous Clutter Detection and Semantic Segmentation of Moving Objects for Automotive Radar Data

DeepMetricEye: Metric Depth Estimation in Periocular VR Imagery

Multi-task learning for joint weakly-supervised segmentation and aortic arch anomaly classification in fetal cardiac MRI

Few Shot Learning for the Classification of Confocal Laser Endomicroscopy Images of Head and Neck Tumors

A method for quantifying sectoral optic disc pallor in fundus photographs and its association with peripapillary RNFL thickness

Cross-modal Generative Model for Visual-Guided Binaural Stereo Generation

MonoDiffusion: Self-Supervised Monocular Depth Estimation Using Diffusion Model

Fitting tree model with CNN and geodesics to track vesselsand application to Ultrasound Localization Microscopy data

Regenerating Arbitrary Video Sequences with Distillation Path-Finding

NDDepth: Normal-Distance Assisted Monocular Depth Estimation and Completion

CycleGANAS: Differentiable Neural Architecture Search for CycleGAN

Detecting As Labeling: Rethinking LiDAR-camera Fusion in 3D Object Detection

PadChannel: Improving CNN Performance through Explicit Padding Encoding

Attention-Challenging Multiple Instance Learning for Whole Slide Image Classification

SpectralGPT: Spectral Foundation Model

CLiF-VQA: Enhancing Video Quality Assessment by Incorporating High-Level Semantic Information related to Human Feelings

GazeForensics: DeepFake Detection via Gaze-guided Spatial Inconsistency Learning

$L_0$-Sampler: An $L_{0}$ Model Guided Volume Sampling for NeRF

Open-Vocabulary Video Anomaly Detection

Pretrain like Your Inference: Masked Tuning Improves Zero-Shot Composed Image Retrieval

TTMFN: Two-stream Transformer-based Multimodal Fusion Network for Survival Prediction

PICS in Pics: Physics Informed Contour Selection for Rapid Image Segmentation