2023-07-04

cs.CV

cs.CV - 2023-07-04

Localized Data Work as a Precondition for Data-Centric ML: A Case Study of Full Lifecycle Crop Disease Identification in Ghana

paper_url: http://arxiv.org/abs/2307.01767
repo_url: None
paper_authors: Darlington Akogo, Issah Samori, Cyril Akafia, Harriet Fiagbor, Andrews Kangah, Donald Kwame Asiedu, Kwabena Fuachie, Luis Oala
for: 这项研究的目的是提高农业生产力和食品安全。
methods: 该研究使用无人机采集数据和机器学习算法来确定作物受到的压力。
results: 研究组共同开发了数据、模型和应用程序，并将其提供给当地农民 via 桌面应用程序。

Abstract
The Ghana Cashew Disease Identification with Artificial Intelligence (CADI AI) project demonstrates the importance of sound data work as a precondition for the delivery of useful, localized datacentric solutions for public good tasks such as agricultural productivity and food security. Drone collected data and machine learning are utilized to determine crop stressors. Data, model and the final app are developed jointly and made available to local farmers via a desktop application.

摘要
《加纳核桃疾病识别用人工智能（CADI AI）项目》表明了准确的数据工作的重要性，作为地方化数据驱动解决方案的先决条件，以提高农业生产力和食品安全。在该项目中，用扫描机采集的数据和机器学习算法来确定作物压力。数据、模型和最终应用程序都是在桌面应用程序上开发的，并且提供给当地农民使用。

Pretraining is All You Need: A Multi-Atlas Enhanced Transformer Framework for Autism Spectrum Disorder Classification

paper_url: http://arxiv.org/abs/2307.01759
repo_url: https://github.com/lugges991/metaformer
paper_authors: Lucas Mahler, Qi Wang, Julius Steiglechner, Florian Birk, Samuel Heczko, Klaus Scheffler, Gabriele Lohmann
for: 这个研究旨在提出一个新的多几个地图扩展器架构（METAFormer），用于抑制Autism Spectrum Disorder（ASD）分类。
methods: 这个架构使用了休息状态功能磁共振成像资料，并使用多几个地图方法，包括AAL、CC200和DOS160地图，以进行自我超vised pretraining。
results: 这个研究显示，METAFormer可以在ABIDE I dataset上超过现有的州��状态表现，具有83.7%的精度和0.832的AUC分数。

Abstract
Autism spectrum disorder (ASD) is a prevalent psychiatric condition characterized by atypical cognitive, emotional, and social patterns. Timely and accurate diagnosis is crucial for effective interventions and improved outcomes in individuals with ASD. In this study, we propose a novel Multi-Atlas Enhanced Transformer framework, METAFormer, ASD classification. Our framework utilizes resting-state functional magnetic resonance imaging data from the ABIDE I dataset, comprising 406 ASD and 476 typical control (TC) subjects. METAFormer employs a multi-atlas approach, where flattened connectivity matrices from the AAL, CC200, and DOS160 atlases serve as input to the transformer encoder. Notably, we demonstrate that self-supervised pretraining, involving the reconstruction of masked values from the input, significantly enhances classification performance without the need for additional or separate training data. Through stratified cross-validation, we evaluate the proposed framework and show that it surpasses state-of-the-art performance on the ABIDE I dataset, with an average accuracy of 83.7% and an AUC-score of 0.832. The code for our framework is available at https://github.com/Lugges991/METAFormer

摘要
“对于自闭症 спектル中的诊断，时间和准确性都是非常重要的。在这个研究中，我们提出了一个新的多几个 Atlases 增强 Transformer 框架，METAFormer，用于自闭症分类。我们的框架使用了 AAL、CC200 和 DOS160 的 Atlases，将其融合为一个入口，并将其传递给 Transformer Encoder。我们还证明了，不需要额外训练数据，通过自我预 обу的方法，可以很好地增强分类性能。通过阶层验证，我们评估了我们的框架，并发现它在 ABIDE I 数据集上超过了现有的州度数据，具有83.7% 的精度和 0.832 的 AUC 得分。我们的代码可以在 GitHub 上找到：https://github.com/Lugges991/METAFormer。”

K-complex Detection Using Fourier Spectrum Analysis In EEG

paper_url: http://arxiv.org/abs/2307.01754
repo_url: None
paper_authors: Alexey Protopopov
for: automatic K-complex detection in EEG records
methods: based on fast Fourier transform, not using neural networks
results: comparable or superior quality to previous methods, including those using neural networks, with less computational power required.

Abstract
K-complexes are an important marker of brain activity and are used both in clinical practice to perform sleep scoring, and in research. However, due to the size of electroencephalography (EEG) records, as well as the subjective nature of K-complex detection performed by somnologists, it is reasonable to automate K-complex detection. Previous works in this field of research have relied on the values of true positive rate and false positive rate to quantify the effectiveness of proposed methods, however this set of metrics may be misleading. The objective of the present research is to find a more accurate set of metrics and use them to develop a new method of K-complex detection, which would not rely on neural networks. Thus, the present article proposes two new methods for K-complex detection based on the fast Fourier transform. The results achieved demonstrated that the proposed methods offered a quality of K-complex detection that is either similar or superior to the quality of the methods demonstrated in previous works, including the methods employing neural networks, while requiring less computational power, meaning that K-complex detection does not require the use of neural networks. The proposed methods were evaluated using a new set of metrics, which is more representative of the quality of K-complex detection.

摘要

SRCD: Semantic Reasoning with Compound Domains for Single-Domain Generalized Object Detection

paper_url: http://arxiv.org/abs/2307.01750
repo_url: None
paper_authors: Zhijie Rao, Jingcai Guo, Luyao Tang, Yue Huang, Xinghao Ding, Song Guo
for: 本文提出了一种单域泛化物体检测（Single-DGOD）框架，旨在学习和维护自增强样本的 semantic 结构，以提高模型的泛化能力。
methods: 我们提出的 SRCD 包括两个主要组成部分： texture-based self-augmentation (TBSA) 模块和 local-global semantic reasoning (LGSR) 模块。 TBSA 模块通过自适应增强来消除影响标签的不相关特征，如光、阴影、颜色等。而 LGSR 模块则用于进一步模型实例特征之间的semantic关系，以暴露和维护内在的 semantic 结构。
results: 我们在多个 benchmark 上进行了广泛的实验，证明了我们提出的 SRCD 的效果。

Abstract
This paper provides a novel framework for single-domain generalized object detection (i.e., Single-DGOD), where we are interested in learning and maintaining the semantic structures of self-augmented compound cross-domain samples to enhance the model's generalization ability. Different from DGOD trained on multiple source domains, Single-DGOD is far more challenging to generalize well to multiple target domains with only one single source domain. Existing methods mostly adopt a similar treatment from DGOD to learn domain-invariant features by decoupling or compressing the semantic space. However, there may have two potential limitations: 1) pseudo attribute-label correlation, due to extremely scarce single-domain data; and 2) the semantic structural information is usually ignored, i.e., we found the affinities of instance-level semantic relations in samples are crucial to model generalization. In this paper, we introduce Semantic Reasoning with Compound Domains (SRCD) for Single-DGOD. Specifically, our SRCD contains two main components, namely, the texture-based self-augmentation (TBSA) module, and the local-global semantic reasoning (LGSR) module. TBSA aims to eliminate the effects of irrelevant attributes associated with labels, such as light, shadow, color, etc., at the image level by a light-yet-efficient self-augmentation. Moreover, LGSR is used to further model the semantic relationships on instance features to uncover and maintain the intrinsic semantic structures. Extensive experiments on multiple benchmarks demonstrate the effectiveness of the proposed SRCD.

摘要
这篇论文提出了一种新的单域总体化对象检测框架（即Single-DGOD），我们在这种框架中学习和维护自适应增强的混合域样本的 semantic structure，以提高模型的通用能力。与多源域DGOD相比，单域DGOD更加具有挑战性，因为只有一个源域，因此模型难以通用到多个目标域。现有方法通常采用类似于DGOD的方法，即学习域无关特征，通过减少或压缩 semantic space来实现。但是，存在两个潜在的限制：1） Pseudo attribute-label correlation，由于单域数据非常稀缺; 2）模型忽略了实例水平的semantic structural information，即 samples中实例之间的semantic关系的强度是对模型泛化的关键。在本文中，我们提出了Semantic Reasoning with Compound Domains（SRCD）模型，其包括两个主要组件：texture-based self-augmentation（TBSA）模块和local-global semantic reasoning（LGSR）模块。TBSA模块通过快速和高效地自适应来消除与标签相关的不相关特征，如光、阴影、颜色等。而LGSR模块则用于进一步模型实例水平的semantic关系，以uncover和维护内在的semantic结构。我们在多个 benchmark上进行了广泛的实验，并证明了我们提出的SRCD的效果。

Ben-ge: Extending BigEarthNet with Geographical and Environmental Data

paper_url: http://arxiv.org/abs/2307.01741
repo_url: https://github.com/hsg-aiml/ben-ge
paper_authors: Michael Mommert, Nicolas Kesseli, Joëlle Hanna, Linus Scheibenreif, Damian Borth, Begüm Demir
for: 本研究旨在探讨多模式 Earth observation 数据的分析方法，并证明 combining 不同数据模式可以提高下游任务的准确率。
methods: 本研究使用了 Earth observation 数据的多模式 combine，包括 patch-based 地用/地形类别和地用/地形分割等下游任务。
results: 研究表明，通过 combining 不同数据模式，可以提高下游任务的准确率，并且可以作为 Earth observation 应用的测试平台。

Abstract
Deep learning methods have proven to be a powerful tool in the analysis of large amounts of complex Earth observation data. However, while Earth observation data are multi-modal in most cases, only single or few modalities are typically considered. In this work, we present the ben-ge dataset, which supplements the BigEarthNet-MM dataset by compiling freely and globally available geographical and environmental data. Based on this dataset, we showcase the value of combining different data modalities for the downstream tasks of patch-based land-use/land-cover classification and land-use/land-cover segmentation. ben-ge is freely available and expected to serve as a test bed for fully supervised and self-supervised Earth observation applications.

摘要
深度学习方法在大量复杂的地球观测数据分析中表现出了强大的功能。然而，大多数情况下的地球观测数据是多modal的，但只考虑单个或少数modalities。在这个工作中，我们提供了ben-ge数据集，该数据集收集了全球和自由可用的地理和环境数据，并基于这个数据集，我们展示了不同modalities的结合对下游任务（patch-based 土地用途/土地覆盖分类和土地用途/土地覆盖分割）的价值。ben-ge是免费可用的，预计将成为全supervised和self-supervised Earth observation应用程序的测试床。

Synchronous Image-Label Diffusion Probability Model with Application to Stroke Lesion Segmentation on Non-contrast CT

paper_url: http://arxiv.org/abs/2307.01740
repo_url: None
paper_authors: Jianhai Zhang, Tonghua Wan, Ethan MacDonald, Bijoy Menon, Aravind Ganesh, Qiu Wu
for: 这个论文是为了提出一种基于Markov扩散过程的同步图像标签扩散可能性模型（SDPM），用于非对比CT扫描图像中的血栓病变 segmentation。
methods: 该模型基于一个隐变量模型（LVM），并引入了一个额外网络流，以获得初始噪声标签估计，以便高效地推理最终标签。通过优化指定的可变边界，训练好的模型可以对输入图像噪声给出多个标签估计。
results: 该模型在三个血栓病变数据集上进行测试，包括一个公共数据集和两个私人数据集，并与一些U-net和变换器基于的分割方法进行比较。结果显示，提出的SDPM模型能够达到当前最佳性能。代码公开 disponível。

Abstract
Stroke lesion volume is a key radiologic measurement for assessing the prognosis of Acute Ischemic Stroke (AIS) patients, which is challenging to be automatically measured on Non-Contrast CT (NCCT) scans. Recent diffusion probabilistic models have shown potentials of being used for image segmentation. In this paper, a novel Synchronous image-label Diffusion Probability Model (SDPM) is proposed for stroke lesion segmentation on NCCT using Markov diffusion process. The proposed SDPM is fully based on a Latent Variable Model (LVM), offering a complete probabilistic elaboration. An additional net-stream, parallel with a noise prediction stream, is introduced to obtain initial noisy label estimates for efficiently inferring the final labels. By optimizing the specified variational boundaries, the trained model can infer multiple label estimates for reference given the input images with noises. The proposed model was assessed on three stroke lesion datasets including one public and two private datasets. Compared to several U-net and transformer-based segmentation methods, our proposed SDPM model is able to achieve state-of-the-art performance. The code is publicly available.

摘要
stroke lesion volume 是评估急性血栓roke（AIS）患者 prospect 的关键 radiologic 测量，在 Non-Contrast CT（NCCT）扫描中具有挑战性。 recient diffusion probabilistic 模型已经表现出了用于图像分割的潜在性。在这篇论文中，一种新的同步图像标签Diffusion Probability Model（SDPM）被提出用于NCCT扫描中的roke lesion 分割。 SDPM 基于 Latent Variable Model（LVM），提供了完整的 probabilistic 推导。一个额外的网络流，与噪声预测流并行，用于获取初始噪声标签估计，以高效地推导最终标签。通过优化指定的边界，训练模型可以基于输入图像噪声提供多个标签估计。 proposed SDPM 模型在三个roke lesion 数据集中评估，包括一个公共数据集和两个私人数据集。与一些 U-net 和 transformer 基于的 segmentation 方法相比，我们的提出的 SDPM 模型能够实现状态天表性性能。代码公开可用。

Mitigating Calibration Bias Without Fixed Attribute Grouping for Improved Fairness in Medical Imaging Analysis

paper_url: http://arxiv.org/abs/2307.01738
repo_url: None
paper_authors: Changjian Shui, Justin Szeto, Raghav Mehta, Douglas L. Arnold, Tal Arbel
for: 这个研究是为了提高深度学习医学影像分析模型在实际医疗执行中的可靠性和准确性。
methods: 我们提出了一个新的二阶方法：集群专注法，它可以首先识别低准确的样本，然后将它们分为群体，最后将每个群体使用集群专注损失来改善准确偏差。
results: 我们的方法可以帮助控制最差表现的子群体的准确偏差，同时保持预测性能，并比最近的基elines表现更好。

Abstract
Trustworthy deployment of deep learning medical imaging models into real-world clinical practice requires that they be calibrated. However, models that are well calibrated overall can still be poorly calibrated for a sub-population, potentially resulting in a clinician unwittingly making poor decisions for this group based on the recommendations of the model. Although methods have been shown to successfully mitigate biases across subgroups in terms of model accuracy, this work focuses on the open problem of mitigating calibration biases in the context of medical image analysis. Our method does not require subgroup attributes during training, permitting the flexibility to mitigate biases for different choices of sensitive attributes without re-training. To this end, we propose a novel two-stage method: Cluster-Focal to first identify poorly calibrated samples, cluster them into groups, and then introduce group-wise focal loss to improve calibration bias. We evaluate our method on skin lesion classification with the public HAM10000 dataset, and on predicting future lesional activity for multiple sclerosis (MS) patients. In addition to considering traditional sensitive attributes (e.g. age, sex) with demographic subgroups, we also consider biases among groups with different image-derived attributes, such as lesion load, which are required in medical image analysis. Our results demonstrate that our method effectively controls calibration error in the worst-performing subgroups while preserving prediction performance, and outperforming recent baselines.

摘要
信任性的深度学习医疗影像模型在实际临床实践中部署需要进行准确化。然而，即使模型在整体上具有良好的准确性，也可能对一个子population产生差异，导致医生因模型的建议而做出不良决策。虽然有方法可以在不同 subgroup 上减少偏见，但这项工作将关注医疗影像分析中的开放问题——准确性偏见的缓解。我们的方法不需要在训练过程中提供 subgroup 特征，因此可以随时缓解不同敏感特征的偏见。为此，我们提出了一种新的两阶段方法：首先使用 clustering 来identify poorly calibrated samples，然后引入 group-wise focal loss 来改善偏见偏见。我们在皮肤损害分类和多发性硬化病（MS）患者预测未来损害情况上进行了评估。此外，我们还考虑了传统敏感特征（如年龄、性别）和医疗影像分析中必需的图像特征，如肿瘤荷重。我们的结果表明，我们的方法可以控制最差 subgroup 的准确性错误，保持预测性能，并超过最近的基elines。

Interpretable Computer Vision Models through Adversarial Training: Unveiling the Robustness-Interpretability Connection

paper_url: http://arxiv.org/abs/2307.02500
repo_url: https://github.com/delyan-boychev/pytorch_trainers_interpretability
paper_authors: Delyan Boychev
for: 本研究旨在评估对抗训练的影响，以生成更加鲁棒的模型，即具有对抗攻击的抵抗能力。
methods: 本研究使用了本地特征重要性方法（SHAP和Integrated Gradients）和特征视觉技术（Representation Inversion和Class Specific Image Generation）进行了广泛的测试和分析。
results: 研究发现，对抗训练可以使计算机视觉模型更加 interpretable，即其学习的特征更加类似于人类的理解。此外，对抗训练的模型在对抗攻击时表现更加鲁棒，并且更加注重图像中的特定区域，以支持其预测。

Abstract
With the perpetual increase of complexity of the state-of-the-art deep neural networks, it becomes a more and more challenging task to maintain their interpretability. Our work aims to evaluate the effects of adversarial training utilized to produce robust models - less vulnerable to adversarial attacks. It has been shown to make computer vision models more interpretable. Interpretability is as essential as robustness when we deploy the models to the real world. To prove the correlation between these two problems, we extensively examine the models using local feature-importance methods (SHAP, Integrated Gradients) and feature visualization techniques (Representation Inversion, Class Specific Image Generation). Standard models, compared to robust are more susceptible to adversarial attacks, and their learned representations are less meaningful to humans. Conversely, these models focus on distinctive regions of the images which support their predictions. Moreover, the features learned by the robust model are closer to the real ones.

摘要
随着现代深度神经网络的复杂性不断增加，维护它们的解释性变得越来越困难。我们的工作旨在评估针对性训练可以生成更加鲁棒的模型，以降低对抗性攻击的脆弱性。已经证明了在计算机视觉领域中，使用针对性训练可以提高模型的解释性。当我们将模型部署到实际应用中时，解释性的重要性与鲁棒性一样高。为了证明这两个问题之间的相关性，我们广泛使用本地特征重要性方法（SHAP、整合梯度）和特征视觉技术（ Representation Inversion、类特征图生成）进行检验。对比标准模型和鲁棒模型，后者更易受到抗性攻击，并且它所学习的特征更难以被人类理解。然而，这些模型强调特定的图像区域，这些区域支持它们的预测。此外，鲁棒模型学习的特征更加接近真实的特征。

Graph-Ensemble Learning Model for Multi-label Skin Lesion Classification using Dermoscopy and Clinical Images

paper_url: http://arxiv.org/abs/2307.01704
repo_url: None
paper_authors: Peng Tang, Yang Nan, Tobias Lasser
for: 本研究旨在开发一种基于多模态数据的多标签分类方法，以提高皮肤病诊断的准确性。
methods: 该方法利用图像 convolutional neural network (GCN) 来利用多模态数据的协调关系，并通过自适应 fusion 技术将 GCN 的预测与多模态数据 fusion 模型的预测进行权重平均 fusions，以获取更高的分类精度。
results: 实验结果表明，提出的 Graph-Ensemble Learning Model (GELN) 可以在不同的 dataset 上提高分类性能，并在 SPC 和诊断分类方面达到领先的表现。

Abstract
Many skin lesion analysis (SLA) methods recently focused on developing a multi-modal-based multi-label classification method due to two factors. The first is multi-modal data, i.e., clinical and dermoscopy images, which can provide complementary information to obtain more accurate results than single-modal data. The second one is that multi-label classification, i.e., seven-point checklist (SPC) criteria as an auxiliary classification task can not only boost the diagnostic accuracy of melanoma in the deep learning (DL) pipeline but also provide more useful functions to the clinical doctor as it is commonly used in clinical dermatologist's diagnosis. However, most methods only focus on designing a better module for multi-modal data fusion; few methods explore utilizing the label correlation between SPC and skin disease for performance improvement. This study fills the gap that introduces a Graph Convolution Network (GCN) to exploit prior co-occurrence between each category as a correlation matrix into the DL model for the multi-label classification. However, directly applying GCN degraded the performances in our experiments; we attribute this to the weak generalization ability of GCN in the scenario of insufficient statistical samples of medical data. We tackle this issue by proposing a Graph-Ensemble Learning Model (GELN) that views the prediction from GCN as complementary information of the predictions from the fusion model and adaptively fuses them by a weighted averaging scheme, which can utilize the valuable information from GCN while avoiding its negative influences as much as possible. To evaluate our method, we conduct experiments on public datasets. The results illustrate that our GELN can consistently improve the classification performance on different datasets and that the proposed method can achieve state-of-the-art performance in SPC and diagnosis classification.

摘要
多种皮肤 lesion 分析（SLA）方法最近都在努力开发一种多模态基于多标签分类方法，主要是因为两点。第一，我们有多种模态数据，例如临床和肤视图图像，这些数据可以提供补偿信息，以获得更准确的结果。第二，多标签分类可以不仅提高抑制癌症的深度学习（DL）管道的诊断精度，还可以提供更有用的功能 для临床医生，因为这种分类方法在临床 dermatologist 的诊断中广泛使用。然而，大多数方法都是关注设计更好的多模态数据融合模块，很少方法探讨利用皮病分类标准（SPC）和皮肤病的标签相关性来提高性能。本研究填补了这一空白，通过引入一个图像卷积网络（GCN），利用每个类别之间的协同关系，在 DL 模型中进行多标签分类。然而，直接应用 GCN 会下降性能，我们归因于医学数据的不充分统计样本的问题。我们解决这个问题，提出一种图像ensemble学习模型（GELN），视图GCN 的预测为补充信息，并通过一种权重平均方式，将其与融合模型的预测相乘，以利用 GCN 的有价值信息，同时避免它的负面影响。为评估我们的方法，我们在公共数据集上进行实验。结果表明，我们的 GELN 可以在不同的数据集上具有稳定的分类性能，并且可以在 SPC 和诊断分类中达到国际级的表现。

Augment Features Beyond Color for Domain Generalized Segmentation

paper_url: http://arxiv.org/abs/2307.01703
repo_url: None
paper_authors: Qiyu Sun, Pavlo Melnyk, Michael Felsberg, Yang Tang
for: 这篇论文旨在提出一种适用于不同类型资料的通用Semantic Segmentation方法，并且不需要target资料进行训练。
methods: 我们的方法包括两个模组：随机图像颜色增强（RICA）和随机特征分布增强（RFDA）。RICA将图像从RGB转换为CIELAB颜色模型，并在一个感知基础上随机调整图像以提高图像质量。我们还使用CycleGAN-based生成网络将增强后的图像扩展到特征空间，以更好地丰富数据。
results: 我们进行了广泛的实验，结果显示我们的方法在不同的资料集上（包括Synthia、Cityscapes、BDDS和Mapillary）实现了顶尖的Semantic Segmentation性能。

Abstract
Domain generalized semantic segmentation (DGSS) is an essential but highly challenging task, in which the model is trained only on source data and any target data is not available. Previous DGSS methods can be partitioned into augmentation-based and normalization-based ones. The former either introduces extra biased data or only conducts channel-wise adjustments for data augmentation, and the latter may discard beneficial visual information, both of which lead to limited performance in DGSS. Contrarily, our method performs inter-channel transformation and meanwhile evades domain-specific biases, thus diversifying data and enhancing model generalization performance. Specifically, our method consists of two modules: random image color augmentation (RICA) and random feature distribution augmentation (RFDA). RICA converts images from RGB to the CIELAB color model and randomizes color maps in a perception-based way for image enhancement purposes. We further this augmentation by extending it beyond color to feature space using a CycleGAN-based generative network, which complements RICA and further boosts generalization capability. We conduct extensive experiments, and the generalization results from the synthetic GTAV and SYNTHIA to the real Cityscapes, BDDS, and Mapillary datasets show that our method achieves state-of-the-art performance in DGSS.

摘要
领域普遍 semantic segmentation (DGSS) 是一个非常重要但也非常具有挑战性的任务，模型在训练时只有source数据可用，target数据不可用。先前的DGSS方法可以分为两种：增强型和normalization型。前者可能引入额外的偏见数据或只是执行通道 wise的调整，后者可能会弃用有利的视觉信息，这两者都导致DGSS的表现有限。相反，我们的方法会执行 между通道转换和避免域专偏见，因此可以多样化数据和提高模型的普遍性表现。我们的方法包括两个模组：随机图像颜色增强 (RICA) 和随机特征分布增强 (RFDA)。RICA 将图像从 RGB 转换为 CIELAB 颜色模型，并在感知方式下随机调整图像以增强图像表现。我们继续这个增强，通过使用 CycleGAN 基本的生成网络，该网络可以补充 RICA 并进一步提高普遍能力。我们实现了广泛的实验，并从 sintetic GTAV 和 SYNTHIA Synthetic 资料集到 real Cityscapes、BDDS 和 Mapillary 资料集的一致性结果显示，我们的方法在 DGSS 中实现了顶尖的表现。

Spike-driven Transformer

paper_url: http://arxiv.org/abs/2307.01694
repo_url: https://github.com/biclab/spike-driven-transformer
paper_authors: Man Yao, Jiakui Hu, Zhaokun Zhou, Li Yuan, Yonghong Tian, Bo Xu, Guoqi Li
for:The paper is written to propose a new deep learning model called Spike-driven Transformer, which incorporates the spike-driven paradigm into the Transformer architecture.methods:The proposed Spike-driven Transformer uses four unique properties: event-driven, binary spike communication, self-attention with linear complexity, and mask and addition operations.results:The Spike-driven Transformer achieves a top-1 accuracy of 77.1% on ImageNet-1K, which is the state-of-the-art result in the SNN field.

Abstract
Spiking Neural Networks (SNNs) provide an energy-efficient deep learning option due to their unique spike-based event-driven (i.e., spike-driven) paradigm. In this paper, we incorporate the spike-driven paradigm into Transformer by the proposed Spike-driven Transformer with four unique properties: 1) Event-driven, no calculation is triggered when the input of Transformer is zero; 2) Binary spike communication, all matrix multiplications associated with the spike matrix can be transformed into sparse additions; 3) Self-attention with linear complexity at both token and channel dimensions; 4) The operations between spike-form Query, Key, and Value are mask and addition. Together, there are only sparse addition operations in the Spike-driven Transformer. To this end, we design a novel Spike-Driven Self-Attention (SDSA), which exploits only mask and addition operations without any multiplication, and thus having up to $87.2\times$ lower computation energy than vanilla self-attention. Especially in SDSA, the matrix multiplication between Query, Key, and Value is designed as the mask operation. In addition, we rearrange all residual connections in the vanilla Transformer before the activation functions to ensure that all neurons transmit binary spike signals. It is shown that the Spike-driven Transformer can achieve 77.1\% top-1 accuracy on ImageNet-1K, which is the state-of-the-art result in the SNN field. The source code is available at https://github.com/BICLab/Spike-Driven-Transformer.

摘要
ospiking neural networks (SNNs) provide an energy-efficient deep learning option due to their unique spike-based event-driven (i.e., spike-driven) paradigm. In this paper, we incorporate the spike-driven paradigm into Transformer by the proposed Spike-driven Transformer with four unique properties: 1) Event-driven, no calculation is triggered when the input of Transformer is zero; 2) Binary spike communication, all matrix multiplications associated with the spike matrix can be transformed into sparse additions; 3) Self-attention with linear complexity at both token and channel dimensions; 4) The operations between spike-form Query, Key, and Value are mask and addition. Together, there are only sparse addition operations in the Spike-driven Transformer. To this end, we design a novel Spike-Driven Self-Attention (SDSA), which exploits only mask and addition operations without any multiplication, and thus having up to $87.2\times$ lower computation energy than vanilla self-attention. Especially in SDSA, the matrix multiplication between Query, Key, and Value is designed as the mask operation. In addition, we rearrange all residual connections in the vanilla Transformer before the activation functions to ensure that all neurons transmit binary spike signals. It is shown that the Spike-driven Transformer can achieve 77.1\% top-1 accuracy on ImageNet-1K, which is the state-of-the-art result in the SNN field. The source code is available at https://github.com/BICLab/Spike-Driven-Transformer.Note that the translation is in Simplified Chinese, which is one of the two standard forms of Chinese writing. The other form is Traditional Chinese.

Training Energy-Based Models with Diffusion Contrastive Divergences

paper_url: http://arxiv.org/abs/2307.01668
repo_url: None
paper_authors: Weijian Luo, Hao Jiang, Tianyang Hu, Jiacheng Sun, Zhenguo Li, Zhihua Zhang
for:This paper focuses on improving the efficiency and accuracy of Energy-Based Models (EBMs) for generative modeling, specifically addressing the trade-off between computational burden and validity in Contrastive Divergence (CD) training.methods:The authors propose a new family of Diffusion Contrastive Divergence (DCD) methods that replace the Langevin dynamic used in CD with other EBM-parameter-free diffusion processes, leading to more efficient and accurate training of EBMs.results:The proposed DCD methods outperform CD in terms of computational efficiency and accuracy, as demonstrated through extensive experiments on synthetic data modeling, high-dimensional image denoising and generation, and image generation. Specifically, the proposed DCD achieves better performance than CD on synthetic data learning and image denoising experiments, and is capable of training an EBM for generating the Celab-A $32\times 32$ dataset.

Abstract
Energy-Based Models (EBMs) have been widely used for generative modeling. Contrastive Divergence (CD), a prevailing training objective for EBMs, requires sampling from the EBM with Markov Chain Monte Carlo methods (MCMCs), which leads to an irreconcilable trade-off between the computational burden and the validity of the CD. Running MCMCs till convergence is computationally intensive. On the other hand, short-run MCMC brings in an extra non-negligible parameter gradient term that is difficult to handle. In this paper, we provide a general interpretation of CD, viewing it as a special instance of our proposed Diffusion Contrastive Divergence (DCD) family. By replacing the Langevin dynamic used in CD with other EBM-parameter-free diffusion processes, we propose a more efficient divergence. We show that the proposed DCDs are both more computationally efficient than the CD and are not limited to a non-negligible gradient term. We conduct intensive experiments, including both synthesis data modeling and high-dimensional image denoising and generation, to show the advantages of the proposed DCDs. On the synthetic data learning and image denoising experiments, our proposed DCD outperforms CD by a large margin. In image generation experiments, the proposed DCD is capable of training an energy-based model for generating the Celab-A $32\times 32$ dataset, which is comparable to existing EBMs.

摘要
energy-based models (EBMs) 已经广泛应用于生成模型。对比тив的游逸差（CD），一种广泛使用的EBMs 训练目标，需要从EBM中采样使用Markov Chain Monte Carlo方法（MCMC），这导致了计算束缚和CD 的有效性之间的一种不可调和的负担。在MCMC 到 converges 的过程中，计算束缚是计算昂贵的。另一方面，短跑MCMC 会带来一个难以处理的非可忽略的参数梯度项。在这篇论文中，我们提供了CD 的通用解释，视其为我们提议的噪声扩散对照法（DCD）家族的一个特例。通过将CD 中的朗格文动力换为其他EBM参数无关的扩散过程，我们提议了更有效的分化。我们表明，我们提议的DCD 比CD更高效，并且不受非可忽略的参数梯度项的限制。我们在对 synthetic data 学习和图像噪声除掉和生成等实验中进行了广泛的测试，并证明了我们的DCD 比CD 大幅提高性能。在图像生成实验中，我们的DCD 能够训练一个能够生成 Celab-A $32\times 32$ 数据集的能量基本模型，与现有EBMs 相当。

Sensors and Systems for Monitoring Mental Fatigue: A systematic review

paper_url: http://arxiv.org/abs/2307.01666
repo_url: None
paper_authors: Prabin Sharma, Joanna C. Justus, Govinda R. Poudel
for: 本研究旨在提供一个批判性概念模型，描述关键实现感疲劳检测技术，并进行系统性审查 latest studies 使用生物感测器系统检测人类的疲劳状态。methods: 本研究使用系统性搜寻和评价57篇文献（N=1082），主要使用电enzephalography（EEG）基于感应器追踪疲劳状态。results: 本研究发现EEG基于感应器可提供moderate至good的感传度来检测疲劳状态。另外，本研究发现高密度EEG感应器无法提供增加的优势。基于发现，本研究提供了一个批判性讨论，探讨将穿戴式EEG和环境感应器integragted into real-world monitoring。未来的工作需要进一步改进和适应这些技术，以便实现广泛的疲劳监控在自主和无人驾驶产业中。

Abstract
Mental fatigue is a leading cause of motor vehicle accidents, medical errors, loss of workplace productivity, and student disengagements in e-learning environment. Development of sensors and systems that can reliably track mental fatigue can prevent accidents, reduce errors, and help increase workplace productivity. This review provides a critical summary of theoretical models of mental fatigue, a description of key enabling sensor technologies, and a systematic review of recent studies using biosensor-based systems for tracking mental fatigue in humans. We conducted a systematic search and review of recent literature which focused on detection and tracking of mental fatigue in humans. The search yielded 57 studies (N=1082), majority of which used electroencephalography (EEG) based sensors for tracking mental fatigue. We found that EEG-based sensors can provide a moderate to good sensitivity for fatigue detection. Notably, we found no incremental benefit of using high-density EEG sensors for application in mental fatigue detection. Given the findings, we provide a critical discussion on the integration of wearable EEG and ambient sensors in the context of achieving real-world monitoring. Future work required to advance and adapt the technologies toward widespread deployment of wearable sensors and systems for fatigue monitoring in semi-autonomous and autonomous industries is examined.

摘要
心理疲劳是主要导致机动车事故、医疗错误、工作场所产量下降和电子学习环境中学生失业的原因。开发可靠跟踪心理疲劳的感应器和系统可以预防事故、减少错误，并帮助提高工作场所产量。本文提供了心理疲劳理论模型的批判摘要、关键实现感应器技术的描述，以及在人类中使用感应器基于系统的评估。我们对最新的文献进行了系统性的搜索和评估，搜索结果共57篇论文（N=1082），大多数使用电encephalography（EEG）基于感应器跟踪心理疲劳。我们发现EEG基于感应器可以提供moderate至good的敏感性 для疲劳检测。另外，我们未发现使用高密度EEG感应器在心理疲劳检测中增加的优势。根据发现，我们提供了在实际监测中 integrating wearable EEG和 ambient sensor的批判讨论，以及将这些技术应用于自动化和半自动化业务的未来工作。

Exploring Transformers for On-Line Handwritten Signature Verification

paper_url: http://arxiv.org/abs/2307.01663
repo_url: None
paper_authors: Pietro Melzi, Ruben Tolosana, Ruben Vera-Rodriguez, Paula Delgado-Santos, Giuseppe Stragapede, Julian Fierrez, Javier Ortega-Garcia
for: 这个研究旨在评估基于最新的Transformers架构的在线签名验证系统的可靠性。
methods: 研究人员使用四种不同的配置，其中两种使用基于Vanilla Transformer核心的Encoder，另外两种则已经在步行和活动识别任务上得到了成功。
results: 实验结果表明，使用Transformers架构可以提供高度可靠的在线签名验证系统。

Abstract
The application of mobile biometrics as a user-friendly authentication method has increased in the last years. Recent studies have proposed novel behavioral biometric recognition systems based on Transformers, which currently outperform the state of the art in several application scenarios. On-line handwritten signature verification aims to verify the identity of subjects, based on their biometric signatures acquired using electronic devices such as tablets or smartphones. This paper investigates the suitability of architectures based on recent Transformers for on-line signature verification. In particular, four different configurations are studied, two of them rely on the Vanilla Transformer encoder, and the two others have been successfully applied to the tasks of gait and activity recognition. We evaluate the four proposed configurations according to the experimental protocol proposed in the SVC-onGoing competition. The results obtained in our experiments are promising, and promote the use of Transformers for on-line signature verification.

摘要
随着移动生物 метрик作为用户友好的验证方法的应用逐渐增加，最近的研究已经提出了基于转换器的新型行为生物метри克认证系统。在线手写签名验证目的是验证使用电子设备such as 平板或智能手机所获取的生物签名的真实性，以验证个体身份。本文研究了基于最新的转换器架构的在线手写签名验证。特别是，我们研究了四种不同的配置，其中两种使用 Vanilla Transformer 编码器，另外两种已经成功应用于步态和活动识别任务。我们按照SVC-onGoing competition的实验协议进行了测试，实验结果很俊朗，这些结果推荐使用转换器进行在线手写签名验证。

Task Planning Support for Arborists and Foresters: Comparing Deep Learning Approaches for Tree Inventory and Tree Vitality Assessment Based on UAV-Data

paper_url: http://arxiv.org/abs/2307.01651
repo_url: None
paper_authors: Jonas-Dario Troles, Richard Nieding, Sonia Simons, Ute Schmid
for: This paper aims to optimize workflows and increase productivity for arborists and foresters who care for trees in urban areas and forests.
methods: The approach uses RGB and multispectral UAV data, as well as multispectral satellite data and soil moisture sensors, to create tree inventories and vitality assessments.
results: The approach generates helpful information and improves task planning for arborists and foresters, allowing them to better care for trees in urban areas and forests.Here’s the text in Simplified Chinese:
for: 这篇论文是为了优化植物护理人员的工作流程和提高生产力而写的。
methods: 该方法使用RGB和多spectral UAV数据，以及多spectral卫星数据和土壤湿度传感器，创建树木目录和评估树木生长状况。
results: 该方法生成有用的信息，提高植物护理人员的日常任务规划，以便更好地照顾城市公园和森林中的树木。

Abstract
Climate crisis and correlating prolonged, more intense periods of drought threaten tree health in cities and forests. In consequence, arborists and foresters suffer from increasing workloads and, in the best case, a consistent but often declining workforce. To optimise workflows and increase productivity, we propose a novel open-source end-to-end approach that generates helpful information and improves task planning of those who care for trees in and around cities. Our approach is based on RGB and multispectral UAV data, which is used to create tree inventories of city parks and forests and to deduce tree vitality assessments through statistical indices and Deep Learning. Due to EU restrictions regarding flying drones in urban areas, we will also use multispectral satellite data and fifteen soil moisture sensors to extend our tree vitality-related basis of data. Furthermore, Bamberg already has a georeferenced tree cadastre of around 15,000 solitary trees in the city area, which is also used to generate helpful information. All mentioned data is then joined and visualised in an interactive web application allowing arborists and foresters to generate individual and flexible evaluations, thereby improving daily task planning.

摘要
клима紧急情况和相关的长期、更加激烈的干旱期导致城市和森林树木健康受到威胁。因此，树木医生和森林管理员工作量增加，而且最好的情况下也是不断减少的工作力量。为了优化工作流程和提高生产力，我们提出了一种新的开源终端方法，通过使用RGB和多spectral UAV数据来生成有用的信息和改善树木照顾人员在城市和森林中的任务规划。我们的方法基于RGB和多spectral UAV数据，用于创建城市公园和森林的树木目录和树木质量评估通过统计指标和深度学习。由于欧盟对城市区域内飞行无人机的限制，我们还使用多spectral卫星数据和十五个土壤湿度传感器来扩展我们树木质量相关的数据基础。此外，巴姆贝格已经有精确地参考树木目录，包括城市区域内约15,000棵 solitary树木，这些数据也用于生成有用信息。所有提到的数据然后被联合并视觉化在一个交互式网页应用程序中， allowing树木医生和森林管理员在日常任务规划中提高效率。

In-Domain Self-Supervised Learning Can Lead to Improvements in Remote Sensing Image Classification

paper_url: http://arxiv.org/abs/2307.01645
repo_url: None
paper_authors: Ivica Dimitrovski, Ivan Kitanovski, Nikola Simidjievski, Dragi Kocev
for: 本研究旨在探讨自监督学习（SSL）在遥感图像分类中的应用前景，尝试利用大量无标签数据来学习图像表示。
methods: 本研究使用Million AID数据集，通过形式ulated auxiliary tasks来生成 pseudo-标签，并使用ViT模型进行预训练和细化训练。
results: 实验结果显示，使用域内SSL预训练（iBOT框架和ViT模型）在14个多样化的遥感图像分类任务中表现更好，比supervised预训练使用ImageNet数据集。

Abstract
Self-supervised learning (SSL) has emerged as a promising approach for remote sensing image classification due to its ability to leverage large amounts of unlabeled data. In contrast to traditional supervised learning, SSL aims to learn representations of data without the need for explicit labels. This is achieved by formulating auxiliary tasks that can be used to create pseudo-labels for the unlabeled data and learn pre-trained models. The pre-trained models can then be fine-tuned on downstream tasks such as remote sensing image scene classification. The paper analyzes the effectiveness of SSL pre-training using Million AID - a large unlabeled remote sensing dataset on various remote sensing image scene classification datasets as downstream tasks. More specifically, we evaluate the effectiveness of SSL pre-training using the iBOT framework coupled with Vision transformers (ViT) in contrast to supervised pre-training of ViT using the ImageNet dataset. The comprehensive experimental work across 14 datasets with diverse properties reveals that in-domain SSL leads to improved predictive performance of models compared to the supervised counterparts.

摘要

ChildPlay: A New Benchmark for Understanding Children’s Gaze Behaviour

paper_url: http://arxiv.org/abs/2307.01630
repo_url: None
paper_authors: Samy Tafasca, Anshul Gupta, Jean-Marc Odobez
for: 这个研究是为了预测儿童和成人之间的注视目标，以便更好地诊断developmental disorders。
methods: 我们提出了一个新的注视目标预测模型，利用latest geometry preserving depth inference methods和rich gaze information，以及一个新的 ChildPlay 数据集。
results: 我们的模型在benchmark datasets和 ChildPlay 上 achieved state of the art results，并且发现looking at faces prediction performance on children is much worse than on adults，可以通过 fine-tuning models using child gaze annotations 进一步提高。

Abstract
Gaze behaviors such as eye-contact or shared attention are important markers for diagnosing developmental disorders in children. While previous studies have looked at some of these elements, the analysis is usually performed on private datasets and is restricted to lab settings. Furthermore, all publicly available gaze target prediction benchmarks mostly contain instances of adults, which makes models trained on them less applicable to scenarios with young children. In this paper, we propose the first study for predicting the gaze target of children and interacting adults. To this end, we introduce the ChildPlay dataset: a curated collection of short video clips featuring children playing and interacting with adults in uncontrolled environments (e.g. kindergarten, therapy centers, preschools etc.), which we annotate with rich gaze information. We further propose a new model for gaze target prediction that is geometrically grounded by explicitly identifying the scene parts in the 3D field of view (3DFoV) of the person, leveraging recent geometry preserving depth inference methods. Our model achieves state of the art results on benchmark datasets and ChildPlay. Furthermore, results show that looking at faces prediction performance on children is much worse than on adults, and can be significantly improved by fine-tuning models using child gaze annotations. Our dataset and models will be made publicly available.

摘要
“眼光行为如视线接触或共同注意是诊断儿童发展障碍的重要标志。过往的研究通常仅在私人数据库中进行分析，并仅限于实验室设置。而所有公开可用的眼光目标预测参考 benchmark 几乎都包含成年人，这使得在他们上训练的模型在儿童enario中变得更加不适用。在本文中，我们提出了第一个预测儿童和成年人之间的眼光目标的研究。为此，我们介绍了 ChildPlay 数据集：一个 curaated 的短视频剪辑集，该集包含儿童在不受控制的环境中玩耍和与成年人互动的短片，并将其注解为丰富的眼光信息。我们还提出了一新的眼光目标预测模型，它在Scene parts 的3D 视野（3DFoV）中Explicitly 识别场景元件，并仅仅利用最近的具有 geometry preserving 的深度推测方法。我们的模型在参考数据集和 ChildPlay 上实现了顶尖的结果，并且结果显示，对于儿童的视线预测性能与成年人相比，相对较差，并可以通过 Fine-tuning 模型使用儿童的眼光注解进行改善。我们的数据和模型将会公开可用。”

Learning Lie Group Symmetry Transformations with Neural Networks

paper_url: http://arxiv.org/abs/2307.01583
repo_url: https://github.com/victoria-klein/learning-lie-group-symmetries
paper_authors: Alex Gabel, Victoria Klein, Riccardo Valperga, Jeroen S. W. Lamb, Kevin Webster, Rick Quax, Efstratios Gavves
for: 本研究旨在探讨和评估数据集中的对称性，以便更好地进行模型选择、生成模型和数据分析等。
methods: 本研究使用一种新的方法，可以自动找出数据集中未知的对称性，包括李群对称变换以外的其他对称变换。
results: 研究结果表明，该方法可以在不同的数据点 Parametrization 下成功地描述数据集的对称性，并且可以在不同的设置下进行数据分析和模型选择。

Abstract
The problem of detecting and quantifying the presence of symmetries in datasets is useful for model selection, generative modeling, and data analysis, amongst others. While existing methods for hard-coding transformations in neural networks require prior knowledge of the symmetries of the task at hand, this work focuses on discovering and characterizing unknown symmetries present in the dataset, namely, Lie group symmetry transformations beyond the traditional ones usually considered in the field (rotation, scaling, and translation). Specifically, we consider a scenario in which a dataset has been transformed by a one-parameter subgroup of transformations with different parameter values for each data point. Our goal is to characterize the transformation group and the distribution of the parameter values. The results showcase the effectiveness of the approach in both these settings.

摘要
“检测和量化数据集中的对称性问题有助于选择模型、生成模型和数据分析等方面。现有的方法需要先知道任务的对称性，而这项工作则是发现和描述数据集中未知的对称性，即李群对称变换以外的其他对称性。特别是，我们考虑了一种情况，在这种情况下，数据集已经被一个一参数子群的变换所变换，每个数据点的参数值都不同。我们的目标是Characterizing the transformation group and the distribution of parameter values。结果显示了该方法在这两个设置下的效果。”Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. Traditional Chinese is used in Taiwan, Hong Kong, and other parts of the world.

A Comprehensive Multi-scale Approach for Speech and Dynamics Synchrony in Talking Head Generation

paper_url: http://arxiv.org/abs/2307.03270
repo_url: https://github.com/louisbearing/hmo-audio
paper_authors: Louis Airale, Dominique Vaufreydaz, Xavier Alameda-Pineda
for: 这种研究旨在提高深度生成模型中的语音与表情动作的同步，以便生成更自然的人脸动作和语音同步。
methods: 该研究使用了多级音视频同步损失函数和多级权重网络来更好地处理语音和表情动作之间的短期和长期相关性。
results: 实验表明，使用该方法可以大幅提高人脸动作的自然性和多级音视频同步的质量，并且在不同时间尺度上都能够保持良好的同步。

Abstract
Animating still face images with deep generative models using a speech input signal is an active research topic and has seen important recent progress. However, much of the effort has been put into lip syncing and rendering quality while the generation of natural head motion, let alone the audio-visual correlation between head motion and speech, has often been neglected. In this work, we propose a multi-scale audio-visual synchrony loss and a multi-scale autoregressive GAN to better handle short and long-term correlation between speech and the dynamics of the head and lips. In particular, we train a stack of syncer models on multimodal input pyramids and use these models as guidance in a multi-scale generator network to produce audio-aligned motion unfolding over diverse time scales. Our generator operates in the facial landmark domain, which is a standard low-dimensional head representation. The experiments show significant improvements over the state of the art in head motion dynamics quality and in multi-scale audio-visual synchrony both in the landmark domain and in the image domain.

摘要
<>使用深度生成模型动画静止图像是一个活跃的研究领域，近年来有重要的进步。然而，许多努力都是 lip syncing 和图像质量的改进，而Audio-visual相关性和自然的头部运动 generation 则 часто被忽略。在这项工作中，我们提出了一种多级音频视频同步损失和多级自适应GAN，以更好地处理 speech 和头部动作之间的短期和长期相关性。具体来说，我们在 multimodal 输入PYRAMIDS 上训练了一 stack of syncer 模型，并将这些模型作为指导在一种多级生成网络中使用，以生成 audio-aligned 动作，从多个时间尺度中 unfolding。我们的生成器在 facial landmark 空间中运行，这是一个标准的低维度头部表示。实验显示，我们的方法可以在 head motion dynamics 质量和多级音频视频同步两个方面提供显著改进，并在 landmark 空间和图像空间都达到了领先水平。

Learning to reconstruct the bubble distribution with conductivity maps using Invertible Neural Networks and Error Diffusion

paper_url: http://arxiv.org/abs/2307.02496
repo_url: None
paper_authors: Nishant Kumar, Lukas Krause, Thomas Wondrak, Sven Eckert, Kerstin Eckert, Stefan Gumhold
for: 这个论文的目的是提高电解质水氢生产的环保性，并且解决在电解质水中产生的气泡对反应干扰、细胞效率和能源消耗的问题。
methods: 这个论文使用了外部磁场传感器测量电解质水中的磁场干扰，并解决了磁场干扰对电解质水中的气泡大小和位置的影响。
results: 研究表明，使用归一化神经网络（INN）可以重建电解质水中的电阻场图像，并且对于这种具有较高分辨率的电阻场图像，INN的性能远胜于提高质量。

Abstract
Electrolysis is crucial for eco-friendly hydrogen production, but gas bubbles generated during the process hinder reactions, reduce cell efficiency, and increase energy consumption. Additionally, these gas bubbles cause changes in the conductivity inside the cell, resulting in corresponding variations in the induced magnetic field around the cell. Therefore, measuring these gas bubble-induced magnetic field fluctuations using external magnetic sensors and solving the inverse problem of Biot-Savart Law allows for estimating the conductivity in the cell and, thus, bubble size and location. However, determining high-resolution conductivity maps from only a few induced magnetic field measurements is an ill-posed inverse problem. To overcome this, we exploit Invertible Neural Networks (INNs) to reconstruct the conductivity field. Our qualitative results and quantitative evaluation using random error diffusion show that INN achieves far superior performance compared to Tikhonov regularization.

摘要
电解是绿色氢生产中不可或缺的过程，但在这个过程中生成的气泡会阻碍反应，降低细胞效率，并增加能源消耗。此外，这些气泡会导致细胞内的电导率发生变化，从而导致细胞外的磁场变化。因此，通过外部磁场传感器测量这些气泡引起的磁场变化，并解决反演比特-萨瓦尔律的逆问题，可以估算细胞内的电导率，并因此计算气泡的大小和位置。然而，从只有几个磁场测量获得高分辨率电导率地图是一个不定性问题。为了解决这个问题，我们利用归一化神经网络（INN）来重建电导率场。我们的资料和量化评估结果表明，INN在比特-萨瓦尔律regularization的情况下表现得更好。

EffSeg: Efficient Fine-Grained Instance Segmentation using Structure-Preserving Sparsity

paper_url: http://arxiv.org/abs/2307.01545
repo_url: None
paper_authors: Cédric Picron, Tinne Tuytelaars
for: 提高实例分割精度和效率
methods: 使用Structure-Preserving Sparsity（SPS）方法，将活动特征、潜在特征和 dense 2D 索引地图分别存储，以保持2D空间结构
results: 与 RefineMask 相当的性能在 COCO 上，降低 FLOPs 71%，提高 FPS 29%

Abstract
Many two-stage instance segmentation heads predict a coarse 28x28 mask per instance, which is insufficient to capture the fine-grained details of many objects. To address this issue, PointRend and RefineMask predict a 112x112 segmentation mask resulting in higher quality segmentations. Both methods however have limitations by either not having access to neighboring features (PointRend) or by performing computation at all spatial locations instead of sparsely (RefineMask). In this work, we propose EffSeg performing fine-grained instance segmentation in an efficient way by using our Structure-Preserving Sparsity (SPS) method based on separately storing the active features, the passive features and a dense 2D index map containing the feature indices. The goal of the index map is to preserve the 2D spatial configuration or structure between the features such that any 2D operation can still be performed. EffSeg achieves similar performance on COCO compared to RefineMask, while reducing the number of FLOPs by 71% and increasing the FPS by 29%. Code will be released.

摘要
多两个阶段实例分割头预测每个实例粗略的28x28 mask，这并不够 capture 多个 объек 的细节信息。为解决这个问题，PointRend 和 RefineMask 预测 112x112 分割mask，得到更高质量的分割结果。然而，这两种方法都有局限性，一是不能访问周围特征（PointRend），二是在所有空间位置上进行计算（RefineMask）。在这项工作中，我们提出了高效的 EffSeg，实现细化实例分割，使用我们的结构保持稀疏（SPS）方法，基于分开存储活动特征、游离特征和 dense 2D 索引地图，包含特征索引。索引地图的目的是保持 2D 空间配置或结构，以便任何 2D 操作仍可以进行。EffSeg 在 COCO 上与 RefineMask 的性能相似，而减少 FLOPs 的数量为 71%，并提高 FPS 的值为 29%。代码将被公开。

Unsupervised Video Anomaly Detection with Diffusion Models Conditioned on Compact Motion Representations

paper_url: http://arxiv.org/abs/2307.01533
repo_url: https://github.com/anilosmantur/conditioned_video_anomaly_diffusion
paper_authors: Anil Osman Tur, Nicola Dall’Asen, Cigdem Beyan, Elisa Ricci
for: 本研究旨在解决无监督视频异常检测（VAD）问题，即根据视频帧是否为正常或异常来分类，无需任何标签。
methods: 本方法使用条件扩散模型，输入数据是一个预训练网络提取的空间时间特征，条件是一个缩写视频段的运动和外观特征。
results: 我们的方法使用数据驱动的阈值，并将高重建错误视为异常事件。实验结果表明，我们的方法可以在两个大规模的VAD benchmark上提高异常检测性能，特别是在不同的数据集上表现更好，超过了当前state-of-the-art和基线方法。

Abstract
This paper aims to address the unsupervised video anomaly detection (VAD) problem, which involves classifying each frame in a video as normal or abnormal, without any access to labels. To accomplish this, the proposed method employs conditional diffusion models, where the input data is the spatiotemporal features extracted from a pre-trained network, and the condition is the features extracted from compact motion representations that summarize a given video segment in terms of its motion and appearance. Our method utilizes a data-driven threshold and considers a high reconstruction error as an indicator of anomalous events. This study is the first to utilize compact motion representations for VAD and the experiments conducted on two large-scale VAD benchmarks demonstrate that they supply relevant information to the diffusion model, and consequently improve VAD performances w.r.t the prior art. Importantly, our method exhibits better generalization performance across different datasets, notably outperforming both the state-of-the-art and baseline methods. The code of our method is available at https://github.com/AnilOsmanTur/conditioned_video_anomaly_diffusion

摘要
本文目的是解决无监督视频异常检测（VAD）问题，即对每帧视频进行正常或异常分类，无需访问标签。为此，我们提出的方法使用条件扩散模型，其输入数据为预训练网络提取的空间时间特征，条件为视频段异常特征概括。我们的方法使用数据驱动的阈值和高重建错误作为异常事件指标。本研究是首次利用异常动作表示来实现VAD，并在两个大规模VAD标准数据集上进行了实验，证明它们为扩散模型提供有用信息，并在对比先前艺术方法时显著提高VAD性能。重要的是，我们的方法在不同数据集上表现更好的泛化性，特别是在比较先前和基eline方法时表现出优异。代码可以在https://github.com/AnilOsmanTur/conditioned_video_anomaly_diffusion上获取。

Exploiting Richness of Learned Compressed Representation of Images for Semantic Segmentation

paper_url: http://arxiv.org/abs/2307.01524
repo_url: https://github.com/DL4Compression/Semantic_Segmentation_of_Driving_Videos_on_Learning_based_Image_Compression
paper_authors: Ravi Kakaiya, Rakshith Sathish, Ramanathan Sethuraman, Debdoot Sheet
for: 这个论文目的是提出一种基于学习的压缩编码器，以减少自这些汽车与高级驾驶辅助系统（ADAS）中的数据传输网络延迟。
methods: 这个方法使用学习基于的压缩编码器，以减少从汽车给云端服务器的数据传输网络延迟。实验 validate 了提案的管道，并证明了学习压缩表示可以用于进行像分类 segmentation 的任务，同时实现了 $66 \times$ 的压缩因子，并保留了适用于分类的信息，而且降低了总 Compute 的 $11%$。
results: 这个研究的结果显示，使用学习基于的压缩编码器可以实现高效的数据传输和分类 tasks，并且可以降低从汽车至云端服务器的网络延迟。

Abstract
Autonomous vehicles and Advanced Driving Assistance Systems (ADAS) have the potential to radically change the way we travel. Many such vehicles currently rely on segmentation and object detection algorithms to detect and track objects around its surrounding. The data collected from the vehicles are often sent to cloud servers to facilitate continual/life-long learning of these algorithms. Considering the bandwidth constraints, the data is compressed before sending it to servers, where it is typically decompressed for training and analysis. In this work, we propose the use of a learning-based compression Codec to reduce the overhead in latency incurred for the decompression operation in the standard pipeline. We demonstrate that the learned compressed representation can also be used to perform tasks like semantic segmentation in addition to decompression to obtain the images. We experimentally validate the proposed pipeline on the Cityscapes dataset, where we achieve a compression factor up to $66 \times$ while preserving the information required to perform segmentation with a dice coefficient of $0.84$ as compared to $0.88$ achieved using decompressed images while reducing the overall compute by $11\%$.

摘要
自动驾驶车和高级驾驶助手系统（ADAS）有可能对我们的旅行方式产生重大变革。许多这些车辆目前使用分割和物体检测算法来检测和跟踪周围的对象。收集到的数据通常将被送往云服务器以便持续学习这些算法。由于带宽限制，这些数据通常会进行压缩，然后在服务器上解压缩以进行训练和分析。在这项工作中，我们提议使用学习基于的压缩编码器来减少标准管道中的延迟过程中的开销。我们示出了学习压缩表示可以同时完成压缩和减少计算的任务，例如 semantic segmentation。我们对Cityscapes dataset进行实验，并实现了最高的66倍压缩因子，保留了用于完成分割的信息，而计算总量减少了11%。

LPN: Language-guided Prototypical Network for few-shot classification

paper_url: http://arxiv.org/abs/2307.01515
repo_url: None
paper_authors: Kaihui Cheng, Chule Yang
for: 这篇论文主要针对几何shot分类问题进行研究，旨在对新任务进行适应，并充分利用可用的数据。
methods: 本文提出了一个名为Language-guided Prototypical Network（LPN）的方法，它利用视觉和语言模式的共同作用，通过两个平行分支来实现这一目的。
results: 实验结果显示，LPN方法与现有方法相比，在 benchmark 数据集上表现竞争力强。

Abstract
Few-shot classification aims to adapt to new tasks with limited labeled examples. To fully use the accessible data, recent methods explore suitable measures for the similarity between the query and support images and better high-dimensional features with meta-training and pre-training strategies. However, the potential of multi-modality information has barely been explored, which may bring promising improvement for few-shot classification. In this paper, we propose a Language-guided Prototypical Network (LPN) for few-shot classification, which leverages the complementarity of vision and language modalities via two parallel branches. Concretely, to introduce language modality with limited samples in the visual task, we leverage a pre-trained text encoder to extract class-level text features directly from class names while processing images with a conventional image encoder. Then, a language-guided decoder is introduced to obtain text features corresponding to each image by aligning class-level features with visual features. In addition, to take advantage of class-level features and prototypes, we build a refined prototypical head that generates robust prototypes in the text branch for follow-up measurement. Finally, we aggregate the visual and text logits to calibrate the deviation of a single modality. Extensive experiments demonstrate the competitiveness of LPN against state-of-the-art methods on benchmark datasets.

摘要
《语言指导的原型网络（LPN） для少量类别分类》目标是在新任务上适应，以限量标注示例进行训练。为了充分利用可用数据，当前方法探索适合查询和支持图像之间的相似度度量和高维特征的meta-training和预训练策略。然而，多Modal信息的潜力尚未得到充分利用，这可能带来了promising的改进。在这篇论文中，我们提议一种语言指导的原型网络（LPN），该网络利用视觉和语言Modal的 complementarity，通过两个平行分支来进行图像分类。具体来说，为了在视觉任务中使用有限的语言样本，我们利用预训练的文本编码器提取类别特征直接从类名称中，然后将这些特征与图像进行处理。接着，我们引入语言指导的解码器，以获取每个图像的特有的文本特征。此外，为了利用类别特征和原型，我们构建了一个精细的原型头，该头在文本分支中生成了robust的原型。最后，我们将视觉和文本的启示拼接起来，以调整单一模式的偏差。广泛的实验表明，LPN在标准 benchmark 数据集上与当前状态OF-the-art方法竞争。

SelfFed: Self-supervised Federated Learning for Data Heterogeneity and Label Scarcity in IoMT

paper_url: http://arxiv.org/abs/2307.01514
repo_url: None
paper_authors: Sunder Ali Khowaja, Kapal Dev, Syed Muhammad Anwar, Marius George Linguraru
for: 这个研究旨在提出一个基于自适应学习的联邦学习框架，以解决联邦学习中的数据不均匀问题和标签稀缺问题。
methods: 我们提出了一个两阶段的方法，首先是预训练阶段，使用Swin Transformer基本Encoder在分散式的方式下进行增强模型。第二阶段是精度调整阶段，引入了对照网络和一个新的联合策略，用于在分散式的方式下训练仅有几个标签的目标任务。
results: 我们在公开可用的医疗影像数据集上进行实验分析，结果显示我们的提出的SelfFed框架在非相同和相同数据集上比 existed baseline perform 更好，尤其是在标签稀缺的情况下。我们的方法在非IID数据集上取得最大改进率为8.8%和4.1%。此外，我们的方法甚至在仅使用10%标签的情况下也能够超越 existed baseline。

Abstract
Self-supervised learning in federated learning paradigm has been gaining a lot of interest both in industry and research due to the collaborative learning capability on unlabeled yet isolated data. However, self-supervised based federated learning strategies suffer from performance degradation due to label scarcity and diverse data distributions, i.e., data heterogeneity. In this paper, we propose the SelfFed framework for Internet of Medical Things (IoMT). Our proposed SelfFed framework works in two phases. The first phase is the pre-training paradigm that performs augmentive modeling using Swin Transformer based encoder in a decentralized manner. The first phase of SelfFed framework helps to overcome the data heterogeneity issue. The second phase is the fine-tuning paradigm that introduces contrastive network and a novel aggregation strategy that is trained on limited labeled data for a target task in a decentralized manner. This fine-tuning stage overcomes the label scarcity problem. We perform our experimental analysis on publicly available medical imaging datasets and show that our proposed SelfFed framework performs better when compared to existing baselines concerning non-independent and identically distributed (IID) data and label scarcity. Our method achieves a maximum improvement of 8.8% and 4.1% on Retina and COVID-FL datasets on non-IID dataset. Further, our proposed method outperforms existing baselines even when trained on a few (10%) labeled instances.

摘要
<> translate the following text into Simplified ChineseSelf-supervised learning in federated learning paradigm has been gaining a lot of interest both in industry and research due to the collaborative learning capability on unlabeled yet isolated data. However, self-supervised based federated learning strategies suffer from performance degradation due to label scarcity and diverse data distributions, i.e., data heterogeneity. In this paper, we propose the SelfFed framework for Internet of Medical Things (IoMT). Our proposed SelfFed framework works in two phases. The first phase is the pre-training paradigm that performs augmentive modeling using Swin Transformer based encoder in a decentralized manner. The first phase of SelfFed framework helps to overcome the data heterogeneity issue. The second phase is the fine-tuning paradigm that introduces contrastive network and a novel aggregation strategy that is trained on limited labeled data for a target task in a decentralized manner. This fine-tuning stage overcomes the label scarcity problem. We perform our experimental analysis on publicly available medical imaging datasets and show that our proposed SelfFed framework performs better when compared to existing baselines concerning non-independent and identically distributed (IID) data and label scarcity. Our method achieves a maximum improvement of 8.8% and 4.1% on Retina and COVID-FL datasets on non-IID dataset. Further, our proposed method outperforms existing baselines even when trained on a few (10%) labeled instances.中文简体版：自我超级vised学习在联邦学习模式下受到了互联网医疗器件（IoMT）领域的广泛关注，因为它可以在无标签的隔离数据上进行协同学习。然而，基于自我超级vised的联邦学习策略受到了数据不一致和标签稀缺的问题的影响。在这篇论文中，我们提出了SelfFed框架，用于解决这些问题。我们的SelfFed框架包括两个阶段：首先是预训练阶段，使用SwinTransformer基于 encoder 进行增强模型化，在分布式方式下进行。这个阶段可以减轻数据不一致问题。第二个阶段是精度调整阶段，引入对比网络和一种新的聚合策略，通过限量标签数据进行定制。这个阶段可以解决标签稀缺问题。我们在公共可用的医学成像数据集上进行了实验分析，并证明了我们的SelfFed框架在非独立和同分布（IID）数据上比较出色。我们的方法在非IID数据上最大提高8.8%和4.1%的提升。此外，我们的提posed方法还可以在只有10%的标签实例上进行训练，并且在这些情况下也能够超越现有的基elines。

FB-OCC: 3D Occupancy Prediction based on Forward-Backward View Transformation

paper_url: http://arxiv.org/abs/2307.01492
repo_url: https://github.com/nvlabs/fb-bev
paper_authors: Zhiqi Li, Zhiding Yu, David Austin, Mingsheng Fang, Shiyi Lan, Jan Kautz, Jose M. Alvarez
for: 本研究旨在提出一个获胜解决方案来3D占用预测挑战，该挑战是CVPR 2023 Workshop on End-to-End Autonomous Driving和CVPR 23 Workshop on Vision-Centric Autonomous Driving Workshop的一部分。
methods: 本研究基于FB-BEV，一个前进后退投影的镜头基础预测设计。在FB-BEV的基础上，我们进一步研究了特有的设计和优化，包括共同深度-Semantic预训练、共同矩阵-BEV表示、模型缩放和有效的后处理策略。
results: 这些设计和优化导致在nuScenes数据集上的mIoU分数为54.19%，在挑战赛道上排名第一。代码和模型将在：https://github.com/NVlabs/FB-BEV 中发布。

Abstract
This technical report summarizes the winning solution for the 3D Occupancy Prediction Challenge, which is held in conjunction with the CVPR 2023 Workshop on End-to-End Autonomous Driving and CVPR 23 Workshop on Vision-Centric Autonomous Driving Workshop. Our proposed solution FB-OCC builds upon FB-BEV, a cutting-edge camera-based bird's-eye view perception design using forward-backward projection. On top of FB-BEV, we further study novel designs and optimization tailored to the 3D occupancy prediction task, including joint depth-semantic pre-training, joint voxel-BEV representation, model scaling up, and effective post-processing strategies. These designs and optimization result in a state-of-the-art mIoU score of 54.19% on the nuScenes dataset, ranking the 1st place in the challenge track. Code and models will be released at: https://github.com/NVlabs/FB-BEV.

摘要
这份技术报告介绍了在CVPR 2023 工作坊上的3D占用预测挑战赛中获胜的解决方案，该挑战赛与CVPR 23 工作坊联合举行。我们的提议方案FB-OCC基于FB-BEV，这是一种前瞻型镜头视图识别设计，使用前进和后退投影。在FB-BEV的基础之上，我们进一步研究了适应3D占用预测任务的新设计和优化策略，包括共同深度semantic预训练、共同矩阵BEV表示、模型缩放和有效 posterior 处理策略。这些设计和优化使得我们在nuScenes数据集上 achieve 54.19%的mIoU分数，在挑战赛中名列第一名。代码和模型将在：https://github.com/NVlabs/FB-BEV 中发布。

Semantic Segmentation on 3D Point Clouds with High Density Variations

paper_url: http://arxiv.org/abs/2307.01489
repo_url: None
paper_authors: Ryan Faulkner, Luke Haub, Simon Ratcliffe, Ian Reid, Tat-Jun Chin
for: 这篇论文是为了解决 LiDAR 探测应用中的大规模 3D 点云对应问题，这些点云具有广泛的区域和距离，并且具有巨大的地方密度变化。
methods: 这篇论文提出了一个名为 HDVNet 的新架构，这个架构包含一个嵌入的集合 Encoder-Decoder 路径，每个路径处理特定的点密度范围。限制Feature Map 之间的连接，使得 HDVNet 能够根据点密度来衡量每个特征的可靠性，例如，对于低密度物体而言，杜重高密度的特征。
results: 在实际的点云数据中，HDVNet 比 state-of-the-art 模型具有更高的准确性，只需使用一半的 Parameters。

Abstract
LiDAR scanning for surveying applications acquire measurements over wide areas and long distances, which produces large-scale 3D point clouds with significant local density variations. While existing 3D semantic segmentation models conduct downsampling and upsampling to build robustness against varying point densities, they are less effective under the large local density variations characteristic of point clouds from surveying applications. To alleviate this weakness, we propose a novel architecture called HDVNet that contains a nested set of encoder-decoder pathways, each handling a specific point density range. Limiting the interconnections between the feature maps enables HDVNet to gauge the reliability of each feature based on the density of a point, e.g., downweighting high density features not existing in low density objects. By effectively handling input density variations, HDVNet outperforms state-of-the-art models in segmentation accuracy on real point clouds with inconsistent density, using just over half the weights.

摘要
李达尔扫描 для测量应用程序获取大面积和长距离的测量值，生成大规模的3D点云，点云中存在显著的地方密度变化。现有的3D semantic segmentation模型通过下采样和上采样来建立对点云密度变化的鲁棒性，但这些模型对大规模的点云密度变化不够有效。为解决这个弱点，我们提出了一种新的架构 called HDVNet，它包含一个嵌入式的编码器-解码器路径，每个路径处理特定的点密度范围。限制Feature map之间的连接，使得 HDVNet 可以根据点的密度来评估特征的可靠性，例如，减重高密度特征不存在低密度对象中。通过有效地处理输入密度变化，HDVNet 在实际点云中 segments 精度高于状态机制模型，使用只有一半的权重。

H-DenseFormer: An Efficient Hybrid Densely Connected Transformer for Multimodal Tumor Segmentation

paper_url: http://arxiv.org/abs/2307.01486
repo_url: https://github.com/shijun18/h-denseformer
paper_authors: Jun Shi, Hongyu Kan, Shulan Ruan, Ziqi Zhu, Minfan Zhao, Liang Qiao, Zhaohui Wang, Hong An, Xudong Xue
for: 这篇论文旨在提出一个混合式密接连接网络（H-DenseFormer）来进行肿瘤分类，以提高多modalities的医疗影像识别能力。
methods: 这篇论文使用了一个具有多路平行嵌入（MPE）模组，可以将多modalities的输入资料组合成多modalities的融合特征。然后，这些融合特征会被传递到不同层次的encoder中进行增强多modalities的学习表现。此外，还设计了一个轻量级的密接连接几何（DCT）封顶，以取代标准几何封顶，从而实现显著的计算量削减。
results: 在HECKTOR21和PI-CAI22两个公共多modalities datasets上进行了广泛的实验，结果显示，我们的提案方法在与现有的州前方法进行比较中，具有更高的识别率和更低的计算量。代码可以在https://github.com/shijun18/H-DenseFormer上取得。

Abstract
Recently, deep learning methods have been widely used for tumor segmentation of multimodal medical images with promising results. However, most existing methods are limited by insufficient representational ability, specific modality number and high computational complexity. In this paper, we propose a hybrid densely connected network for tumor segmentation, named H-DenseFormer, which combines the representational power of the Convolutional Neural Network (CNN) and the Transformer structures. Specifically, H-DenseFormer integrates a Transformer-based Multi-path Parallel Embedding (MPE) module that can take an arbitrary number of modalities as input to extract the fusion features from different modalities. Then, the multimodal fusion features are delivered to different levels of the encoder to enhance multimodal learning representation. Besides, we design a lightweight Densely Connected Transformer (DCT) block to replace the standard Transformer block, thus significantly reducing computational complexity. We conduct extensive experiments on two public multimodal datasets, HECKTOR21 and PI-CAI22. The experimental results show that our proposed method outperforms the existing state-of-the-art methods while having lower computational complexity. The source code is available at https://github.com/shijun18/H-DenseFormer.

摘要
近些年来，深度学习方法在多Modal医学影像肿瘤分割方面取得了可观的成果。然而，现有的方法受到表达能力不充分、特定的Modal数量和高计算复杂性的限制。在这篇论文中，我们提出了一种混合 densely connected network для肿瘤分割，称为 H-DenseFormer，它将 CNN 和 Transformer 结构相结合。具体来说，H-DenseFormer 使用 Transformer 结构基于多Path Parallel Embedding（MPE）模块，可以将任意数量的Modalities作为输入，以提取不同Modalities中的融合特征。然后，多Modal融合特征被传递到不同层次的编码器，以增强多Modal学习表达。此外，我们设计了一种轻量级的 Densely Connected Transformer（DCT）块，以取代标准 Transformer 块，从而实现显著降低计算复杂性。我们对公共的两个多Modal 数据集进行了广泛的实验，结果表明，我们的提议方法在计算复杂性下与现有的状态态势表达方法相比，表现出色，并且可以在多Modal 数据集上达到更高的识别率。代码可以在 https://github.com/shijun18/H-DenseFormer 上找到。

A Review of Driver Gaze Estimation and Application in Gaze Behavior Understanding

paper_url: http://arxiv.org/abs/2307.01470
repo_url: None
paper_authors: Pavan Kumar Sharma, Pranamesh Chakraborty
for: 本研究的主要目标是对 Driver Gaze 基础知识、测算方法和应用场景进行全面的概述，以便更好地理解和应用 Driver Gaze 技术。
methods: 本研究主要使用 Head-mounted 和远程设置基于眼动估计的方法，并详细介绍了现有的 Driver Gaze 数据集和估计算法，包括传统机器学习和深度学习等技术。
results: 本研究使用 Driver Gaze 估计结果来理解 drivers 在拐弯、加速和减速时的眼动行为，以及道路投放广告结构的影响。同时，本研究也提出了现有文献的限制、挑战和未来发展方向。

Abstract
Driver gaze plays an important role in different gaze-based applications such as driver attentiveness detection, visual distraction detection, gaze behavior understanding, and building driver assistance system. The main objective of this study is to perform a comprehensive summary of driver gaze fundamentals, methods to estimate driver gaze, and it's applications in real world driving scenarios. We first discuss the fundamentals related to driver gaze, involving head-mounted and remote setup based gaze estimation and the terminologies used for each of these data collection methods. Next, we list out the existing benchmark driver gaze datasets, highlighting the collection methodology and the equipment used for such data collection. This is followed by a discussion of the algorithms used for driver gaze estimation, which primarily involves traditional machine learning and deep learning based techniques. The estimated driver gaze is then used for understanding gaze behavior while maneuvering through intersections, on-ramps, off-ramps, lane changing, and determining the effect of roadside advertising structures. Finally, we have discussed the limitations in the existing literature, challenges, and the future scope in driver gaze estimation and gaze-based applications.

摘要
Driver's gaze plays an important role in various gaze-based applications such as driver attentiveness detection, visual distraction detection, and understanding gaze behavior. The main objective of this study is to provide a comprehensive summary of driver gaze fundamentals, methods for estimating driver gaze, and its applications in real-world driving scenarios.First, we discuss the fundamentals of driver gaze, including head-mounted and remote setup-based gaze estimation, and the terminologies used for each of these data collection methods. Next, we list out the existing benchmark driver gaze datasets, highlighting the collection methodology and equipment used for such data collection.Then, we discuss the algorithms used for driver gaze estimation, which primarily involve traditional machine learning and deep learning-based techniques. The estimated driver gaze is used for understanding gaze behavior while maneuvering through intersections, on-ramps, off-ramps, lane changing, and determining the effect of roadside advertising structures.Finally, we discuss the limitations in the existing literature, challenges, and the future scope in driver gaze estimation and gaze-based applications.

Generating Animatable 3D Cartoon Faces from Single Portraits

paper_url: http://arxiv.org/abs/2307.01468
repo_url: None
paper_authors: Chuanyu Pan, Guowei Yang, Taijiang Mu, Yu-Kun Lai
for: 本研究旨在提供一种生成可animatable 3D漫画人脸的新方法，以满足现代虚拟现实技术的需求。
methods: 我们提出了一种两stage reconstruction方法，首先使用StyleGAN将输入的真实世界照片转换为 стили化漫画图像，然后通过非rigid deformation以准确地还原3D漫画人脸的细节xture。最后，我们提出了一种基于手动创建的模板和变形传递的semantic preserving face rigging方法。
results: 相比之前的研究，我们的方法可以更好地保持人脸的准确性、美观性和相似性标准。此外，我们还实现了在实时进行人脸动画的能力。

Abstract
With the booming of virtual reality (VR) technology, there is a growing need for customized 3D avatars. However, traditional methods for 3D avatar modeling are either time-consuming or fail to retain similarity to the person being modeled. We present a novel framework to generate animatable 3D cartoon faces from a single portrait image. We first transfer an input real-world portrait to a stylized cartoon image with a StyleGAN. Then we propose a two-stage reconstruction method to recover the 3D cartoon face with detailed texture, which first makes a coarse estimation based on template models, and then refines the model by non-rigid deformation under landmark supervision. Finally, we propose a semantic preserving face rigging method based on manually created templates and deformation transfer. Compared with prior arts, qualitative and quantitative results show that our method achieves better accuracy, aesthetics, and similarity criteria. Furthermore, we demonstrate the capability of real-time facial animation of our 3D model.

摘要
随着虚拟现实（VR）技术的发展，个性化3D人物模型的需求在增长。然而，传统的3D人物模型方法 Either time-consuming or failure to retain the similarity of the person being modeled. We present a novel framework to generate animatable 3D cartoon faces from a single portrait image. We first transfer an input real-world portrait to a stylized cartoon image with a StyleGAN. Then we propose a two-stage reconstruction method to recover the 3D cartoon face with detailed texture, which first makes a coarse estimation based on template models, and then refines the model by non-rigid deformation under landmark supervision. Finally, we propose a semantic preserving face rigging method based on manually created templates and deformation transfer. Compared with prior arts, qualitative and quantitative results show that our method achieves better accuracy, aesthetics, and similarity criteria. Furthermore, we demonstrate the capability of real-time facial animation of our 3D model.Here's a breakdown of the translation:* "随着" (suī zhe) - "with the development of"* "虚拟现实" (hū zhì shèng jì) - "virtual reality"* "技术" (jì shu) - "technology"* "个性化" (ge xìng yào) - "customized"* "3D人物模型" (3D rén wù molding) - "3D avatar model"* "需求" (xū yè) - "need"* "在增长" (zài jìn cháng) - "is growing"* "传统的" (chuán tǒng de) - "traditional"* "3D人物模型方法" (3D rén wù molding fāng fá) - "3D avatar modeling methods"* "Either time-consuming or failure to retain the similarity of the person being modeled" - This phrase is translated as "或者时间耗费或失去人物相似性" (or zhī shí hòu fèi or shī qù rén wù xiàng yì)* "We present a novel framework to generate animatable 3D cartoon faces from a single portrait image" - This phrase is translated as "我们提出了一种新的框架，可以从单一的肖像图片中生成可动的3D漫画人脸" (wǒ men tī shè le yī zhōng xīn de kāng yì, cóng zhī yī zhōng shèng cháng yǐn dào yī zhōng)* "We first transfer an input real-world portrait to a stylized cartoon image with a StyleGAN" - This phrase is translated as "我们首先将输入的真实世界肖像图片转化为一个风格化的漫画人脸，使用StyleGAN" (wǒ men chū xiān shì yǐn zhī shì jīn shì, cóng zhī yī zhōng shèng cháng yǐn dào yī zhōng)* "Then we propose a two-stage reconstruction method to recover the 3D cartoon face with detailed texture" - This phrase is translated as "然后我们提出了一种两阶段重建方法，以获取3D漫画人脸的详细TEXTURE" (rán hái wǒ men tī shè le yī zhōng èr jiāng tiě jīng fāng fá, yǐn dào jīn shì sān jiāng)* "which first makes a coarse estimation based on template models, and then refines the model by non-rigid deformation under landmark supervision" - This phrase is translated as "首先基于模板模型进行大致的估计，然后通过非固定形态的扭变以下领别监督进行细化" (chū shí zhī shì yǐn zhī shì, cóng zhī yī zhōng shèng cháng yǐn dào)* "Finally, we propose a semantic preserving face rigging method based on manually created templates and deformation transfer" - This phrase is translated as "最后，我们提出了一种基于手动创建的模板和形态传递的 semantics保持的人脸动画方法" (wǒ men tī shè le yī zhōng xiān shì yǐn zhī shì, cóng zhī yī zhōng shèng cháng yǐn dào)* "Compared with prior arts, qualitative and quantitative results show that our method achieves better accuracy, aesthetics, and similarity criteria" - This phrase is translated as "与之前的艺术比较，我们的方法在准确性、艺术性和相似性标准上都达到了更高的水平" (yǔ zhī qián de yì shù bǐ kě, wǒ men de fāng fá zài jìn cháng yǐn dào zhèng zhì, xiàng yì yì shù)* "Furthermore, we demonstrate the capability of real-time facial animation of our 3D model" - This phrase is translated as "此外，我们还展示了我们的3D模型在实时动画方面的能力" (qí wài, wǒ men hái jiǎng yǐn dào)

Technical Report for Ego4D Long Term Action Anticipation Challenge 2023

paper_url: http://arxiv.org/abs/2307.01467
repo_url: None
paper_authors: Tatsuya Ishibashi, Kosuke Ono, Noriyuki Kugo, Yuji Sato
for: 预测未来动作序列
methods: 基于SlowFast和SlowFast-CLIP模型的ensemble，加入 Label smoothing 和动作类别（动词、名称）的约束
results: 超越基线性能，在公共领先板上记录第二名成绩

Abstract
In this report, we describe the technical details of our approach for the Ego4D Long-Term Action Anticipation Challenge 2023. The aim of this task is to predict a sequence of future actions that will take place at an arbitrary time or later, given an input video. To accomplish this task, we introduce three improvements to the baseline model, which consists of an encoder that generates clip-level features from the video, an aggregator that integrates multiple clip-level features, and a decoder that outputs Z future actions. 1) Model ensemble of SlowFast and SlowFast-CLIP; 2) Label smoothing to relax order constraints for future actions; 3) Constraining the prediction of the action class (verb, noun) based on word co-occurrence. Our method outperformed the baseline performance and recorded as second place solution on the public leaderboard.

摘要
在这份报告中，我们介绍了我们对Ego4D长期动作预测挑战2023年的技术细节。该任务的目标是基于输入视频预测未来动作的序列，我们在这里引入了三种改进基eline模型，即encoder生成视频clip级别特征，aggregator将多个clip级别特征集成，以及decoder输出Z个未来动作。1）SlowFast和SlowFast-CLIP模型集成；2）将未来动作顺序约束松弛为label smoothing；3）基于单词共occurrence constrain动作类别（动词、名词）预测。我们的方法比基eline表现更好，在公共排名板上录制第二名。

AdAM: Few-Shot Image Generation via Adaptation-Aware Kernel Modulation

paper_url: http://arxiv.org/abs/2307.01465
repo_url: None
paper_authors: Yunqing Zhao, Keshigeyan Chandrasegaran, Abdollahzadeh Milad, Chao Du, Tianyu Pang, Ruoteng Li, Henghui Ding, Ngai-Man Cheung
for: 这个论文目的是解决几个示例（例如10）训练样本的图像生成问题。
methods: 这个论文使用了一个已经在大规模源频道上预训练的GAN，并将其适应到目标频道。Central to recent FSIG methods are knowledge preservation criteria, which select and preserve a subset of source knowledge to the adapted model.
results: 我们的研究表明，许多现有的State-of-the-art（SOTA）方法，只考虑源频道，在不同的频道距离下进行适应，表现不佳。我们的第二个贡献是提出了适应aware的kernel Modulation（AdAM），用于通用的几个源-目标频道距离的图像生成。广泛的实验表明，AdAM可以在挑战性的设置下 consistently achieve SOTA performance。

Abstract
Few-shot image generation (FSIG) aims to learn to generate new and diverse images given few (e.g., 10) training samples. Recent work has addressed FSIG by leveraging a GAN pre-trained on a large-scale source domain and adapting it to the target domain with few target samples. Central to recent FSIG methods are knowledge preservation criteria, which select and preserve a subset of source knowledge to the adapted model. However, a major limitation of existing methods is that their knowledge preserving criteria consider only source domain/task and fail to consider target domain/adaptation in selecting source knowledge, casting doubt on their suitability for setups of different proximity between source and target domain. Our work makes two contributions. Firstly, we revisit recent FSIG works and their experiments. We reveal that under setups which assumption of close proximity between source and target domains is relaxed, many existing state-of-the-art (SOTA) methods which consider only source domain in knowledge preserving perform no better than a baseline method. As our second contribution, we propose Adaptation-Aware kernel Modulation (AdAM) for general FSIG of different source-target domain proximity. Extensive experiments show that AdAM consistently achieves SOTA performance in FSIG, including challenging setups where source and target domains are more apart.

摘要
几个示例图像生成（FSIG）目标是通过几个（例如10）训练样本学习生成新和多样化的图像。 latest work has addressed FSIG by leveraging a pre-trained GAN on a large-scale source domain and adapting it to the target domain with few target samples. central to recent FSIG methods are knowledge preservation criteria, which select and preserve a subset of source knowledge to the adapted model. However, a major limitation of existing methods is that their knowledge preserving criteria consider only source domain/task and fail to consider target domain/adaptation in selecting source knowledge, casting doubt on their suitability for setups of different proximity between source and target domain. Our work makes two contributions. Firstly, we revisit recent FSIG works and their experiments. We reveal that under setups which assumption of close proximity between source and target domains is relaxed, many existing state-of-the-art (SOTA) methods which consider only source domain in knowledge preserving perform no better than a baseline method. As our second contribution, we propose Adaptation-Aware kernel Modulation (AdAM) for general FSIG of different source-target domain proximity. Extensive experiments show that AdAM consistently achieves SOTA performance in FSIG, including challenging setups where source and target domains are more apart.

Unsupervised Quality Prediction for Improved Single-Frame and Weighted Sequential Visual Place Recognition

paper_url: http://arxiv.org/abs/2307.01464
repo_url: None
paper_authors: Helen Carson, Jason J. Ford, Michael Milford
for: 本研究旨在提高自动驾驶系统的定位和视觉定位技术的可靠性和预测性。
methods: 本研究使用一种新的训练自由的方法来预测定位估计的质量，并使用这些预测来偏补一种序列匹配过程，以实现更高的精度性能。
results: 在四个数据集和三种视觉定位技术上，我们的结合系统可以提高定位精度性能，特别是在高精度低匹配点操作点上。我们还提供了减少和分析，以分析预测系统和偏补序列匹配器的性能贡献。

Abstract
While substantial progress has been made in the absolute performance of localization and Visual Place Recognition (VPR) techniques, it is becoming increasingly clear from translating these systems into applications that other capabilities like integrity and predictability are just as important, especially for safety- or operationally-critical autonomous systems. In this research we present a new, training-free approach to predicting the likely quality of localization estimates, and a novel method for using these predictions to bias a sequence-matching process to produce additional performance gains beyond that of a naive sequence matching approach. Our combined system is lightweight, runs in real-time and is agnostic to the underlying VPR technique. On extensive experiments across four datasets and three VPR techniques, we demonstrate our system improves precision performance, especially at the high-precision/low-recall operating point. We also present ablation and analysis identifying the performance contributions of the prediction and weighted sequence matching components in isolation, and the relationship between the quality of the prediction system and the benefits of the weighted sequential matcher.

摘要
“尽管当地化和视觉地标识（VPR）技术的绝对性表现已经取得了 significiant 进步，但是在将这些系统应用于实际应用中，其他特性如完整性和预测性也变得越来越重要，特别是 для安全或操作critical的自动驾驶系统。在这项研究中，我们提出了一种新的、无需训练的方法来预测当地化估计的质量，以及一种新的方法来使用这些预测来偏补一个序列匹配过程，以实现更高的性能提升。我们的整体系统轻量级、实时执行，并且对于下述VPR技术无关。在四个数据集和三种VPR技术的广泛实验中，我们证明了我们的系统可以提高精度性能，特别是在高精度/低回归点操作点。我们还提供了剥离和分析，描述预测和权重序列匹配组件在孤立情况下的性能贡献，以及预测系统质量和权重序列匹配的关系。”Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you need Traditional Chinese, please let me know.

Practical Collaborative Perception: A Framework for Asynchronous and Multi-Agent 3D Object Detection

paper_url: http://arxiv.org/abs/2307.01462
repo_url: None
paper_authors: Minh-Quan Dao, Julie Stephany Berrio, Vincent Frémont, Mao Shan, Elwan Héry, Stewart Worrall
for: 提高 LiDAR-based 物体探测方法中的遮挡问题，特别是在城市交通中， где egocar 需要可靠的物体探测，以避免碰撞，而其视场受到了大量道路用户的阻挡。
methods: collaborative perception via Vehicle-to-Everything (V2X) communication，利用多个连接的代理机构形成完整的场景表示，并通过中间协作来解决性能和带宽之间的负担。
results: 我们提出了一种简单又有效的协作方法，可以在实际应用中超越先前的状态艺术方法，同时尽可能减少对单车辆检测模型的修改和假设不实的多个代理机构同步。实验结果表明，我们的协作方法可以达到98%的性能水平，只 consume相同的带宽，与先前的中间协作方法相比。

Abstract
Occlusion is a major challenge for LiDAR-based object detection methods. This challenge becomes safety-critical in urban traffic where the ego vehicle must have reliable object detection to avoid collision while its field of view is severely reduced due to the obstruction posed by a large number of road users. Collaborative perception via Vehicle-to-Everything (V2X) communication, which leverages the diverse perspective thanks to the presence at multiple locations of connected agents to form a complete scene representation, is an appealing solution. State-of-the-art V2X methods resolve the performance-bandwidth tradeoff using a mid-collaboration approach where the Bird-Eye View images of point clouds are exchanged so that the bandwidth consumption is lower than communicating point clouds as in early collaboration, and the detection performance is higher than late collaboration, which fuses agents' output, thanks to a deeper interaction among connected agents. While achieving strong performance, the real-world deployment of most mid-collaboration approaches is hindered by their overly complicated architectures, involving learnable collaboration graphs and autoencoder-based compressor/ decompressor, and unrealistic assumptions about inter-agent synchronization. In this work, we devise a simple yet effective collaboration method that achieves a better bandwidth-performance tradeoff than prior state-of-the-art methods while minimizing changes made to the single-vehicle detection models and relaxing unrealistic assumptions on inter-agent synchronization. Experiments on the V2X-Sim dataset show that our collaboration method achieves 98\% of the performance of an early-collaboration method, while only consuming the equivalent bandwidth of a late-collaboration method.

摘要
干扰是LiDAR基于对象检测方法的主要挑战。在城市交通中， egovehicle 需要可靠的对象检测，以避免事故，而其视场受到了大量道路用户的干扰。 Collaborative perception via Vehicle-to-Everything (V2X) communication 是一种吸引人的解决方案，它利用了多个位置的连接代理人形成完整的场景表示，并且可以解决性能和带宽的贸易。现状的V2X方法在性能和带宽之间进行了中间协作，通过在 Bird-Eye View 图像上进行点云的交换，以避免在早期协作中的带宽浪费，并在晚期协作中进行拼接，以获得更高的检测性能。在实现优秀性能的同时，大多数中间协作方法的实际部署受到了其复杂的架构和学习可学的协作图、 autoencoder 基于的压缩器/解压缩器的假设，以及 между代理人的不realistic 同步假设。在这种工作中，我们设计了一种简单 yet 有效的协作方法，可以在带宽-性能贸易中提供更好的贸易比例，而且尽可能地避免改变单车检测模型，并放弃不realistic 的 между代理人同步假设。 V2X-Sim 数据集的实验表明，我们的协作方法可以达到98%的性能，只消耗与晚期协作相同的带宽。

Learning Feature Matching via Matchable Keypoint-Assisted Graph Neural Network

paper_url: http://arxiv.org/abs/2307.01447
repo_url: None
paper_authors: Zizhuo Li, Jiayi Ma
for: 提高计算机视觉中的特征匹配精度和效率，特别是在相机转换、基本矩阵估计和视觉地标 tasks 上。
methods: 我们提出了一种快速准确的 MaKeGNN 模型，它使用了稀疏注意力机制来缺省非重复的关键点，并通过匹配关键点来导引有意义的信息传递。
results: 我们在相机转换、基本矩阵估计和视觉地标 tasks 上达到了领先的性能，同时减少了计算和内存复杂度。

Abstract
Accurately matching local features between a pair of images is a challenging computer vision task. Previous studies typically use attention based graph neural networks (GNNs) with fully-connected graphs over keypoints within/across images for visual and geometric information reasoning. However, in the context of feature matching, considerable keypoints are non-repeatable due to occlusion and failure of the detector, and thus irrelevant for message passing. The connectivity with non-repeatable keypoints not only introduces redundancy, resulting in limited efficiency, but also interferes with the representation aggregation process, leading to limited accuracy. Targeting towards high accuracy and efficiency, we propose MaKeGNN, a sparse attention-based GNN architecture which bypasses non-repeatable keypoints and leverages matchable ones to guide compact and meaningful message passing. More specifically, our Bilateral Context-Aware Sampling Module first dynamically samples two small sets of well-distributed keypoints with high matchability scores from the image pair. Then, our Matchable Keypoint-Assisted Context Aggregation Module regards sampled informative keypoints as message bottlenecks and thus constrains each keypoint only to retrieve favorable contextual information from intra- and inter- matchable keypoints, evading the interference of irrelevant and redundant connectivity with non-repeatable ones. Furthermore, considering the potential noise in initial keypoints and sampled matchable ones, the MKACA module adopts a matchability-guided attentional aggregation operation for purer data-dependent context propagation. By these means, we achieve the state-of-the-art performance on relative camera estimation, fundamental matrix estimation, and visual localization, while significantly reducing computational and memory complexity compared to typical attentional GNNs.

摘要
通过快速匹配本地特征点，计算机视觉任务中的一个挑战是准确地匹配两个图像中的特征点。先前的研究通常使用注意力基于图像内部/ между图像的全连接图 neural networks (GNNs) 进行视觉和几何信息的推理。然而，在特征匹配任务中，许多特征点是不可重复的，这些特征点由遮挡和检测器失败所致，因此对信息传递无关。与非重复的特征点连接不仅引入纠续，导致效率有限，而且干扰归纳表达过程，从而限制准确性。为了实现高精度和效率，我们提出了MaKeGNN，一种稀缺注意力基于GNN架构。MaKeGNN通过快速匹配特征点来导引有效的信息传递。更加详细地说，我们的双边上下文感知抽取模块首先在图像对中动态选择了一小集well-distributed的高匹配分数的特征点。然后，我们的匹配点协助上下文聚合模块将这些有用的特征点作为信道瓶颈，限制每个特征点只能从匹配的内部和外部匹配点中 retrieve 有利的上下文信息，避免与非匹配的非重复连接干扰信息传递。此外，为了避免初始特征点和抽取的匹配点中的噪音，MKACA模块采用了匹配性指导的注意力聚合操作，以保证数据依赖关系的纯净传递。通过这些手段，我们实现了相对摄像头估算、基本矩阵估算和视觉地理位置估算的状态 искусственный情况，同时显著降低了通常的注意力GNNs的计算和内存复杂性。

Continual Learning in Open-vocabulary Classification with Complementary Memory Systems

paper_url: http://arxiv.org/abs/2307.01430
repo_url: None
paper_authors: Zhen Zhu, Weijie Lyu, Yao Xiao, Derek Hoiem
for: 这paper是为了解决开放词汇图像分类 tasks中的灵活 continual learning问题， Drawing inspiration from human cognition的 complementary learning systems.
methods: 我们提出了一种”tree probe”方法， Which enables fast learning from new examples with competitive accuracy to batch-trained linear models. We also propose a method to combine predictions from a CLIP zero-shot model and the exemplar-based model, using the zero-shot estimated probability that a sample’s class is within any of the exemplar classes.
results: Our proposed method achieves a good balance of learning speed, target task effectiveness, and zero-shot effectiveness in data incremental, class incremental, and task incremental settings, as well as ability to perform flexible inference on varying subsets of zero-shot and learned categories.

Abstract
We introduce a method for flexible continual learning in open-vocabulary image classification, drawing inspiration from the complementary learning systems observed in human cognition. We propose a "tree probe" method, an adaption of lazy learning principles, which enables fast learning from new examples with competitive accuracy to batch-trained linear models. Further, we propose a method to combine predictions from a CLIP zero-shot model and the exemplar-based model, using the zero-shot estimated probability that a sample's class is within any of the exemplar classes. We test in data incremental, class incremental, and task incremental settings, as well as ability to perform flexible inference on varying subsets of zero-shot and learned categories. Our proposed method achieves a good balance of learning speed, target task effectiveness, and zero-shot effectiveness.

摘要
我们介绍了一种可动的开放词汇图像分类方法， drawing inspiration from human cognition中的补偿学习系统。我们提出了“树触”方法，基于懒散学习原则，可以快速从新示例中学习，与批量训练的直线模型具有竞争性准确率。此外，我们提出了将CLIP零shot模型和 exemplar-based模型的预测结果组合使用，使用零shot估计概率是任何 exemplar类中的样本类别。我们在数据增量、类增量和任务增量设置下测试了我们的提议方法，以及在不同的零shot和学习类别上进行可变的测试。我们的提议方法实现了learner speed, target task effectiveness和零shot effectiveness的好 equilibrio。

DeepfakeBench: A Comprehensive Benchmark of Deepfake Detection

paper_url: http://arxiv.org/abs/2307.01426
repo_url: https://github.com/sclbd/deepfakebench
paper_authors: Zhiyuan Yan, Yong Zhang, Xinhang Yuan, Siwei Lyu, Baoyuan Wu
for: 这篇论文是为了提供一个标准化、一致的深伪检测 benchmark，以解决现有的检测模型之间的不公正比较和可能的误导结果。
methods: 这篇论文使用了一个统一的数据管理系统，以确保所有检测模型的输入数据都是一致的。此外，论文还提供了一个整合了现有方法的框架，以及一个标准化的评估度量和评估协议，以促进透明度和重复性。
results: 这篇论文提供了一个全面的检测模型评估，包括15种现有的检测方法、9个深伪数据集、多种深伪检测评估协议和分析工具，以及广泛的评估。此外，论文还提供了新的分析结果，包括不同的观点（例如，数据增强、后向）。

Abstract
A critical yet frequently overlooked challenge in the field of deepfake detection is the lack of a standardized, unified, comprehensive benchmark. This issue leads to unfair performance comparisons and potentially misleading results. Specifically, there is a lack of uniformity in data processing pipelines, resulting in inconsistent data inputs for detection models. Additionally, there are noticeable differences in experimental settings, and evaluation strategies and metrics lack standardization. To fill this gap, we present the first comprehensive benchmark for deepfake detection, called DeepfakeBench, which offers three key contributions: 1) a unified data management system to ensure consistent input across all detectors, 2) an integrated framework for state-of-the-art methods implementation, and 3) standardized evaluation metrics and protocols to promote transparency and reproducibility. Featuring an extensible, modular-based codebase, DeepfakeBench contains 15 state-of-the-art detection methods, 9 deepfake datasets, a series of deepfake detection evaluation protocols and analysis tools, as well as comprehensive evaluations. Moreover, we provide new insights based on extensive analysis of these evaluations from various perspectives (e.g., data augmentations, backbones). We hope that our efforts could facilitate future research and foster innovation in this increasingly critical domain. All codes, evaluations, and analyses of our benchmark are publicly available at https://github.com/SCLBD/DeepfakeBench.

摘要
具有批评性和 часто被忽视的挑战在深度假造检测领域是缺乏标准化、一致化、完整的标准准测。这个问题导致了不公平的性能比较和可能出现误导性的结果。具体来说，存在数据处理管道的不一致性，导致检测模型的输入数据不一致。此外，实验设置存在显著差异，评价策略和标准化也缺乏。为了填补这一漏洞，我们提出了首个深度假造检测的完整准测，即深度假造准测（DeepfakeBench），它的三个重要贡献如下：1. 一个统一的数据管理系统，确保所有检测器的输入数据具有一致性。2. 一个集成的方法实现框架，包括当前领域的状态艺术方法。3. 标准化的评价指标和协议，以促进透明度和可重现性。我们的准测拥有可扩展、模块化的代码基础，包括15种当前领域的检测方法，9个深度假造数据集，深度假造检测评价协议和分析工具，以及全面的评估。此外，我们还提供了来自多个角度的新的分析结论（例如数据增强、后向传播）。我们希望我们的努力可以促进未来的研究和促进深度假造检测领域的创新。我们的所有代码、评价、分析和结论都公开可用于https://github.com/SCLBD/DeepfakeBench。

Consistent Multimodal Generation via A Unified GAN Framework

paper_url: http://arxiv.org/abs/2307.01425
repo_url: None
paper_authors: Zhen Zhu, Yijun Li, Weijie Lyu, Krishna Kumar Singh, Zhixin Shu, Soeren Pirk, Derek Hoiem
for: 这篇论文的目的是如何通过单一的生成模型生成多模态图像输出，包括RGB、深度和表面法向量。
methods: 该论文基于StyleGAN3架构，使用共享背部和模态特定分支在图像生成网络的最后层。它还提出了每个模态的准确性评估器和跨模态一致性评估器。
results: 在斯坦福2D3D数据集上进行实验，论文实现了真实和一致的RGB、深度和表面法向量图像生成。它还提供了一种训练recipe，可以轻松地将预训练模型应用于新领域，只需几个对应的数据对。此外，论文还评估了使用生成的RGB和深度对进行深度估计器的训练或细化。

Abstract
We investigate how to generate multimodal image outputs, such as RGB, depth, and surface normals, with a single generative model. The challenge is to produce outputs that are realistic, and also consistent with each other. Our solution builds on the StyleGAN3 architecture, with a shared backbone and modality-specific branches in the last layers of the synthesis network, and we propose per-modality fidelity discriminators and a cross-modality consistency discriminator. In experiments on the Stanford2D3D dataset, we demonstrate realistic and consistent generation of RGB, depth, and normal images. We also show a training recipe to easily extend our pretrained model on a new domain, even with a few pairwise data. We further evaluate the use of synthetically generated RGB and depth pairs for training or fine-tuning depth estimators. Code will be available at https://github.com/jessemelpolio/MultimodalGAN.

摘要
我们研究如何通过单一的生成模型生成多modal的图像输出，如RGB、深度和表面法向量。挑战在于生成的输出应该是真实的，同时也需要在不同modal之间具有一致性。我们基于StyleGAN3架构，在生成网络的最后层添加共享后置和modal特定的分支，并提出每个modal的准确性评价器以及交叉modal的一致性评价器。在Stanford2D3D数据集上进行实验，我们实现了真实和一致的RGB、深度和normal图像生成。此外，我们还提供了一个训练recipe，可以轻松地在新领域上扩展我们预训练模型，只需要一些对应的对比数据。最后，我们还评估了使用生成的RGB和深度对来训练或练化深度估计器。代码将在GitHub上提供。

Direct Superpoints Matching for Fast and Robust Point Cloud Registration

paper_url: http://arxiv.org/abs/2307.01362
repo_url: None
paper_authors: Aniket Gupta, Yiming Xie, Hanumant Singh, Huaizu Jiang
for: 本研究旨在提出一种简单 yet effective的方法，可以直接匹配 superpoints，以确定源点云和目标点云之间的固定变换。
methods: 本方法使用全局滑动层来直接匹配 superpoints，并使用这些匹配来确定点云之间的变换。
results: 与直接预测对应点的方法相比，我们的方法可以更准确地估计变换，并且不需要任何后处理修复。在 ModelNet 和 3DMatch 测试集上，我们的方法达到了当前最佳的结果。

Abstract
Although deep neural networks endow the downsampled superpoints with discriminative feature representations, directly matching them is usually not used alone in state-of-the-art methods, mainly for two reasons. First, the correspondences are inevitably noisy, so RANSAC-like refinement is usually adopted. Such ad hoc postprocessing, however, is slow and not differentiable, which can not be jointly optimized with feature learning. Second, superpoints are sparse and thus more RANSAC iterations are needed. Existing approaches use the coarse-to-fine strategy to propagate the superpoints correspondences to the point level, which are not discriminative enough and further necessitates the postprocessing refinement. In this paper, we present a simple yet effective approach to extract correspondences by directly matching superpoints using a global softmax layer in an end-to-end manner, which are used to determine the rigid transformation between the source and target point cloud. Compared with methods that directly predict corresponding points, by leveraging the rich information from the superpoints matchings, we can obtain more accurate estimation of the transformation and effectively filter out outliers without any postprocessing refinement. As a result, our approach is not only fast, but also achieves state-of-the-art results on the challenging ModelNet and 3DMatch benchmarks. Our code and model weights will be publicly released.

摘要
尽管深度神经网络使得下采样后的超点具有特征表示能力，但直接匹配它们通常不是独立使用的方法，主要是因为两个原因。首先，匹配是不可避免的噪声，因此通常采用RANSAC-like的修正。这种随机处理不是梯度可导的，因此无法与特征学习一起共同优化。其次，超点是稀疏的，因此需要更多的RANSAC迭代。现有的方法使用粗细到细粒的策略来传播超点匹配到点级别，但这些匹配不够精细，需要进一步的后处理修正。在这篇论文中，我们提出了一种简单又有效的方法，通过直接匹配超点使用全局软max层，以END-TO-END的方式确定源点云和目标点云之间的固定变换。与直接预测对应点的方法相比，通过利用超点匹配中的丰富信息，我们可以更准确地估算变换，并有效地排除噪声，无需任何后处理修正。因此，我们的方法不仅快速，而且实现了对挑战性 ModelNet 和 3DMatch 测试集的state-of-the-art 结果。我们的代码和模型参数将公开发布。

Patch-CNN: Training data-efficient deep learning for high-fidelity diffusion tensor estimation from minimal diffusion protocols

paper_url: http://arxiv.org/abs/2307.01346
repo_url: None
paper_authors: Tobias Goodwin-Allcock, Ting Gong, Robert Gray, Parashkev Nachev, Hui Zhang
for: 优化Diffusion Tensor Estimation（DT）从六个方向的扩散束图像（DWI）中获得更高精度的数据。
methods: 使用深度学习方法，主要是基于 convolutional neural networks（CNN），从只有六个方向的扩散束图像（DWI）中获得 diffusion tensor 参数。
results: 与传统模型适应和基于voxel-wise fully-connected neural networks（FCN）的方法相比，Patch-CNN 可以更好地优化 scalar dMRI 参数和纤维orientation estimation from six-direction DWIs，并生成更好的 tractogram。

Abstract
We propose a new method, Patch-CNN, for diffusion tensor (DT) estimation from only six-direction diffusion weighted images (DWI). Deep learning-based methods have been recently proposed for dMRI parameter estimation, using either voxel-wise fully-connected neural networks (FCN) or image-wise convolutional neural networks (CNN). In the acute clinical context -- where pressure of time limits the number of imaged directions to a minimum -- existing approaches either require an infeasible number of training images volumes (image-wise CNNs), or do not estimate the fibre orientations (voxel-wise FCNs) required for tractogram estimation. To overcome these limitations, we propose Patch-CNN, a neural network with a minimal (non-voxel-wise) convolutional kernel (3$\times$3$\times$3). Compared with voxel-wise FCNs, this has the advantage of allowing the network to leverage local anatomical information. Compared with image-wise CNNs, the minimal kernel vastly reduces training data demand. Evaluated against both conventional model fitting and a voxel-wise FCN, Patch-CNN, trained with a single subject is shown to improve the estimation of both scalar dMRI parameters and fibre orientation from six-direction DWIs. The improved fibre orientation estimation is shown to produce improved tractogram.

摘要
我们提出了一种新方法，named Patch-CNN，用于从只有六个方向的扩散束图像（DWI）中提取扩散tensor（DT）。在临床应用中，使用深度学习基于方法进行dMRI参数估计，可以使用 Either voxel-wise fully-connected neural networks (FCN) or image-wise convolutional neural networks (CNN)。然而，在临床上，由于时间压力，通常只能取得有限的图像方向，因此现有的方法可能需要很多的训练图像量（image-wise CNNs），或者不会估计束纹方向（voxel-wise FCNs）。为了解决这些限制，我们提出了Patch-CNN，一个具有最小（非voxel-wise） convolutional kernel（3x3x3）的神经网络。与voxel-wise FCNs比较，这有利于网络利用局部 анатомиче信息。与image-wise CNNs比较，最小kernel减少了训练数据的需求。我们对比 conventunal model fitting和voxel-wise FCN进行评估，发现Patch-CNN，通过使用单个subject进行训练，可以提高从六个方向DWIs中的scalar dMRI参数和束纹方向的估计。这些改进的束纹方向估计，可以生成改进的 tractogram。

Real-time Monocular Full-body Capture in World Space via Sequential Proxy-to-Motion Learning

paper_url: http://arxiv.org/abs/2307.01200
repo_url: None
paper_authors: Yuxiang Zhang, Hongwen Zhang, Liangxiao Hu, Hongwei Yi, Shengping Zhang, Yebin Liu
for: 这个论文的目的是提出一种基于学习的单视动作捕捉系统，可以在世界空间中实时捕捉全身动作，同时保持准确性。
methods: 该论文使用了一种顺序的代理人-到-动作学习方案，并使用了一个代理数据集，包括2D骨架序列和3D旋转动作在世界空间中。这些代理数据允许我们建立一个基于学习的网络，并在全身动作上提供准确的超级视图指导。此外，我们还在网络中共享了身体手部上下文信息，以便更好地恢复腕姿。
results: 根据该论文的结果，我们实现了世界空间中的实时单视全身动作捕捉系统，并且能够保持准确性和物理可能性。此外，我们还提供了更多的视频结果，可以在我们项目页面上找到：https://liuyebin.com/proxycap。

Abstract
Learning-based approaches to monocular motion capture have recently shown promising results by learning to regress in a data-driven manner. However, due to the challenges in data collection and network designs, it remains challenging for existing solutions to achieve real-time full-body capture while being accurate in world space. In this work, we contribute a sequential proxy-to-motion learning scheme together with a proxy dataset of 2D skeleton sequences and 3D rotational motions in world space. Such proxy data enables us to build a learning-based network with accurate full-body supervision while also mitigating the generalization issues. For more accurate and physically plausible predictions, a contact-aware neural motion descent module is proposed in our network so that it can be aware of foot-ground contact and motion misalignment with the proxy observations. Additionally, we share the body-hand context information in our network for more compatible wrist poses recovery with the full-body model. With the proposed learning-based solution, we demonstrate the first real-time monocular full-body capture system with plausible foot-ground contact in world space. More video results can be found at our project page: https://liuyebin.com/proxycap.

摘要
现代学习方法在单视动作捕捉领域已经显示出了扎实的成果，通过数据驱动的方式来进行回归。然而，由于数据收集和网络设计的问题，现有的解决方案在实时全身捕捉中具有很大的挑战。在这项工作中，我们提出了一种顺序Proxy-to-动作学习方案，并使用了2Dskeleton序列和3D旋转动作的世界空间数据集作为代理数据。这些代理数据允许我们建立一个基于学习的网络，并在全身监督下进行准确的回归。为了提高预测的准确性和物理可能性，我们还提出了一种具有联系感的神经动作下降模块，该模块能够考虑到脚地接触和动作不一致的代理观察。此外，我们在网络中分享了身体手势信息，以便更好地恢复与全身模型兼容的手势pose。通过我们的学习型解决方案，我们实现了世界空间中实时单视全身捕捉系统，并且首次实现了准确的脚地接触。更多视频结果可以在我们项目页面上找到：https://liuyebin.com/proxycap。

Segment Anything Meets Point Tracking

paper_url: http://arxiv.org/abs/2307.01197
repo_url: https://github.com/syscv/sam-pt
paper_authors: Frano Rajič, Lei Ke, Yu-Wing Tai, Chi-Keung Tang, Martin Danelljan, Fisher Yu
for: 这篇论文旨在扩展Segment Anything Model (SAM)的能力，以便在动态视频中跟踪和分割任何东西。
methods: 该方法利用了稳健和稀疏点选择和卷积技术来生成Mask，并且使用了点推送技术来跟踪目标对象。
results: 该方法在流行的视频对象分割评价标准DAVIS、YouTube-VOS和MOSE上表现出了强大的零shot性能。相比传统的对象中心的推送策略，我们使用点推送技术来利用地方结构信息，不受对象 semantics 的限制。

Abstract
The Segment Anything Model (SAM) has established itself as a powerful zero-shot image segmentation model, employing interactive prompts such as points to generate masks. This paper presents SAM-PT, a method extending SAM's capability to tracking and segmenting anything in dynamic videos. SAM-PT leverages robust and sparse point selection and propagation techniques for mask generation, demonstrating that a SAM-based segmentation tracker can yield strong zero-shot performance across popular video object segmentation benchmarks, including DAVIS, YouTube-VOS, and MOSE. Compared to traditional object-centric mask propagation strategies, we uniquely use point propagation to exploit local structure information that is agnostic to object semantics. We highlight the merits of point-based tracking through direct evaluation on the zero-shot open-world Unidentified Video Objects (UVO) benchmark. To further enhance our approach, we utilize K-Medoids clustering for point initialization and track both positive and negative points to clearly distinguish the target object. We also employ multiple mask decoding passes for mask refinement and devise a point re-initialization strategy to improve tracking accuracy. Our code integrates different point trackers and video segmentation benchmarks and will be released at https://github.com/SysCV/sam-pt.

摘要
Segment Anything Model (SAM) 已成为一种强大的零shot图像分割模型，使用交互提示如点来生成面积。本文介绍 SAM-PT，一种扩展 SAM 的能力，以跟踪和分割视频中的任何物体。SAM-PT 利用了Robust 和稀疏的点选择和宣传技术来生成面积，实际表明了基于 SAM 的分割跟踪器可以在流行的视频对象分割benchmark中具有强大的零shot性能，包括 DAVIS、YouTube-VOS 和 MOSE。与传统的对象中心的面积宣传策略不同，我们使用点宣传来利用本地结构信息，无论对象 semantics 无关。我们通过直接评估零shot开放世界 Unidentified Video Objects (UVO) benchmark来评估点跟踪的优势。为了进一步提高我们的方法，我们使用 K-Medoids 聚类算法来初始化点并跟踪正面和负面的点，以清晰地 distinguishes 目标对象。我们还使用多个面积解码通过来进行面积纠正，并设计了点重初始化策略以提高跟踪准确性。我们的代码将包括不同的点跟踪器和视频分割benchmark，将在 GitHub 上发布。

Investigating Data Memorization in 3D Latent Diffusion Models for Medical Image Synthesis

paper_url: http://arxiv.org/abs/2307.01148
repo_url: None
paper_authors: Salman Ul Hassan Dar, Arman Ghanaat, Jannik Kahmann, Isabelle Ayx, Theano Papavassiliu, Stefan O. Schoenberg, Sandy Engelhardt
for: 用于生成真实的医疗数据，保护患者隐私。
methods: 使用生成模型，利用自我指导学习方法检测数据记忆能力。
results: 发现模型确实记忆训练数据，需要采取缓解措施。

Abstract
Generative latent diffusion models have been established as state-of-the-art in data generation. One promising application is generation of realistic synthetic medical imaging data for open data sharing without compromising patient privacy. Despite the promise, the capacity of such models to memorize sensitive patient training data and synthesize samples showing high resemblance to training data samples is relatively unexplored. Here, we assess the memorization capacity of 3D latent diffusion models on photon-counting coronary computed tomography angiography and knee magnetic resonance imaging datasets. To detect potential memorization of training samples, we utilize self-supervised models based on contrastive learning. Our results suggest that such latent diffusion models indeed memorize training data, and there is a dire need for devising strategies to mitigate memorization.

摘要
<>通过实验，我们发现了生成潜在扩散模型在数据生成中的表现。这些模型在生成真实的医疗影像数据方面具有潜在的应用，例如公开分享医疗影像数据而不会侵犯病人隐私。然而，这些模型是否能够记忆敏感的训练数据仍然是一个未知之地。在这篇文章中，我们评估了3D潜在扩散模型在光子计数 computed tomography angiography和膝骨磁共振成像数据集上的记忆能力。我们使用自动化学习的对照学习方法来检测模型是否会记忆训练数据。我们的结果显示，这些潜在扩散模型确实会记忆训练数据，因此需要发展新的策略来缓和这种记忆。

AVSegFormer: Audio-Visual Segmentation with Transformer

paper_url: http://arxiv.org/abs/2307.01146
repo_url: https://github.com/vvvb-github/avsegformer
paper_authors: Shengyi Gao, Zhe Chen, Guo Chen, Wenhai Wang, Tong Lu
for: 该研究旨在提出一种新的听视多模态（AVS）任务，即将视频中的声音对象定位和分割。
methods: 该研究提出了一种基于 transformer 架构的新方法，称为 AVSegFormer，该方法利用听 queries 和可学习 queries 在 transformer 解码器中，使网络能够选择性地注意到有兴趣的视觉特征。
results: extensive experiments 表明，AVSegFormer 在 AVS benchmark 上达到了状态 искусственный智能的最佳 результаTS，并且可以在不同的视频和声音背景下表现出色。I hope that helps!

Abstract
The combination of audio and vision has long been a topic of interest in the multi-modal community. Recently, a new audio-visual segmentation (AVS) task has been introduced, aiming to locate and segment the sounding objects in a given video. This task demands audio-driven pixel-level scene understanding for the first time, posing significant challenges. In this paper, we propose AVSegFormer, a novel framework for AVS tasks that leverages the transformer architecture. Specifically, we introduce audio queries and learnable queries into the transformer decoder, enabling the network to selectively attend to interested visual features. Besides, we present an audio-visual mixer, which can dynamically adjust visual features by amplifying relevant and suppressing irrelevant spatial channels. Additionally, we devise an intermediate mask loss to enhance the supervision of the decoder, encouraging the network to produce more accurate intermediate predictions. Extensive experiments demonstrate that AVSegFormer achieves state-of-the-art results on the AVS benchmark. The code is available at https://github.com/vvvb-github/AVSegFormer.

摘要
通过音频和视觉的组合，多模态社区已经是长期的研究主题。最近，一个新的音频视频分割（AVS）任务被提出，旨在在给定的视频中找到并分割声音的对象。这个任务需要音频驱动像素级场景理解，对于多模态社区来说，它具有 significante挑战。在这篇论文中，我们提出了AVSegFormer，一种新的AVS任务框架，利用转换器架构。我们在转换器解码器中引入了音频问题和学习问题，使网络可以选择性地听取 interess visual特征。此外，我们还提出了一个音频视频混合器，可以动态调整视觉特征，增强相关的空间通道，并降低无关的空间通道。此外，我们还设计了一个中间mask损失，以增强解码器的监督，让网络生成更加准确的中间预测。我们的实验结果表明，AVSegFormer可以在AVS标准 benchmark上达到状态 искусственный智能的Result。代码可以在https://github.com/vvvb-github/AVSegFormer中找到。