2023-10-25

cs.CV

cs.CV - 2023-10-25

Exploring Question Decomposition for Zero-Shot VQA

paper_url: http://arxiv.org/abs/2310.17050
repo_url: None
paper_authors: Zaid Khan, Vijay Kumar BG, Samuel Schulter, Manmohan Chandraker, Yun Fu
for: 提高Visual Question Answering（VQA）任务的性能，使其更能够模仿人类的问答策略。
methods: 使用人类写好的问题分解策略，以及模型自动生成的问题分解策略，从示例 alone 学习两种任务。
results: 在八个VQA任务上，通过选择性地使用模型生成的问题分解策略，可以提高准确率，包括在医疗VQA任务上提高 >20%，并将BLIP-2的零基础性能提高到Winoground任务上的VQA重新定义任务上。

Abstract
Visual question answering (VQA) has traditionally been treated as a single-step task where each question receives the same amount of effort, unlike natural human question-answering strategies. We explore a question decomposition strategy for VQA to overcome this limitation. We probe the ability of recently developed large vision-language models to use human-written decompositions and produce their own decompositions of visual questions, finding they are capable of learning both tasks from demonstrations alone. However, we show that naive application of model-written decompositions can hurt performance. We introduce a model-driven selective decomposition approach for second-guessing predictions and correcting errors, and validate its effectiveness on eight VQA tasks across three domains, showing consistent improvements in accuracy, including improvements of >20% on medical VQA datasets and boosting the zero-shot performance of BLIP-2 above chance on a VQA reformulation of the challenging Winoground task. Project Site: https://zaidkhan.me/decomposition-0shot-vqa/

摘要
视觉问答 (VQA) 传统上被视为一个单步任务，每个问题都receives the same amount of effort，与人类问答策略不同。我们explore一种问题分解策略来超越这一限制。我们 probed recently developed large vision-language models的能力使用人类写的分解和生成自己的分解，发现它们可以从示例 alone learn both tasks。然而，我们发现直接使用模型写的分解可以伤性表现。我们引入一种模型驱动的选择性分解方法，用于second-guessing predictions和 corrected errors，并在八个 VQA任务上三个领域进行了验证，显示了一致性提高，包括 >20%的提高在医学 VQA数据集上和在Winoground任务上超过Random guess的VQA重新定义提高。项目网站：https://zaidkhan.me/decomposition-0shot-vqa/

Hierarchical Text Spotter for Joint Text Spotting and Layout Analysis

paper_url: http://arxiv.org/abs/2310.17674
repo_url: None
paper_authors: Shangbang Long, Siyang Qin, Yasuhisa Fujii, Alessandro Bissacco, Michalis Raptis
for: 这个论文是为了解决文本检测和几何布局分析的联合任务而设计的。
methods: 论文使用了两个新的组件：一个叫做Unified-Detector-Polygon (UDP)，它生成了文本线的贝塞尔曲线 polygon，并生成了段落之间的亲和力矩阵; 另一个叫做Line-to-Character-to-Word (L2C2W) recognizer，它将分割出来的行转换为字符，并将字符合并回到单词中。
results: 论文在多个单词文本检测标准数据集上达到了状态之Art Results，以及几何布局分析任务中的优秀成绩。

Abstract
We propose Hierarchical Text Spotter (HTS), a novel method for the joint task of word-level text spotting and geometric layout analysis. HTS can recognize text in an image and identify its 4-level hierarchical structure: characters, words, lines, and paragraphs. The proposed HTS is characterized by two novel components: (1) a Unified-Detector-Polygon (UDP) that produces Bezier Curve polygons of text lines and an affinity matrix for paragraph grouping between detected lines; (2) a Line-to-Character-to-Word (L2C2W) recognizer that splits lines into characters and further merges them back into words. HTS achieves state-of-the-art results on multiple word-level text spotting benchmark datasets as well as geometric layout analysis tasks.

摘要
我们提出了层次文本检测器（HTS），一种新的方法，用于同时解决单词水平文本检测和几何布局分析任务。HTS可以在图像中识别文本并识别其4级层次结构：字符、单词、行和段落。提出的HTS具有两个新的组成部分：（1）一个统一探测器多边形（UDP），生成文本线的贝塞尔曲线多边形和行间文本段落相互关系矩阵;（2）一个字行字符识别器（L2C2W），将行分成字符并将其再次合并回words。HTS在多个单词文本检测数据集和几何布局分析任务上达到了现状之冠的结果。

Trust, but Verify: Robust Image Segmentation using Deep Learning

paper_url: http://arxiv.org/abs/2310.16999
repo_url: None
paper_authors: Fahim Ahmed Zaman, Xiaodong Wu, Weiyu Xu, Milan Sonka, Raghuraman Mudumbai
for: 验证深度神经网络的医学图像分割输出的可靠性，抗性 against 随机和最坏情况的攻击。
methods: 基于作者提出的“信任，但验证”方法，使用助记网络生成受Mask的特征预测，并将其与原始图像进行比较，以检测错误分割。
results: 比较于之前的方法，新的验证网络设计可以减少假阳性（错误地判断正确分割为错误分割），并且在不同类型的攻击下保持高度的可靠性。

Abstract
We describe a method for verifying the output of a deep neural network for medical image segmentation that is robust to several classes of random as well as worst-case perturbations i.e. adversarial attacks. This method is based on a general approach recently developed by the authors called "Trust, but Verify" wherein an auxiliary verification network produces predictions about certain masked features in the input image using the segmentation as an input. A well-designed auxiliary network will produce high-quality predictions when the input segmentations are accurate, but will produce low-quality predictions when the segmentations are incorrect. Checking the predictions of such a network with the original image allows us to detect bad segmentations. However, to ensure the verification method is truly robust, we need a method for checking the quality of the predictions that does not itself rely on a black-box neural network. Indeed, we show that previous methods for segmentation evaluation that do use deep neural regression networks are vulnerable to false negatives i.e. can inaccurately label bad segmentations as good. We describe the design of a verification network that avoids such vulnerability and present results to demonstrate its robustness compared to previous methods.

摘要
我们描述了一种用于验证深度神经网络医学图像分割结果的方法，该方法具有对多种随机和最差情况的抗击性，即抗击攻击。该方法基于我们最近开发的“信任，但验证”方法，其中一个辅助验证网络生成了基于输入图像的掩码特征的预测结果。如果输入分割结果正确，那么这个辅助网络会生成高质量的预测结果；如果输入分割结果错误，那么辅助网络会生成低质量的预测结果。通过对辅助网络的预测结果与原始图像进行比较，我们可以检测出错误的分割结果。但是，为了确保验证方法的真正可靠性，我们需要一种不依赖于黑obox神经网络的方法来检查预测结果的质量。我们显示了先前用于分割评估的深度神经回归网络方法存在false negative问题，即可能错别错误地将错误的分割结果标记为正确的。我们描述了一种避免这种敏感性的验证网络的设计，并提供了证明其稳定性的结果。

An Efficient Deep Learning-based approach for Recognizing Agricultural Pests in the Wild

paper_url: http://arxiv.org/abs/2310.16991
repo_url: https://github.com/mohtasimhadi/An-Efficient-Deep-Learning-Based-Approach-for-Recognizing-Agricultural-Pests-in-the-Wild
paper_authors: Mohtasim Hadi Rafi, Mohammad Ratul Mahjabin, Md Sabbir Rahman
for: 本研究旨在帮助农民防治垂直病虫，提高农业产量和经济效益。
methods: 本研究使用了数据转移学习、微调和自定义建 Architecture，实现了蜂群识别的高精度和可靠性。
results: 实验结果表明，使用我们提出的方法可以准确地识别各种蜂群，并且在不同的数据集上都具有良好的 robustness。

Abstract
One of the biggest challenges that the farmers go through is to fight insect pests during agricultural product yields. The problem can be solved easily and avoid economic losses by taking timely preventive measures. This requires identifying insect pests in an easy and effective manner. Most of the insect species have similarities between them. Without proper help from the agriculturist academician it is very challenging for the farmers to identify the crop pests accurately. To address this issue we have done extensive experiments considering different methods to find out the best method among all. This paper presents a detailed overview of the experiments done on mainly a robust dataset named IP102 including transfer learning with finetuning, attention mechanism and custom architecture. Some example from another dataset D0 is also shown to show robustness of our experimented techniques.

摘要
一个最大的问题是农民面临的是在农产品收成时遇到昆虫害。这个问题可以轻松解决，避免经济损失，通过及时预防措施。这需要识别昆虫害的方法。大多数昆虫种类之间有相似之处。如果没有专业农业学家的帮助，农民很难准确识别作物害虫。为解决这个问题，我们在不同方法的基础上进行了广泛的实验。这篇论文提供了对我们实验的详细概述，包括转移学习与细化、注意机制和定制架构。另外，我们还提供了另一个数据集D0中的一些示例，以示我们的实验技术的稳定性。

Unsupervised Domain Adaptation for Semantic Segmentation with Pseudo Label Self-Refinement

paper_url: http://arxiv.org/abs/2310.16979
repo_url: None
paper_authors: Xingchen Zhao, Niluthpol Chowdhury Mithun, Abhinav Rajvanshi, Han-Pang Chiu, Supun Samarasekera
for: 提高深度学习基于 semantic segmentation 模型在不同特征集上的性能，尤其是在实际应用环境中。
methods: 使用教师模型生成 pseudo-标签，并使用学生模型在新数据上进行自教育。auxiliary pseudo-label refinement network (PRN) 用于在不同阶段进行pseudo标签的修正和选择高可靠的标签。
results: 在多个 benchmark 数据集上，我们的方法比前一个状态的方法表现出了显著的改善， indicating that our approach can effectively improve the robustness of segmentation models against pseudo label noise propagation during different stages of adaptation.

Abstract
Deep learning-based solutions for semantic segmentation suffer from significant performance degradation when tested on data with different characteristics than what was used during the training. Adapting the models using annotated data from the new domain is not always practical. Unsupervised Domain Adaptation (UDA) approaches are crucial in deploying these models in the actual operating conditions. Recent state-of-the-art (SOTA) UDA methods employ a teacher-student self-training approach, where a teacher model is used to generate pseudo-labels for the new data which in turn guide the training process of the student model. Though this approach has seen a lot of success, it suffers from the issue of noisy pseudo-labels being propagated in the training process. To address this issue, we propose an auxiliary pseudo-label refinement network (PRN) for online refining of the pseudo labels and also localizing the pixels whose predicted labels are likely to be noisy. Being able to improve the quality of pseudo labels and select highly reliable ones, PRN helps self-training of segmentation models to be robust against pseudo label noise propagation during different stages of adaptation. We evaluate our approach on benchmark datasets with three different domain shifts, and our approach consistently performs significantly better than the previous state-of-the-art methods.

摘要
Translated into Simplified Chinese:深度学习基于的 semantic segmentation 模型在测试数据上uffer from 性能下降，特别是当数据的特征与训练时使用的数据不同时。 adapting 模型使用新Domain的标注数据不always practical. Unsupervised Domain Adaptation (UDA) 方法是部署这些模型的关键。 recent state-of-the-art (SOTA) UDA 方法使用教师模型生成新数据上的pseudo-labels，并使用这些pseudo-labels来导导学生模型的训练过程。 Although this approach has seen a lot of success, it suffers from the issue of noisy pseudo-labels being propagated in the training process. To address this issue, we propose an auxiliary pseudo-label refinement network (PRN) for online refining of the pseudo labels and also localizing the pixels whose predicted labels are likely to be noisy. PRN can improve the quality of pseudo labels and select highly reliable ones, which helps self-training of segmentation models to be robust against pseudo label noise propagation during different stages of adaptation. We evaluate our approach on benchmark datasets with three different domain shifts, and our approach consistently performs significantly better than the previous state-of-the-art methods.

Improving Performance in Colorectal Cancer Histology Decomposition using Deep and Ensemble Machine Learning

paper_url: http://arxiv.org/abs/2310.16954
repo_url: None
paper_authors: Fabi Prezja, Leevi Annala, Sampsa Kiiskinen, Suvi Lahtinen, Timo Ojala, Pekka Ruusuvuori, Teijo Kuopio
for: This paper aims to explore the potential of convolutional neural networks (CNNs) in facilitating the extraction of clinically relevant biomarkers from histologic samples for colorectal cancer management.
methods: The authors use a hybrid Deep and ensemble machine learning model to classify diverse tissue types from whole slide microscope images accurately, which is critical for amplifying the prognostic potential of imaging-based biomarkers.
results: The model achieved 96.74% accuracy on the external test set and 99.89% on the internal test set, demonstrating its high accuracy and potential for clinical application.

Abstract
In routine colorectal cancer management, histologic samples stained with hematoxylin and eosin are commonly used. Nonetheless, their potential for defining objective biomarkers for patient stratification and treatment selection is still being explored. The current gold standard relies on expensive and time-consuming genetic tests. However, recent research highlights the potential of convolutional neural networks (CNNs) in facilitating the extraction of clinically relevant biomarkers from these readily available images. These CNN-based biomarkers can predict patient outcomes comparably to golden standards, with the added advantages of speed, automation, and minimal cost. The predictive potential of CNN-based biomarkers fundamentally relies on the ability of convolutional neural networks (CNNs) to classify diverse tissue types from whole slide microscope images accurately. Consequently, enhancing the accuracy of tissue class decomposition is critical to amplifying the prognostic potential of imaging-based biomarkers. This study introduces a hybrid Deep and ensemble machine learning model that surpassed all preceding solutions for this classification task. Our model achieved 96.74% accuracy on the external test set and 99.89% on the internal test set. Recognizing the potential of these models in advancing the task, we have made them publicly available for further research and development.

摘要
Routine colorectal cancer management 通常使用 Hematoxylin 和 Eosin 染色的 histologic samples，但是它们的潜在作用还在探索中。目前的黄金标准是使用 expensive 和 time-consuming 的遗传学测试。然而，最近的研究表明，扩散 нейрон网络 (CNN) 可以帮助从readily available 的图像中提取临床 relevance 的生物标志物。这些 CNN-based 生物标志物可以与黄金标准相比，预测患者的结果，并且具有速度、自动化和成本低的优势。生物标志物的预测潜力基于 CNN 的能力准确地分类不同的组织类型从整个染色microscope 图像中。因此，提高染色microscope 图像中组织类型的准确性是关键，以激活生物标志物的预测潜力。本研究提出了一种 Hybrid Deep 和ensemble 机器学习模型，超过了所有之前的解决方案。我们的模型在 external 测试集上达到了 96.74% 的准确率，并在 internal 测试集上达到了 99.89%。认可这些模型在任务上的潜力，我们将其公开发布，以便进一步的研究和发展。

Diagnosing Alzheimer’s Disease using Early-Late Multimodal Data Fusion with Jacobian Maps

paper_url: http://arxiv.org/abs/2310.16936
repo_url: None
paper_authors: Yasmine Mustafa, Tie Luo
for: 这篇论文主要旨在提出一种高效的早期-晚期 fusión（ELF）方法，用于早期识别阿尔茨海默病（AD）的四个阶段。methods: 该方法使用了一个 convolutional neural network（CNN）来自动提取特征，并使用 random forests 来实现小型数据集上的竞争性表现。此外，该方法还提出了一种可靠的预处理管道，该管道可以适应个体Subject的特有特征，并使用整个大脑图像而不是切片或质量图像来进行预处理。results: 在使用 OASIS-3 dataset 的 MRI 和 CT 图像上进行实验，该方法可以准确地将 AD 分类为四个阶段，准确率达到 97.19%。

Abstract
Alzheimer's disease (AD) is a prevalent and debilitating neurodegenerative disorder impacting a large aging population. Detecting AD in all its presymptomatic and symptomatic stages is crucial for early intervention and treatment. An active research direction is to explore machine learning methods that harness multimodal data fusion to outperform human inspection of medical scans. However, existing multimodal fusion models have limitations, including redundant computation, complex architecture, and simplistic handling of missing data. Moreover, the preprocessing pipelines of medical scans remain inadequately detailed and are seldom optimized for individual subjects. In this paper, we propose an efficient early-late fusion (ELF) approach, which leverages a convolutional neural network for automated feature extraction and random forests for their competitive performance on small datasets. Additionally, we introduce a robust preprocessing pipeline that adapts to the unique characteristics of individual subjects and makes use of whole brain images rather than slices or patches. Moreover, to tackle the challenge of detecting subtle changes in brain volume, we transform images into the Jacobian domain (JD) to enhance both accuracy and robustness in our classification. Using MRI and CT images from the OASIS-3 dataset, our experiments demonstrate the effectiveness of the ELF approach in classifying AD into four stages with an accuracy of 97.19%.

摘要
阿尔茨海默病 (AD) 是一种广泛存在并且严重影响老龄人口的神经退化疾病。早期发现 AD 的检测是非常重要，以便提供早期 intervención 和治疗。目前的研究方向之一是利用机器学习方法，把多modal 数据融合以超越人工检查医疗影像。然而，现有的多modal 融合模型受到一些限制，包括重复计算、复杂的架构和简单处理缺失数据。此外，医疗影像的预处理管道仍然不够详细，通常不会适应个体Subject 的特点。在这篇论文中，我们提出了一种高效的早期晚期融合 (ELF) 方法，利用 convolutional neural network (CNN) 自动提取特征，并使用 random forest 来实现小型数据集的竞争性表现。此外，我们还引入了一种可靠的预处理管道，该管道适应每个个体Subject 的特点，并使用整个大脑图像而不是切片或贴图。此外，为了解决检测大脑体积的微小变化的挑战，我们将图像转换为 Jacobian 域 (JD)，以提高精度和可靠性的分类。使用 MRI 和 CT 图像从 OASIS-3 数据集，我们的实验表明 ELF 方法可以在四个 AD 阶段中分类，准确率达 97.19%。

MCUFormer: Deploying Vision Tranformers on Microcontrollers with Limited Memory

paper_url: http://arxiv.org/abs/2310.16898
repo_url: https://github.com/liangyn22/mcuformer
paper_authors: Yinan Liang, Ziwei Wang, Xiuwei Xu, Yansong Tang, Jie Zhou, Jiwen Lu
for: 这篇论文旨在实现深度学习模型的部署在Internet of Things（IoT）设备上，例如微控制器，以减少成本和能源消耗。
methods: 本篇论文提出了一个硬件-算法共优化方法，名为MCUFormer，以便在微控制器上部署视觉 трансформа器，并且将其应用于图像识别 tasks。
results: 实验结果显示，使用MCUFormer可以在STM32F746微控制器上实现73.62%的top-1精度（在ImageNet图像识别任务中），并且仅需320KB的内存。

Abstract
Due to the high price and heavy energy consumption of GPUs, deploying deep models on IoT devices such as microcontrollers makes significant contributions for ecological AI. Conventional methods successfully enable convolutional neural network inference of high resolution images on microcontrollers, while the framework for vision transformers that achieve the state-of-the-art performance in many vision applications still remains unexplored. In this paper, we propose a hardware-algorithm co-optimizations method called MCUFormer to deploy vision transformers on microcontrollers with extremely limited memory, where we jointly design transformer architecture and construct the inference operator library to fit the memory resource constraint. More specifically, we generalize the one-shot network architecture search (NAS) to discover the optimal architecture with highest task performance given the memory budget from the microcontrollers, where we enlarge the existing search space of vision transformers by considering the low-rank decomposition dimensions and patch resolution for memory reduction. For the construction of the inference operator library of vision transformers, we schedule the memory buffer during inference through operator integration, patch embedding decomposition, and token overwriting, allowing the memory buffer to be fully utilized to adapt to the forward pass of the vision transformer. Experimental results demonstrate that our MCUFormer achieves 73.62\% top-1 accuracy on ImageNet for image classification with 320KB memory on STM32F746 microcontroller. Code is available at https://github.com/liangyn22/MCUFormer.

摘要
Due to the high price and heavy energy consumption of GPUs, deploying deep models on IoT devices such as microcontrollers makes significant contributions for ecological AI. Conventional methods successfully enable convolutional neural network inference of high resolution images on microcontrollers, while the framework for vision transformers that achieve the state-of-the-art performance in many vision applications still remains unexplored. In this paper, we propose a hardware-algorithm co-optimizations method called MCUFormer to deploy vision transformers on microcontrollers with extremely limited memory, where we jointly design transformer architecture and construct the inference operator library to fit the memory resource constraint. More specifically, we generalize the one-shot network architecture search (NAS) to discover the optimal architecture with highest task performance given the memory budget from the microcontrollers, where we enlarge the existing search space of vision transformers by considering the low-rank decomposition dimensions and patch resolution for memory reduction. For the construction of the inference operator library of vision transformers, we schedule the memory buffer during inference through operator integration, patch embedding decomposition, and token overwriting, allowing the memory buffer to be fully utilized to adapt to the forward pass of the vision transformer. Experimental results demonstrate that our MCUFormer achieves 73.62% top-1 accuracy on ImageNet for image classification with 320KB memory on STM32F746 microcontroller. Code is available at https://github.com/liangyn22/MCUFormer.

SparseDFF: Sparse-View Feature Distillation for One-Shot Dexterous Manipulation

paper_url: http://arxiv.org/abs/2310.16838
repo_url: None
paper_authors: Qianxu Wang, Haotong Zhang, Congyue Deng, Yang You, Hao Dong, Yixin Zhu, Leonidas Guibas
for: 将 robot 给授予高水平的 Semantic Understanding，以便在3D场景中进行高级的物体捕捉和操作。
methods: 我们运用大量2D Computer Vision模型，将多视图图像中的semantic feature概念传递到3D场景中，以建立一个Distilled Feature Field（DFF）。
results: 我们的方法可以从简单的RGBD观察中获取高水平的3D DFF，并且可以在不同的物体和场景下进行一次性学习，并且能够在不同的物体和场景下进行传授。

Abstract
Humans excel at transferring manipulation skills across diverse object shapes, poses, and appearances due to their understanding of semantic correspondences between different instances. To endow robots with a similar high-level understanding, we develop a Distilled Feature Field (DFF) for 3D scenes, leveraging large 2D vision models to distill semantic features from multiview images. While current research demonstrates advanced performance in reconstructing DFFs from dense views, the development of learning a DFF from sparse views is relatively nascent, despite its prevalence in numerous manipulation tasks with fixed cameras. In this work, we introduce SparseDFF, a novel method for acquiring view-consistent 3D DFFs from sparse RGBD observations, enabling one-shot learning of dexterous manipulations that are transferable to novel scenes. Specifically, we map the image features to the 3D point cloud, allowing for propagation across the 3D space to establish a dense feature field. At the core of SparseDFF is a lightweight feature refinement network, optimized with a contrastive loss between pairwise views after back-projecting the image features onto the 3D point cloud. Additionally, we implement a point-pruning mechanism to augment feature continuity within each local neighborhood. By establishing coherent feature fields on both source and target scenes, we devise an energy function that facilitates the minimization of feature discrepancies w.r.t. the end-effector parameters between the demonstration and the target manipulation. We evaluate our approach using a dexterous hand, mastering real-world manipulations on both rigid and deformable objects, and showcase robust generalization in the face of object and scene-context variations.

摘要
人类具有将抓取技能转移到多种物体形状、姿态和外观的能力，这是因为他们对不同实例之间的 semantic 匹配有深刻的理解。为了赋予机器人类似的高级理解，我们开发了一种 Distilled Feature Field (DFF) for 3D 场景，利用大量 2D 视觉模型来精炼 semantic 特征从多视图图像中。当前研究已经实现了高级的 DFF 重建从密集视图中，但是对于从稀疏视图学习 DFF 的开发还是相对落后，尽管这种情况在许多抓取任务中具有广泛的应用。在这项工作中，我们介绍了一种新的方法，即 SparseDFF，用于从稀疏 RGBD 观察中获取视元一致的 3D DFF，以便一次学习灵活的抓取动作，并将其应用到新的场景。我们将图像特征映射到 3D 点云上，以便在 3D 空间中进行特征场的传播。SparseDFF 的核心是一种轻量级的特征修正网络，通过对匹配视图之间的特征进行对比而优化。此外，我们还实现了一种点云杂除机制，以确保每个本地邻域内的特征连续性。通过在源场景和目标场景上建立一致的特征场，我们定义了一个能量函数，该函数使得在示例抓取动作和目标抓取动作之间的特征差异与机器人的终端参数之间的差异进行最小化。我们通过使用一个协助手臂来评估我们的方法，并在实际抓取任务中展示了对物体和场景变化的稳定性。

LightSpeed: Light and Fast Neural Light Fields on Mobile Devices

paper_url: http://arxiv.org/abs/2310.16832
repo_url: https://github.com/lightspeed-r2l/lightspeed
paper_authors: Aarush Gupta, Junli Cao, Chaoyang Wang, Ju Hu, Sergey Tulyakov, Jian Ren, László A Jeni
for: 实现实时新视图图像生成在移动设备上，因为计算能力和存储空间有限制。
methods: 使用神经光场表示法，这些方法可以在移动设备上实现高品质的视图生成，神经光场方法直接将射线表示与像素颜色映射。
results: 我们发现使用光板表示是一种高效的射线表示，可以使用特征网格快速训练和渲染。我们的方法可以在非正面视图中进行扩展，并且比前一代光场方法提供更高品质的渲染和更好的速度比。

Abstract
Real-time novel-view image synthesis on mobile devices is prohibitive due to the limited computational power and storage. Using volumetric rendering methods, such as NeRF and its derivatives, on mobile devices is not suitable due to the high computational cost of volumetric rendering. On the other hand, recent advances in neural light field representations have shown promising real-time view synthesis results on mobile devices. Neural light field methods learn a direct mapping from a ray representation to the pixel color. The current choice of ray representation is either stratified ray sampling or Plucker coordinates, overlooking the classic light slab (two-plane) representation, the preferred representation to interpolate between light field views. In this work, we find that using the light slab representation is an efficient representation for learning a neural light field. More importantly, it is a lower-dimensional ray representation enabling us to learn the 4D ray space using feature grids which are significantly faster to train and render. Although mostly designed for frontal views, we show that the light-slab representation can be further extended to non-frontal scenes using a divide-and-conquer strategy. Our method offers superior rendering quality compared to previous light field methods and achieves a significantly improved trade-off between rendering quality and speed.

摘要
Translated into Simplified Chinese:实时新视图图像合成在移动设备上是不可能的，由于限制的计算能力和存储空间。使用液体渲染方法，如NeRF和其 derivates，在移动设备上不适用，因为这些方法的计算成本过高。然而，最近的神经光场表示法已经展示了在移动设备上的实时视图合成结果。神经光场方法直接将光束表示转换为像素颜色。当前的光束表示方法有 stratified ray sampling 和 Plucker坐标，但我们提议使用光束坐标表示，这是一种更有效的、低维度的表示。这使得我们可以使用特征网格来学习4D光束空间。虽然主要设计为前视场景，但我们表明了使用光束坐标表示可以进一步扩展到非前视场景，使用分割和聚合策略。我们的方法提供了比前一代光场方法更高的图像质量和速度之间的改善交易。

PERF: Panoramic Neural Radiance Field from a Single Panorama

paper_url: http://arxiv.org/abs/2310.16831
repo_url: https://github.com/perf-project/PeRF
paper_authors: Guangcong Wang, Peng Wang, Zhaoxi Chen, Wenping Wang, Chen Change Loy, Ziwei Liu
for: 这 paper 的目的是提出一种基于单个扫描图的360度新视图合成方法，使得可以在复杂的场景中实现3D 浏览而不需要花费大量时间和精力进行图像采集。
methods: 这 paper 使用了一种 collaborative RGBD 填充方法和一种进行填充和 удали除方法来将2D 图像提升到3D 场景中。 Specifically, authors first predict a panoramic depth map based on a single panorama and reconstruct visible 3D regions using volume rendering. Then, they introduce a collaborative RGBD inpainting approach into a NeRF to complete RGB images and depth maps from random views. Finally, they use an inpainting-and-erasing strategy to avoid inconsistent geometry between a newly-sampled view and reference views.
results: 作者们的方法可以在 Replica 和一个新的数据集 PERF-in-the-wild 上实现了比州-of-the-art 的性能。 Their method can be widely used for real-world applications, such as panorama-to-3D, text-to-3D, and 3D scene stylization applications.

Abstract
Neural Radiance Field (NeRF) has achieved substantial progress in novel view synthesis given multi-view images. Recently, some works have attempted to train a NeRF from a single image with 3D priors. They mainly focus on a limited field of view with a few occlusions, which greatly limits their scalability to real-world 360-degree panoramic scenarios with large-size occlusions. In this paper, we present PERF, a 360-degree novel view synthesis framework that trains a panoramic neural radiance field from a single panorama. Notably, PERF allows 3D roaming in a complex scene without expensive and tedious image collection. To achieve this goal, we propose a novel collaborative RGBD inpainting method and a progressive inpainting-and-erasing method to lift up a 360-degree 2D scene to a 3D scene. Specifically, we first predict a panoramic depth map as initialization given a single panorama and reconstruct visible 3D regions with volume rendering. Then we introduce a collaborative RGBD inpainting approach into a NeRF for completing RGB images and depth maps from random views, which is derived from an RGB Stable Diffusion model and a monocular depth estimator. Finally, we introduce an inpainting-and-erasing strategy to avoid inconsistent geometry between a newly-sampled view and reference views. The two components are integrated into the learning of NeRFs in a unified optimization framework and achieve promising results. Extensive experiments on Replica and a new dataset PERF-in-the-wild demonstrate the superiority of our PERF over state-of-the-art methods. Our PERF can be widely used for real-world applications, such as panorama-to-3D, text-to-3D, and 3D scene stylization applications. Project page and code are available at https://perf-project.github.io/ and https://github.com/perf-project/PeRF.

摘要
neural radiance field (NeRF) 已经取得了在新视图合成方面的显著进步，尤其是使用多视图图像。在这篇文章中，我们介绍了一种名为 PeRF 的全景新视图合成框架，可以从单个全景图像中训练一个全景 NeRF。与传统的方法不同的是，PeRF 允许无需昂贵和繁琐的图像采集，就能够在实际世界中360度的全景enario中进行3D镜头游走。为了实现这一目标，我们提出了一种新的协同 RGBD 填充方法和一种进行填充和取消填充的策略，以将2D场景提升到3D场景。具体来说，我们首先从单个全景图像中预测了全景深度图，并使用可视3D区域的volume rendering重建可见的3D区域。然后，我们引入了协同 RGBD 填充方法，以完成来自Random View的 RGB 图像和深度图的完善。最后，我们引入了一种填充和取消填充策略，以避免在新采集的视角和参考视角之间出现不一致的几何结构。这两种组件被集成到了 NeRF 的学习框架中，并实现了良好的结果。我们的 PeRF 可以广泛应用于实际应用场景，如全景图像到3D、文本到3D和3D场景风格化应用。项目页面和代码可以在和上找到。

CommonCanvas: An Open Diffusion Model Trained with Creative-Commons Images

paper_url: http://arxiv.org/abs/2310.16825
repo_url: https://github.com/mosaicml/diffusion
paper_authors: Aaron Gokaslan, A. Feder Cooper, Jasmine Collins, Landan Seguin, Austin Jacobson, Mihir Patel, Jonathan Frankle, Cory Stephenson, Volodymyr Kuleshov
for: 这个论文是用于训练一种基于开源扩散模型的文本到图像生成模型的。
methods: 这个论文使用了一种启发式传输学习技术来生成高质量的 sintetic caption，并使用了一种数据和计算效率的训练策略来训练模型。
results: 这个论文通过使用高质量的 sintetic caption和数据和计算效率的训练策略，成功地训练出一些高质量的文本到图像生成模型，并且这些模型的性能与LAION-2B数据集上训练的SD2模型相当。

Abstract
We assemble a dataset of Creative-Commons-licensed (CC) images, which we use to train a set of open diffusion models that are qualitatively competitive with Stable Diffusion 2 (SD2). This task presents two challenges: (1) high-resolution CC images lack the captions necessary to train text-to-image generative models; (2) CC images are relatively scarce. In turn, to address these challenges, we use an intuitive transfer learning technique to produce a set of high-quality synthetic captions paired with curated CC images. We then develop a data- and compute-efficient training recipe that requires as little as 3% of the LAION-2B data needed to train existing SD2 models, but obtains comparable quality. These results indicate that we have a sufficient number of CC images (~70 million) for training high-quality models. Our training recipe also implements a variety of optimizations that achieve ~3X training speed-ups, enabling rapid model iteration. We leverage this recipe to train several high-quality text-to-image models, which we dub the CommonCanvas family. Our largest model achieves comparable performance to SD2 on a human evaluation, despite being trained on our CC dataset that is significantly smaller than LAION and using synthetic captions for training. We release our models, data, and code at https://github.com/mosaicml/diffusion/blob/main/assets/common-canvas.md

摘要
我团队 assemble 一个 Creative-Commons-licensed（CC）图像集，用于训练一些开放扩散模型，这些模型与 Stable Diffusion 2（SD2）相比质量相似。这个任务存在两个挑战：（1）高分辨率 CC 图像缺乏需要训练文本到图像生成模型的描述;（2） CC 图像相对罕见。为了解决这些挑战，我们使用一种直观转移学习技术生成高质量的 sintetic 描述，并将它们分配到我们精心选择的 CC 图像上。然后，我们开发了一种数据和计算效率高的训练方法，只需要使用 LAION-2B 数据的 3%，但可以获得相似的质量。这些结果表明我们有足够的 CC 图像（约 70 万） для训练高质量模型。我们的训练方法还实现了多种优化，实现了 ~3X 的训练速度提升，以便快速进行模型迭代。我们利用这种方法训练了一些高质量的文本到图像模型，我们称之为 CommonCanvas 家族。我们最大的模型在人类评估中与 SD2 相比，即使是在我们小于 LAION 的 CC 数据集上训练，也达到了相似的性能。我们将我们的模型、数据和代码发布在 GitHub 上，请参考 https://github.com/mosaicml/diffusion/blob/main/assets/common-canvas.md。

DreamCraft3D: Hierarchical 3D Generation with Bootstrapped Diffusion Prior

paper_url: http://arxiv.org/abs/2310.16818
repo_url: https://github.com/deepseek-ai/dreamcraft3d
paper_authors: Jingxiang Sun, Bo Zhang, Ruizhi Shao, Lizhen Wang, Wen Liu, Zhenda Xie, Yebin Liu
for: 这篇论文旨在提出一种基于引用图像的层次3D内容生成方法，以生成高品质和一致的3D对象。
methods: 作者使用了geometry sculpting和texture boosting两个阶段来生成3D对象，并通过score distillation sampling和 Bootstrapped Score Distillation来保证geometry和texture的一致性。
results: 通过 alternating optimization of the diffusion prior and 3D scene representation，作者实现了mutually reinforcing improvements，并达到了高品质的rendering和一致的3D对象生成。

Abstract
We present DreamCraft3D, a hierarchical 3D content generation method that produces high-fidelity and coherent 3D objects. We tackle the problem by leveraging a 2D reference image to guide the stages of geometry sculpting and texture boosting. A central focus of this work is to address the consistency issue that existing works encounter. To sculpt geometries that render coherently, we perform score distillation sampling via a view-dependent diffusion model. This 3D prior, alongside several training strategies, prioritizes the geometry consistency but compromises the texture fidelity. We further propose Bootstrapped Score Distillation to specifically boost the texture. We train a personalized diffusion model, Dreambooth, on the augmented renderings of the scene, imbuing it with 3D knowledge of the scene being optimized. The score distillation from this 3D-aware diffusion prior provides view-consistent guidance for the scene. Notably, through an alternating optimization of the diffusion prior and 3D scene representation, we achieve mutually reinforcing improvements: the optimized 3D scene aids in training the scene-specific diffusion model, which offers increasingly view-consistent guidance for 3D optimization. The optimization is thus bootstrapped and leads to substantial texture boosting. With tailored 3D priors throughout the hierarchical generation, DreamCraft3D generates coherent 3D objects with photorealistic renderings, advancing the state-of-the-art in 3D content generation. Code available at https://github.com/deepseek-ai/DreamCraft3D.

摘要
我们介绍 DreamCraft3D，一种层次的三维内容生成方法，可以生成高准确性和一致的三维对象。我们解决了一致性问题，通过利用二维参考图来导航几个阶段的几何雕刻和текстура增强。我们的中心焦点是解决现有工作中的一致性问题，通过视图依赖的扩散模型进行得分采样。这个三维优先， alongside 多种训练策略，优先几何一致性，但是妥协текстура准确性。我们还提议了 Bootstrapped Score Distillation，用于特性增强。我们在增强的场景下训练了个性化的扩散模型， Dreambooth，使其具有场景的3D知识。得分采样从这个3D-意识扩散先进提供了视图一致的指导，为场景优化提供了视图一致的指导。通过场景优化和扩散先进的交互优化，我们实现了相互扶持的改进：优化的3D场景帮助了场景特定的扩散模型训练，这些模型提供了越来越视图一致的指导，以便为3D优化提供越来越高的指导。因此，我们可以通过层次的生成来生成一致的3D对象，并且可以提高图像的精度和一致性，从而提高3D内容生成的状态。代码可以在https://github.com/deepseek-ai/DreamCraft3D 中找到。

Exploring OCR Capabilities of GPT-4V(ision) : A Quantitative and In-depth Evaluation

paper_url: http://arxiv.org/abs/2310.16809
repo_url: https://github.com/scut-dlvclab/gpt-4v_ocr
paper_authors: Yongxin Shi, Dezhi Peng, Wenhui Liao, Zening Lin, Xinhong Chen, Chongyu Liu, Yuyi Zhang, Lianwen Jin
For: The paper evaluates the Optical Character Recognition (OCR) capabilities of the recently released GPT-4V(ision), a Large Multimodal Model (LMM), and assesses its performance across a range of OCR tasks.* Methods: The paper uses a comprehensive evaluation pipeline to test the model’s performance on scene text recognition, handwritten text recognition, handwritten mathematical expression recognition, table structure recognition, and information extraction from visually-rich documents.* Results: The evaluation reveals that GPT-4V performs well on recognizing and understanding Latin contents, but struggles with multilingual scenarios and complex tasks such as handwriting mathematical expression recognition, table structure recognition, and end-to-end semantic entity recognition and pair extraction from document image.

Abstract
This paper presents a comprehensive evaluation of the Optical Character Recognition (OCR) capabilities of the recently released GPT-4V(ision), a Large Multimodal Model (LMM). We assess the model's performance across a range of OCR tasks, including scene text recognition, handwritten text recognition, handwritten mathematical expression recognition, table structure recognition, and information extraction from visually-rich document. The evaluation reveals that GPT-4V performs well in recognizing and understanding Latin contents, but struggles with multilingual scenarios and complex tasks. Specifically, it showed limitations when dealing with non-Latin languages and complex tasks such as handwriting mathematical expression recognition, table structure recognition, and end-to-end semantic entity recognition and pair extraction from document image. Based on these observations, we affirm the necessity and continued research value of specialized OCR models. In general, despite its versatility in handling diverse OCR tasks, GPT-4V does not outperform existing state-of-the-art OCR models. How to fully utilize pre-trained general-purpose LMMs such as GPT-4V for OCR downstream tasks remains an open problem. The study offers a critical reference for future research in OCR with LMMs. Evaluation pipeline and results are available at https://github.com/SCUT-DLVCLab/GPT-4V_OCR.

摘要
The evaluation reveals that GPT-4V performs well on recognizing and understanding Latin contents, but encounters challenges in multilingual scenarios and complex tasks. Specifically, it struggles with non-Latin languages and tasks such as handwriting mathematical expression recognition, table structure recognition, and end-to-end semantic entity recognition and pair extraction from document images. These findings highlight the importance and continued value of specialized OCR models.Although GPT-4V can handle diverse OCR tasks, it does not outperform existing state-of-the-art OCR models. The study underscores the need to explore how to fully utilize pre-trained general-purpose LMMs like GPT-4V for OCR downstream tasks. The research provides a valuable reference for future OCR research with LMMs. The evaluation pipeline and results are available on GitHub at .

Fingervein Verification using Convolutional Multi-Head Attention Network

paper_url: http://arxiv.org/abs/2310.16808
repo_url: None
paper_authors: Raghavendra Ramachandra, Sushma Venkatesh
for: 这个论文是为了提出一种基于 convolutional multihead attention network 的新型指静脉识别方法，以提高指静脉识别的精度和可靠性。
methods: 该方法使用了 VeinAtnNet 网络，该网络通过EXTRACTING DISCRIMINANT INFORMATION FROM BOTH NORMAL AND ENHANCED FINGERVEIN IMAGES来提取指静脉图像中的特征信息，并通过一种小型的学习参数来减轻网络的计算负担。
results: 在新收集的 FV-300 数据集和公共可用的 FV-USM 和 FV-PolyU 指静脉数据集上，提出的方法与五种现有的指静脉识别系统进行比较，并显示出提案的 VeinAtnNet 方法的高效性。

Abstract
Biometric verification systems are deployed in various security-based access-control applications that require user-friendly and reliable person verification. Among the different biometric characteristics, fingervein biometrics have been extensively studied owing to their reliable verification performance. Furthermore, fingervein patterns reside inside the skin and are not visible outside; therefore, they possess inherent resistance to presentation attacks and degradation due to external factors. In this paper, we introduce a novel fingervein verification technique using a convolutional multihead attention network called VeinAtnNet. The proposed VeinAtnNet is designed to achieve light weight with a smaller number of learnable parameters while extracting discriminant information from both normal and enhanced fingervein images. The proposed VeinAtnNet was trained on the newly constructed fingervein dataset with 300 unique fingervein patterns that were captured in multiple sessions to obtain 92 samples per unique fingervein. Extensive experiments were performed on the newly collected dataset FV-300 and the publicly available FV-USM and FV-PolyU fingervein dataset. The performance of the proposed method was compared with five state-of-the-art fingervein verification systems, indicating the efficacy of the proposed VeinAtnNet.

摘要
生物特征验证系统在安全访问控制应用中广泛应用，需要易于使用和可靠的人验证。 Among different生物特征，手掌血管网的验证性能得到了广泛的研究，因为它们在皮肤内部存在，不可见于外部，因此具有内生的抗击损和抗伪攻击性。在本文中，我们提出了一种基于卷积多头注意网络的新型手掌血管验证技术，称为VeinAtnNet。提出的VeinAtnNet设计用较小的学习参数量和批处理数据来提取特征信息，并且可以在不同的手掌血管图像下提取有效的特征信息。我们使用新建的手掌血管数据集，包含300个唯一的手掌血管图像，在多个会话中采集到92个样本。我们对新收集的FV-300数据集和公共可用的FV-USM和FV-PolyU手掌血管数据集进行了广泛的实验，并与5种现有的手掌血管验证系统进行了比较。实验结果表明，提出的方法可以提供高效的手掌血管验证。

The GOOSE Dataset for Perception in Unstructured Environments

paper_url: http://arxiv.org/abs/2310.16788
repo_url: None
paper_authors: Peter Mortimer, Raphael Hagmanns, Miguel Granero, Thorsten Luettel, Janko Petereit, Hans-Joachim Wuensche
for: 这个研究是为了提高自动化系统在无结构的开放空间环境中的感知和解释能力。
methods: 这个研究使用了深度学习技术，将数据分为10,000对描述像和点云数据，并将其用于训练多种现有的分类模型。
results: 这个研究获得了一个名为GOOSE的大量数据集，并提供了一个开源的数据集、ontology для无结构的地形，以及数据集的标准和指引。这个initiative可以帮助建立一个通用的框架，并且实现将现有的数据集和模型融合，以提高各种无结构环境中的感知能力。

Abstract
The potential for deploying autonomous systems can be significantly increased by improving the perception and interpretation of the environment. However, the development of deep learning-based techniques for autonomous systems in unstructured outdoor environments poses challenges due to limited data availability for training and testing. To address this gap, we present the German Outdoor and Offroad Dataset (GOOSE), a comprehensive dataset specifically designed for unstructured outdoor environments. The GOOSE dataset incorporates 10 000 labeled pairs of images and point clouds, which are utilized to train a range of state-of-the-art segmentation models on both image and point cloud data. We open source the dataset, along with an ontology for unstructured terrain, as well as dataset standards and guidelines. This initiative aims to establish a common framework, enabling the seamless inclusion of existing datasets and a fast way to enhance the perception capabilities of various robots operating in unstructured environments. The dataset, pre-trained models for offroad perception, and additional documentation can be found at https://goose-dataset.de/.

摘要
可以significantly increasethe potential for deploying autonomous systems by improving the perception and interpretation of the environment. However, the development of deep learning-based techniques for autonomous systems in unstructured outdoor environments poses challenges due to limited data availability for training and testing. To address this gap, we present the German Outdoor and Offroad Dataset (GOOSE), a comprehensive dataset specifically designed for unstructured outdoor environments. The GOOSE dataset incorporates 10,000 labeled pairs of images and point clouds, which are utilized to train a range of state-of-the-art segmentation models on both image and point cloud data. We open source the dataset, along with an ontology for unstructured terrain, as well as dataset standards and guidelines. This initiative aims to establish a common framework, enabling the seamless inclusion of existing datasets and a fast way to enhance the perception capabilities of various robots operating in unstructured environments. The dataset, pre-trained models for offroad perception, and additional documentation can be found at https://goose-dataset.de/.Here's the word-for-word translation:可以significantly increasethe potential for deploying autonomous systems by improving the perception and interpretation of the environment. However, the development of deep learning-based techniques for autonomous systems in unstructured outdoor environments poses challenges due to limited data availability for training and testing. To address this gap, we present the German Outdoor and Offroad Dataset (GOOSE), a comprehensive dataset specifically designed for unstructured outdoor environments. The GOOSE dataset incorporates 10,000 labeled pairs of images and point clouds, which are utilized to train a range of state-of-the-art segmentation models on both image and point cloud data. We open source the dataset, along with an ontology for unstructured terrain, as well as dataset standards and guidelines. This initiative aims to establish a common framework, enabling the seamless inclusion of existing datasets and a fast way to enhance the perception capabilities of various robots operating in unstructured environments. The dataset, pre-trained models for offroad perception, and additional documentation can be found at https://goose-dataset.de/.

S$^3$-TTA: Scale-Style Selection for Test-Time Augmentation in Biomedical Image Segmentation

paper_url: http://arxiv.org/abs/2310.16783
repo_url: None
paper_authors: Kangxian Xie, Siyu Huang, Sebastian Cajas Ordone, Hanspeter Pfister, Donglai Wei
for: 提高生物医学图像分割 task 的泛化能力
methods: 使用 S$^3$-TTA 框架，选择适合的图像比例和风格，并实现端到端的损均折合训练管道
results: 在公共 benchmark 上，S$^3$-TTA 对 cell 和 lung 分割 task 进行了3.4% 和 1.3% 的提高，仅通过在测试阶段对输入数据进行扩展。

Abstract
Deep-learning models have been successful in biomedical image segmentation. To generalize for real-world deployment, test-time augmentation (TTA) methods are often used to transform the test image into different versions that are hopefully closer to the training domain. Unfortunately, due to the vast diversity of instance scale and image styles, many augmented test images produce undesirable results, thus lowering the overall performance. This work proposes a new TTA framework, S$^3$-TTA, which selects the suitable image scale and style for each test image based on a transformation consistency metric. In addition, S$^3$-TTA constructs an end-to-end augmentation-segmentation joint-training pipeline to ensure a task-oriented augmentation. On public benchmarks for cell and lung segmentation, S$^3$-TTA demonstrates improvements over the prior art by 3.4% and 1.3%, respectively, by simply augmenting the input data in testing phase.

摘要
深度学习模型在生物医学影像分割中取得了成功。为了在真实世界中广泛应用，测试时增强（TTA）方法通常用于在测试图像上进行不同版本的转换，以期望更近于训练领域。然而，由于图像实例尺寸和风格的巨大多样性，许多增强后的测试图像会导致不愿意的结果，从而降低总性能。本工作提出了一种新的TTA框架，S$^3$-TTA，该框架可以根据图像测试阶段的变换一致度指标选择适合的图像尺寸和风格。此外，S$^3$-TTA还构建了端到端增强-分割共训练管道，以确保任务导向的增强。在公共基准测试数据集上，S$^3$-TTA比依据测试阶段增强的先前艺术提高了3.4%和1.3%。

MixerFlow for Image Modelling

paper_url: http://arxiv.org/abs/2310.16777
repo_url: None
paper_authors: Eshant English, Matthias Kirchler, Christoph Lippert
for: 图像模型的建模和数据生成
methods: 基于MLP-Mixer架构的 MixerFlow 模型，实现Weight共享和流式模型的结合
results: 在固定计算资源下，MixerFlow 在图像数据集上表现出更好的密度估计，并且随着图像分辨率的增加，性能逐渐提高，成为Glow-based架构的有力且简单的替代方案。同时，MixerFlow 还提供了更有用的嵌入特征比Glow-based架构。

Abstract
Normalising flows are statistical models that transform a complex density into a simpler density through the use of bijective transformations enabling both density estimation and data generation from a single model. In the context of image modelling, the predominant choice has been the Glow-based architecture, whereas alternative architectures remain largely unexplored in the research community. In this work, we propose a novel architecture called MixerFlow, based on the MLP-Mixer architecture, further unifying the generative and discriminative modelling architectures. MixerFlow offers an effective mechanism for weight sharing for flow-based models. Our results demonstrate better density estimation on image datasets under a fixed computational budget and scales well as the image resolution increases, making MixeFlow a powerful yet simple alternative to the Glow-based architectures. We also show that MixerFlow provides more informative embeddings than Glow-based architectures.

摘要
通常的流变换是一种统计模型，它将复杂的概率变换成更简单的概率，通过使用射影函数，以便 Both density estimation和数据生成从同一模型中。在图像模型中，主流选择是基于Glow架构的，而其他架构在研究community中尚未得到广泛的探索。在这项工作中，我们提出了一种新的架构called MixerFlow，基于MLP-Mixer架构，并将生成和识别模型架构进行统一。MixerFlow提供了有效的权重共享机制 для流变换模型。我们的结果表明，MixerFlow在图像数据集上的静止计算资源下可以更好地估计概率，并且随着图像分辨率的增加，MixerFlow的性能逐渐提高，使其成为一种强大 yet simple的Glow-based架构的替代方案。此外，我们还证明了MixerFlow提供了更有用的嵌入than Glow-based架构。

ConvNets Match Vision Transformers at Scale

paper_url: http://arxiv.org/abs/2310.16764
repo_url: https://github.com/kyegomez/ConvNet
paper_authors: Samuel L. Smith, Andrew Brock, Leonard Berrada, Soham De
for: 这个论文主要用于检验 ConvNet 是否在大型数据集上表现出色，并评估不同计算预算下 ConvNet 的表现。
methods: 该论文使用 JFT-4B 大量标注图像数据集进行预训练，并从 NFNet 模型家族中选择不同的深度和宽度来训练不同的网络。
results: 研究发现在不同计算预算下，ConvNet 会与 Vision Transformers 具有相似的表现，而且在 ImageNet 上进行精度训练后，NFNets 可以达到 Reported 性能水平。最终的精度训练结果为 Top-1 精度为 90.4%。

Abstract
Many researchers believe that ConvNets perform well on small or moderately sized datasets, but are not competitive with Vision Transformers when given access to datasets on the web-scale. We challenge this belief by evaluating a performant ConvNet architecture pre-trained on JFT-4B, a large labelled dataset of images often used for training foundation models. We consider pre-training compute budgets between 0.4k and 110k TPU-v4 core compute hours, and train a series of networks of increasing depth and width from the NFNet model family. We observe a log-log scaling law between held out loss and compute budget. After fine-tuning on ImageNet, NFNets match the reported performance of Vision Transformers with comparable compute budgets. Our strongest fine-tuned model achieves a Top-1 accuracy of 90.4%.

摘要
许多研究人员认为卷积网络在小或中等规模的数据集上表现良好，但不能与视觉转换器在大规模数据集上竞争。我们挑战这个信念，通过评估一种高性能的卷积网络架构，预训练在JFT-4B大量标签图像集上。我们在预训练计算预算为0.4k至110k TPU-v4核心计算时间之间考虑了一系列不同的网络模型，从NFNet家族中选择了一些网络。我们发现了对保留的损失和计算预算之间的对数对应关系。经过精度调整后，我们的最佳精度调整模型在ImageNet上达到了90.4%的顶部一致率。

SonoSAM – Segment Anything on Ultrasound Images

paper_url: http://arxiv.org/abs/2310.16872
repo_url: None
paper_authors: Hariharan Ravishankar, Rohan Patil, Vikram Melapudi, Parminder Bhatia, Kass-Hout Taha, Pavan Annangi
for: 这个论文旨在提出一种可靠、通用的基础模型，用于分割ultrasound图像中的对象。
methods: 该模型基于一个丰富、多样的对象集，并通过精心调整和学习来实现高性能。
results: 模型在8个未看过的ultrasound数据集上表现出色，与其他方法相比，在所有评价指标上都具有显著的优势。

Abstract
In this paper, we present SonoSAM - a promptable foundational model for segmenting objects of interest on ultrasound images. Fine-tuned exclusively on a rich, diverse set of objects from roughly 200k ultrasound image-mask pairs, SonoSAM demonstrates state-of-the-art performance on 8 unseen ultrasound data-sets, outperforming competing methods by a significant margin on all metrics of interest. SonoSAM achieves average dice similarity score of more than 90% on almost all test datasets within 2-6 clicks on an average, making it a valuable tool for annotating ultrasound images. We also extend SonoSAM to 3-D (2-D +t) applications and demonstrate superior performance making it a valuable tool for generating dense annotations from ultrasound cine-loops. Further, to increase practical utility of SonoSAM, we propose a two-step process of fine-tuning followed by knowledge distillation to a smaller footprint model without comprising the performance. We present detailed qualitative and quantitative comparisons of SonoSAM with state-of-the art methods showcasing efficacy of SonoSAM as one of the first reliable, generic foundational model for ultrasound.

摘要
在这篇论文中，我们介绍了SonoSAM - 一个可提Prompt的基础模型，用于分割ultrasound图像中的对象。通过仅在200k ultrasound图像-mask对中进行微调，SonoSAM在8个未看过的ultrasound数据集上达到了最新的性能水平，比竞争方法在所有关键指标上都高于它们。SonoSAM在大多数测试数据集上的average dice相似性分数超过90%，只需2-6个键击，使其成为对ultrasound图像进行标注的有价值工具。此外，我们还扩展了SonoSAM到3D（2D+t）应用程序，并证明其在生成密集注释方面表现出色，使其成为对ultrasound cinema-loop进行密集注释的有价值工具。为了提高SonoSAM的实际使用性，我们提议了一种微调后followed by knowledge distillation的两步过程，以降低模型的尺寸不减其性能。我们提供了详细的量化和质量比较，展示了SonoSAM作为ultrasound领域的首个可靠、通用基础模型的效果。

paper_url: http://arxiv.org/abs/2310.16754
repo_url: None
paper_authors: Asmar Nadeem, Adrian Hilton, Robert Dawes, Graham Thomas, Armin Mustafa
for: 提高 Audio Visual Question Answering (AVQA) 任务中的表现。
methods: 提出一种 Contextual Multi-modal Alignment (CAD) 网络，解决现有 AVQA 方法中的两大缺点： Audio-Visual (AV) 信息在网络中不匹配的 Spatial 和 Temporal 两个水平，以及 Audio 和 Visual 模式之间的 Semantic 信息在 контекст中不均衡。
results: 在 MUSIC-AVQA 数据集上，CAD 网络比状态艺术方法提高了平均性能，提高了9.4%。同时，我们还证明了我们的提案可以让现有 AVQA 方法中的表现得到改善，无需增加复杂度要求。

Abstract
In the context of Audio Visual Question Answering (AVQA) tasks, the audio visual modalities could be learnt on three levels: 1) Spatial, 2) Temporal, and 3) Semantic. Existing AVQA methods suffer from two major shortcomings; the audio-visual (AV) information passing through the network isn't aligned on Spatial and Temporal levels; and, inter-modal (audio and visual) Semantic information is often not balanced within a context; this results in poor performance. In this paper, we propose a novel end-to-end Contextual Multi-modal Alignment (CAD) network that addresses the challenges in AVQA methods by i) introducing a parameter-free stochastic Contextual block that ensures robust audio and visual alignment on the Spatial level; ii) proposing a pre-training technique for dynamic audio and visual alignment on Temporal level in a self-supervised setting, and iii) introducing a cross-attention mechanism to balance audio and visual information on Semantic level. The proposed novel CAD network improves the overall performance over the state-of-the-art methods on average by 9.4% on the MUSIC-AVQA dataset. We also demonstrate that our proposed contributions to AVQA can be added to the existing methods to improve their performance without additional complexity requirements.

摘要
在听视问答（AVQA）任务上，听视模态可以学习在三个水平：1）空间、2）时间和3）Semantic。现有的AVQA方法受到两大缺点的影响：听视信息在网络中不匹配的空间和时间水平，以及听视模态之间的Semantic信息在一个上下文中不均衡。这导致了表现不佳。在本文中，我们提出了一种新的综合多模态对齐（CAD）网络，解决了AVQA方法中的挑战。我们的方法包括：1）引入无参数的随机Contextual块，确保听视对齐的空间水平是Robust的。2）提出了一种在无监督下自适应的听视对齐策略，以解决听视模态之间的时间水平的匹配问题。3）引入了一种cross-attention机制，以协调听视信息的Semantic水平。我们的CAD网络在MUSIC-AVQA数据集上的平均提高了state-of-the-art方法的性能，提高了9.4%。此外，我们还证明了我们的提议可以追加到现有的方法中，无需增加复杂性。

Metrically Scaled Monocular Depth Estimation through Sparse Priors for Underwater Robots

paper_url: http://arxiv.org/abs/2310.16750
repo_url: https://github.com/ebnerluca/uw_depth
paper_authors: Luca Ebner, Gideon Billings, Stefan Williams
for: 这篇论文targets the problem of real-time dense depth estimation from monocular underwater images for mobile underwater vehicles.
methods: The paper proposes a deep learning model that fuses sparse depth measurements from triangulated features to improve depth predictions and solve the scale ambiguity problem. The model uses an efficient encoder-decoder backbone and modern lightweight transformer optimization stage to encode global context.
results: The proposed method achieves significant improvement in depth prediction accuracy by fusing sparse feature priors, and achieves similar accuracy on a downward-looking dataset without any retraining. The method runs at 160 FPS on a laptop GPU and 7 FPS on a single CPU core, making it suitable for direct deployment on embedded systems.

Abstract
In this work, we address the problem of real-time dense depth estimation from monocular images for mobile underwater vehicles. We formulate a deep learning model that fuses sparse depth measurements from triangulated features to improve the depth predictions and solve the problem of scale ambiguity. To allow prior inputs of arbitrary sparsity, we apply a dense parameterization method. Our model extends recent state-of-the-art approaches to monocular image based depth estimation, using an efficient encoder-decoder backbone and modern lightweight transformer optimization stage to encode global context. The network is trained in a supervised fashion on the forward-looking underwater dataset, FLSea. Evaluation results on this dataset demonstrate significant improvement in depth prediction accuracy by the fusion of the sparse feature priors. In addition, without any retraining, our method achieves similar depth prediction accuracy on a downward looking dataset we collected with a diver operated camera rig, conducting a survey of a coral reef. The method achieves real-time performance, running at 160 FPS on a laptop GPU and 7 FPS on a single CPU core and is suitable for direct deployment on embedded systems. The implementation of this work is made publicly available at https://github.com/ebnerluca/uw_depth.

摘要
在这个工作中，我们解决了来自单摄像头图像的实时稠密深度估计问题 для移动式水下潜水器。我们设计了一种深度学习模型，该模型将 sparse depth measurement from triangulated features fusion 以提高深度预测和解决尺度 ambiguity 问题。为了允许不同的稀疏性输入，我们应用了密集参数化方法。我们的模型基于最近的状态开头的方法，使用高效的编码器-解码器脊梁和现代轻量级 transformer 优化阶段来编码全球上下文。网络在监督模式下在 forward-looking 水下数据集上训练，评估结果表明，通过稀疏特征先导的深度预测得到了显著改善。此外，没有任何重新训练，我们的方法在一个下降看到的数据集上也实现了类似的深度预测精度。此方法在实时性方面表现出色，在笔记型 GPU 上运行速度达 160 FPS，单 CPU 核心上运行速度为 7 FPS，适用于直接部署在嵌入式系统上。我们在 https://github.com/ebnerluca/uw_depth 上公开了实现。

Interferometric Neural Networks

paper_url: http://arxiv.org/abs/2310.16742
repo_url: https://github.com/arunsehrawat/interferometric-neural-networks
paper_authors: Arun Sehrawat
for: 本研究旨在探讨人工神经网络和干涉仪器的结合，并将其应用于机器学习和优化领域。
methods: 本研究使用干涉仪器构建了基于神经网络的生成对抗网络，而这些神经网络没有任何类传呈层，可以在量子计算机或光子芯片上实现。
results: 研究表明，这种方法可以用于解决 combinatorial optimization 问题，并在多类图像识别任务中达到了93%和83%的准确率。此外，它还可以生成数字0-9和人脸图像。

Abstract
On the one hand, artificial neural networks have many successful applications in the field of machine learning and optimization. On the other hand, interferometers are integral parts of any field that deals with waves such as optics, astronomy, and quantum physics. Here, we introduce neural networks composed of interferometers and then build generative adversarial networks from them. Our networks do not have any classical layer and can be realized on quantum computers or photonic chips. We demonstrate their applicability for combinatorial optimization, image classification, and image generation. For combinatorial optimization, our network consistently converges to the global optimum or remains within a narrow range of it. In multi-class image classification tasks, our networks achieve accuracies of 93% and 83%. Lastly, we show their capability to generate images of digits from 0 to 9 as well as human faces.

摘要
一方面，人工神经网络在机器学习和优化领域有很多成功应用。另一方面，湍波器是任何波动领域的重要组件，包括光学、天文学和量子物理。我们在这里引入神经网络中的湍波器，然后建立基于这些湍波器的生成敌对网络。我们的网络没有任何经典层，可以在量子计算机或光子芯片上实现。我们展示了它们在组合优化、图像分类和图像生成等方面的可行性。在组合优化任务中，我们的网络一般 converge 到全局最优或在一个窄范围内停止。在多类图像分类任务中，我们的网络达到了93%和83%的准确率。最后，我们展示了它们可以生成从0到9的数字图像以及人脸。

A No-Reference Quality Assessment Method for Digital Human Head

paper_url: http://arxiv.org/abs/2310.16732
repo_url: None
paper_authors: Yingjie Zhou, Zicheng Zhang, Wei Sun, Xiongkuo Min, Xianghe Ma, Guangtao Zhai
for: 这篇论文是关于数字人质量评估（DHQA）的研究，目的是提出一种基于Transformer的无参考评估方法，以解决数字人在生成和传输过程中可能出现的多种扭曲和质量下降问题。
methods: 该方法使用了 Rendering 技术将数字人的前2D投影作为输入，然后使用视transformer（ViT）进行特征提取，并设计了一个多任务模块以同时分类扭曲类型和预测数字人的 perceived quality 水平。
results: 实验结果表明，提出的方法与人工评估得分高度相关，并且在比较现有评估方法时表现出色。

Abstract
In recent years, digital humans have been widely applied in augmented/virtual reality (A/VR), where viewers are allowed to freely observe and interact with the volumetric content. However, the digital humans may be degraded with various distortions during the procedure of generation and transmission. Moreover, little effort has been put into the perceptual quality assessment of digital humans. Therefore, it is urgent to carry out objective quality assessment methods to tackle the challenge of digital human quality assessment (DHQA). In this paper, we develop a novel no-reference (NR) method based on Transformer to deal with DHQA in a multi-task manner. Specifically, the front 2D projections of the digital humans are rendered as inputs and the vision transformer (ViT) is employed for the feature extraction. Then we design a multi-task module to jointly classify the distortion types and predict the perceptual quality levels of digital humans. The experimental results show that the proposed method well correlates with the subjective ratings and outperforms the state-of-the-art quality assessment methods.

摘要
在最近的几年里，数字人类在扩展/虚拟现实（A/VR）领域广泛应用，让观众可以自由观看和与三维内容互动。然而，数字人类可能在生成和传输过程中受到多种扭曲的影响。此外，对数字人类的主观质量评估没有充分的努力。因此，需要开发一种无参考（NR）方法，以便在多任务方式下进行数字人类质量评估（DHQA）。在这篇论文中，我们提出了一种基于变换器的新方法，以解决DHQA的挑战。具体来说，我们将数字人类的前2D投影作为输入，并使用视传送器（ViT）进行特征提取。然后，我们设计了一个多任务模块，以同时类型化扭曲和预测数字人类的主观质量水平。实验结果表明，我们的方法与主观评估结果高度相关，并超越了现有的质量评估方法。

Rebuild City Buildings from Off-Nadir Aerial Images with Offset-Building Model (OBM)

paper_url: http://arxiv.org/abs/2310.16717
repo_url: None
paper_authors: Kai Li, Yupeng Deng, Yunlong Kong, Diyou Liu, Jingbo Chen, Yu Meng, Junxian Ma
for: 这个研究的目的是提出一个互动式Transformer模型，用于精确地测量高解析 remote sensing 图像中的建筑物偏移。
methods: 这个模型使用了一个互动式Transformer模型，与一个启发器Encoder，以精确地测量建筑物偏移。它还使用了一个ROAM模块，用于解决常见的预测建筑物偏移的问题。
results: 这个研究在公开available的BONAI dataset上进行了试验，实现了显著对Prompt-Instance-Level偏移Error的减少，从14.6%到16.3%。此外，这个研究也开发了一个适合大规模建筑物偏移的Distance-NMS算法，对预测建筑物偏移角度和长度进行了重要改善。

Abstract
Accurate measurement of the offset from roof-to-footprint in very-high-resolution remote sensing imagery is crucial for urban information extraction tasks. With the help of deep learning, existing methods typically rely on two-stage CNN models to extract regions of interest on building feature maps. At the first stage, a Region Proposal Network (RPN) is applied to extract thousands of ROIs (Region of Interests) which will post-imported into a Region-based Convolutional Neural Networks (RCNN) to extract wanted information. However, because of inflexible RPN, these methods often lack effective user interaction, encounter difficulties in instance correspondence, and struggle to keep up with the advancements in general artificial intelligence. This paper introduces an interactive Transformer model combined with a prompt encoder to precisely extract building segmentation as well as the offset vectors from roofs to footprints. In our model, a powerful module, namely ROAM, was tailored for common problems in predicting roof-to-footprint offsets. We tested our model's feasibility on the publicly available BONAI dataset, achieving a significant reduction in Prompt-Instance-Level offset errors ranging from 14.6% to 16.3%. Additionally, we developed a Distance-NMS algorithm tailored for large-scale building offsets, significantly enhancing the accuracy of predicted building offset angles and lengths in a straightforward and efficient manner. To further validate the model's robustness, we created a new test set using 0.5m remote sensing imagery from Huizhou, China, for inference testing. Our code, training methods, and the updated dataset will be accessable at https://github.com/likaiucas.

摘要
准确测量房屋缘界与地面的偏移量在高分辨率Remote sensing影像中是城市信息提取任务中非常重要的。通过深度学习，现有方法通常是通过两个阶段Convolutional Neural Networks (CNN)模型来提取建筑特征图像中的区域兴趣点。在第一阶段，一个Region Proposal Network (RPN)被应用以提取数以千计的ROIs（区域兴趣点），然后将其导入到基于区域的Convolutional Neural Networks (RCNN)中进行信息提取。然而，由于不灵活的RPN，这些方法经常缺乏有效的用户互动，遇到实例匹配的问题，并且难以保持总人工智能的提高。这篇论文介绍了一种交互式Transformer模型，并与一个提示编码器结合，以准确提取建筑分割以及缘界到地面的偏移量。在我们的模型中，一个专门为建筑问题设计的强大模块，即ROAM，用于解决通用的预测缘界到地面的偏移量问题。我们在公共可用的BONAI数据集上测试了我们的模型的可行性，实现了对Prompt-Instance-Level偏移量的显著减少，范围为14.6%至16.3%。此外，我们开发了一种适合大规模建筑偏移量的Distance-NMS算法，大幅提高了预测建筑偏移角度和长度的准确率，并且这是一种简单、有效的方式。为了进一步验证我们的模型的稳定性，我们创建了一个使用0.5米Remote sensing影像的新测试集，用于对模型进行推理测试。我们的代码、训练方法和更新的数据集将在https://github.com/likaiucas中公开。

Nighttime Driver Behavior Prediction Using Taillight Signal Recognition via CNN-SVM Classifier

paper_url: http://arxiv.org/abs/2310.16706
repo_url: https://github.com/deepcar/taillight_recognition
paper_authors: Amir Hossein Barshooi, Elmira Bagheri
for: 这种研究的目的是提高夜间驾驶行为预测能力，通过识别人驾和自动驾车 taillight。
methods: 提出的模型使用了自定义的探测器，通过提取输入图像的深度特征，并对每个特征计算数据稀缺性。然后，通过引入 soft attention 的权重binary mask，使模型更加注重预先确定的区域。最后，使用 Convolutional Neural Networks (CNNs) 提取特征，并使用 Principal Component Analysis (PCA) 减少维度。
results: 实验结果显示，提出的方法可以准确地分类夜间驾驶行为，具体的结果为92.14%的准确率、97.38%的特殊性、92.09%的敏感度、92.10%的 F1-度量和0.895的科恩 statistic。

Abstract
This paper aims to enhance the ability to predict nighttime driving behavior by identifying taillights of both human-driven and autonomous vehicles. The proposed model incorporates a customized detector designed to accurately detect front-vehicle taillights on the road. At the beginning of the detector, a learnable pre-processing block is implemented, which extracts deep features from input images and calculates the data rarity for each feature. In the next step, drawing inspiration from soft attention, a weighted binary mask is designed that guides the model to focus more on predetermined regions. This research utilizes Convolutional Neural Networks (CNNs) to extract distinguishing characteristics from these areas, then reduces dimensions using Principal Component Analysis (PCA). Finally, the Support Vector Machine (SVM) is used to predict the behavior of the vehicles. To train and evaluate the model, a large-scale dataset is collected from two types of dash-cams and Insta360 cameras from the rear view of Ford Motor Company vehicles. This dataset includes over 12k frames captured during both daytime and nighttime hours. To address the limited nighttime data, a unique pixel-wise image processing technique is implemented to convert daytime images into realistic night images. The findings from the experiments demonstrate that the proposed methodology can accurately categorize vehicle behavior with 92.14% accuracy, 97.38% specificity, 92.09% sensitivity, 92.10% F1-measure, and 0.895 Cohen's Kappa Statistic. Further details are available at https://github.com/DeepCar/Taillight_Recognition.

摘要
To train and evaluate the model, a large-scale dataset is collected from two types of dash-cams and Insta360 cameras from the rear view of Ford Motor Company vehicles. The dataset includes over 12,000 frames captured during both daytime and nighttime hours. To address the limited nighttime data, a unique pixel-wise image processing technique is implemented to convert daytime images into realistic night images.The experimental results show that the proposed methodology can accurately categorize vehicle behavior with 92.14% accuracy, 97.38% specificity, 92.09% sensitivity, 92.10% F1-measure, and 0.895 Cohen's Kappa Statistic. More details can be found at .

From Pointwise to Powerhouse: Initialising Neural Networks with Generative Models

paper_url: http://arxiv.org/abs/2310.16695
repo_url: None
paper_authors: Christian Harder, Moritz Fuchs, Yuri Tolkach, Anirban Mukhopadhyay
for: 这篇研究旨在提出新的启动方法，以解决深度神经网络中的弹性问题（vanishing or exploding gradients）。
methods: 这篇研究使用生成模型来初始化神经网络，包括使用Variational Autoencoders（VAEs）和Graph Hypernetworks（GHNs）。
results: 研究发现，使用全量的初始化方法可以提高精度和初始化速度，但是透过Graph Hypernetworks（GHNs）实现的方法会导致集合的ensemble performance在离distribution data上下降。为了解决这个问题，研究提出了一种噪音Graph Hypernetworks（noise GHNs）来鼓励多样性。此外，这篇研究还发现这些新的启动方法可能可以将学习到的知识转移到不同的图像分布上。

Abstract
Traditional initialisation methods, e.g. He and Xavier, have been effective in avoiding the problem of vanishing or exploding gradients in neural networks. However, they only use simple pointwise distributions, which model one-dimensional variables. Moreover, they ignore most information about the architecture and disregard past training experiences. These limitations can be overcome by employing generative models for initialisation. In this paper, we introduce two groups of new initialisation methods. First, we locally initialise weight groups by employing variational autoencoders. Secondly, we globally initialise full weight sets by employing graph hypernetworks. We thoroughly evaluate the impact of the employed generative models on state-of-the-art neural networks in terms of accuracy, convergence speed and ensembling. Our results show that global initialisations result in higher accuracy and faster initial convergence speed. However, the implementation through graph hypernetworks leads to diminished ensemble performance on out of distribution data. To counteract, we propose a modification called noise graph hypernetwork, which encourages diversity in the produced ensemble members. Furthermore, our approach might be able to transfer learned knowledge to different image distributions. Our work provides insights into the potential, the trade-offs and possible modifications of these new initialisation methods.

摘要
传统的初始化方法，如希尔和夏维，有效地避免神经网络中的衰减或扩散Gradient问题。然而，它们只使用简单的点位分布，这些分布模型了一维变量。此外，它们忽略了神经网络的建筑和过去训练经验。这些限制可以通过使用生成模型来超越。在这篇论文中，我们介绍了两组新的初始化方法。首先，我们在weight组上本地初始化使用变量自动机。其次，我们在全Weight集上全球初始化使用图hyper网络。我们仔细评估了采用生成模型对现有神经网络的影响，包括精度、快速初始化速度和 ensemble。我们的结果显示，全球初始化可以提高精度和初始化速度，但是通过图hyper网络的实现会导致对于异常数据的ensemble表现下降。为了缓解这个问题，我们提出了噪音图hyper网络修改，该修改强制生成 ensemble member中的多样性。此外，我们的方法可能可以传输学习到不同的图像分布。我们的工作为这些新的初始化方法的潜力、交易和可能的修改提供了深入的视角。

DSAM-GN:Graph Network based on Dynamic Similarity Adjacency Matrices for Vehicle Re-identification

paper_url: http://arxiv.org/abs/2310.16694
repo_url: None
paper_authors: Yuejun Jiao, Song Qiu, Mingsong Chen, Dingding Han, Qingli Li, Yue Lu
for: 本研究旨在提高智能交通系统中车辆重新认识（Re-ID）的精度，适用于助动系统、交通流管理和车辆跟踪等应用。
methods: 本研究提出了基于动态相似 adjacency 矩阵（DSAM-GN）的方法，具有新的相似性矩阵构建方法，可以捕捉车辆特征的空间关系，减少背景噪音。
results: 实验结果表明，提出的方法比现有方法更有效，能够更好地提高车辆 Re-ID 的精度。

Abstract
In recent years, vehicle re-identification (Re-ID) has gained increasing importance in various applications such as assisted driving systems, traffic flow management, and vehicle tracking, due to the growth of intelligent transportation systems. However, the presence of extraneous background information and occlusions can interfere with the learning of discriminative features, leading to significant variations in the same vehicle image across different scenarios. This paper proposes a method, named graph network based on dynamic similarity adjacency matrices (DSAM-GN), which incorporates a novel approach for constructing adjacency matrices to capture spatial relationships of local features and reduce background noise. Specifically, the proposed method divides the extracted vehicle features into different patches as nodes within the graph network. A spatial attention-based similarity adjacency matrix generation (SASAMG) module is employed to compute similarity matrices of nodes, and a dynamic erasure operation is applied to disconnect nodes with low similarity, resulting in similarity adjacency matrices. Finally, the nodes and similarity adjacency matrices are fed into graph networks to extract more discriminative features for vehicle Re-ID. Experimental results on public datasets VeRi-776 and VehicleID demonstrate the effectiveness of the proposed method compared with recent works.

摘要
Recently, vehicle re-identification (Re-ID) has become increasingly important in various applications such as assisted driving systems, traffic flow management, and vehicle tracking, due to the development of intelligent transportation systems. However, the presence of extraneous background information and occlusions can interfere with the learning of discriminative features, leading to significant variations in the same vehicle image across different scenarios. This paper proposes a method, named graph network based on dynamic similarity adjacency matrices (DSAM-GN), which incorporates a novel approach for constructing adjacency matrices to capture spatial relationships of local features and reduce background noise. Specifically, the proposed method divides the extracted vehicle features into different patches as nodes within the graph network. A spatial attention-based similarity adjacency matrix generation (SASAMG) module is employed to compute similarity matrices of nodes, and a dynamic erasure operation is applied to disconnect nodes with low similarity, resulting in similarity adjacency matrices. Finally, the nodes and similarity adjacency matrices are fed into graph networks to extract more discriminative features for vehicle Re-ID. Experimental results on public datasets VeRi-776 and VehicleID demonstrate the effectiveness of the proposed method compared with recent works.

Local Statistics for Generative Image Detection

paper_url: http://arxiv.org/abs/2310.16684
repo_url: https://github.com/djdprogramming/adfa2
paper_authors: Yung Jer Wong, Teck Khim Ng
for: 这个论文主要旨在提高Diffusion models（DMs）的渲染质量，以便在不同任务上使用DMs进行图像生成和图像超分辨率等任务。
methods: 这个论文使用了计算地方统计信息，而不是全局统计信息，以分辨出图像是否来自DMs生成。它们认为地方统计信息能够解决图像空间不均衡问题。
results: 论文表明，使用地方统计信息可以提供有前景的结果，并且这种方法对图像缩放和JPEG压缩等多种杂音噪声有良好的Robustness。

Abstract
Diffusion models (DMs) are generative models that learn to synthesize images from Gaussian noise. DMs can be trained to do a variety of tasks such as image generation and image super-resolution. Researchers have made significant improvement in the capability of synthesizing photorealistic images in the past few years. These successes also hasten the need to address the potential misuse of synthesized images. In this paper, we highlight the effectiveness of computing local statistics, as opposed to global statistics, in distinguishing digital camera images from DM-generated images. We hypothesized that local statistics should be used to address the spatial non-stationarity problem in images. We show that our approach produced promising results and it is also robust to various perturbations such as image resizing and JPEG compression.

摘要
干扰模型（DM）是一种生成模型，可以学习将泊松噪声转化为图像。DM可以完成多种任务，如图像生成和图像超分辨率。过去几年，研究人员在图像生成方面做出了 significative 进步，这也提高了对生成图像的可能性。然而，随着技术的发展，对生成图像的潜在misuse的需要也在增加。在这篇论文中，我们指出了计算地方统计信息，而不是全局统计信息，对于区分摄像头图像和DM生成图像的效iveness。我们假设了使用地方统计来解决图像的空间不均衡问题。我们的方法得到了有 promise 的结果，并且对各种扰动，如图像缩放和JPEG压缩，也具有良好的鲁棒性。

CoDet: Co-Occurrence Guided Region-Word Alignment for Open-Vocabulary Object Detection

paper_url: http://arxiv.org/abs/2310.16667
repo_url: https://github.com/cvmi-lab/codet
paper_authors: Chuofan Ma, Yi Jiang, Xin Wen, Zehuan Yuan, Xiaojuan Qi
for: 学习开放词汇对象检测中的对象级视力语言表示，从图文对 alignment 获得可靠的区域词对应关系是关键。
methods: CoDet 提出了一种新的方法，即将区域词对应关系重新定义为共同发现问题，通过图像的视觉相似性来发现共同出现的对象。
results: 实验结果表明，CoDet 在开放词汇检测中具有优秀的性能和扩展性，例如通过增加视觉底层，CoDet 在 OV-LVIS 上实现了 37.0 $\text{AP}^m_{novel}$ 和 44.7 $\text{AP}^m_{all}$，比前一个 SoTA 高出 4.2 $\text{AP}^m_{novel}$ 和 9.8 $\text{AP}^m_{all}$。

Abstract
Deriving reliable region-word alignment from image-text pairs is critical to learn object-level vision-language representations for open-vocabulary object detection. Existing methods typically rely on pre-trained or self-trained vision-language models for alignment, which are prone to limitations in localization accuracy or generalization capabilities. In this paper, we propose CoDet, a novel approach that overcomes the reliance on pre-aligned vision-language space by reformulating region-word alignment as a co-occurring object discovery problem. Intuitively, by grouping images that mention a shared concept in their captions, objects corresponding to the shared concept shall exhibit high co-occurrence among the group. CoDet then leverages visual similarities to discover the co-occurring objects and align them with the shared concept. Extensive experiments demonstrate that CoDet has superior performances and compelling scalability in open-vocabulary detection, e.g., by scaling up the visual backbone, CoDet achieves 37.0 $\text{AP}^m_{novel}$ and 44.7 $\text{AP}^m_{all}$ on OV-LVIS, surpassing the previous SoTA by 4.2 $\text{AP}^m_{novel}$ and 9.8 $\text{AP}^m_{all}$. Code is available at https://github.com/CVMI-Lab/CoDet.

摘要
通过图文对应关系学习对象级视觉语言表示是关键，以实现开放词汇物体检测。现有方法通常基于预训练或自动训练视觉语言模型进行对齐，这些方法容易受到当地化精度或通用能力的限制。在本文中，我们提出了CoDet方法，它通过重新定义区域词对应关系为共同发现问题来超越预aligned视觉语言空间的局限性。具体来说，通过将图像分组，图像的描述中提到的共同概念的对象将显示高度的共同出现。CoDet然后利用视觉相似性来找到共同出现的对象，并将其与共同概念进行对齐。我们的实验表明，CoDet在开放词汇检测中具有优秀的性能和吸引人的扩展性，例如，通过将视觉后骨骼扩展到更大的尺度，CoDet在OV-LVIS上 achievied 37.0 $\text{AP}^m_{novel}$和44.7 $\text{AP}^m_{all}$，超过了之前的SoTAby 4.2 $\text{AP}^m_{novel}$和9.8 $\text{AP}^m_{all}$。代码可以在https://github.com/CVMI-Lab/CoDet上获取。

Robust Source-Free Domain Adaptation for Fundus Image Segmentation

paper_url: http://arxiv.org/abs/2310.16665
repo_url: https://github.com/LinGrayy/PLPB
paper_authors: Lingrui Li, Yanfeng Zhou, Ge Yang
For: This paper focuses on improving the robustness of unsupervised domain adaptation (UDA) techniques for medical image segmentation, specifically fundus image segmentation.* Methods: The proposed method consists of two stages: (1) source training with adversarial sample augmentation to enhance the robustness and generalization capability of the source model, and (2) target training with a novel robust pseudo-label and pseudo-boundary (PLPB) method that utilizes unlabeled target data to generate pseudo labels and pseudo boundaries for self-adaptation.* Results: Extensive experimental results on cross-domain fundus image segmentation confirm the effectiveness and versatility of the proposed method, demonstrating improved accuracy and robustness compared to existing UDA techniques.Here’s the simplified Chinese version:* For: 这篇论文主要关注改进医疗图像分割领域的无监督领域适应（UDA）技术的Robustness，具体来说是基于眼膜图像分割。* Methods: 提议的方法包括两个阶段：（1）源训练阶段使用对抗样本扩大来提高源模型的Robustness和泛化能力，以及（2）目标训练阶段提出一种新的Robust pseudo-label和pseudo-boundary（PLPB）方法，通过不使用源数据，使用无标目标数据生成pseudo标签和pseudo bound。* Results: 对于cross-domain眼膜图像分割，实验结果证明提议的方法的有效性和多样性，比较 existing UDA 技术，显示了提高了准确率和Robustness。

Abstract
Unsupervised Domain Adaptation (UDA) is a learning technique that transfers knowledge learned in the source domain from labelled training data to the target domain with only unlabelled data. It is of significant importance to medical image segmentation because of the usual lack of labelled training data. Although extensive efforts have been made to optimize UDA techniques to improve the accuracy of segmentation models in the target domain, few studies have addressed the robustness of these models under UDA. In this study, we propose a two-stage training strategy for robust domain adaptation. In the source training stage, we utilize adversarial sample augmentation to enhance the robustness and generalization capability of the source model. And in the target training stage, we propose a novel robust pseudo-label and pseudo-boundary (PLPB) method, which effectively utilizes unlabeled target data to generate pseudo labels and pseudo boundaries that enable model self-adaptation without requiring source data. Extensive experimental results on cross-domain fundus image segmentation confirm the effectiveness and versatility of our method. Source code of this study is openly accessible at https://github.com/LinGrayy/PLPB.

摘要
无监督领域适应（UDA）是一种学习技术，它将源领域中标注训练数据上的知识传递到目标领域中，只使用无标注数据进行学习。由于医学影像分割通常缺乏标注训练数据，UDA技术具有重要的意义。虽然有大量研究探讨了如何优化UDA技术以提高目标领域中分割模型的准确率，但只有一些研究考虑了UDA模型的稳定性。在这个研究中，我们提出了一种两stage训练策略，以提高UDA模型的稳定性。在源训练阶段，我们利用对抗样本增强的技术，以提高源模型的 robustness和通用性。在目标训练阶段，我们提出了一种新的假标签和假边界（PLPB）方法，它可以使用无标注目标数据来生成假标签和假边界，以便模型自适应而不需要源数据。我们的方法在跨领域眼影像分割中进行了广泛的实验，并证明了我们的方法的效果和多样性。源代码可以在https://github.com/LinGrayy/PLPB中获取。

MACP: Efficient Model Adaptation for Cooperative Perception

paper_url: http://arxiv.org/abs/2310.16870
repo_url: https://github.com/purduedigitaltwin/macp
paper_authors: Yunsheng Ma, Juanwu Lu, Can Cui, Sicheng ZHao, Xu Cao, Wenqian Ye, Ziran Wang
for: 提高连接自动化车辆（CAVs）的感知能力，使其可以“看过遮盖物”，提高性能。
methods: 基于单机器学习模型，增加合作能力。
results: 在模拟和实际协同感知测试中，提出的方法可以有效利用协同观察，并超越其他当前最佳方法，需要许多 fewer 参数和通信成本。I hope this helps! Let me know if you have any other questions.

Abstract
Vehicle-to-vehicle (V2V) communications have greatly enhanced the perception capabilities of connected and automated vehicles (CAVs) by enabling information sharing to "see through the occlusions", resulting in significant performance improvements. However, developing and training complex multi-agent perception models from scratch can be expensive and unnecessary when existing single-agent models show remarkable generalization capabilities. In this paper, we propose a new framework termed MACP, which equips a single-agent pre-trained model with cooperation capabilities. We approach this objective by identifying the key challenges of shifting from single-agent to cooperative settings, adapting the model by freezing most of its parameters and adding a few lightweight modules. We demonstrate in our experiments that the proposed framework can effectively utilize cooperative observations and outperform other state-of-the-art approaches in both simulated and real-world cooperative perception benchmarks while requiring substantially fewer tunable parameters with reduced communication costs. Our source code is available at https://github.com/PurdueDigitalTwin/MACP.

摘要
connected and automated vehicles (CAVs) 的 perception capabilities 有 greatly enhanced by enabling information sharing to "see through the occlusions", resulting in significant performance improvements. However, developing and training complex multi-agent perception models from scratch can be expensive and unnecessary when existing single-agent models show remarkable generalization capabilities. In this paper, we propose a new framework termed MACP, which equips a single-agent pre-trained model with cooperation capabilities. We approach this objective by identifying the key challenges of shifting from single-agent to cooperative settings, adapting the model by freezing most of its parameters and adding a few lightweight modules. We demonstrate in our experiments that the proposed framework can effectively utilize cooperative observations and outperform other state-of-the-art approaches in both simulated and real-world cooperative perception benchmarks while requiring substantially fewer tunable parameters with reduced communication costs. Our source code is available at https://github.com/PurdueDigitalTwin/MACP.Here's the translation breakdown:* connected and automated vehicles (CAVs) connected and automated vehicles (CAVs)* perception capabilities 感知能力* greatly enhanced 增强了* by enabling information sharing to "see through the occlusions" 通过共享信息来"看through the occlusions"* resulting in significant performance improvements 导致显著性能提升* However, developing and training complex multi-agent perception models from scratch can be expensive and unnecessary 然而，从零开始开发和训练复杂多代理感知模型可能是不必要的和昂贵的* when existing single-agent models show remarkable generalization capabilities 当现有的单代理模型表现出很好的泛化能力* In this paper, we propose a new framework termed MACP 在这篇论文中，我们提出了一个新的框架，名为MACP* which equips a single-agent pre-trained model with cooperation capabilities 将单代理预训练模型带有合作能力* We approach this objective by identifying the key challenges of shifting from single-agent to cooperative settings 我们通过了解单代理到合作设置的主要挑战来实现这个目标* adapting the model by freezing most of its parameters and adding a few lightweight modules 通过冻结大多数参数并添加一些轻量级模块来适应模型* We demonstrate in our experiments that the proposed framework can effectively utilize cooperative observations and outperform other state-of-the-art approaches in both simulated and real-world cooperative perception benchmarks 我们在实验中表明，提出的框架可以有效利用合作观察和在模拟和真实世界的合作感知指标上超越其他现有的方法* while requiring substantially fewer tunable parameters with reduced communication costs 而不需要大量可调参数和减少的通信成本* Our source code is available at https://github.com/PurdueDigitalTwin/MACP 我们的源代码可以在上获取

Deep Learning Techniques for Cervical Cancer Diagnosis based on Pathology and Colposcopy Images

paper_url: http://arxiv.org/abs/2310.16662
repo_url: None
paper_authors: Hana Ahmadzadeh Sarhangi, Dorsa Beigifard, Elahe Farmani, Hamidreza Bolhasani
for: 本研究旨在探讨 Deep Learning 技术在预防和诊断 cervical cancer 方面的潜在应用，以提高诊断的准确率和效率。
methods: 本研究使用 Deep Learning 技术进行训练，包括分类、 segmentation 和检测任务，以提高预防和诊断 cervical cancer 的精度和效率。
results: 研究发现，Deep Learning 技术可以帮助提高预防和诊断 cervical cancer 的精度和效率，并且可以减少人类 Error 的影响。

Abstract
Cervical cancer is a prevalent disease affecting millions of women worldwide every year. It requires significant attention, as early detection during the precancerous stage provides an opportunity for a cure. The screening and diagnosis of cervical cancer rely on cytology and colposcopy methods. Deep learning, a promising technology in computer vision, has emerged as a potential solution to improve the accuracy and efficiency of cervical cancer screening compared to traditional clinical inspection methods that are prone to human error. This review article discusses cervical cancer and its screening processes, followed by the Deep Learning training process and the classification, segmentation, and detection tasks for cervical cancer diagnosis. Additionally, we explored the most common public datasets used in both cytology and colposcopy and highlighted the popular and most utilized architectures that researchers have applied to both cytology and colposcopy. We reviewed 24 selected practical papers in this study and summarized them. This article highlights the remarkable efficiency in enhancing the precision and speed of cervical cancer analysis by Deep Learning, bringing us closer to early diagnosis and saving lives.

摘要
cervical cancer是每年全球多 millones of women affected by the disease，需要 significativ attention，因为 early detection during the precancerous stage provides an opportunity for a cure. Screening and diagnosis of cervical cancer rely on cytology and colposcopy methods，Deep learning，a promising technology in computer vision，has emerged as a potential solution to improve the accuracy and efficiency of cervical cancer screening compared to traditional clinical inspection methods that are prone to human error.this review article discusses cervical cancer and its screening processes，followed by the Deep Learning training process and the classification, segmentation, and detection tasks for cervical cancer diagnosis. Additionally, we explored the most common public datasets used in both cytology and colposcopy and highlighted the popular and most utilized architectures that researchers have applied to both cytology and colposcopy. We reviewed 24 selected practical papers in this study and summarized them. This article highlights the remarkable efficiency in enhancing the precision and speed of cervical cancer analysis by Deep Learning，bringing us closer to early diagnosis and saving lives.

EmoCLIP: A Vision-Language Method for Zero-Shot Video Facial Expression Recognition

paper_url: http://arxiv.org/abs/2310.16640
repo_url: https://github.com/nickyfot/emoclip
paper_authors: Niki Maria Foteinopoulou, Ioannis Patras
for: 这篇论文的目的是提高 dynamic FER 中的表情识别精度，并且扩展表情识别的调estre spectrum。
methods: 这篇论文提出了一个 novel vision-language model，使用 sample-level text descriptions 作为自然语言监督，以增强网络内部化的复杂表情表现。
results: 这篇论文的结果显示，使用 sample-level descriptions 监督的方法可以在零 shot 分类中实现 Significant Improvements，比如在某些 datasets 上与 CLIP 相比，增加了10% 以上的 Weighted Average Recall 和5% 以上的 Unweighted Average Recall。此外，这篇论文也评估了这种网络在下游任务中的表现，例如 mental health symptom estimation，并取得了与现有方法相似或更高的表现。

Abstract
Facial Expression Recognition (FER) is a crucial task in affective computing, but its conventional focus on the seven basic emotions limits its applicability to the complex and expanding emotional spectrum. To address the issue of new and unseen emotions present in dynamic in-the-wild FER, we propose a novel vision-language model that utilises sample-level text descriptions (i.e. captions of the context, expressions or emotional cues) as natural language supervision, aiming to enhance the learning of rich latent representations, for zero-shot classification. To test this, we evaluate using zero-shot classification of the model trained on sample-level descriptions on four popular dynamic FER datasets. Our findings show that this approach yields significant improvements when compared to baseline methods. Specifically, for zero-shot video FER, we outperform CLIP by over 10\% in terms of Weighted Average Recall and 5\% in terms of Unweighted Average Recall on several datasets. Furthermore, we evaluate the representations obtained from the network trained using sample-level descriptions on the downstream task of mental health symptom estimation, achieving performance comparable or superior to state-of-the-art methods and strong agreement with human experts. Namely, we achieve a Pearson's Correlation Coefficient of up to 0.85 on schizophrenia symptom severity estimation, which is comparable to human experts' agreement. The code is publicly available at: https://github.com/NickyFot/EmoCLIP.

摘要
Facial Expression Recognition (FER) 是人工智能情感计算中的关键任务，但它的传统 фокус只是七种基本情感限制了其应用范围。为解决新和未经见的情感在动态的自然环境中的FER问题，我们提出了一种新的视觉语言模型，利用样本级文本描述（例如，情绪或情感cue的标签）作为自然语言监督，以增强模型学习的质折表示。为测试这种方法，我们在四个流行的动态FER数据集上进行零 shot 分类测试。我们的结果显示，这种方法可以与基准方法相比，提供显著改善。特别是，在零 shot 视频FER问题上，我们超过 CLIP 的10%以上在权重平均回归和5%以上在无权重平均回归方面进行了比较优秀的表现。此外，我们使用模型通过样本级文本描述学习获得的表示在下游任务中的精神病状 симптом估计中表现出色，与当前的状态 искусственный智能技术和人类专家的协同达到了高度一致。具体来说，我们在各种精神病状Symptom Severity估计中达到了0.85的佩森相关系数，与人类专家的一致度相当。代码可以在 GitHub 上获取：https://github.com/NickyFot/EmoCLIP。

Driving through the Concept Gridlock: Unraveling Explainability Bottlenecks in Automated Driving

paper_url: http://arxiv.org/abs/2310.16639
repo_url: https://github.com/jessicamecht/concept_gridlock
paper_authors: Jessica Echterhoff, An Yan, Kyungtae Han, Amr Abdelraouf, Rohit Gupta, Julian McAuley
for: 这个论文是为了提出一种基于概念瓶颈模型的可解释机器学习方法，用于解释自动驾驶车辆的决策和行为。
methods: 该论文使用了概念瓶颈模型，将人类定义的概念编码到模型中，以实现可解释的机器学习。
results: 该论文提出了一种新的方法，使用概念瓶颈作为视觉特征来预测和解释用户和车辆行为，并实现了与离散特征的竞争性表现。

Abstract
Concept bottleneck models have been successfully used for explainable machine learning by encoding information within the model with a set of human-defined concepts. In the context of human-assisted or autonomous driving, explainability models can help user acceptance and understanding of decisions made by the autonomous vehicle, which can be used to rationalize and explain driver or vehicle behavior. We propose a new approach using concept bottlenecks as visual features for control command predictions and explanations of user and vehicle behavior. We learn a human-understandable concept layer that we use to explain sequential driving scenes while learning vehicle control commands. This approach can then be used to determine whether a change in a preferred gap or steering commands from a human (or autonomous vehicle) is led by an external stimulus or change in preferences. We achieve competitive performance to latent visual features while gaining interpretability within our model setup.

摘要
<>用概念瓶颈模型实现可解释机器学习，将人类定义的概念编码到模型中，以解释自动驾驶决策的可解释性。在人工或自动驾驶场景下，解释模型可以帮助用户理解和接受自动车辆的决策，并用于解释机动员或车辆行为的原因。我们提出了一种新的方法，利用概念瓶颈作为视觉特征来预测和解释用户和车辆行为。我们学习了人类可理解的概念层，用于解释顺序驾驶场景，同时学习车辆控制命令。这种方法可以确定外部刺激或变化的是否导致人类或自动车辆的控制命令变化。我们实现了与潜在视觉特征相当的竞争性性能，同时在我们的模型设置中获得了解释性。

EdgeCalib: Multi-Frame Weighted Edge Features for Automatic Targetless LiDAR-Camera Calibration

paper_url: http://arxiv.org/abs/2310.16629
repo_url: None
paper_authors: Xingchen Li, Yifan Duan, Beibei Wang, Haojie Ren, Guoliang You, Yu Sheng, Jianmin Ji, Yanyong Zhang
for: 提高 LiDAR 和摄像头之间的外部准确定时参数的自动化线程，以便在实际场景中实现高精度的多模态感知系统。
methods: 基于边缘特征的自动化线程，包括在图像和点云中提取稳定和可靠的边缘特征，并通过多帧权重策略对边缘特征进行过滤。最后，通过边缘匹配约束来优化精度的外部参数。
results: 在 KITTI 数据集和自己的数据集上进行了评估，实现了Rotation 精度0.086度和Translation 精度0.977 cm，超过了现有的边缘基于准确定时参数的方法。

Abstract
In multimodal perception systems, achieving precise extrinsic calibration between LiDAR and camera is of critical importance. Previous calibration methods often required specific targets or manual adjustments, making them both labor-intensive and costly. Online calibration methods based on features have been proposed, but these methods encounter challenges such as imprecise feature extraction, unreliable cross-modality associations, and high scene-specific requirements. To address this, we introduce an edge-based approach for automatic online calibration of LiDAR and cameras in real-world scenarios. The edge features, which are prevalent in various environments, are aligned in both images and point clouds to determine the extrinsic parameters. Specifically, stable and robust image edge features are extracted using a SAM-based method and the edge features extracted from the point cloud are weighted through a multi-frame weighting strategy for feature filtering. Finally, accurate extrinsic parameters are optimized based on edge correspondence constraints. We conducted evaluations on both the KITTI dataset and our dataset. The results show a state-of-the-art rotation accuracy of 0.086{\deg} and a translation accuracy of 0.977 cm, outperforming existing edge-based calibration methods in both precision and robustness.

摘要
在多模态感知系统中，精确的外部准确性 calibration between LiDAR 和摄像头是关键。先前的准确方法通常需要特定的目标或手动调整，使其成为劳动密集和昂贵的。基于特征的在线准确方法已经被提议，但这些方法遇到了准确特征提取、交叉模式关联不可靠和高场景特定性的挑战。为解决这个问题，我们介绍了一种基于边的方法，用于自动在实际场景中进行 LiDAR 和摄像头的在线准确性 calibration。在图像和点云中对边特征进行对齐，以确定外部参数。 Specifically, 使用 SAM 方法提取稳定和可靠的图像边特征，并通过多帧权重策略来筛选点云中的边特征。最后，基于边对应关系的约束，进行高精度的外部参数优化。我们在 KITTI 数据集和我们自己的数据集上进行了评估，结果显示，我们的方法在精度和可靠性方面占据了领先地位，与现有的边基于准确方法相比，在精度和可靠性方面都有显著的提升。

Real-time 6-DoF Pose Estimation by an Event-based Camera using Active LED Markers

paper_url: http://arxiv.org/abs/2310.16618
repo_url: None
paper_authors: Gerald Ebmer, Adam Loch, Minh Nhat Vu, Germain Haessig, Roberto Mecca, Markus Vincze, Christian Hartl-Nesic, Andreas Kugi
for: 本研究旨在提出一种简单 yet effective的事件基于pose数据估算系统，用于快速和准确的 pose数据估算。
methods: 本研究使用了活动LED标记（ALM），并提出了一种基于事件的pose数据估算算法，可以在实时中运行，并且可以在不可靠的视觉条件下保持精度。
results: 实验结果表明，提出的方法可以在实时中运行，并且可以在静止和动态场景中保持高度的计算速度和精度。

Abstract
Real-time applications for autonomous operations depend largely on fast and robust vision-based localization systems. Since image processing tasks require processing large amounts of data, the computational resources often limit the performance of other processes. To overcome this limitation, traditional marker-based localization systems are widely used since they are easy to integrate and achieve reliable accuracy. However, classical marker-based localization systems significantly depend on standard cameras with low frame rates, which often lack accuracy due to motion blur. In contrast, event-based cameras provide high temporal resolution and a high dynamic range, which can be utilized for fast localization tasks, even under challenging visual conditions. This paper proposes a simple but effective event-based pose estimation system using active LED markers (ALM) for fast and accurate pose estimation. The proposed algorithm is able to operate in real time with a latency below \SI{0.5}{\milli\second} while maintaining output rates of \SI{3}{\kilo \hertz}. Experimental results in static and dynamic scenarios are presented to demonstrate the performance of the proposed approach in terms of computational speed and absolute accuracy, using the OptiTrack system as the basis for measurement.

摘要

paper_url: http://arxiv.org/abs/2310.16590
repo_url: None
paper_authors: Adnen Abdessaied, Lei Shi, Andreas Bulling
for: 本文提出了一种新的视觉对话模型（$\mathbb{VD}$-$\mathbb{GR}$），它将预训练语言模型（LM）与图 neural networks（GNN）结合在一起，以利用它们的优势。
methods: 本文使用了多modal GNN进行特征处理，并利用本地结构信息以前进行BERT层的全球注意力。此外，本文还提出了核心节点，它们连接到每个模式图中的所有节点，使模型可以在不同模式之间传递信息。
results: 根据VisDial v1.0、VisDial v0.9、VisDialConv和VisPro等四个数据集的评估结果，$\mathbb{VD}$-$\mathbb{GR}$模型在所有四个数据集上均达到了新的状态级 результа。

Abstract
We propose $\mathbb{VD}$-$\mathbb{GR}$ - a novel visual dialog model that combines pre-trained language models (LMs) with graph neural networks (GNNs). Prior works mainly focused on one class of models at the expense of the other, thus missing out on the opportunity of combining their respective benefits. At the core of $\mathbb{VD}$-$\mathbb{GR}$ is a novel integration mechanism that alternates between spatial-temporal multi-modal GNNs and BERT layers, and that covers three distinct contributions: First, we use multi-modal GNNs to process the features of each modality (image, question, and dialog history) and exploit their local structures before performing BERT global attention. Second, we propose hub-nodes that link to all other nodes within one modality graph, allowing the model to propagate information from one GNN (modality) to the other in a cascaded manner. Third, we augment the BERT hidden states with fine-grained multi-modal GNN features before passing them to the next $\mathbb{VD}$-$\mathbb{GR}$ layer. Evaluations on VisDial v1.0, VisDial v0.9, VisDialConv, and VisPro show that $\mathbb{VD}$-$\mathbb{GR}$ achieves new state-of-the-art results across all four datasets.

摘要
我们提出了一种新的视觉对话模型，称为$\mathbb{VD}$-$\mathbb{GR}$，它结合了预训练语言模型（LM）和图 neural network（GNN）。先前的工作主要集中在一类模型上，忽略了另一类模型的机会，因此我们决定结合它们的优点。 $\mathbb{VD}$-$\mathbb{GR}$ 的核心机制是一种新的集成机制，它在空间-时间多Modal GNN 和 BERT 层之间进行交替，并包括以下三个贡献：1. 我们使用多Modal GNN 处理每个模式（图像、问题和对话历史）的特征，并利用它们的本地结构，然后进行 BERT 全局注意力。2. 我们提出了核心节点，它们与一个模式图中的所有节点相连，使模型可以在一个模式中传递信息，并在另一个模式中进行协调。3. 我们将 BERT 隐藏状态与细致的多Modal GNN 特征相加，然后传递给下一层 $\mathbb{VD}$-$\mathbb{GR}$。我们对 VisDial v1.0、VisDial v0.9、VisDialConv 和 VisPro 进行评估，得到了新的领域记录。

Flow-Attention-based Spatio-Temporal Aggregation Network for 3D Mask Detection

paper_url: http://arxiv.org/abs/2310.16569
repo_url: https://github.com/josephcao0327/fasten
paper_authors: Yuxin Cao, Yian Li, Yumeng Zhu, Derui Wang, Minhui Xue
for:The paper aims to address the challenges of anti-spoofing detection in face recognition systems, specifically the generalizability insufficiency of deep-learning-based methods in 3D masks.methods:The proposed method, FASTEN, is a novel 3D mask detection framework that leverages remote photoplethysmography (rPPG) technology and tailors a network for focusing on fine-grained details in large movements. The network consists of three key modules: facial optical flow network, flow attention, and spatio-temporal aggregation.results:FASTEN outperforms eight competitors in terms of multiple detection metrics, requiring only five frames of input. The proposed method has been deployed in real-world mobile devices for practical 3D mask detection.

Abstract
Anti-spoofing detection has become a necessity for face recognition systems due to the security threat posed by spoofing attacks. Despite great success in traditional attacks, most deep-learning-based methods perform poorly in 3D masks, which can highly simulate real faces in appearance and structure, suffering generalizability insufficiency while focusing only on the spatial domain with single frame input. This has been mitigated by the recent introduction of a biomedical technology called rPPG (remote photoplethysmography). However, rPPG-based methods are sensitive to noisy interference and require at least one second (> 25 frames) of observation time, which induces high computational overhead. To address these challenges, we propose a novel 3D mask detection framework, called FASTEN (Flow-Attention-based Spatio-Temporal aggrEgation Network). We tailor the network for focusing more on fine-grained details in large movements, which can eliminate redundant spatio-temporal feature interference and quickly capture splicing traces of 3D masks in fewer frames. Our proposed network contains three key modules: 1) a facial optical flow network to obtain non-RGB inter-frame flow information; 2) flow attention to assign different significance to each frame; 3) spatio-temporal aggregation to aggregate high-level spatial features and temporal transition features. Through extensive experiments, FASTEN only requires five frames of input and outperforms eight competitors for both intra-dataset and cross-dataset evaluations in terms of multiple detection metrics. Moreover, FASTEN has been deployed in real-world mobile devices for practical 3D mask detection.

摘要
face recognition 系统中的反伪检测已成为必备因素，由于伪检测攻击的安全威胁。虽然传统攻击方法取得了很大成功，但大多数深度学习基于方法在3D面具下表现不佳，它们在遥感特征上缺乏普适性，只集中在空间领域，使用单帧输入。这一问题得到了最近的生物医学技术——远程血氧测量（rPPG）的解决。然而，rPPG基于方法对干扰噪音敏感，需要至少一秒（> 25帧）的观察时间，这会导致高度的计算开销。为解决这些挑战，我们提出了一种新的3D面具检测框架，called FASTEN（流量注意力基于空间-时间聚合网络）。我们修改网络以便更注重大运动中细节，可以消除重复的空间-时间特征干扰，快速捕捉3D面具的拼接迹象。我们的提议网络包括以下三个关键模块：1） facial optical flow网络获取非RGB间帧流动信息；2）流量注意力将每帧图像分配不同的重要性；3）空间-时间聚合以聚合高级空间特征和时间变化特征。经过广泛的实验，FASTEN只需输入5帧，并在多个检测指标方面超越8个竞争对手。此外，FASTEN已经在实际应用中的移动设备中部署了实用3D面具检测。

ParisLuco3D: A high-quality target dataset for domain generalization of LiDAR perception

paper_url: http://arxiv.org/abs/2310.16542
repo_url: None
paper_authors: Jules Sanchez, Louis Soum-Fontez, Jean-Emmanuel Deschaud, Francois Goulette
for: 本研究旨在提供一个跨多个频道的评估数据集，以便评估不同来源数据的性能。
methods: 本研究使用了一种新的评估方法，可以在不同的频道上进行公正的比较。
results: 本研究提供了一个新的评估数据集，可以帮助研究人员更好地评估不同来源数据的性能。

Abstract
LiDAR is a sensor system that supports autonomous driving by gathering precise geometric information about the scene. Exploiting this information for perception is interesting as the amount of available data increases. As the quantitative performance of various perception tasks has improved, the focus has shifted from source-to-source perception to domain adaptation and domain generalization for perception. These new goals require access to a large variety of domains for evaluation. Unfortunately, the various annotation strategies of data providers complicate the computation of cross-domain performance based on the available data This paper provides a novel dataset, specifically designed for cross-domain evaluation to make it easier to evaluate the performance of various source datasets. Alongside the dataset, a flexible online benchmark is provided to ensure a fair comparison across methods.

摘要
李达（LiDAR）是一种感知系统，用于支持自动驾驶，它可以准确地收集干扰场景的几何信息。利用这些信息进行感知是有趣的，因为数据量的增加会提高感知任务的量化性能。随着感知任务的数学性能的改进，关注点从源源感知转移到领域适应和领域总结，以便更好地评估各种来源数据的性能。然而，不同数据提供者的注释策略会使计算交叉领域性能变得复杂。本文提供了一个新的数据集，专门用于交叉领域评估，以便更好地评估各种来源数据的性能。此外，还提供了一个灵活的在线测试台，以确保对各种方法进行公平的比较。

Dual Defense: Adversarial, Traceable, and Invisible Robust Watermarking against Face Swapping

paper_url: http://arxiv.org/abs/2310.16540
repo_url: None
paper_authors: Yunming Zhang, Dengpan Ye, Caiyun Xie, Long Tang, Chuanxi Chen, Ziyi Liu, Jiacheng Deng
for: 防范深刻的违当用途，如人脸替换，以防止违信传播和身份篡改。
methods: 提出了一种新的总防御机制，即双重防御， combine traceability和对抗性，以快速应对违当用途。
results: 实验表明，双重防御可以实现最佳的总防御成功率，并且在不同的人脸数据集上表现出优秀的通用性和对抗性。

Abstract
The malicious applications of deep forgery, represented by face swapping, have introduced security threats such as misinformation dissemination and identity fraud. While some research has proposed the use of robust watermarking methods to trace the copyright of facial images for post-event traceability, these methods cannot effectively prevent the generation of forgeries at the source and curb their dissemination. To address this problem, we propose a novel comprehensive active defense mechanism that combines traceability and adversariality, called Dual Defense. Dual Defense invisibly embeds a single robust watermark within the target face to actively respond to sudden cases of malicious face swapping. It disrupts the output of the face swapping model while maintaining the integrity of watermark information throughout the entire dissemination process. This allows for watermark extraction at any stage of image tracking for traceability. Specifically, we introduce a watermark embedding network based on original-domain feature impersonation attack. This network learns robust adversarial features of target facial images and embeds watermarks, seeking a well-balanced trade-off between watermark invisibility, adversariality, and traceability through perceptual adversarial encoding strategies. Extensive experiments demonstrate that Dual Defense achieves optimal overall defense success rates and exhibits promising universality in anti-face swapping tasks and dataset generalization ability. It maintains impressive adversariality and traceability in both original and robust settings, surpassing current forgery defense methods that possess only one of these capabilities, including CMUA-Watermark, Anti-Forgery, FakeTagger, or PGD methods.

摘要
“深层伪造的黑客应用，例如脸部交换，导致安全风险，如误传信息和身份骗复。一些研究提出使用可靠的水印方法来追溯颜面图像的版权，以便后续追踪，但这些方法无法有效防止伪造的生成和传播。为解决这个问题，我们提出了一个全新的综合式活动防御机制，称为“双重防御”（Dual Defense）。这个机制隐藏式嵌入了单一的可靠水印在目标脸部中，以活动地回应突然的黑客脸部交换。它破坏该交换模型的输出，同时保持水印信息的完整性 throughout the entire 传播过程。这使得可以在任何追踪过程中提取水印。 Specifically, we introduce a water mark embedding network based on original-domain feature impersonation attack. This network learns robust adversarial features of target facial images and embeds watermarks, seeking a well-balanced trade-off between water mark invisibility, adversariality, and traceability through perceptual adversarial encoding strategies. Extensive experiments demonstrate that Dual Defense achieves optimal overall defense success rates and exhibits promising universality in anti-face swapping tasks and dataset generalization ability. It maintains impressive adversariality and traceability in both original and robust settings, surpassing current forgery defense methods that possess only one of these capabilities, including CMUA-Watermark, Anti-Forgery, FakeTagger, or PGD methods.”

Learning Robust Deep Visual Representations from EEG Brain Recordings

paper_url: http://arxiv.org/abs/2310.16532
repo_url: https://github.com/prajwalsingh/eegstylegan-ada
paper_authors: Prajwal Singh, Dwip Dalal, Gautam Vashishtha, Krishna Miyapuram, Shanmuganathan Raman
for:This paper is written for researchers and scientists interested in brain-computer interfacing and reconstruction of visual images from brain Electroencephalography (EEG) signals.methods:The paper proposes a two-stage method for image generation and classification using EEG signals. The first step is to obtain EEG-derived features for robust learning of deep representations, and the second step is to utilize the learned representation for image generation and classification. The paper uses deep-learning architectures with supervised and contrastive learning methods.results:The paper demonstrates the generalizability of the feature extraction pipeline across three different datasets using deep-learning architectures with supervised and contrastive learning methods. The paper also shows that a subject invariant linearly separable visual representation can be learned using EEG data alone in an unimodal setting, which gives better k-means accuracy compared to a joint representation learning between EEG and images. Finally, the paper proposes a novel framework to transform unseen images into the EEG space and reconstruct them with approximation, showcasing the potential for image reconstruction from EEG signals. The proposed image synthesis method from EEG shows 62.9% and 36.13% inception score improvement on the EEGCVPR40 and the Thoughtviz datasets, which is better than state-of-the-art performance in GAN.

Abstract
Decoding the human brain has been a hallmark of neuroscientists and Artificial Intelligence researchers alike. Reconstruction of visual images from brain Electroencephalography (EEG) signals has garnered a lot of interest due to its applications in brain-computer interfacing. This study proposes a two-stage method where the first step is to obtain EEG-derived features for robust learning of deep representations and subsequently utilize the learned representation for image generation and classification. We demonstrate the generalizability of our feature extraction pipeline across three different datasets using deep-learning architectures with supervised and contrastive learning methods. We have performed the zero-shot EEG classification task to support the generalizability claim further. We observed that a subject invariant linearly separable visual representation was learned using EEG data alone in an unimodal setting that gives better k-means accuracy as compared to a joint representation learning between EEG and images. Finally, we propose a novel framework to transform unseen images into the EEG space and reconstruct them with approximation, showcasing the potential for image reconstruction from EEG signals. Our proposed image synthesis method from EEG shows 62.9% and 36.13% inception score improvement on the EEGCVPR40 and the Thoughtviz datasets, which is better than state-of-the-art performance in GAN.

摘要
neuroscientists 和人工智能研究者都在努力 decode the human brain. 使用电encephalography (EEG) 信号重建视觉图像的技术吸引了很多关注，因为它在Brain-computer interfacing中有很多应用。本研究提出了一种两个阶段的方法，其中第一个阶段是使用EEG信号获得可靠的特征，然后使用这些特征进行图像生成和分类。我们在三个不同的数据集上使用深度学习架构和监督学习方法来验证我们的特征提取管道的一致性。此外，我们还完成了零shot EEG分类任务，以更进一步地证明我们的特征提取管道的一致性。我们发现，使用EEG数据 alone 在单模式下可以学习一个主动抗干扰的视觉表示，这个表示可以在k-means分类任务中达到更高的准确率。最后，我们提出了一种将未看过的图像转换到EEG空间的框架，并使用这些图像进行重建，这展示了图像从EEG信号中的重建的潜在可能性。我们的提出的图像生成方法在EEGCVPR40和Thoughtviz数据集上达到了62.9%和36.13%的inception分数提升，这比state-of-the-art的GAN性能更好。

Enhancing Document Information Analysis with Multi-Task Pre-training: A Robust Approach for Information Extraction in Visually-Rich Documents

paper_url: http://arxiv.org/abs/2310.16527
repo_url: None
paper_authors: Tofik Ali, Partha Pratim Roy
for: 这个研究是为了开发一种专门为文档信息分析而设计的深度学习模型，包括文档分类、实体关系EXTRACTION和文档视频问答。
methods: 该模型使用基于transformer的模型来编码文档图像中的所有信息，包括文字、视觉和布局信息。模型首先预训练，然后进行精度调整以适应不同的文档图像分析任务。该模型还包括在预训练阶段进行多个任务的混合预训练，以及在多个数据集上进行精度调整。
results: 该模型在多个任务上达到了出色的效果，包括文档分类（RVL-CDIP数据集上的准确率为95.87%）、实体关系EXTRACTION（FUNSD、CORD、SROIE和Kleister-NDA数据集上的F1分数分别为0.9306、0.9804、0.9794和0.8742）和文档视频问答（DocVQA数据集上的ANLS分数为0.8468）。结果表明该模型可以快速和准确地理解和解释复杂的文档布局和内容，使其成为文档分析任务中的一种有前途的工具。

Abstract
This paper introduces a deep learning model tailored for document information analysis, emphasizing document classification, entity relation extraction, and document visual question answering. The proposed model leverages transformer-based models to encode all the information present in a document image, including textual, visual, and layout information. The model is pre-trained and subsequently fine-tuned for various document image analysis tasks. The proposed model incorporates three additional tasks during the pre-training phase, including reading order identification of different layout segments in a document image, layout segments categorization as per PubLayNet, and generation of the text sequence within a given layout segment (text block). The model also incorporates a collective pre-training scheme where losses of all the tasks under consideration, including pre-training and fine-tuning tasks with all datasets, are considered. Additional encoder and decoder blocks are added to the RoBERTa network to generate results for all tasks. The proposed model achieved impressive results across all tasks, with an accuracy of 95.87% on the RVL-CDIP dataset for document classification, F1 scores of 0.9306, 0.9804, 0.9794, and 0.8742 on the FUNSD, CORD, SROIE, and Kleister-NDA datasets respectively for entity relation extraction, and an ANLS score of 0.8468 on the DocVQA dataset for visual question answering. The results highlight the effectiveness of the proposed model in understanding and interpreting complex document layouts and content, making it a promising tool for document analysis tasks.

摘要
The proposed model includes three additional tasks during the pre-training phase: identifying the reading order of different layout segments in a document image, categorizing layout segments using PubLayNet, and generating text within a given layout segment. The model also uses a collective pre-training scheme, where the losses of all tasks are considered. Additional encoder and decoder blocks are added to the RoBERTa network to generate results for all tasks.The proposed model achieved impressive results across all tasks, with an accuracy of 95.87% on the RVL-CDIP dataset for document classification, F1 scores of 0.9306, 0.9804, 0.9794, and 0.8742 on the FUNSD, CORD, SROIE, and Kleister-NDA datasets respectively for entity relation extraction, and an ANLS score of 0.8468 on the DocVQA dataset for visual question answering. These results demonstrate the effectiveness of the proposed model in understanding and interpreting complex document layouts and content, making it a promising tool for document analysis tasks.

Lang3DSG: Language-based contrastive pre-training for 3D Scene Graph prediction

paper_url: http://arxiv.org/abs/2310.16494
repo_url: None
paper_authors: Sebastian Koch, Pedro Hermosilla, Narunas Vaskevicius, Mirco Colosi, Timo Ropinski
for: 本研究旨在提高3D场景图模型的学习效果，因为学习3D场景图需要不仅物体标签，还需要关系注释，这些注释在数据集中非常罕见。
methods: 我们采用了语言基于的预训练方法，利用图像语言模型CLIP的语言编码器储存其知识，并通过对subject-predicate-object triplets的对比，将语言表示和预测的3D图像特征进行对应。
results: 我们的方法在主要的semantic 3D场景图标准测试集上达到了状态 искусственный智能水平，比基eline预测方法和所有现有的完全监督场景图预测方法都高出较多。此外，由于我们的场景图特征是语言对应的，因此可以在零shot情况下查询语言空间中的特征。在本文中，我们展示了使用这种特征的属性来预测场景中的房型。

Abstract
D scene graphs are an emerging 3D scene representation, that models both the objects present in the scene as well as their relationships. However, learning 3D scene graphs is a challenging task because it requires not only object labels but also relationship annotations, which are very scarce in datasets. While it is widely accepted that pre-training is an effective approach to improve model performance in low data regimes, in this paper, we find that existing pre-training methods are ill-suited for 3D scene graphs. To solve this issue, we present the first language-based pre-training approach for 3D scene graphs, whereby we exploit the strong relationship between scene graphs and language. To this end, we leverage the language encoder of CLIP, a popular vision-language model, to distill its knowledge into our graph-based network. We formulate a contrastive pre-training, which aligns text embeddings of relationships (subject-predicate-object triplets) and predicted 3D graph features. Our method achieves state-of-the-art results on the main semantic 3D scene graph benchmark by showing improved effectiveness over pre-training baselines and outperforming all the existing fully supervised scene graph prediction methods by a significant margin. Furthermore, since our scene graph features are language-aligned, it allows us to query the language space of the features in a zero-shot manner. In this paper, we show an example of utilizing this property of the features to predict the room type of a scene without further training.

摘要
DScene graphs是一种emerging 3D场景表示，它模型了场景中的对象以及它们之间的关系。然而，学习3D场景图是一个具有挑战性的任务，因为它需要不仅对象标签，还需要关系注释，这些注释在数据集中很罕见。而现有的预训练方法在低数据量情况下并不适用于3D场景图。为解决这个问题，我们提出了首个语言基于的预训练方法 для3D场景图，其中我们利用了场景图和语言之间的强关系。为此，我们利用了CLIP语言编码器，一种流行的视觉语言模型，将其知识融合到我们的图基于网络中。我们提出了一种对比预训练方法，将文本 embedding（主语-谓语-词语 triplets）与预测的3D图像特征进行对比。我们的方法在主要的semantic 3D场景图标准测试集上实现了状态之 искусственный智能水平，比基elines预训练方法和所有现有的完全监督场景图预测方法在较大的margin上表现出色。此外，因为我们的场景图特征与语言空间相对应，因此我们可以在零shot情况下查询语言空间中的特征。在本文中，我们给出了一个例子，利用这种特征的属性来预测场景中的房间类型。

Gramian Attention Heads are Strong yet Efficient Vision Learners

paper_url: http://arxiv.org/abs/2310.16483
repo_url: https://github.com/lab-lvm/imagenet-models
paper_authors: Jongbin Ryu, Dongyoon Han, Jongwoo Lim
For: 该论文旨在提出一种新的建筑设计，以增强表达能力。* Methods: 该方法使用多个分类头（classification heads），而不是通过通道扩展或附加块来提高表达能力。这些头使用注意力基于的聚合，利用对Feature similarity进行拟合，以增强每个轻量级头的表达能力。* Results: 该方法可以在ImageNet-1K上超越现有的CNN和ViT模型，并在多个下游任务中表现出色，如COCO物体实例分割、ADE20k semantic segmentation和细化视觉分类等。 Code publicly available at: https://github.com/Lab-LVM/imagenet-models。

Abstract
We introduce a novel architecture design that enhances expressiveness by incorporating multiple head classifiers (\ie, classification heads) instead of relying on channel expansion or additional building blocks. Our approach employs attention-based aggregation, utilizing pairwise feature similarity to enhance multiple lightweight heads with minimal resource overhead. We compute the Gramian matrices to reinforce class tokens in an attention layer for each head. This enables the heads to learn more discriminative representations, enhancing their aggregation capabilities. Furthermore, we propose a learning algorithm that encourages heads to complement each other by reducing correlation for aggregation. Our models eventually surpass state-of-the-art CNNs and ViTs regarding the accuracy-throughput trade-off on ImageNet-1K and deliver remarkable performance across various downstream tasks, such as COCO object instance segmentation, ADE20k semantic segmentation, and fine-grained visual classification datasets. The effectiveness of our framework is substantiated by practical experimental results and further underpinned by generalization error bound. We release the code publicly at: https://github.com/Lab-LVM/imagenet-models.

摘要
我们提出了一种新的建筑设计，通过多个头分类器（即分类头）来提高表达能力，而不是依赖通道扩展或额外组件。我们的方法使用注意力基于的汇集，通过对每个头进行积分来增强轻量级的头。我们计算 Gramian 矩阵来强制类别标签在注意力层中增强多个头的汇集能力。此外，我们提出了一种学习算法，使得头们之间减少相关性，以便协同汇集。我们的模型最终超越了状态艺术 CNN 和 ViT 在 ImageNet-1K 上的精度-通过put trade-off，并在多个下游任务上表现出色，如 COCO 对象实例分割、ADE20k semantic segmentation 和细化视觉分类 dataset。我们的框架的有效性得到了实验证明，并被更加强大的通用Error bound 支持。我们在 GitHub 上公开了代码：https://github.com/Lab-LVM/imagenet-models。

Show from Tell: Audio-Visual Modelling in Clinical Settings

paper_url: http://arxiv.org/abs/2310.16477
repo_url: None
paper_authors: Jianbo Jiao, Mohammad Alsharid, Lior Drukker, Aris T. Papageorghiou, Andrew Zisserman, J. Alison Noble
for: 本研究は医疗设定下でのaudio-visual模型を提案し、専门家のアノテーションなしで医疗タスクに役立つ医学的表现を学习するための方法を提案します。
methods: 本研究では、自律学习のための简単で效率的な多Modal自律学习フレームワークを提案します。この方法では、speech音声を参照にして、ultrasound画像中の解剖学的领域を検出することができます。
results: 実験结果では、大规模の医疗多Modal ultrasoundビデオデータセットに対して、提案された自律学习方法は、専门家アノテーションなしで高性能な自动化下流医疗タスクを実现するための良い转移学习表现を学习することができました。さらに、全ての学习データを使用した完全な学习方法を上回るパフォーマンスを示しました。

Abstract
Auditory and visual signals usually present together and correlate with each other, not only in natural environments but also in clinical settings. However, the audio-visual modelling in the latter case can be more challenging, due to the different sources of audio/video signals and the noise (both signal-level and semantic-level) in auditory signals -- usually speech. In this paper, we consider audio-visual modelling in a clinical setting, providing a solution to learn medical representations that benefit various clinical tasks, without human expert annotation. A simple yet effective multi-modal self-supervised learning framework is proposed for this purpose. The proposed approach is able to localise anatomical regions of interest during ultrasound imaging, with only speech audio as a reference. Experimental evaluations on a large-scale clinical multi-modal ultrasound video dataset show that the proposed self-supervised method learns good transferable anatomical representations that boost the performance of automated downstream clinical tasks, even outperforming fully-supervised solutions.

摘要
听觉和视觉信号通常同时存在，并且在自然环境和临床设置中呈相关关系。然而，在临床情况下，听觉-视觉模型化可能更加困难，主要因为听觉信号中的噪音（包括信号水平噪音和semantic-level噪音），以及不同的听觉/视觉信号来源。在这篇论文中，我们考虑了在临床设置下的听觉-视觉模型化，提供一种不需要人工专家标注的解决方案。我们提议的方法是一种简单 yet effective的多modal自动学习框架，可以在听觉信号中 lokalisir anatomical region of interest。我们对大规模的临床多Modal Ultrasound视频数据集进行了实验评估，结果表明，我们的自动学习方法可以学习出好的传输性骨骼表示，提高了下游临床任务的自动化处理效果，甚至超过了全部监督的解决方案。

DualMatch: Robust Semi-Supervised Learning with Dual-Level Interaction

paper_url: http://arxiv.org/abs/2310.16459
repo_url: https://github.com/cwangai/dualmatch
paper_authors: Cong Wang, Xiaofeng Cao, Lanzhe Guo2, Zenglin Shi
for: 这 paper 的目的是提出一种新的 semi-supervised learning 方法，以便在标签不够的情况下利用无标签数据。
methods: 这 paper 使用了一种新的 dual-level 交互方法，即在jointly invoking feature embedding和class prediction的方式下进行学习。此外，它还需要一种consistent regularization，确保不同的数据扩展视图和不同的数据之间的feature embedding具有相似性。
results: 经验表明，这 paper 的提议可以在标准的 semi-supervised learning 设置下实现9%的错误减少，而在更复杂的类别不均衡设置下，仍可以实现6%的错误减少。

Abstract
Semi-supervised learning provides an expressive framework for exploiting unlabeled data when labels are insufficient. Previous semi-supervised learning methods typically match model predictions of different data-augmented views in a single-level interaction manner, which highly relies on the quality of pseudo-labels and results in semi-supervised learning not robust. In this paper, we propose a novel SSL method called DualMatch, in which the class prediction jointly invokes feature embedding in a dual-level interaction manner. DualMatch requires consistent regularizations for data augmentation, specifically, 1) ensuring that different augmented views are regulated with consistent class predictions, and 2) ensuring that different data of one class are regulated with similar feature embeddings. Extensive experiments demonstrate the effectiveness of DualMatch. In the standard SSL setting, the proposal achieves 9% error reduction compared with SOTA methods, even in a more challenging class-imbalanced setting, the proposal can still achieve 6% error reduction. Code is available at https://github.com/CWangAI/DualMatch

摘要
semi-supervised learning 提供了一个表达性强的框架，可以将无标签数据作用到当 labels 不足时。先前的 semi-supervised learning 方法通常是将不同扩展的观点汇总在单一水平上进行汇总，这高度依赖 pseudo-label 的质量，从而导致 semi-supervised learning 不稳定。在这篇论文中，我们提出了一个新的 SSL 方法，叫做 DualMatch，这个方法在两个水平上进行汇总。DualMatch 需要一些一致的调整，特别是：1）确保不同扩展的观点汇总的类别预测是一致的，2）确保不同一个类别的数据是一致的。实验结果显示 DualMatch 的效果。在标准 SSL 设定下，我们的提案可以与 SOTA 方法相比，获得 9% 的错误减少，甚至在更加具体的类别偏见设定下，我们的提案仍然可以获得 6% 的错误减少。代码可以在 https://github.com/CWangAI/DualMatch 上取得。

ChimpACT: A Longitudinal Dataset for Understanding Chimpanzee Behaviors

paper_url: http://arxiv.org/abs/2310.16447
repo_url: https://github.com/shirleymaxx/chimpact
paper_authors: Xiaoxuan Ma, Stephan P. Kaufhold, Jiajun Su, Wentao Zhu, Jack Terwilliger, Andres Meza, Yixin Zhu, Federico Rossano, Yizhou Wang
for: 这个研究的目的是提高动物福祉，模拟社会行为，以及了解人类和其他动物之间的共同行为。
methods: 这个研究使用了视频数据，并对其进行了详细的标注和分类。
results: 这个研究提供了一个包含20多只黑猩狮的视频数据集，并对这些数据进行了详细的分析和研究，以深入了解黑猩狮的社会行为和通信方式。

Abstract
Understanding the behavior of non-human primates is crucial for improving animal welfare, modeling social behavior, and gaining insights into distinctively human and phylogenetically shared behaviors. However, the lack of datasets on non-human primate behavior hinders in-depth exploration of primate social interactions, posing challenges to research on our closest living relatives. To address these limitations, we present ChimpACT, a comprehensive dataset for quantifying the longitudinal behavior and social relations of chimpanzees within a social group. Spanning from 2015 to 2018, ChimpACT features videos of a group of over 20 chimpanzees residing at the Leipzig Zoo, Germany, with a particular focus on documenting the developmental trajectory of one young male, Azibo. ChimpACT is both comprehensive and challenging, consisting of 163 videos with a cumulative 160,500 frames, each richly annotated with detection, identification, pose estimation, and fine-grained spatiotemporal behavior labels. We benchmark representative methods of three tracks on ChimpACT: (i) tracking and identification, (ii) pose estimation, and (iii) spatiotemporal action detection of the chimpanzees. Our experiments reveal that ChimpACT offers ample opportunities for both devising new methods and adapting existing ones to solve fundamental computer vision tasks applied to chimpanzee groups, such as detection, pose estimation, and behavior analysis, ultimately deepening our comprehension of communication and sociality in non-human primates.

摘要
理解非人类 primate 的行为是关键的，可以提高动物福祉，模拟社会行为，并为人类和生物共享的行为提供新的洞察。然而，因为非人类 primate 的行为数据缺乏，因此对非人类 primate 的社会互动进行深入探索受到限制。为解决这些限制，我们介绍了 ChimpACT，一个包括2015-2018年在德国列vik zoo的一群超过20只非人类 primate 的行为数据。这个数据集涵盖了一个年轻的 male chimanzee 的发展轨迹，名为 Azibo。ChimpACT 是丰富和挑战的数据集，包括163个视频，共计160,500帧，每个帧都有丰富的注释，包括检测、识别、姿势估计和精细的时空行为标签。我们在 ChimpACT 上测试了代表性的三个Track：（i）跟踪和识别、（ii）姿势估计、（iii）时空动作检测。我们的实验表明，ChimpACT 提供了许多机会，用于创造新的方法和适应现有的方法，以解决在非人类 primate 群体中的检测、姿势估计和行为分析问题，最终深化我们对非人类 primate 的communication和社会性的理解。

On Pixel-level Performance Assessment in Anomaly Detection

paper_url: http://arxiv.org/abs/2310.16435
repo_url: None
paper_authors: Mehdi Rafiei, Toby P. Breckon, Alexandros Iosifidis
for: 本研究旨在探讨 anomaly detection 方法在不同应用中的表现，特别是在像素级别时的评估带来的复杂挑战。
methods: 本研究使用了 eleven 种现代 anomaly detection 方法，应用于 twenty-one 个 anomaly detection 问题。
results: 经过广泛的实验评估，研究人员发现，使用 Precision-Recall 基于的 metric 可以更好地捕捉方法的表现，这些 metric 更适合用于这种任务。

Abstract
Anomaly detection methods have demonstrated remarkable success across various applications. However, assessing their performance, particularly at the pixel-level, presents a complex challenge due to the severe imbalance that is most commonly present between normal and abnormal samples. Commonly adopted evaluation metrics designed for pixel-level detection may not effectively capture the nuanced performance variations arising from this class imbalance. In this paper, we dissect the intricacies of this challenge, underscored by visual evidence and statistical analysis, leading to delve into the need for evaluation metrics that account for the imbalance. We offer insights into more accurate metrics, using eleven leading contemporary anomaly detection methods on twenty-one anomaly detection problems. Overall, from this extensive experimental evaluation, we can conclude that Precision-Recall-based metrics can better capture relative method performance, making them more suitable for the task.

摘要
异常检测方法在不同应用领域中表现出了惊人的成功。然而，评估这些方法的性能，特别是在像素级别，却存在严重的类别不平衡问题。通常采用的评估指标可能不能准确捕捉这种类别不平衡导致的性能变化。本文通过视觉证据和统计分析，探讨了这种挑战的复杂性，并提出了考虑类别不平衡的评估指标。我们在 twenty-one 个异常检测问题上使用了 eleven 种当代异常检测方法进行了广泛的实验评估。总的来说，我们可以从这些实验结果中得出结论，精度-回归-基于的指标更适合用于这个任务。

Winning Prize Comes from Losing Tickets: Improve Invariant Learning by Exploring Variant Parameters for Out-of-Distribution Generalization

paper_url: http://arxiv.org/abs/2310.16391
repo_url: None
paper_authors: Zhuo Huang, Muyang Li, Li Shen, Jun Yu, Chen Gong, Bo Han, Tongliang Liu
for:EVIL aims to improve OOD generalization by identifying a robust subnetwork that is resistant to distribution shift.methods:EVIL leverages distribution knowledge to find both invariant and variant parameters, and uses them to improve OOD generalization.results:EVIL can effectively and efficiently enhance many popular methods, such as ERM, IRM, SAM, etc., on an integrated testbed called DomainBed.

Abstract
Out-of-Distribution (OOD) Generalization aims to learn robust models that generalize well to various environments without fitting to distribution-specific features. Recent studies based on Lottery Ticket Hypothesis (LTH) address this problem by minimizing the learning target to find some of the parameters that are critical to the task. However, in OOD problems, such solutions are suboptimal as the learning task contains severe distribution noises, which can mislead the optimization process. Therefore, apart from finding the task-related parameters (i.e., invariant parameters), we propose Exploring Variant parameters for Invariant Learning (EVIL) which also leverages the distribution knowledge to find the parameters that are sensitive to distribution shift (i.e., variant parameters). Once the variant parameters are left out of invariant learning, a robust subnetwork that is resistant to distribution shift can be found. Additionally, the parameters that are relatively stable across distributions can be considered invariant ones to improve invariant learning. By fully exploring both variant and invariant parameters, our EVIL can effectively identify a robust subnetwork to improve OOD generalization. In extensive experiments on integrated testbed: DomainBed, EVIL can effectively and efficiently enhance many popular methods, such as ERM, IRM, SAM, etc.

摘要
外部分布（OOD）泛化目标是学习具有良好泛化能力的模型，以适应不同环境中的数据分布。近年来，基于抽奖假设（LTH）的研究提出了以最小化学习目标来找到任务相关的参数的方法，但在OOD问题中，这些解决方案是不优化的，因为学习任务中存在严重的分布噪声，这可能会导致优化过程受到束缚。因此，我们提出了尝试探索变体参数来实现泛化学习（EVIL），该方法还利用了分布知识来找到分布转移中不稳定的参数。一旦变体参数被去除，我们可以找到一个鲁棒的子网络，该子网络对分布转移具有抗性。此外，可以考虑相对稳定的参数作为惰性参数，以改进泛化学习。通过全面探索变体和惰性参数，我们的EVIL可以有效地找到一个鲁棒的子网络，以提高OOD泛化能力。在各种流行的方法基础上，如ERM、IRM、SAM等，我们在DomainBed集成测试床上进行了广泛的实验，EVIL可以高效地和高质量地增强这些方法。

MVFAN: Multi-View Feature Assisted Network for 4D Radar Object Detection

paper_url: http://arxiv.org/abs/2310.16389
repo_url: None
paper_authors: Qiao Yan, Yihan Wang
for:* 这篇论文的目的是提出一种基于4D雷达的3D对象检测方法，以提高自动驾驶系统的能力和可靠性。methods:* 该方法基于一个新的Position Map Generation模块，用于增强特征学习，并且使用了一种新的Radar Feature Assisted backbone来全面利用4D雷达传感器提供的Doppler速度和反射率数据。results:* 对Astyx和VoD数据集进行了广泛的实验和ablation研究，证明了该方法的有效性，特别是对小移动目标物如人行和自行车的检测性能有了明显的改善。I hope that helps! Let me know if you have any further questions.

Abstract
4D radar is recognized for its resilience and cost-effectiveness under adverse weather conditions, thus playing a pivotal role in autonomous driving. While cameras and LiDAR are typically the primary sensors used in perception modules for autonomous vehicles, radar serves as a valuable supplementary sensor. Unlike LiDAR and cameras, radar remains unimpaired by harsh weather conditions, thereby offering a dependable alternative in challenging environments. Developing radar-based 3D object detection not only augments the competency of autonomous vehicles but also provides economic benefits. In response, we propose the Multi-View Feature Assisted Network (\textit{MVFAN}), an end-to-end, anchor-free, and single-stage framework for 4D-radar-based 3D object detection for autonomous vehicles. We tackle the issue of insufficient feature utilization by introducing a novel Position Map Generation module to enhance feature learning by reweighing foreground and background points, and their features, considering the irregular distribution of radar point clouds. Additionally, we propose a pioneering backbone, the Radar Feature Assisted backbone, explicitly crafted to fully exploit the valuable Doppler velocity and reflectivity data provided by the 4D radar sensor. Comprehensive experiments and ablation studies carried out on Astyx and VoD datasets attest to the efficacy of our framework. The incorporation of Doppler velocity and RCS reflectivity dramatically improves the detection performance for small moving objects such as pedestrians and cyclists. Consequently, our approach culminates in a highly optimized 4D-radar-based 3D object detection capability for autonomous driving systems, setting a new standard in the field.

摘要
四维度雷达被广泛应用于自动驾驶领域，因其鲜为人知的优点，包括可靠性和成本效益。雷达不同于激光雷达和摄像头，在恶劣天气条件下仍然能够保持高度可靠，因此在自动驾驶系统中扮演着重要的辅助角色。为了提高自动驾驶系统的可靠性和安全性，我们提出了基于四维度雷达的三维物体检测方法，即多视图特征帮助网络（MVFAN）。我们解决了尚未充分利用特征的问题，通过引入新的位置图生成模块，以提高特征学习的灵活性和可靠性。此外，我们还提出了一种创新的干扰抑制器，以避免由雷达点云的不规则分布所引起的干扰。此外，我们还特制了一种雷达特征帮助核心，以全面利用雷达传感器提供的Doppler速度和反射率数据。我们在Astyx和VoD数据集上进行了广泛的实验和缺省研究，结果表明，我们的框架在检测小运动目标（如行人和自行车）的表现出色。因此，我们的方法在自动驾驶系统中提供了一种高度优化的四维度雷达基于三维物体检测能力，为自动驾驶领域的发展提供了新的标准。

General Point Model with Autoencoding and Autoregressive

paper_url: http://arxiv.org/abs/2310.16861
repo_url: None
paper_authors: Zhe Li, Zhangyang Gao, Cheng Tan, Stan Z. Li, Laurence T. Yang
for: 这篇论文旨在探讨大语言模型的预训练架构，以及如何使用这些架构来提高点云表示能力。
methods: 该论文提出了一种通用的点云模型（General Point Model，GPM），该模型结合了自编码和自发现任务，并可以进行精细调整以适应不同的下游任务。
results: 对比 Point-BERT、MaskPoint 和 PointMAE 等模型，GPM在点云理解任务中表现出色，并且在条件生成任务中也可以达到比较高的水平。此外，GPM的核心思想是将自编码和自发现任务融合到同一个 transformer 中，这使得模型在不同的下游任务上具有更大的灵活性。

Abstract
The pre-training architectures of large language models encompass various types, including autoencoding models, autoregressive models, and encoder-decoder models. We posit that any modality can potentially benefit from a large language model, as long as it undergoes vector quantization to become discrete tokens. Inspired by GLM, we propose a General Point Model (GPM) which seamlessly integrates autoencoding and autoregressive tasks in point cloud transformer. This model is versatile, allowing fine-tuning for downstream point cloud representation tasks, as well as unconditional and conditional generation tasks. GPM enhances masked prediction in autoencoding through various forms of mask padding tasks, leading to improved performance in point cloud understanding. Additionally, GPM demonstrates highly competitive results in unconditional point cloud generation tasks, even exhibiting the potential for conditional generation tasks by modifying the input's conditional information. Compared to models like Point-BERT, MaskPoint and PointMAE, our GPM achieves superior performance in point cloud understanding tasks. Furthermore, the integration of autoregressive and autoencoding within the same transformer underscores its versatility across different downstream tasks.

摘要
大型语言模型的预训练架构包括自编码模型、自回归模型以及编码器-解码器模型。我们认为任何modal都可能受益于大语言模型，只要它经过vector量化成为简单的token。受GLM的 inspirations所影响，我们提议一种通用点模型（GPM），该模型可以在点云trasnformer中协调自编码和自回归任务。这个模型非常灵活，可以进行下游点云表示任务的细化调整，以及无条件和条件生成任务。GPM通过不同的mask padding任务来提高autoencoding中的假值预测，从而提高点云理解性能。此外，GPM在无条件点云生成任务中表现出非常竞争力，甚至可以通过修改输入的条件信息来实现条件生成任务。相比之下，与Point-BERT、MaskPoint和PointMAE等模型相比，我们的GPM在点云理解任务中表现出较高的性能。此外，将自回归和自编码任务嵌入同一个transformer中，强调了这个模型的多功能性在不同的下游任务中。

Deepfake Detection: Leveraging the Power of 2D and 3D CNN Ensembles

paper_url: http://arxiv.org/abs/2310.16388
repo_url: None
paper_authors: Aagam Bakliwal, Amit D. Joshi
for: 这个研究目的是为了实现深伪检测中的动态实体验证。
methods: 这个方法结合了进步的2D和3D卷积神经网，其中3D模型通过滑动范围实现了空间和时间维度上的特征捕捉。
results: 实验显示，这种组合实现了优异的验证效果，表明它具有对深伪生成的欺骗行为的应急应对能力。

Abstract
In the dynamic realm of deepfake detection, this work presents an innovative approach to validate video content. The methodology blends advanced 2-dimensional and 3-dimensional Convolutional Neural Networks. The 3D model is uniquely tailored to capture spatiotemporal features via sliding filters, extending through both spatial and temporal dimensions. This configuration enables nuanced pattern recognition in pixel arrangement and temporal evolution across frames. Simultaneously, the 2D model leverages EfficientNet architecture, harnessing auto-scaling in Convolutional Neural Networks. Notably, this ensemble integrates Voting Ensembles and Adaptive Weighted Ensembling. Strategic prioritization of the 3-dimensional model's output capitalizes on its exceptional spatio-temporal feature extraction. Experimental validation underscores the effectiveness of this strategy, showcasing its potential in countering deepfake generation's deceptive practices.

摘要
在深层伪造检测领域中，这项工作提出了一种创新的方法来验证视频内容。该方法结合了高级的2维和3维卷积神经网络。3D模型特意设计了捕捉空间时间特征的滑动缓示，通过空间和时间维度的扩展来提高细节特征识别。同时，2D模型采用了高效的EfficientNet架构，实现了自适应卷积神经网络的核心。此外，这个集成还包括投票集成和适应加权集成。在优先级方面，将3维模型的输出作为首要考虑，以便利用其出色的空间时间特征抽取。实验证明了这种策略的有效性，表明其在防止深层伪造生成的欺诈实践中具有潜在的优势。

Frequency-Aware Transformer for Learned Image Compression

paper_url: http://arxiv.org/abs/2310.16387
repo_url: None
paper_authors: Han Li, Shaohui Li, Wenrui Dai, Chenglin Li, Junni Zou, Hongkai Xiong
for: 提高了图像压缩和传输的效率，解决了现有LIC方法中的约束和方向细节损失问题。
methods: 我们提出了一种新的频率意识变换块（FAT），通过多级方向分析来捕捉自然图像的频率组成。此外，我们还引入了频率调制Feedforward网络（FMFFN）来适应不同频率组成，提高了比特率-误差性能。最后，我们提出了一种基于变换器的通道自动循环（T-CA）模型，有效地利用通道相互关系。
results: 我们的方法在BD-率上比现有LIC方法更高，并且胜过最新的标准化编码器VTM-12.1 by 14.5%, 15.1%, 13.0% on the Kodak, Tecnick, and CLIC datasets。

Abstract
Learned image compression (LIC) has gained traction as an effective solution for image storage and transmission in recent years. However, existing LIC methods are redundant in latent representation due to limitations in capturing anisotropic frequency components and preserving directional details. To overcome these challenges, we propose a novel frequency-aware transformer (FAT) block that for the first time achieves multiscale directional ananlysis for LIC. The FAT block comprises frequency-decomposition window attention (FDWA) modules to capture multiscale and directional frequency components of natural images. Additionally, we introduce frequency-modulation feed-forward network (FMFFN) to adaptively modulate different frequency components, improving rate-distortion performance. Furthermore, we present a transformer-based channel-wise autoregressive (T-CA) model that effectively exploits channel dependencies. Experiments show that our method achieves state-of-the-art rate-distortion performance compared to existing LIC methods, and evidently outperforms latest standardized codec VTM-12.1 by 14.5%, 15.1%, 13.0% in BD-rate on the Kodak, Tecnick, and CLIC datasets.

摘要
现代学习图像压缩（LIC）技术在最近几年来得到了广泛应用和推广，但现有的LIC方法具有重复的 latent 表示，导致不能够准确捕捉自然图像的多方位频率成分和方向细节。为解决这些挑战，我们提出了一种新的频率意识转换块（FAT），该块包括频率分解窗口注意力（FDWA）模块，以捕捉自然图像的多方位频率成分。此外，我们还引入了频率调制Feedforward网络（FMFFN），以适应不同频率成分的改变，从而提高rate-distortion性能。此外，我们还提出了基于 transformer 的渠道 wise 自动逆生成（T-CA）模型，该模型能够有效利用渠道之间的依赖关系。实验表明，我们的方法在 rate-distortion 性能方面与现有的 LIC 方法相比，有州最佳的表现，并且明显超过最新的标准化编码器 VTM-12.1 的14.5%、15.1% 和 13.0% 的BD-rate 在 Kodak、Tecnick 和 CLIC 数据集上。

Open-NeRF: Towards Open Vocabulary NeRF Decomposition

paper_url: http://arxiv.org/abs/2310.16383
repo_url: None
paper_authors: Hao Zhang, Fang Li, Narendra Ahuja
for: 解决Neural Radiance Fields（NeRF）中的对象分解问题，以便在3D重建和视觉合成中进行物体操作。
methods: 利用大规模的、可用的、Segment Anything Model（SAM）和嵌入式并蒸馈法来实现开放词汇查询的灵活性和3D分 segmentation的准确性。
results: 比静脉补充（LERF）和FFD（FFD）在开放词汇场景下表现出色，并且在干扰和杂乱特征的情况下保持一致的物体认知和细化。

Abstract
In this paper, we address the challenge of decomposing Neural Radiance Fields (NeRF) into objects from an open vocabulary, a critical task for object manipulation in 3D reconstruction and view synthesis. Current techniques for NeRF decomposition involve a trade-off between the flexibility of processing open-vocabulary queries and the accuracy of 3D segmentation. We present, Open-vocabulary Embedded Neural Radiance Fields (Open-NeRF), that leverage large-scale, off-the-shelf, segmentation models like the Segment Anything Model (SAM) and introduce an integrate-and-distill paradigm with hierarchical embeddings to achieve both the flexibility of open-vocabulary querying and 3D segmentation accuracy. Open-NeRF first utilizes large-scale foundation models to generate hierarchical 2D mask proposals from varying viewpoints. These proposals are then aligned via tracking approaches and integrated within the 3D space and subsequently distilled into the 3D field. This process ensures consistent recognition and granularity of objects from different viewpoints, even in challenging scenarios involving occlusion and indistinct features. Our experimental results show that the proposed Open-NeRF outperforms state-of-the-art methods such as LERF \cite{lerf} and FFD \cite{ffd} in open-vocabulary scenarios. Open-NeRF offers a promising solution to NeRF decomposition, guided by open-vocabulary queries, enabling novel applications in robotics and vision-language interaction in open-world 3D scenes.

摘要
在这篇论文中，我们解决了基于神经辐射场（NeRF）的对象分解问题，这是3D重建和视觉合成中对物体进行操作的关键任务。现有的NeRF分解技术存在较大的灵活性和3D分割精度之间的负担。我们提出了开放词汇内置神经辐射场（Open-NeRF），它利用大规模的商业化分割模型，如Segment Anything Model（SAM），并在层次嵌入和热链整合方法下实现了开放词汇查询的灵活性和3D分割精度。Open-NeRF首先利用大规模基础模型生成层次2D面mask提案，从不同视点生成这些提案，然后使用跟踪方法对它们进行准确的对齐和集成，并将其嵌入到3D空间中，最后进行热链整合和筛选，以保证对不同视点的物体承载和不确定特征的一致性。我们的实验结果表明，我们提出的Open-NeRF在开放词汇场景下超过了现有的LERF \cite{lerf}和FFD \cite{ffd}的性能。Open-NeRF提供了一种有前途的解决方案，受开放词汇查询指导，在开放世界3D场景中实现了新的 робо扮和视觉语言互动应用。

Towards Large-scale Masked Face Recognition

paper_url: http://arxiv.org/abs/2310.16364
repo_url: None
paper_authors: Manyuan Zhang, Bingqi Ma, Guanglu Song, Yunxiao Wang, Hongsheng Li, Yu Liu
for: 本研究的目的是提出一种在COVID-19 coronavirus 疫情期间大规模戴口罩的人脸识别算法冠军解决方案。
methods: 本研究使用的方法包括大规模训练、数据噪声处理、戴口罩和不戴口罩人脸识别精度平衡等四个挑战。
results: 本研究在ICCV MFR WebFace260M 和 InsightFace 无结构授益识别 tracks 上实现了冠军成绩，并提出了一种适用于大规模戴口罩人脸识别的推理友好模型体系。

Abstract
During the COVID-19 coronavirus epidemic, almost everyone is wearing masks, which poses a huge challenge for deep learning-based face recognition algorithms. In this paper, we will present our \textbf{championship} solutions in ICCV MFR WebFace260M and InsightFace unconstrained tracks. We will focus on four challenges in large-scale masked face recognition, i.e., super-large scale training, data noise handling, masked and non-masked face recognition accuracy balancing, and how to design inference-friendly model architecture. We hope that the discussion on these four aspects can guide future research towards more robust masked face recognition systems.

摘要
durante la epidemia de COVID-19 del coronavirus, prácticamente todos están usando mascarillas, lo que plantea un gran desafío para los algoritmos de reconocimiento de rostros basados en aprendizaje profundo. En este artículo, presentaremos nuestras soluciones campeonas en las pistas MFR WebFace260M e InsightFace de ICCV. Centraremonos en cuatro desafíos en el reconocimiento de rostros mascados a gran escala, es decir, la capacitación en escalas supergrandes, el manejo de ruido de datos, el equilibrio entre la precisión de reconocimiento de rostros mascados y no mascados, y cómo diseñar arquitecturas de modelos amigables con la inferencia. Esperamos que el debate sobre estos cuatro aspectos pueda guiar la investigación futura hacia sistemas de reconocimiento de rostros más robustos con mascarillas.

paper_url: http://arxiv.org/abs/2310.16349
repo_url: None
paper_authors: Se-Ho Kim, Inyong Koo, Inyoung Lee, Byeongjun Park, Changick Kim
for: 提高3D物体检测器的性能
methods: 使用diffusion过程进行提档 proposal refinement
results: 在KITTI数据集上实现了高性能的3D物体检测Here’s the full translation in Simplified Chinese:
for: 提高3D物体检测器的性能
methods: 使用diffusion过程进行提档 proposal refinement
results: 在KITTI数据集上实现了高性能的3D物体检测I hope that helps! Let me know if you have any other questions.

Abstract
Denoising diffusion models show remarkable performances in generative tasks, and their potential applications in perception tasks are gaining interest. In this paper, we introduce a novel framework named DiffRef3D which adopts the diffusion process on 3D object detection with point clouds for the first time. Specifically, we formulate the proposal refinement stage of two-stage 3D object detectors as a conditional diffusion process. During training, DiffRef3D gradually adds noise to the residuals between proposals and target objects, then applies the noisy residuals to proposals to generate hypotheses. The refinement module utilizes these hypotheses to denoise the noisy residuals and generate accurate box predictions. In the inference phase, DiffRef3D generates initial hypotheses by sampling noise from a Gaussian distribution as residuals and refines the hypotheses through iterative steps. DiffRef3D is a versatile proposal refinement framework that consistently improves the performance of existing 3D object detection models. We demonstrate the significance of DiffRef3D through extensive experiments on the KITTI benchmark. Code will be available.

摘要
diffusion 模型在生成任务中表现出色，其在感知任务中的潜在应用也引起了关注。在这篇论文中，我们介绍了一个名为DiffRef3D的新框架，它在3D物体检测中使用点云的扩散过程来进行首次应用。具体来说，我们将两stage 3D物体检测器的提议改进阶段设计为一个条件的扩散过程。在训练过程中，DiffRef3D逐渐添加了随机噪声到提议和目标对象之间的差异，然后将这些噪声应用到提议中来生成假设。提升模块使用这些假设来减少噪声并生成准确的盒子预测。在推断阶段，DiffRef3D通过随机从 Gaussian 分布中采样噪声来生成初始假设，然后通过迭代步骤来修改这些假设，以生成高精度的盒子预测。DiffRef3D 是一种通用的提议改进框架，可以在现有的 3D 物体检测模型上逐次提高性能。我们通过对 KITTI benchmark 进行了广泛的实验，证明了DiffRef3D 的重要性。代码将可以获得。

Dolfin: Diffusion Layout Transformers without Autoencoder

paper_url: http://arxiv.org/abs/2310.16305
repo_url: None
paper_authors: Yilin Wang, Zeyuan Chen, Liangjun Zhong, Zheng Ding, Zhizhou Sha, Zhuowen Tu
for: 这篇论文旨在提出一种新的生成模型，即Diffusion Layout Transformers without Autoencoder (Dolfin)，该模型可以有效地提高生成能力，同时减少计算复杂性。
methods: Dolfin使用Transformer-based噪声过程来实现布局生成，并提出了一种有效的bi-directional（非 causal joint）序列表示方法，以及一种autoregressive噪声模型（Dolfin-AR），能够更好地捕捉邻近对象之间的 semantics相关性，如对齐、大小和覆盖。
results: 对标准生成布局Benchmark进行评估，Dolfin显著提高了各种指标（fid, alignment, overlap, MaxIoU和DocSim scores），同时提高了透明度和可操作性。此外，Dolfin的应用不仅限于布局生成，还适用于模型几何结构，如直线段。实验结果表明Dolfin具有优势。

Abstract
In this paper, we introduce a novel generative model, Diffusion Layout Transformers without Autoencoder (Dolfin), which significantly improves the modeling capability with reduced complexity compared to existing methods. Dolfin employs a Transformer-based diffusion process to model layout generation. In addition to an efficient bi-directional (non-causal joint) sequence representation, we further propose an autoregressive diffusion model (Dolfin-AR) that is especially adept at capturing rich semantic correlations for the neighboring objects, such as alignment, size, and overlap. When evaluated against standard generative layout benchmarks, Dolfin notably improves performance across various metrics (fid, alignment, overlap, MaxIoU and DocSim scores), enhancing transparency and interoperability in the process. Moreover, Dolfin's applications extend beyond layout generation, making it suitable for modeling geometric structures, such as line segments. Our experiments present both qualitative and quantitative results to demonstrate the advantages of Dolfin.

摘要
在这篇论文中，我们引入了一种新的生成模型，即扩散布局变换器无自编码器（Dolfin），它能够显著提高模型化能力而减少复杂性，相比现有的方法。Dolfin使用Transformer基于的扩散过程来模型布局生成。除了高效的双向（非 causal 联合）序列表示之外，我们还提出了一种激进的扩散模型（Dolfin-AR），它尤其适合捕捉邻近对象的丰富semantic相关性，如对齐、大小和重叠。当评估了标准的生成布局benchmark时，Dolfin明显提高了多个指标（fid、对齐、重叠、MaxIoU和DocSim分数），从而提高了透明度和可操作性。此外，Dolfin的应用场景不仅包括布局生成，还适用于模型Geometric结构，如直线段。我们的实验包括qualitative和quantitative结果，以demonstrate Dolfin的优势。

4D-Editor: Interactive Object-level Editing in Dynamic Neural Radiance Fields via 4D Semantic Segmentation

paper_url: http://arxiv.org/abs/2310.16858
repo_url: None
paper_authors: Dadong Jiang, Zhihui Ke, Xiaobo Zhou, Xidong Shi
for: 这 paper 的目的是实现在动态场景中进行交互式对象水平编辑（例如，删除、重新颜色、变换、组合）。
methods: 这 paper 使用的方法包括 hybrid semantic feature fields 来保持空间时间一致性，以及 recursive selection refinement 来提高动态 NeRF 中的 segmentation 精度。
results: EXTENSIVE experiments 和 editing examples 表明，4D-Editor 可以实现高品质的动态 NeRF 编辑。Here’s the full text in Simplified Chinese:
for: 这 paper 的目的是实现在动态场景中进行交互式对象水平编辑（例如，删除、重新颜色、变换、组合）。
methods: 这 paper 使用的方法包括 hybrid semantic feature fields 来保持空间时间一致性，以及 recursive selection refinement 来提高动态 NeRF 中的 segmentation 精度。
results: EXTENSIVE experiments 和 editing examples 表明，4D-Editor 可以实现高品质的动态 NeRF 编辑。I hope this helps! Let me know if you have any other questions.

Abstract
This paper targets interactive object-level editing(e.g., deletion, recoloring, transformation, composition) in dynamic scenes. Recently, some methods aiming for flexible editing static scenes represented by neural radiance field (NeRF) have shown impressive synthesis quality, while similar capabilities in time-variant dynamic scenes remain limited. To solve this problem, we propose 4D-Editor, an interactive semantic-driven editing framework, allowing editing multiple objects in dynamic NeRF based on user strokes on a single frame. Our dynamic scene representation is built upon hybrid semantic feature fields so that the spatial-temporal consistency can be maintained after editing. In addition, we design recursive selection refinement that significantly boosts segmentation accuracy in a dynamic NeRF to aid the editing process. Moreover, we develop multi-view reprojection inpainting to fill holes caused by incomplete scene capture after editing. Extensive experiments and editing examples on real-world demonstrate that 4D-Editor achieves photo-realistic dynamic NeRF editing. Project page: https://patrickddj.github.io/4D-Editor

摘要
这篇论文targets互动对象水平编辑（例如，删除、重新颜色、变换、组合）在动态场景中。在最近，一些方法targeting静止场景 represented by neural radiance field (NeRF)的灵活编辑能力有所进步，而在时间变化的动态场景中的相似能力尚未得到有效的解决。为解决这个问题，我们提出了4D-Editor，一个互动semantic驱动的编辑框架，允许在单帧上进行多个对象的编辑。我们的动态场景表示基于半结合 semantic feature fields，以保持空间时间一致性 после编辑。此外，我们设计了重层选择精度提高，以提高动态NeRF中的分割精度，以便编辑过程中更好地帮助用户。此外，我们开发了多视图重oprojection填充，以填充在编辑后Scene capture中出现的孔隙。广泛的实验和编辑示例表明，4D-Editor可以实现高品质的动态NeRF编辑。项目页面：https://patrickddj.github.io/4D-EditorNote: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you need Traditional Chinese, please let me know.

MotionAGFormer: Enhancing 3D Human Pose Estimation with a Transformer-GCNFormer Network

paper_url: http://arxiv.org/abs/2310.16288
repo_url: https://github.com/taatiteam/motionagformer
paper_authors: Soroush Mehraban, Vida Adeli, Babak Taati
for: 本研究旨在提出一种新的注意力GCNFormer块（AGFormer），以提高3D人姿估计中的本地关系学习。
methods: 该模型使用两个并行的 transformer 流水线和 GCNFormer 流水线，并将其分解为多个 AGFormer 块。GCNFormer 模块利用邻近关节之间的本地关系，生成一个补充性的表示，并与 transformer 输出进行可靠的拟合。
results: 在 Human3.6M 和 MPI-INF-3DHP 两个标准测试集上，MotionAGFormer 模型 achieved state-of-the-art 结果，P1 误差分别为 38.4mm 和 16.2mm。同时，该模型使用的参数量只有一半，计算量三倍于之前的领先模型。代码和模型可以在 GitHub 上获取。

Abstract
Recent transformer-based approaches have demonstrated excellent performance in 3D human pose estimation. However, they have a holistic view and by encoding global relationships between all the joints, they do not capture the local dependencies precisely. In this paper, we present a novel Attention-GCNFormer (AGFormer) block that divides the number of channels by using two parallel transformer and GCNFormer streams. Our proposed GCNFormer module exploits the local relationship between adjacent joints, outputting a new representation that is complementary to the transformer output. By fusing these two representation in an adaptive way, AGFormer exhibits the ability to better learn the underlying 3D structure. By stacking multiple AGFormer blocks, we propose MotionAGFormer in four different variants, which can be chosen based on the speed-accuracy trade-off. We evaluate our model on two popular benchmark datasets: Human3.6M and MPI-INF-3DHP. MotionAGFormer-B achieves state-of-the-art results, with P1 errors of 38.4mm and 16.2mm, respectively. Remarkably, it uses a quarter of the parameters and is three times more computationally efficient than the previous leading model on Human3.6M dataset. Code and models are available at https://github.com/TaatiTeam/MotionAGFormer.

摘要
近期基于transformer的方法已经表现出色地进行3D人姿估算。然而，它们具有整体视图，通过编码全局关系 между所有关节来不准确地捕捉本地依赖关系。在这篇论文中，我们提出了一种新的Attention-GCNFormer（AGFormer）块，通过使用两个平行的transformer和GCNFormer流程来分解通道数。我们的提议的GCNFormer模块利用邻近关节之间的本地关系，输出一个新的表示，与transformer输出相комplementary。通过在adaptive的方式进行融合，AGFormer能够更好地学习下来3D结构。通过堆叠多个AGFormer块，我们提议MotionAGFormer模型，有四种不同的变体，可以根据速度精度质量进行选择。我们在人3.6M和MPI-INF-3DHP两个 популяр的benchmark数据集上评估了我们的模型，MotionAGFormer-B Variant achieve state-of-the-art Results，P1 error为38.4mm和16.2mm，分别。很Remarkably，它使用的参数数量只是前一个领先模型的一半，并且在人3.6M数据集上三倍更快速并且更加计算效率。代码和模型可以在https://github.com/TaatiTeam/MotionAGFormer上获取。

TransPose: 6D Object Pose Estimation with Geometry-Aware Transformer

paper_url: http://arxiv.org/abs/2310.16279
repo_url: None
paper_authors: Xiao Lin, Deming Wang, Guangliang Zhou, Chengju Liu, Qijun Chen
for: 提高RGB基于方法的6D对象pose估计精度，避免 occlusion 和照明变化的影响。
methods: 使用TransformerEncoder和geometry-aware模块，提取点云特征表示，并在全球信息交换下提高对 occlusion 的Robustness。
results: 在三个benchmark datasets上实现了竞争性的pose估计效果。

Abstract
Estimating the 6D object pose is an essential task in many applications. Due to the lack of depth information, existing RGB-based methods are sensitive to occlusion and illumination changes. How to extract and utilize the geometry features in depth information is crucial to achieve accurate predictions. To this end, we propose TransPose, a novel 6D pose framework that exploits Transformer Encoder with geometry-aware module to develop better learning of point cloud feature representations. Specifically, we first uniformly sample point cloud and extract local geometry features with the designed local feature extractor base on graph convolution network. To improve robustness to occlusion, we adopt Transformer to perform the exchange of global information, making each local feature contains global information. Finally, we introduce geometry-aware module in Transformer Encoder, which to form an effective constrain for point cloud feature learning and makes the global information exchange more tightly coupled with point cloud tasks. Extensive experiments indicate the effectiveness of TransPose, our pose estimation pipeline achieves competitive results on three benchmark datasets.

摘要
估算6D对象姿 pose是许多应用中的关键任务。由于缺乏深度信息，现有的RGB基于方法容易受到遮挡和照明变化的影响。如何EXTRACT和利用点云信息的几何特征是很重要的。为了实现这一目标，我们提出了TransPose，一种新的6D姿态框架，利用Transformer Encoder和几何意识模块来提高点云特征表示学习。具体来说，我们首先对点云进行均匀采样，然后使用设计的本地特征提取器基于图像感知网络提取当地几何特征。为了提高遮挡Robustness，我们采用Transformer来进行全局信息交换，使每个本地特征包含全局信息。最后，我们在Transformer Encoder中引入几何意识模块，以形成有效的约束，使点云特征学习更加紧密地关联点云任务。我们对TransPose进行了广泛的实验，结果表明TransPose可以准确地估算6D对象姿。我们的姿态估算管道在三个标准数据集上达到了竞争性的 результа。

Deep Learning for Plant Identification and Disease Classification from Leaf Images: Multi-prediction Approaches

paper_url: http://arxiv.org/abs/2310.16273
repo_url: https://github.com/funzi-son/plant_pathology_dl
paper_authors: Jianping Yao, Son N. Tran, Saurabh Garg, Samantha Sawyer
for: 本研究主要针对现代农业中的深度学习应用，特别是使用叶子图像进行植物疾病诊断，深度学习在这一领域中扮演着重要的角色。methods: 本研究使用的方法包括多种多样的深度学习模型，包括多模型、多标签、多输出和多任务模型，其中不同的底层CNN可以被使用。results: 经过实验研究，我们发现使用InceptionV3作为底层CNN可以获得更好的性能，而使用单个模型也可以与使用两个模型相比肤。最终，我们的提出的总结型多输出CNN（GSMo-CNN）在三个标准测试集上达到了领先的性能。

Abstract
Deep learning plays an important role in modern agriculture, especially in plant pathology using leaf images where convolutional neural networks (CNN) are attracting a lot of attention. While numerous reviews have explored the applications of deep learning within this research domain, there remains a notable absence of an empirical study to offer insightful comparisons due to the employment of varied datasets in the evaluation. Furthermore, a majority of these approaches tend to address the problem as a singular prediction task, overlooking the multifaceted nature of predicting various aspects of plant species and disease types. Lastly, there is an evident need for a more profound consideration of the semantic relationships that underlie plant species and disease types. In this paper, we start our study by surveying current deep learning approaches for plant identification and disease classification. We categorise the approaches into multi-model, multi-label, multi-output, and multi-task, in which different backbone CNNs can be employed. Furthermore, based on the survey of existing approaches in plant pathology and the study of available approaches in machine learning, we propose a new model named Generalised Stacking Multi-output CNN (GSMo-CNN). To investigate the effectiveness of different backbone CNNs and learning approaches, we conduct an intensive experiment on three benchmark datasets Plant Village, Plant Leaves, and PlantDoc. The experimental results demonstrate that InceptionV3 can be a good choice for a backbone CNN as its performance is better than AlexNet, VGG16, ResNet101, EfficientNet, MobileNet, and a custom CNN developed by us. Interestingly, empirical results support the hypothesis that using a single model can be comparable or better than using two models. Finally, we show that the proposed GSMo-CNN achieves state-of-the-art performance on three benchmark datasets.

摘要
现代农业中，深度学习扮演着重要的角色，特别是在植物病理学中使用叶片图像，其中 convolutional neural networks (CNN) 在这个领域吸引了很多关注。虽然有很多文章评论了深度学习在这个研究领域的应用，但是还没有一篇实证研究提供了有用的对比。此外，大多数方法都是单纯地视为预测问题，忽略了植物种和病种类型之间的多方面性。此外，还有一个明显的需求，即更深入地理解植物种和病种类型之间的含义关系。在本文中，我们开始我们的研究 by surveying current deep learning approaches for plant identification and disease classification.我们将这些方法分为多模型、多标签、多输出和多任务类型，其中可以使用不同的底层CNN。此外，根据现有的植物病理学方法和机器学习方法的调查，我们提出了一种新的模型 named Generalised Stacking Multi-output CNN (GSMo-CNN)。为了评估不同的底层CNN和学习方法的效果，我们在三个标准数据集（Plant Village、Plant Leaves、PlantDoc）上进行了广泛的实验。实验结果显示，InceptionV3可以作为底层CNN，其性能比AlexNet、VGG16、ResNet101、EfficientNet、MobileNet和我们自己开发的自定义CNN更好。有趣的是，实验结果支持我们的假设，即使用单个模型可以与使用两个模型相比或更好。最后，我们表明了我们提出的GSMo-CNN在三个标准数据集上达到了状态之前的最佳性能。

SCB-ST-Dataset4: Extending the Spatio-Temporal Behavior Dataset in Student Classroom Scenarios Through Image Dataset Method

paper_url: http://arxiv.org/abs/2310.16267
repo_url: https://github.com/whiffe/scb-dataset
paper_authors: Fan Yang, Xiaofei Wang
for: This paper aims to provide a solution to the lack of publicly available spatio-temporal datasets on student behavior, which hinders research in the field of automatic student behavior detection using deep learning methods.
methods: The proposed method involves extending the existing SCB-ST-Dataset4 with an image dataset and using a Behavior Similarity Index (BSI) to explore the similarity of behaviors.
results: The proposed method was evaluated using four deep learning algorithms (YOLOv5, YOLOv7, YOLOv8, and SlowFast) and achieved a mean average precision (map) of up to 82.3%. The experiment demonstrated the effectiveness of the method and the dataset provides a robust foundation for future research in student behavior detection.Here’s the information in Simplified Chinese text:
for: 这篇论文的目的是解决学生行为自动检测使用深度学习方法时缺乏公共可用的空间时间数据的问题。
methods: 提议的方法是通过扩展现有的 SCB-ST-Dataset4 图像集，并使用行为相似性指数 (BSI) 来探索行为之间的相似性。
results: 提议的方法被评估使用四种深度学习算法 (YOLOv5, YOLOv7, YOLOv8, SlowFast)，实现了最高的 mean average precision (map) 值达到 82.3%。实验证明了方法的有效性，数据集提供了未来学生行为检测研究的坚实基础。

Abstract
Using deep learning methods to detect students' classroom behavior automatically is a promising approach for analyzing their class performance and improving teaching effectiveness. However, the lack of publicly available spatio-temporal datasets on student behavior, as well as the high cost of manually labeling such datasets, pose significant challenges for researchers in this field. To address this issue, we proposed a method for extending the spatio-temporal behavior dataset in Student Classroom Scenarios (SCB-ST-Dataset4) through image dataset. Our SCB-ST-Dataset4 comprises 754094 images with 25670 labels, focusing on 3 behaviors: hand-raising, reading, writing. Our proposed method can rapidly generate spatio-temporal behavioral datasets without requiring annotation. Furthermore, we proposed a Behavior Similarity Index (BSI) to explore the similarity of behaviors. We evaluated the dataset using the YOLOv5, YOLOv7, YOLOv8, and SlowFast algorithms, achieving a mean average precision (map) of up to 82.3%. The experiment further demonstrates the effectiveness of our method. This dataset provides a robust foundation for future research in student behavior detection, potentially contributing to advancements in this field. The SCB-ST-Dataset4 is available for download at: https://github.com/Whiffe/SCB-dataset.

摘要
（使用深度学习方法检测学生学习环境中的行为自动化是一个有前途的方法，可以分析学生的课程表现和提高教学效果。然而，学生行为的公共可用空间时间数据集和手动标注这些数据集的高成本，对于这个领域的研究人员而言是一个大的挑战。为解决这个问题，我们提出了一种方法，通过图像集来扩展学生学习环境中的行为数据集。我们的SCB-ST-Dataset4包含754094张图像和25670个标签，关注3种行为：抬头、读书和写作。我们提出的方法可以快速生成空间时间行为数据集，不需要注解。此外，我们还提出了行为相似指数（BSI），以探索行为之间的相似性。我们使用YOLOv5、YOLOv7、YOLOv8和SlowFast算法进行评估，实现了最大平均准确率（map）达82.3%。实验证明了我们的方法的有效性。这个数据集为未来学生行为检测领域的研究提供了一个坚实的基础，有助于这一领域的进步。SCB-ST-Dataset4可以在以下链接下载：https://github.com/Whiffe/SCB-dataset。）

UAV-Sim: NeRF-based Synthetic Data Generation for UAV-based Perception

paper_url: http://arxiv.org/abs/2310.16255
repo_url: None
paper_authors: Christopher Maxey, Jaehoon Choi, Hyungtae Lee, Dinesh Manocha, Heesung Kwon
for: 用于提高UAV预测模型的训练数据 quantity和质量。
methods: 利用最新的神经渲染技术进行静态和动态新视图UAV预测图像生成，尤其是高空拍摄场景中的突出特征。
results: 使用混合实际和synthetic数据进行优化后，检测模型的性能得到了显著提升。

Abstract
Tremendous variations coupled with large degrees of freedom in UAV-based imaging conditions lead to a significant lack of data in adequately learning UAV-based perception models. Using various synthetic renderers in conjunction with perception models is prevalent to create synthetic data to augment the learning in the ground-based imaging domain. However, severe challenges in the austere UAV-based domain require distinctive solutions to image synthesis for data augmentation. In this work, we leverage recent advancements in neural rendering to improve static and dynamic novelview UAV-based image synthesis, especially from high altitudes, capturing salient scene attributes. Finally, we demonstrate a considerable performance boost is achieved when a state-ofthe-art detection model is optimized primarily on hybrid sets of real and synthetic data instead of the real or synthetic data separately.

摘要
巨大的变化和大量的自由度在无人机图像环境中导致学习无人机图像模型的数据缺乏。使用各种合成渲染器和感知模型是常见的做法来创建合成数据以增强地面上的图像学习。然而，无人机图像领域的恶劣环境需要特有的解决方案来synthesize图像，尤其是从高空拍摄的场景。在这种情况下，我们利用最新的神经渲染技术来提高静止和动态新视图无人机图像synthesize，特别是高空拍摄的场景。最终，我们示出了将状态之最佳检测模型优化为主要使用混合的实际和合成数据集，而不是单独使用实际数据或合成数据，可以获得显著的性能提升。

GraFT: Gradual Fusion Transformer for Multimodal Re-Identification

paper_url: http://arxiv.org/abs/2310.16856
repo_url: None
paper_authors: Haoli Yin, Jiayao Li, Eva Schiller, Luke McDermott, Daniel Cummings
for: 本研究旨在提出一种能够有效地进行多Modal ReID的模型，以满足计算机视觉领域中增加模式的需求。
methods: 本研究提出了一种名为Gradual Fusion Transformer（GraFT）的新模型，它使用学习扩展的协同自注意力机制，以便同时捕捉多Modal特征和物体特征。此外，研究人员还提出了一种新的训练方法和一种改进的 triplet损失函数，以便优化ReID特征空间。
results: 对于多Modal ReID任务，GraFT consistently 超越了现有的多Modal ReID标准准确率。此外，研究人员还通过了大量的缺失学习研究，以证明GraFT的有效性。此外，为了实现模型的部署 versatility，研究人员还提出了一种基于神经网络裁剪的方法，以实现模型的大小和性能之间的平衡。

Abstract
Object Re-Identification (ReID) is pivotal in computer vision, witnessing an escalating demand for adept multimodal representation learning. Current models, although promising, reveal scalability limitations with increasing modalities as they rely heavily on late fusion, which postpones the integration of specific modality insights. Addressing this, we introduce the \textbf{Gradual Fusion Transformer (GraFT)} for multimodal ReID. At its core, GraFT employs learnable fusion tokens that guide self-attention across encoders, adeptly capturing both modality-specific and object-specific features. Further bolstering its efficacy, we introduce a novel training paradigm combined with an augmented triplet loss, optimizing the ReID feature embedding space. We demonstrate these enhancements through extensive ablation studies and show that GraFT consistently surpasses established multimodal ReID benchmarks. Additionally, aiming for deployment versatility, we've integrated neural network pruning into GraFT, offering a balance between model size and performance.

摘要

2023-10-25

Exploring Question Decomposition for Zero-Shot VQA

Hierarchical Text Spotter for Joint Text Spotting and Layout Analysis

Trust, but Verify: Robust Image Segmentation using Deep Learning

An Efficient Deep Learning-based approach for Recognizing Agricultural Pests in the Wild

Unsupervised Domain Adaptation for Semantic Segmentation with Pseudo Label Self-Refinement

Improving Performance in Colorectal Cancer Histology Decomposition using Deep and Ensemble Machine Learning

Diagnosing Alzheimer’s Disease using Early-Late Multimodal Data Fusion with Jacobian Maps

MCUFormer: Deploying Vision Tranformers on Microcontrollers with Limited Memory

SparseDFF: Sparse-View Feature Distillation for One-Shot Dexterous Manipulation

LightSpeed: Light and Fast Neural Light Fields on Mobile Devices

PERF: Panoramic Neural Radiance Field from a Single Panorama

CommonCanvas: An Open Diffusion Model Trained with Creative-Commons Images

DreamCraft3D: Hierarchical 3D Generation with Bootstrapped Diffusion Prior

Exploring OCR Capabilities of GPT-4V(ision) : A Quantitative and In-depth Evaluation

Fingervein Verification using Convolutional Multi-Head Attention Network

The GOOSE Dataset for Perception in Unstructured Environments

S$^3$-TTA: Scale-Style Selection for Test-Time Augmentation in Biomedical Image Segmentation

MixerFlow for Image Modelling

ConvNets Match Vision Transformers at Scale

SonoSAM – Segment Anything on Ultrasound Images

CAD – Contextual Multi-modal Alignment for Dynamic AVQA

Metrically Scaled Monocular Depth Estimation through Sparse Priors for Underwater Robots

Interferometric Neural Networks

A No-Reference Quality Assessment Method for Digital Human Head

Rebuild City Buildings from Off-Nadir Aerial Images with Offset-Building Model (OBM)

Nighttime Driver Behavior Prediction Using Taillight Signal Recognition via CNN-SVM Classifier

From Pointwise to Powerhouse: Initialising Neural Networks with Generative Models

DSAM-GN:Graph Network based on Dynamic Similarity Adjacency Matrices for Vehicle Re-identification

Local Statistics for Generative Image Detection

CoDet: Co-Occurrence Guided Region-Word Alignment for Open-Vocabulary Object Detection

Robust Source-Free Domain Adaptation for Fundus Image Segmentation

MACP: Efficient Model Adaptation for Cooperative Perception

Deep Learning Techniques for Cervical Cancer Diagnosis based on Pathology and Colposcopy Images

EmoCLIP: A Vision-Language Method for Zero-Shot Video Facial Expression Recognition

Driving through the Concept Gridlock: Unraveling Explainability Bottlenecks in Automated Driving

EdgeCalib: Multi-Frame Weighted Edge Features for Automatic Targetless LiDAR-Camera Calibration

Real-time 6-DoF Pose Estimation by an Event-based Camera using Active LED Markers

$\mathbb{VD}$-$\mathbb{GR}$: Boosting $\mathbb{V}$isual $\mathbb{D}$ialog with Cascaded Spatial-Temporal Multi-Modal $\mathbb{GR}$aphs

Flow-Attention-based Spatio-Temporal Aggregation Network for 3D Mask Detection

ParisLuco3D: A high-quality target dataset for domain generalization of LiDAR perception

Dual Defense: Adversarial, Traceable, and Invisible Robust Watermarking against Face Swapping

Learning Robust Deep Visual Representations from EEG Brain Recordings

Enhancing Document Information Analysis with Multi-Task Pre-training: A Robust Approach for Information Extraction in Visually-Rich Documents

Lang3DSG: Language-based contrastive pre-training for 3D Scene Graph prediction

Gramian Attention Heads are Strong yet Efficient Vision Learners

Show from Tell: Audio-Visual Modelling in Clinical Settings

DualMatch: Robust Semi-Supervised Learning with Dual-Level Interaction

ChimpACT: A Longitudinal Dataset for Understanding Chimpanzee Behaviors

On Pixel-level Performance Assessment in Anomaly Detection

Winning Prize Comes from Losing Tickets: Improve Invariant Learning by Exploring Variant Parameters for Out-of-Distribution Generalization

MVFAN: Multi-View Feature Assisted Network for 4D Radar Object Detection

General Point Model with Autoencoding and Autoregressive

Deepfake Detection: Leveraging the Power of 2D and 3D CNN Ensembles

Frequency-Aware Transformer for Learned Image Compression

Open-NeRF: Towards Open Vocabulary NeRF Decomposition

Towards Large-scale Masked Face Recognition

DiffRef3D: A Diffusion-based Proposal Refinement Framework for 3D Object Detection

Dolfin: Diffusion Layout Transformers without Autoencoder

4D-Editor: Interactive Object-level Editing in Dynamic Neural Radiance Fields via 4D Semantic Segmentation

MotionAGFormer: Enhancing 3D Human Pose Estimation with a Transformer-GCNFormer Network

TransPose: 6D Object Pose Estimation with Geometry-Aware Transformer

Deep Learning for Plant Identification and Disease Classification from Leaf Images: Multi-prediction Approaches

SCB-ST-Dataset4: Extending the Spatio-Temporal Behavior Dataset in Student Classroom Scenarios Through Image Dataset Method

UAV-Sim: NeRF-based Synthetic Data Generation for UAV-based Perception

GraFT: Gradual Fusion Transformer for Multimodal Re-Identification