2023-11-07

cs.CV

cs.CV - 2023-11-07

3DiffTection: 3D Object Detection with Geometry-Aware Diffusion Features

paper_url: http://arxiv.org/abs/2311.04391
repo_url: None
paper_authors: Chenfeng Xu, Huan Ling, Sanja Fidler, Or Litany
for: 3D объект детектирования из单张图像
methods: 使用3D散度模型提取特征，并通过两种特化Strategy进行微调，包括几何微调和semantic微调
results: 实现了高效的3D检测，并在Omni3D-ARkitscene数据集上超越了前一代标准Cube-RCNN模型，提高了AP3D指标的9.43%。同时，模型也表现出了较好的数据效率和跨频训练能力。

Abstract
We present 3DiffTection, a state-of-the-art method for 3D object detection from single images, leveraging features from a 3D-aware diffusion model. Annotating large-scale image data for 3D detection is resource-intensive and time-consuming. Recently, pretrained large image diffusion models have become prominent as effective feature extractors for 2D perception tasks. However, these features are initially trained on paired text and image data, which are not optimized for 3D tasks, and often exhibit a domain gap when applied to the target data. Our approach bridges these gaps through two specialized tuning strategies: geometric and semantic. For geometric tuning, we fine-tune a diffusion model to perform novel view synthesis conditioned on a single image, by introducing a novel epipolar warp operator. This task meets two essential criteria: the necessity for 3D awareness and reliance solely on posed image data, which are readily available (e.g., from videos) and does not require manual annotation. For semantic refinement, we further train the model on target data with detection supervision. Both tuning phases employ ControlNet to preserve the integrity of the original feature capabilities. In the final step, we harness these enhanced capabilities to conduct a test-time prediction ensemble across multiple virtual viewpoints. Through our methodology, we obtain 3D-aware features that are tailored for 3D detection and excel in identifying cross-view point correspondences. Consequently, our model emerges as a powerful 3D detector, substantially surpassing previous benchmarks, e.g., Cube-RCNN, a precedent in single-view 3D detection by 9.43\% in AP3D on the Omni3D-ARkitscene dataset. Furthermore, 3DiffTection showcases robust data efficiency and generalization to cross-domain data.

摘要
我们介绍了3DiffTection方法，这是一种基于单张图像的3D物体检测state-of-the-art方法，利用了3D扩散模型的特征。为了实现3D检测， annotating大规模图像数据是资源充足和时间consuming的。现在，预训练的大图像扩散模型已成为2D感知任务的有效特征提取器。然而，这些特征通常是通过paired文本和图像数据进行预训练，这些数据通常不适合3D任务，而且经常会出现域隔现象。我们的方法使用两种特殊的调整策略来bridges这些差距：几何调整和semantic调整。为了几何调整，我们精心调整了扩散模型，以实现基于单张图像的新视图生成，通过引入novel epipolar warp operator。这个任务满足了两个关键的要求：需要3D意识和仅仅基于提供的posed图像数据进行调整，这些数据 readily available（例如，从视频中获得），不需要手动标注。为了semantic调整，我们进一步在目标数据上训练模型，并在检测监督下进行训练。两个调整阶段都使用ControlNet来保持原始特征Capabilities。在最后一步，我们利用这些加强的特征来进行测试预测 ensemble across multiple virtual viewpoints。通过我们的方法，我们获得了3D意识的特征，这些特征适用于3D检测，并且能够准确地找到交叉视点对应关系。因此，我们的模型在Omni3D-ARkitscene dataset上的AP3D指标上表现出色，比前一代标准9.43%。此外，3DiffTection还展示了robust数据效率和跨领域数据的通用性。

Basis restricted elastic shape analysis on the space of unregistered surfaces

paper_url: http://arxiv.org/abs/2311.04382
repo_url: None
paper_authors: Emmanuel Hartman, Emery Pierson, Martin Bauer, Mohamed Daoudi, Nicolas Charon
for: 这个论文旨在提出一种新的数学和计算方法来进行表面分析，基于泛函空间上的弹性里曼米特凯。
methods: 该方法 restricts the space of allowable transformations to predefined finite dimensional bases of deformation fields，并使用数据驱动的方式来估算这些基。
results: 该方法可以有效地完成许多surface mesh中的任务，包括shape registration、interpolation、motion transfer和随机pose生成，并且在人体形态和姿态数据和人脸扫描中表现出优于状态 искусственный方法。

Abstract
This paper introduces a new mathematical and numerical framework for surface analysis derived from the general setting of elastic Riemannian metrics on shape spaces. Traditionally, those metrics are defined over the infinite dimensional manifold of immersed surfaces and satisfy specific invariance properties enabling the comparison of surfaces modulo shape preserving transformations such as reparametrizations. The specificity of the approach we develop is to restrict the space of allowable transformations to predefined finite dimensional bases of deformation fields. These are estimated in a data-driven way so as to emulate specific types of surface transformations observed in a training set. The use of such bases allows to simplify the representation of the corresponding shape space to a finite dimensional latent space. However, in sharp contrast with methods involving e.g. mesh autoencoders, the latent space is here equipped with a non-Euclidean Riemannian metric precisely inherited from the family of aforementioned elastic metrics. We demonstrate how this basis restricted model can be then effectively implemented to perform a variety of tasks on surface meshes which, importantly, does not assume these to be pre-registered (i.e. with given point correspondences) or to even have a consistent mesh structure. We specifically validate our approach on human body shape and pose data as well as human face scans, and show how it generally outperforms state-of-the-art methods on problems such as shape registration, interpolation, motion transfer or random pose generation.

摘要
However, this approach restricts the space of allowable transformations to predefined finite-dimensional bases of deformation fields, which are estimated in a data-driven way to emulate specific types of surface transformations observed in a training set. This allows for the simplification of the representation of the corresponding shape space to a finite-dimensional latent space, equipped with a non-Euclidean Riemannian metric precisely inherited from the family of elastic metrics.The proposed basis-restricted model can be effectively implemented to perform a variety of tasks on surface meshes, including shape registration, interpolation, motion transfer, and random pose generation, without assuming pre-registration or a consistent mesh structure. The approach is validated on human body shape and pose data, as well as human face scans, and is shown to outperform state-of-the-art methods on these tasks.

A Deep Learning Approach to Video Anomaly Detection using Convolutional Autoencoders

paper_url: http://arxiv.org/abs/2311.04351
repo_url: None
paper_authors: Gopikrishna Pavuluri, Gayathri Annem
for: 这个研究旨在提出一种基于深度学习的错误探测方法，用于视频中探测错误，使用核心autoencoder和解oder神经网络，并在UCSD dataset上进行评估。
methods: 这个方法使用核心autoencoder来学习正常视频的空间时间模式，然后将每帧视频测试影像与这个学习的表现进行比较。
results: 试验结果显示，这个方法在UCSD Ped1和Ped2 dataset上的全部精度为99.35%和99.77%，较其他现有方法高，显示这个方法具有优秀的错误探测能力，可以应用于实际应用中的视频错误探测。

Abstract
In this research we propose a deep learning approach for detecting anomalies in videos using convolutional autoencoder and decoder neural networks on the UCSD dataset.Our method utilizes a convolutional autoencoder to learn the spatiotemporal patterns of normal videos and then compares each frame of a test video to this learned representation. We evaluated our approach on the UCSD dataset and achieved an overall accuracy of 99.35% on the Ped1 dataset and 99.77% on the Ped2 dataset, demonstrating the effectiveness of our method for detecting anomalies in surveillance videos. The results show that our method outperforms other state-of-the-art methods, and it can be used in real-world applications for video anomaly detection.

摘要
在这项研究中，我们提出了一种基于深度学习的视频异常检测方法，使用 convolutional autoencoder 和 decoder 神经网络在 UCSD 数据集上进行检测。我们的方法利用 convolutional autoencoder 学习正常视频中的空间时间特征，然后对每帧测试视频进行比较，以确定异常情况。我们对 UCSD 数据集进行评估，实现了 Ped1 数据集的总准确率为 99.35%， Ped2 数据集的总准确率为 99.77%，这表明我们的方法可以有效地检测视频中的异常情况。结果显示，我们的方法比其他当前状态最佳方法高效，可以在实际应用中用于视频异常检测。

SaFL: Sybil-aware Federated Learning with Application to Face Recognition

paper_url: http://arxiv.org/abs/2311.04346
repo_url: None
paper_authors: Mahdi Ghafourian, Julian Fierrez, Ruben Vera-Rodriguez, Ruben Tolosana, Aythami Morales
for: 本研究旨在提出一种新的防御方法，以防止在联合学习（Federated Learning，FL）中发生毒素攻击。
methods: 本研究使用了一种时变汇集方案，以降低毒素攻击的影响。
results: 研究发现，SaFL可以有效降低毒素攻击的影响，并提高FL的安全性和隐私性。

Abstract
Federated Learning (FL) is a machine learning paradigm to conduct collaborative learning among clients on a joint model. The primary goal is to share clients' local training parameters with an integrating server while preserving their privacy. This method permits to exploit the potential of massive mobile users' data for the benefit of machine learning models' performance while keeping sensitive data on local devices. On the downside, FL raises security and privacy concerns that have just started to be studied. To address some of the key threats in FL, researchers have proposed to use secure aggregation methods (e.g. homomorphic encryption, secure multiparty computation, etc.). These solutions improve some security and privacy metrics, but at the same time bring about other serious threats such as poisoning attacks, backdoor attacks, and free running attacks. This paper proposes a new defense method against poisoning attacks in FL called SaFL (Sybil-aware Federated Learning) that minimizes the effect of sybils with a novel time-variant aggregation scheme.

摘要
Federated Learning (FL) 是一种机器学习模式，通过客户端之间的合作学习，实现共同模型的提升。主要目标是在客户端上保持本地训练参数的私钥，同时将这些参数与集成服务器共享。这种方法可以利用大量移动用户数据来提高机器学习模型的性能，而不需要披露敏感数据。然而，FL 也存在安全性和隐私问题，这些问题正在被研究。为了解决 FL 中一些关键威胁，研究人员提出了使用安全汇聚方法（如幂omorphic 加密、安全多方计算等）。这些解决方案可以提高一些安全性和隐私指标，但同时也会带来其他严重威胁，如毒素攻击、后门攻击和自由跑攻击。这篇论文提出了一种新的防御方法，称为 SaFL（知识体-意识 Federated Learning），以防止毒素攻击。SaFL 通过一种新的时变汇聚方案，来减少 sybil 的影响。

Efficient Semantic Matching with Hypercolumn Correlation

paper_url: http://arxiv.org/abs/2311.04336
repo_url: None
paper_authors: Seungwook Kim, Juhong Min, Minsu Cho
for: 本文主要针对 semantic matching 问题，即在多个视觉特征点之间建立 semantic 相关性的问题。
methods: 本文提出了 HCCNet 方法，它利用 multi-scale correlation maps 来提高 semantic matching 性能，而不需要对 4D correlation map 进行费时的 match-wise 关系挖掘。HCCNet 方法通过 feature slicing 技术，将瓶颈特征分解为更多的 intermediate features，然后使用这些 intermediate features 构建 hypercolumn correlation。
results: HCCNet 方法在标准的 semantic matching Benchmark 上达到了当前最佳或竞争性性能，同时具有较低的延迟和计算负担。相比之下，现有的 SoTA 方法在计算负担和延迟上都有所增加。

Abstract
Recent studies show that leveraging the match-wise relationships within the 4D correlation map yields significant improvements in establishing semantic correspondences - but at the cost of increased computation and latency. In this work, we focus on the aspect that the performance improvements of recent methods can also largely be attributed to the usage of multi-scale correlation maps, which hold various information ranging from low-level geometric cues to high-level semantic contexts. To this end, we propose HCCNet, an efficient yet effective semantic matching method which exploits the full potential of multi-scale correlation maps, while eschewing the reliance on expensive match-wise relationship mining on the 4D correlation map. Specifically, HCCNet performs feature slicing on the bottleneck features to yield a richer set of intermediate features, which are used to construct a hypercolumn correlation. HCCNet can consequently establish semantic correspondences in an effective manner by reducing the volume of conventional high-dimensional convolution or self-attention operations to efficient point-wise convolutions. HCCNet demonstrates state-of-the-art or competitive performances on the standard benchmarks of semantic matching, while incurring a notably lower latency and computation overhead compared to the existing SoTA methods.

摘要
最新的研究表明，基于4D相关图的对应关系可以获得显著提高 semantic correspondence的性能 - 但是这也需要更高的计算和延迟。在这项工作中，我们注重于 Multi-scale correlation maps 的使用，它们包含了低级别的几何准确信息到高级别的 semantic 上下文信息。为了实现这一目标，我们提出了 HCCNet，一种高效且高性能的 semantic matching 方法。HCCNet 通过对瓶颈特征进行 feature slicing，以获得更丰富的中间特征，并将其用于构建 hypercolumn correlation。HCCNet 可以通过减少传统的高维ensional convolution 或自注意力操作，实现效果性的 semantic correspondence 建立。与现有的 SoTA 方法相比，HCCNet 在标准的 semantic matching 标准准样上显示出了领先或竞争性的性能，同时也具有较低的计算和延迟开销。

A Data Perspective on Enhanced Identity Preservation for Diffusion Personalization

paper_url: http://arxiv.org/abs/2311.04315
repo_url: None
paper_authors: Xingzhe He, Zhiwen Cao, Nicholas Kolkin, Lantao Yu, Helge Rhodin, Ratheesh Kalarot
for: 能够生成具有自然语言描述的图像
methods: 使用数据增强方法，不需要修改模型结构
results: 生成的图像保留细节，同时能够生成多个不同的样本，具有高质量和好的文本拟合性

Abstract
Large text-to-image models have revolutionized the ability to generate imagery using natural language. However, particularly unique or personal visual concepts, such as your pet, an object in your house, etc., will not be captured by the original model. This has led to interest in how to inject new visual concepts, bound to a new text token, using as few as 4-6 examples. Despite significant progress, this task remains a formidable challenge, particularly in preserving the subject's identity. While most researchers attempt to to address this issue by modifying model architectures, our approach takes a data-centric perspective, advocating the modification of data rather than the model itself. We introduce a novel regularization dataset generation strategy on both the text and image level; demonstrating the importance of a rich and structured regularization dataset (automatically generated) to prevent losing text coherence and better identity preservation. The better quality is enabled by allowing up to 5x more fine-tuning iterations without overfitting and degeneration. The generated renditions of the desired subject preserve even fine details such as text and logos; all while maintaining the ability to generate diverse samples that follow the input text prompt. Since our method focuses on data augmentation, rather than adjusting the model architecture, it is complementary and can be combined with prior work. We show on established benchmarks that our data-centric approach forms the new state of the art in terms of image quality, with the best trade-off between identity preservation, diversity, and text alignment.

摘要
大型文本到图像模型已经革命化了使用自然语言生成图像的能力。然而，特定的独特或个人视觉概念，如你的宠物或家中的物品等，将不会被原始模型捕捉。这导致了如何在少量示例（4-6个）下注入新的视觉概念，以保持文本准确性和主体认知的挑战。虽然大多数研究人员尝试通过修改模型结构来解决这个问题，但我们的方法具有数据驱动的思维，主张修改数据而不是模型。我们提出了一种新的常规数据生成策略，包括文本和图像两个水平，以保持文本准确性和主体认知。通过自动生成的丰富和结构化常规数据，我们可以避免文本准确性的损失和主体认知的受损。我们的方法强调数据增强，因此它是可以与之前的研究结合使用的。我们在确立的标准准测试 benchmark 上表现出新的状态前瞻图像质量，同时保持文本准确性、多样性和主体认知的最佳平衡。

Holistic Evaluation of Text-To-Image Models

paper_url: http://arxiv.org/abs/2311.04287
repo_url: https://github.com/stanford-crfm/helm
paper_authors: Tony Lee, Michihiro Yasunaga, Chenlin Meng, Yifan Mai, Joon Sung Park, Agrim Gupta, Yunzhi Zhang, Deepak Narayanan, Hannah Benita Teufel, Marco Bellagente, Minguk Kang, Taesung Park, Jure Leskovec, Jun-Yan Zhu, Li Fei-Fei, Jiajun Wu, Stefano Ermon, Percy Liang
for: 评估文本到图像模型的全面性能。
methods: 引入新的评价指标，包括12个方面，如文本到图像对齐、图像质量、美学性、创新性、理解力、知识、偏见、恶意、公平性、稳定性、多语言支持和效率。
results: 评估26种state-of-the-art文本到图像模型，发现不同模型各有优势，没有一个模型在所有方面表现出色。

Abstract
The stunning qualitative improvement of recent text-to-image models has led to their widespread attention and adoption. However, we lack a comprehensive quantitative understanding of their capabilities and risks. To fill this gap, we introduce a new benchmark, Holistic Evaluation of Text-to-Image Models (HEIM). Whereas previous evaluations focus mostly on text-image alignment and image quality, we identify 12 aspects, including text-image alignment, image quality, aesthetics, originality, reasoning, knowledge, bias, toxicity, fairness, robustness, multilinguality, and efficiency. We curate 62 scenarios encompassing these aspects and evaluate 26 state-of-the-art text-to-image models on this benchmark. Our results reveal that no single model excels in all aspects, with different models demonstrating different strengths. We release the generated images and human evaluation results for full transparency at https://crfm.stanford.edu/heim/v1.1.0 and the code at https://github.com/stanford-crfm/helm, which is integrated with the HELM codebase.

摘要
“这些最近的文本至图模型的优秀质量改善，吸引了广泛的注意和采纳。但我们尚未有一个全面的量化理解其能力和风险。为了填补这个空白，我们引入了一个新的参考基准，即整体评估文本至图模型（HEIM）。先前的评估主要集中在文本至图Alignment和图像质量，我们识别出12个方面，包括文本至图Alignment、图像质量、美学、原创性、推理、知识、偏见、毒性、公平、Robustness、多语言和效率。我们组合62个情况，包括这12个方面，并评估26种文本至图模型。我们的结果显示，没有一个模型在所有方面都优秀，不同的模型各有优势。我们发布生成的图像和人类评价结果，以及模型代码，在https://crfm.stanford.edu/heim/v1.1.0上公开，并与HELM代码库集成在https://github.com/stanford-crfm/helm。”

Video Instance Matting

paper_url: http://arxiv.org/abs/2311.04212
repo_url: https://github.com/shi-labs/vim
paper_authors: Jiachen Li, Roberto Henschel, Vidit Goel, Marianna Ohanyan, Shant Navasardyan, Humphrey Shi
for: 这篇论文是为了解决视频实例分割和 alpha matte 问题而写的。
methods: 这篇论文使用了一种名为 MSG-VIM 的Mask Sequence Guided Video Instance Matting神经网络来解决这个问题，该模型具有增强的mask guidance和时间特征指导等特性。
results: 根据新建的 VIM50 数据集和 VIMQ 评价指标，该模型在视频实例分割和 alpha matte 任务上具有出色的表现，与现有方法相比具有大幅提升。

Abstract
Conventional video matting outputs one alpha matte for all instances appearing in a video frame so that individual instances are not distinguished. While video instance segmentation provides time-consistent instance masks, results are unsatisfactory for matting applications, especially due to applied binarization. To remedy this deficiency, we propose Video Instance Matting~(VIM), that is, estimating alpha mattes of each instance at each frame of a video sequence. To tackle this challenging problem, we present MSG-VIM, a Mask Sequence Guided Video Instance Matting neural network, as a novel baseline model for VIM. MSG-VIM leverages a mixture of mask augmentations to make predictions robust to inaccurate and inconsistent mask guidance. It incorporates temporal mask and temporal feature guidance to improve the temporal consistency of alpha matte predictions. Furthermore, we build a new benchmark for VIM, called VIM50, which comprises 50 video clips with multiple human instances as foreground objects. To evaluate performances on the VIM task, we introduce a suitable metric called Video Instance-aware Matting Quality~(VIMQ). Our proposed model MSG-VIM sets a strong baseline on the VIM50 benchmark and outperforms existing methods by a large margin. The project is open-sourced at https://github.com/SHI-Labs/VIM.

摘要
传统的视频排除输出一个alpha环境图，以便在视频帧中对所有实例进行分割。然而，视频实例分割提供了时间一致的实例掩模，但是结果并不满意，特别是因为应用了二进制化。为了解决这些不足，我们提出了视频实例排除（VIM），即在视频序列中每帧估计每个实例的alpha环境图。为了解决这个复杂的问题，我们提出了一种新的基线模型，即Mask Sequence Guided Video Instance Matting（MSG-VIM）。MSG-VIM利用了一种混合的掩码更新，以使其预测Results robust to inaccurate and inconsistent mask guidance。它还包括时间掩模和时间特征引导，以提高alpha环境预测的 temporal consistency。此外，我们建立了一个新的基准数据集，名为VIM50，该数据集包括50个视频clip，每个clip中有多个人体实例作为前景物体。为了评估在VIM任务中的性能，我们提出了一个适合的度量，即Video Instance-aware Matting Quality（VIMQ）。我们的提出的模型MSG-VIM在VIM50 benchmark上设置了一个强大的基线，并在现有方法之上出现较大的差。项目开源在https://github.com/SHI-Labs/VIM。

Deep Hashing via Householder Quantization

paper_url: http://arxiv.org/abs/2311.04207
repo_url: https://github.com/twistedcubic/learn-to-hash
paper_authors: Lucas R. Schwengber, Lucas Resende, Paulo Orenstein, Roberto I. Oliveira
For: The paper is written for improving the efficiency and performance of large-scale image similarity search using deep hashing techniques.* Methods: The paper proposes an alternative quantization strategy that decomposes the learning problem into two stages: first, perform similarity learning over the embedding space with no quantization, and second, find an optimal orthogonal transformation of the embeddings using Householder matrices and then quantize the transformed embedding through the sign function.* Results: The proposed algorithm leads to state-of-the-art performance on widely used image datasets and brings consistent improvements in performance to existing deep hashing algorithms, without any hyperparameter tuning and at no cost in terms of performance.Here’s the simplified Chinese text for the three key information points:* 用途：本文提出了一种改进大规模图像相似搜索的深度哈希技术，以提高效率和性能。* 方法：本文提出的方法是将学习问题分解成两个阶段：首先，在嵌入空间中进行相似学习，无需归一化；其次，使用欧几里得变换来找到最佳的对称变换，然后对transformed embedding进行归一化。* 结果：提议的算法在广泛使用的图像Dataset上达到了状态态的性能，并且与现有的深度哈希算法相比，带来了一致的改进，无需进行任何参数调整和性能损失。

Abstract
Hashing is at the heart of large-scale image similarity search, and recent methods have been substantially improved through deep learning techniques. Such algorithms typically learn continuous embeddings of the data. To avoid a subsequent costly binarization step, a common solution is to employ loss functions that combine a similarity learning term (to ensure similar images are grouped to nearby embeddings) and a quantization penalty term (to ensure that the embedding entries are close to binarized entries, e.g., -1 or 1). Still, the interaction between these two terms can make learning harder and the embeddings worse. We propose an alternative quantization strategy that decomposes the learning problem in two stages: first, perform similarity learning over the embedding space with no quantization; second, find an optimal orthogonal transformation of the embeddings so each coordinate of the embedding is close to its sign, and then quantize the transformed embedding through the sign function. In the second step, we parametrize orthogonal transformations using Householder matrices to efficiently leverage stochastic gradient descent. Since similarity measures are usually invariant under orthogonal transformations, this quantization strategy comes at no cost in terms of performance. The resulting algorithm is unsupervised, fast, hyperparameter-free and can be run on top of any existing deep hashing or metric learning algorithm. We provide extensive experimental results showing that this approach leads to state-of-the-art performance on widely used image datasets, and, unlike other quantization strategies, brings consistent improvements in performance to existing deep hashing algorithms.

摘要
“哈希是大规模图像相似搜寻中的核心，而近年来的方法受到深度学习技术的改进。这些算法通常学习连续的嵌入。以避免后续的昂贵 binarization 步骤，一个常见的解决方案是使用一个 combine 两个 тер优点：一个 similarity learning 项（确保相似的图像被分配到附近的嵌入）和一个 quantization penalty 项（确保嵌入的元素都很近，例如 -1 或 1）。然而，这两个项目之间的互动可能会让学习更加困难，并导致嵌入更差。我们提出一个 alternating quantization 策略，将学习问题分成两个阶段：第一个阶段是在嵌入空间进行 similarity learning，第二个阶段是使用 Householder 矩阵对嵌入进行快速的对称转换，然后使用 sign 函数对转换后的嵌入进行量化。在第二个阶段，我们使用 Householder 矩阵来快速地对嵌入进行对称转换，这样可以充分利用测验梯度下降。因为相似度测验通常是对对称转换的不变量，因此这个量化策略不会对性能造成损害。得到的算法是无监控、快速、无超参数的，可以在任何现有的深度哈希或度量学习算法之上运行。我们在广泛的图像数据集上进行了广泛的实验，结果显示这个方法对现有的深度哈希算法带来了稳定的改进。”

High-fidelity 3D Reconstruction of Plants using Neural Radiance Field

paper_url: http://arxiv.org/abs/2311.04154
repo_url: None
paper_authors: Kewei Hu, Ying Wei, Yaoqiang Pan, Hanwen Kang, Chao Chen
for:* The paper is focused on exploring the use of Neural Radiance Fields (NeRF) for plant phenotyping in agricultural contexts.methods:* The paper utilizes two state-of-the-art NeRF methods, Instant-NGP and Instant-NSR, to synthesize 2D novel-view images and reconstruct 3D crop and plant models.results:* The paper demonstrates that NeRF achieves commendable performance in synthesizing novel-view images and is competitive with Reality Capture, a leading commercial software for 3D Multi-View Stereo (MVS)-based reconstruction. However, the study also highlights certain drawbacks of NeRF, including relatively slow training speeds, performance limitations in cases of insufficient sampling, and challenges in obtaining geometry quality in complex setups.Here is the text in Simplified Chinese:for:* 本研究是针对减少可持续农业实践中的植物形态重建问题，通过使用神经辐射场（NeRF）进行研究。methods:* 本研究使用了两种 state-of-the-art NeRF方法，即 Instant-NGP 和 Instant-NSR，来synthesize 2D 新视图图像和重建 3D 作物和植物模型。results:* 本研究表明，NeRF 在synthesize 2D 新视图图像和重建 3D 作物和植物模型方面达到了可观的表现，与Reality Capture，一种商业化的3D Multi-View Stereo（MVS）重建软件，相比竞争。然而，研究也指出了NeRF的一些缺点，包括训练速度较慢，在不充分采样的情况下表现有限，以及在复杂设置下获得geometry质量的挑战。

Abstract
Accurate reconstruction of plant phenotypes plays a key role in optimising sustainable farming practices in the field of Precision Agriculture (PA). Currently, optical sensor-based approaches dominate the field, but the need for high-fidelity 3D reconstruction of crops and plants in unstructured agricultural environments remains challenging. Recently, a promising development has emerged in the form of Neural Radiance Field (NeRF), a novel method that utilises neural density fields. This technique has shown impressive performance in various novel vision synthesis tasks, but has remained relatively unexplored in the agricultural context. In our study, we focus on two fundamental tasks within plant phenotyping: (1) the synthesis of 2D novel-view images and (2) the 3D reconstruction of crop and plant models. We explore the world of neural radiance fields, in particular two SOTA methods: Instant-NGP, which excels in generating high-quality images with impressive training and inference speed, and Instant-NSR, which improves the reconstructed geometry by incorporating the Signed Distance Function (SDF) during training. In particular, we present a novel plant phenotype dataset comprising real plant images from production environments. This dataset is a first-of-its-kind initiative aimed at comprehensively exploring the advantages and limitations of NeRF in agricultural contexts. Our experimental results show that NeRF demonstrates commendable performance in the synthesis of novel-view images and is able to achieve reconstruction results that are competitive with Reality Capture, a leading commercial software for 3D Multi-View Stereo (MVS)-based reconstruction. However, our study also highlights certain drawbacks of NeRF, including relatively slow training speeds, performance limitations in cases of insufficient sampling, and challenges in obtaining geometry quality in complex setups.

摘要
准确重建植物fenotype在精准农业（PA）中扮演着关键角色。目前，光学感测器基本方法dominates the field, but the need for high-fidelity 3D reconstruction of crops and plants in unstructured agricultural environments remains challenging. Recently, a promising development has emerged in the form of Neural Radiance Field (NeRF), a novel method that utilizes neural density fields. This technique has shown impressive performance in various novel vision synthesis tasks, but has remained relatively unexplored in the agricultural context. In our study, we focus on two fundamental tasks within plant phenotyping: (1) the synthesis of 2D novel-view images and (2) the 3D reconstruction of crop and plant models. We explore the world of NeRF, in particular two state-of-the-art methods: Instant-NGP, which excels in generating high-quality images with impressive training and inference speed, and Instant-NSR, which improves the reconstructed geometry by incorporating the Signed Distance Function (SDF) during training. In particular, we present a novel plant phenotype dataset comprising real plant images from production environments. This dataset is a first-of-its-kind initiative aimed at comprehensively exploring the advantages and limitations of NeRF in agricultural contexts. Our experimental results show that NeRF demonstrates commendable performance in the synthesis of novel-view images and is able to achieve reconstruction results that are competitive with Reality Capture, a leading commercial software for 3D Multi-View Stereo (MVS)-based reconstruction. However, our study also highlights certain drawbacks of NeRF, including relatively slow training speeds, performance limitations in cases of insufficient sampling, and challenges in obtaining geometry quality in complex setups.

I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models

paper_url: http://arxiv.org/abs/2311.04145
repo_url: https://github.com/damo-vilab/i2vgen-xl
paper_authors: Shiwei Zhang, Jiayu Wang, Yingya Zhang, Kang Zhao, Hangjie Yuan, Zhiwu Qin, Xiang Wang, Deli Zhao, Jingren Zhou
for: 提高视频生成的semantic精度、时间空间连续性和视觉质量。
methods: 提出了一种逐 stage的I2VGen-XL方法，通过将输入数据分解成两个阶段，以提高模型的表现。
results: 通过大量的实验和比较，表明I2VGen-XL可以同时提高视频的semantic精度、时间空间连续性和视觉质量。

Abstract
Video synthesis has recently made remarkable strides benefiting from the rapid development of diffusion models. However, it still encounters challenges in terms of semantic accuracy, clarity and spatio-temporal continuity. They primarily arise from the scarcity of well-aligned text-video data and the complex inherent structure of videos, making it difficult for the model to simultaneously ensure semantic and qualitative excellence. In this report, we propose a cascaded I2VGen-XL approach that enhances model performance by decoupling these two factors and ensures the alignment of the input data by utilizing static images as a form of crucial guidance. I2VGen-XL consists of two stages: i) the base stage guarantees coherent semantics and preserves content from input images by using two hierarchical encoders, and ii) the refinement stage enhances the video's details by incorporating an additional brief text and improves the resolution to 1280$\times$720. To improve the diversity, we collect around 35 million single-shot text-video pairs and 6 billion text-image pairs to optimize the model. By this means, I2VGen-XL can simultaneously enhance the semantic accuracy, continuity of details and clarity of generated videos. Through extensive experiments, we have investigated the underlying principles of I2VGen-XL and compared it with current top methods, which can demonstrate its effectiveness on diverse data. The source code and models will be publicly available at \url{https://i2vgen-xl.github.io}.

摘要
“视频生成技术在最近几年内升级很快，但它仍然面临Semantic精度、清晰度和时空连续性的挑战。这些挑战的主要原因是缺乏准确的文本-视频数据和视频的复杂内存结构，使得模型很难同时保证Semantic和质量上的优秀表现。在这份报告中，我们提议一种名为I2VGen-XL的方法，该方法可以减轻这些因素的影响，并通过使用静止图像作为重要导向来保证输入数据的对齐。I2VGen-XL包括两个阶段：一是基础阶段，该阶段通过两个层次编码器保证文本和图像之间的协调性，并保留输入图像的内容；二是优化阶段，该阶段通过添加一些简短的文本和提高分辨率到1280×720来提高视频的细节。为了提高多样性，我们收集了约35亿个单个文本-视频对和60亿个文本-图像对，并且通过优化模型来提高模型的性能。因此，I2VGen-XL可以同时提高Semantic精度、视频细节的连续性和清晰度。经过广泛的实验，我们发现I2VGen-XL的下面原理和当前顶峰方法的效果，可以在多样的数据上证明其效果。模型和代码将在 \url{https://i2vgen-xl.github.io} 上公开。”

Perceptual Quality Improvement in Videoconferencing using Keyframes-based GAN

paper_url: http://arxiv.org/abs/2311.04263
repo_url: https://github.com/lorenzoagnolucci/keyframes-gan
paper_authors: Lorenzo Agnolucci, Leonardo Galteri, Marco Bertini, Alberto Del Bimbo
for: 提高视频会议中视觉质量
methods: 使用GAN技术维护和更新参考帧，提取多尺度特征并按面部特征进行进度组合
results: 提高视觉质量并生成高真实度结果，即使高压缩率时仍能获得良好效果

Abstract
In the latest years, videoconferencing has taken a fundamental role in interpersonal relations, both for personal and business purposes. Lossy video compression algorithms are the enabling technology for videoconferencing, as they reduce the bandwidth required for real-time video streaming. However, lossy video compression decreases the perceived visual quality. Thus, many techniques for reducing compression artifacts and improving video visual quality have been proposed in recent years. In this work, we propose a novel GAN-based method for compression artifacts reduction in videoconferencing. Given that, in this context, the speaker is typically in front of the camera and remains the same for the entire duration of the transmission, we can maintain a set of reference keyframes of the person from the higher-quality I-frames that are transmitted within the video stream and exploit them to guide the visual quality improvement; a novel aspect of this approach is the update policy that maintains and updates a compact and effective set of reference keyframes. First, we extract multi-scale features from the compressed and reference frames. Then, our architecture combines these features in a progressive manner according to facial landmarks. This allows the restoration of the high-frequency details lost after the video compression. Experiments show that the proposed approach improves visual quality and generates photo-realistic results even with high compression rates. Code and pre-trained networks are publicly available at https://github.com/LorenzoAgnolucci/Keyframes-GAN.

摘要
最近几年，视频会议已经在人际关系中扮演了重要的角色，包括个人和商业用途。损失式视频压缩算法是视频会议的核心技术，它降低了实时视频流的带宽需求。然而，损失式视频压缩会降低视频的视觉质量。因此，许多用于减少压缩残差和提高视频视觉质量的技术已经在最近几年内提出。在这种情况下，我们提出了一种基于GAN的新方法，用于减少视频会议中的压缩残差。由于发言人通常会站在摄像头前，并且在整个传输过程中保持不变，我们可以维护一组高质量I帧中的人脸参考图像，并利用它们来导航视觉质量改进。我们的方法包括以下步骤：首先，我们从压缩和参考帧中提取多级特征。然后，我们将这些特征进行进一步组合，根据人脸特征进行进行逐步组合。这样可以重新获得因压缩而丢失的高频环境细节。实验结果表明，我们的方法可以提高视觉质量，并且可以生成高质量的图像，即使压缩率较高。我们的代码和预训练网络可以在 GitHub 上获得：https://github.com/LorenzoAgnolucci/Keyframes-GAN。

paper_url: http://arxiv.org/abs/2311.04107
repo_url: None
paper_authors: Tatiana Zemskova, Aleksei Staroverov, Kirill Muravyev, Dmitry Yudin, Aleksandr Panov
for: 本研究旨在提出一种基于学习方法的移动机器人视觉对象导航方法，以提高移动机器人在室内环境中的导航质量。
methods: 该方法基于神经网络方法，通过误差估计值的反propagation进行权重调整，以形成在实体交互过程中的Scene semantic map。
results: 在Habitat环境中进行了大规模实验，并达到了与现有方法相比的显著提高在导航质量指标上。代码和自定义数据集在github.com/AIRI-Institute/skill-fusion上公开发布。

Abstract
Visual object navigation using learning methods is one of the key tasks in mobile robotics. This paper introduces a new representation of a scene semantic map formed during the embodied agent interaction with the indoor environment. It is based on a neural network method that adjusts the weights of the segmentation model with backpropagation of the predicted fusion loss values during inference on a regular (backward) or delayed (forward) image sequence. We have implemented this representation into a full-fledged navigation approach called SkillTron, which can select robot skills from end-to-end policies based on reinforcement learning and classic map-based planning methods. The proposed approach makes it possible to form both intermediate goals for robot exploration and the final goal for object navigation. We conducted intensive experiments with the proposed approach in the Habitat environment, which showed a significant superiority in navigation quality metrics compared to state-of-the-art approaches. The developed code and used custom datasets are publicly available at github.com/AIRI-Institute/skill-fusion.

摘要
视觉对象导航使用学习方法是移动机器人领域中的关键任务之一。这篇论文介绍了一种新的场景Semantic地图表示方法，基于神经网络方法，在感知机器人与室内环境互动中调整分割模型的权重。我们在SkillTron全面导航方法中实现了这种表示方法，可以根据复制学习和经典地图基于规划方法选择机器人技能。该方法可以形成穿梭机器人的中间目标和最终导航目标。我们在Habitat环境中进行了广泛的实验，并显示了与当前方法相比 Navigation 质量指标有显著的superiority。我们在github.com/AIRI-Institute/skill-fusion上公开了代码和自定义数据集。

DeepPatent2: A Large-Scale Benchmarking Corpus for Technical Drawing Understanding

paper_url: http://arxiv.org/abs/2311.04098
repo_url: https://github.com/gofigure-lanl/figure-segmentation
paper_authors: Kehinde Ajayi, Xin Wei, Martin Gryder, Winston Shields, Jian Wu, Shawn M. Jones, Michal Kucer, Diane Oyen
for: 该论文是为了推动计算机视觉和自然语言处理领域的进步，通过利用大量实际应用的数据。
methods: 该论文使用了大规模的设计专利文档来提供大于270万件技术图像，并从这些图像中提取了132,890个物体名称和22,394个视角点。
results: 该论文通过使用 DeepPatent2 数据集，实现了技术图像的描述功能，并且表明该数据集可以用于其他研究领域，如3D图像重建和图像检索。

Abstract
Recent advances in computer vision (CV) and natural language processing have been driven by exploiting big data on practical applications. However, these research fields are still limited by the sheer volume, versatility, and diversity of the available datasets. CV tasks, such as image captioning, which has primarily been carried out on natural images, still struggle to produce accurate and meaningful captions on sketched images often included in scientific and technical documents. The advancement of other tasks such as 3D reconstruction from 2D images requires larger datasets with multiple viewpoints. We introduce DeepPatent2, a large-scale dataset, providing more than 2.7 million technical drawings with 132,890 object names and 22,394 viewpoints extracted from 14 years of US design patent documents. We demonstrate the usefulness of DeepPatent2 with conceptual captioning. We further provide the potential usefulness of our dataset to facilitate other research areas such as 3D image reconstruction and image retrieval.

摘要
(Simplified Chinese translation)现代计算机视觉（CV）和自然语言处理技术的进步受到大规模实际应用的抽象启用。然而，这些研究领域仍受到可用数据的量、多样性和多样性的限制。CV任务，如图像描述，在自然图像上进行的主要任务仍然减少生成准确和有意义的描述 sketched 图像，常出现在科学和技术文档中。提高其他任务，如从 2D 图像中重建 3D 图像，需要更多的视图点。我们介绍 DeepPatent2，一个大规模数据集，包含 más de 2.7 万个技术图像，132,890 个物体名称和 22,394 个视图点，从美国设计专利文档中提取了 14 年。我们示出 DeepPatent2 的用于概念描述，并提供了其他研究领域，如 3D 图像重建和图像检索的潜在用途。

Image-Pointcloud Fusion based Anomaly Detection using PD-REAL Dataset

paper_url: http://arxiv.org/abs/2311.04095
repo_url: None
paper_authors: Jianjian Qin, Chunzhi Gu, Jun Yu, Chao Zhang
For: This paper is written for researchers and practitioners in the field of unsupervised anomaly detection (AD) in the 3D domain.* Methods: The paper uses a novel dataset called PD-REAL, which consists of Play-Doh models of 15 object categories with six types of anomalies, captured under different lighting conditions using a RealSense camera. The authors use state-of-the-art AD algorithms to evaluate the benefits and challenges of using 3D information in AD tasks.* Results: The paper shows that the PD-REAL dataset provides a controlled environment for analyzing the potential benefits of 3D information in AD tasks, and that the use of 3D information can improve the detection of anomalies compared to 2D-only representations. However, the authors also identify challenges in using 3D information, such as the need for more sophisticated algorithms to handle varying lighting conditions and object orientations.

Abstract
We present PD-REAL, a novel large-scale dataset for unsupervised anomaly detection (AD) in the 3D domain. It is motivated by the fact that 2D-only representations in the AD task may fail to capture the geometric structures of anomalies due to uncertainty in lighting conditions or shooting angles. PD-REAL consists entirely of Play-Doh models for 15 object categories and focuses on the analysis of potential benefits from 3D information in a controlled environment. Specifically, objects are first created with six types of anomalies, such as dent, crack, or perforation, and then photographed under different lighting conditions to mimic real-world inspection scenarios. To demonstrate the usefulness of 3D information, we use a commercially available RealSense camera to capture RGB and depth images. Compared to the existing 3D dataset for AD tasks, the data acquisition of PD-REAL is significantly cheaper, easily scalable and easier to control variables. Extensive evaluations with state-of-the-art AD algorithms on our dataset demonstrate the benefits as well as challenges of using 3D information. Our dataset can be downloaded from https://github.com/Andy-cs008/PD-REAL

摘要
我们介绍PD-REAL，一个新的大规模未supervised anomaly detection（AD） dataset在3D领域。它受到了2D只 представiation在AD任务中可能无法捕捉异常的几何结构，因为照明条件或拍摄角度的不确定性。PD-REAL完全由Play-Doh模型组成15种物品类别，专注于控制环境中3D信息的分析。具体来说，物品首先创建了6种异常，如痕、裂、或者贯穿，然后在不同的照明条件下拍摄，以模拟实际的检查enario。为了证明3D信息的有用性，我们使用了一款商业可用的RealSense摄像头捕捉RGB和深度图像。相比已有的3Ddataset для AD任务，PD-REAL的数据取得更加便宜，易扩展和更容易控制变量。我们进行了state-of-the-art AD算法的广泛评估，以证明PD-REAL dataset的优点和挑战。您可以在https://github.com/Andy-cs008/PD-REAL上下载这个dataset。

Proceedings of the 5th International Workshop on Reading Music Systems

paper_url: http://arxiv.org/abs/2311.04091
repo_url: https://github.com/suziai/gui-tools
paper_authors: Jorge Calvo-Zaragoza, Alexander Pacha, Elona Shatri
for: 这个论文是为了连接研究者们，论文主要关注在乐谱识别领域，与其他研究者和实践者进行交流和合作。
methods: 本论文使用了乐谱识别技术，包括图像处理、作者识别、多模态系统等方法。
results: 本论文介绍了5th International Workshop on Reading Music Systems的论文集，包括乐谱识别、数据集和性能评估、图像处理等主题。

Abstract
The International Workshop on Reading Music Systems (WoRMS) is a workshop that tries to connect researchers who develop systems for reading music, such as in the field of Optical Music Recognition, with other researchers and practitioners that could benefit from such systems, like librarians or musicologists. The relevant topics of interest for the workshop include, but are not limited to: Music reading systems; Optical music recognition; Datasets and performance evaluation; Image processing on music scores; Writer identification; Authoring, editing, storing and presentation systems for music scores; Multi-modal systems; Novel input-methods for music to produce written music; Web-based Music Information Retrieval services; Applications and projects; Use-cases related to written music. These are the proceedings of the 5th International Workshop on Reading Music Systems, held in Milan, Italy on Nov. 4th 2023.

摘要
世界音乐读取系统国际研讨会（WoRMS）是一个研讨会，旨在连接开发音乐读取系统的研究人员（如光学音乐识别）与其他研究人员和实践者（如图书管理员或音乐学家），以便共享知识和技术。研讨会的主题包括，但不限于：音乐读取系统；光学音乐识别；数据集和性能评估；图书馆管理系统；多模态系统；新的音乐输入方法来生成书面音乐；网络音乐信息检索服务；应用和项目；有关书面音乐的用例。这是第5届世界音乐读取系统国际研讨会的论文集，于2023年11月4日在意大利米兰举行。

Restoration of Analog Videos Using Swin-UNet

paper_url: http://arxiv.org/abs/2311.04261
repo_url: https://github.com/miccunifi/analog-video-restoration
paper_authors: Lorenzo Agnolucci, Leonardo Galteri, Marco Bertini, Alberto Del Bimbo
for: restaure analog videos of historical archives
methods: multi-frame approach, deal with severe tape mistracking
results: effective restoration of original content, tested on real-world videos from a major historical video archive

Abstract
In this paper, we present a system to restore analog videos of historical archives. These videos often contain severe visual degradation due to the deterioration of their tape supports that require costly and slow manual interventions to recover the original content. The proposed system uses a multi-frame approach and is able to deal with severe tape mistracking, which results in completely scrambled frames. Tests on real-world videos from a major historical video archive show the effectiveness of our demo system. The code and the pre-trained model are publicly available at https://github.com/miccunifi/analog-video-restoration.

摘要
在这篇论文中，我们提出了一种系统来恢复历史档案中的分析视频。这些视频经常受到媒体支持的腐蚀，需要费时费力的手动操作来恢复原始内容。我们的系统采用多帧方法，能够有效地处理严重的卷积问题，从而恢复原始内容。我们在一个重要的历史视频档案中进行了实验，测试结果表明我们的示例系统的效果。我们的代码和预训练模型都可以在 GitHub 上获得，请参考。

Learning Super-Resolution Ultrasound Localization Microscopy from Radio-Frequency Data

paper_url: http://arxiv.org/abs/2311.04081
repo_url: None
paper_authors: Christopher Hahne, Georges Chabouh, Olivier Couture, Raphael Sznitman
for: 本研究旨在提高ultrasound localization microscopy（ULM）的分辨率性能，通过快速和高效地地标定目标位置。
methods: 本研究使用无处理Radio-Frequency（RF）数据，并通过超解像网络进行地标定。为此，我们实现了标点投影和反向点 transform between B-mode和RF坐标空间。
results: 对于state-of-the-art技术的比较，我们的RF训练网络表明，不使用延迟和总和（DAS）扫描 beamforming可以更好地优化ULM的分辨率性能。

Abstract
Ultrasound Localization Microscopy (ULM) enables imaging of vascular structures in the micrometer range by accumulating contrast agent particle locations over time. Precise and efficient target localization accuracy remains an active research topic in the ULM field to further push the boundaries of this promising medical imaging technology. Existing work incorporates Delay-And-Sum (DAS) beamforming into particle localization pipelines, which ultimately determines the ULM image resolution capability. In this paper we propose to feed unprocessed Radio-Frequency (RF) data into a super-resolution network while bypassing DAS beamforming and its limitations. To facilitate this, we demonstrate label projection and inverse point transformation between B-mode and RF coordinate space as required by our approach. We assess our method against state-of-the-art techniques based on a public dataset featuring in silico and in vivo data. Results from our RF-trained network suggest that excluding DAS beamforming offers a great potential to optimize on the ULM resolution performance.

摘要
Ultrasound Localization Microscopy (ULM) 可以在微米级别快速成像血管结构，通过时间积累对比杂素点的位置。为了进一步推动这项承诺的医疗成像技术，精准和高效的目标定位精度仍然是ULM领域的活跃研究话题。现有的工作将延迟和总和（DAS）扩散 beamforming 集成到参与者localization管道中，从而决定ULM图像分辨率能力。在这篇论文中，我们提议将未处理的Radio-Frequency（RF）数据feed到超分辨网络中，而不是通过DAS扩散 beamforming 和其局限性。为此，我们展示了标签投影和反向点变换 междуB-mode和RF坐标空间，这些步骤是我们方法所需的。我们对比了我们的RF训练网络和现有技术，根据公共数据集上的响应Silico和in vivo数据。结果表明，不包括DAS扩散 beamforming 可以提高ULM的分辨率性能。

paper_url: http://arxiv.org/abs/2311.04079
repo_url: None
paper_authors: Katie Z Luo, Xinshuo Weng, Yan Wang, Shuang Wu, Jie Li, Kilian Q Weinberger, Yue Wang, Marco Pavone
for: 实时赛道预测
methods: 使用 Standard Definition (SD) 地图和 Transformer 数据 Representations from transFormers 进行赛道预测
results: 与现有的 online map prediction 方法相比，提高了赛道探测和预测效率（最高提升60%），可以立即适用于任何 Transformer 基本的赛道预测方法。

Abstract
Autonomous driving has traditionally relied heavily on costly and labor-intensive High Definition (HD) maps, hindering scalability. In contrast, Standard Definition (SD) maps are more affordable and have worldwide coverage, offering a scalable alternative. In this work, we systematically explore the effect of SD maps for real-time lane-topology understanding. We propose a novel framework to integrate SD maps into online map prediction and propose a Transformer-based encoder, SD Map Encoder Representations from transFormers, to leverage priors in SD maps for the lane-topology prediction task. This enhancement consistently and significantly boosts (by up to 60%) lane detection and topology prediction on current state-of-the-art online map prediction methods without bells and whistles and can be immediately incorporated into any Transformer-based lane-topology method. Code is available at https://github.com/NVlabs/SMERF.

摘要
自适应驾驶曾经依赖高Definition（HD）地图，导致扩展性受限。相比之下，标准Definition（SD）地图更加可 affordable，具有全球覆盖，提供可扩展的 alternativa。在这种工作中，我们系统地探讨了SD地图对实时干道结构理解的效果。我们提议一种将SD地图集成到在线地图预测中的框架，并使用Transformer基于的编码器，即SD Map Encoder Representations from transFormers，以利用SD地图中的假设来提高干道结构预测任务。这种改进可以在当前领先的在线地图预测方法上提高（最高达60%）干道检测和结构预测，无需额外的套件和配置，可以立即 integrate into任何Transformer基于的干道结构方法。代码可以在https://github.com/NVlabs/SMERF中找到。

Energy-based Calibrated VAE with Test Time Free Lunch

paper_url: http://arxiv.org/abs/2311.04071
repo_url: None
paper_authors: Yihong Luo, Siya Qiu, Xingjian Tao, Yujun Cai, Jing Tang
for: 提高 Variational Autoencoders (VAEs) 的采样效率和生成质量
methods: 使用 Conditional EBM 进行采样和生成，无需 MCMC 抽样
results: 提出了一种能够在训练和测试阶段都不需要 MCMC 抽样的 Energy-Calibrated Generative Model，并且在多个应用中达到了state-of-the-art 性能。

Abstract
In this paper, we propose a novel Energy-Calibrated Generative Model that utilizes a Conditional EBM for enhancing Variational Autoencoders (VAEs). VAEs are sampling efficient but often suffer from blurry generation results due to the lack of training in the generative direction. On the other hand, Energy-Based Models (EBMs) can generate high-quality samples but require expensive Markov Chain Monte Carlo (MCMC) sampling. To address these issues, we introduce a Conditional EBM for calibrating the generative direction during training, without requiring it for test time sampling. Our approach enables the generative model to be trained upon data and calibrated samples with adaptive weight, thereby enhancing efficiency and effectiveness without necessitating MCMC sampling in the inference phase. We also show that the proposed approach can be extended to calibrate normalizing flows and variational posterior. Moreover, we propose to apply the proposed method to zero-shot image restoration via neural transport prior and range-null theory. We demonstrate the effectiveness of the proposed method through extensive experiments in various applications, including image generation and zero-shot image restoration. Our method shows state-of-the-art performance over single-step non-adversarial generation.

摘要
在这篇论文中，我们提出了一种新的能量准确生成模型，该模型利用 conditional EBM 来增强变量自动编码器（VAE）。 VAE 可以高效地采样，但通常因缺乏生成方向的训练而产生模糊的生成结果。而EBM 可以生成高质量的样本，但需要昂贵的 Markov Chain Monte Carlo（MCMC）采样。为了解决这些问题，我们引入了 conditional EBM，以在训练期间准确调整生成方向，不需要测试阶段的 MCMC 采样。我们的方法使得生成模型可以在数据和适应性权重的指导下接受训练，从而提高效率和效果，不需要测试阶段的 MCMC 采样。此外，我们还证明了该方法可以扩展到 calibrate normalizing flows 和 variational posterior。此外，我们还提出了在 zero-shot 图像恢复中应用该方法，使用神经运输前向和范围null理论。我们通过多种应用，包括图像生成和 zero-shot 图像恢复，展示了我们的方法的效果。我们的方法在单步非对抗生成中显示了状态级表现。

paper_url: http://arxiv.org/abs/2311.04069
repo_url: None
paper_authors: Giuseppe Chindemi, Benoit Girard, Camilla Bellone
For: The paper aims to understand the core principles of social behavior and identify potential therapeutic targets for addressing social deficits.* Methods: The paper introduces a new model called LISBET, which uses self-supervised learning to detect and quantify social behaviors from dynamic body parts tracking data.* Results: The paper shows that LISBET can be used in both hypothesis-driven and discovery-driven modes to automate behavior classification and segment social behavior motifs, and that the recognized motifs closely match human annotations and correlate with the electrophysiological activity of dopaminergic neurons in the Ventral Tegmental Area (VTA).Here is the same information in Simplified Chinese:* For: 该研究旨在更深入地理解社会行为的核心原理，并identify potential therapeutic targets for addressing social deficits.* Methods: 该研究提出了一种新的模型called LISBET，该模型使用无监督学习来探索和量化社会行为的动态身体部分跟踪数据。* Results: 研究发现，LISBET可以在假设驱动和发现驱动两种模式下使用，以自动化行为分类和社会行为模式识别。Recognized模式不仅准确地匹配人类注释，还与 Ventral Tegmental Area (VTA) dopamine neurons的电physiological活动相关。

Abstract
Social behavior, defined as the process by which individuals act and react in response to others, is crucial for the function of societies and holds profound implications for mental health. To fully grasp the intricacies of social behavior and identify potential therapeutic targets for addressing social deficits, it is essential to understand its core principles. Although machine learning algorithms have made it easier to study specific aspects of complex behavior, current methodologies tend to focus primarily on single-animal behavior. In this study, we introduce LISBET (seLf-supervIsed Social BEhavioral Transformer), a model designed to detect and segment social interactions. Our model eliminates the need for feature selection and extensive human annotation by using self-supervised learning to detect and quantify social behaviors from dynamic body parts tracking data. LISBET can be used in hypothesis-driven mode to automate behavior classification using supervised finetuning, and in discovery-driven mode to segment social behavior motifs using unsupervised learning. We found that motifs recognized using the discovery-driven approach not only closely match the human annotations but also correlate with the electrophysiological activity of dopaminergic neurons in the Ventral Tegmental Area (VTA). We hope LISBET will help the community improve our understanding of social behaviors and their neural underpinnings.

摘要
社会行为，定义为个体在响应他人的行为和反应过程，对社会的功能至关重要，对心理健康也有深远的影响。为了全面理解社会行为的复杂性和潜在的治疗目标，我们需要了解其核心原理。虽然机器学习算法已经使得研究特定方面的复杂行为变得更加容易，但现有方法ologies tend to focus primarily on single-animal behavior.在这种情况下，我们介绍了LISBET（seLf-supervIsed Social BEhavioral Transformer）模型，用于检测和分类社交互动。我们的模型不需要特定的特征选择和大量的人类标注，通过自动学习检测和跟踪动体部分数据来检测和量化社交行为。LISBET可以在假设驱动模式下自动分类行为，以及在发现驱动模式下分类社交行为模式。我们发现使用发现驱动模式分类的模式与人类标注非常相似，并且与 dopaminergic neurons的电physiological 活动在腹囊核（VTA）也存在相似性。我们希望LISBET能够帮助社区更好地理解社交行为和其神经基础。

mmFUSION: Multimodal Fusion for 3D Objects Detection

paper_url: http://arxiv.org/abs/2311.04058
repo_url: None
paper_authors: Javed Ahmad, Alessio Del Bue
for: 本研究旨在提出一种新的中途级多模态融合（mmFUSION）方法，以解决自驾系统中多感器融合的挑战。
methods: mmFUSION使用了每个感知器 separately compute特征，然后通过cross-modality和多模态注意力机制进行融合。
results: 在KITTI和NuScenes dataset上测试，mmFUSION的性能比EARLY、INTERMEDIATE、LATE和两阶段融合方案更好，并且可以保持多模态信息并学习补做模态缺陷。

Abstract
Multi-sensor fusion is essential for accurate 3D object detection in self-driving systems. Camera and LiDAR are the most commonly used sensors, and usually, their fusion happens at the early or late stages of 3D detectors with the help of regions of interest (RoIs). On the other hand, fusion at the intermediate level is more adaptive because it does not need RoIs from modalities but is complex as the features of both modalities are presented from different points of view. In this paper, we propose a new intermediate-level multi-modal fusion (mmFUSION) approach to overcome these challenges. First, the mmFUSION uses separate encoders for each modality to compute features at a desired lower space volume. Second, these features are fused through cross-modality and multi-modality attention mechanisms proposed in mmFUSION. The mmFUSION framework preserves multi-modal information and learns to complement modalities' deficiencies through attention weights. The strong multi-modal features from the mmFUSION framework are fed to a simple 3D detection head for 3D predictions. We evaluate mmFUSION on the KITTI and NuScenes dataset where it performs better than available early, intermediate, late, and even two-stage based fusion schemes. The code with the mmdetection3D project plugin will be publicly available soon.

摘要
多感器融合是自动驾驶系统中准确的三维物体探测的关键。相机和激光激光是最常用的感知器，通常在3D探测器的早期或晚期阶段进行融合，使用区域关注（RoI）。然而，中间阶段的融合更加适应，因为它不需要modalities的RoI，但是复杂度较高，因为两种感知器的特征从不同的角度出现。在这篇论文中，我们提出了一种新的中间阶段多模态融合（mmFUSION）方法来解决这些挑战。首先，mmFUSION使用每个感知器的分离Encoder计算特征，以实现感知器的特征空间减小。其次，这些特征通过跨模态和多模态注意机制进行融合。mmFUSION框架保留了多模态信息，并通过注意weight学习补做每个感知器的不足。强大的多模态特征从mmFUSION框架中来的是 fed into a simple 3D detection head for 3D predictions。我们在KITTI和NuScenes数据集上评估了mmFUSION，其性能高于现有的早期、中间、晚期和两个阶段融合方案。代码将在near future通过mmdetection3D项目插件公开。

Generative Structural Design Integrating BIM and Diffusion Model

paper_url: http://arxiv.org/abs/2311.04052
repo_url: None
paper_authors: Zhili He, Yu-Hsing Wang, Jian Zhang
For: This paper proposes a comprehensive solution for intelligent structural design using AI, with a focus on improving the perceptual quality and details of generations.* Methods: The paper introduces building information modeling (BIM) into intelligent structural design and establishes a structural design pipeline integrating BIM and generative AI. It also proposes a novel 2-stage generation framework, uses diffusion models (DMs) instead of generative adversarial network (GAN)-based models, and designs an attention block (AB) consisting of a self-attention block (SAB) and a parallel cross-attention block (PCAB) to facilitate cross-domain data fusion.* Results: The paper demonstrates the powerful generation and representation capabilities of the proposed method through quantitative and qualitative results, and shows that DMs have the potential to replace GANs and become the new benchmark for generative problems in civil engineering.

Abstract
Intelligent structural design using AI can effectively reduce time overhead and increase efficiency. It has potential to become the new design paradigm in the future to assist and even replace engineers, and so it has become a research hotspot in the academic community. However, current methods have some limitations to be addressed, whether in terms of application scope, visual quality of generated results, or evaluation metrics of results. This study proposes a comprehensive solution. Firstly, we introduce building information modeling (BIM) into intelligent structural design and establishes a structural design pipeline integrating BIM and generative AI, which is a powerful supplement to the previous frameworks that only considered CAD drawings. In order to improve the perceptual quality and details of generations, this study makes 3 contributions. Firstly, in terms of generation framework, inspired by the process of human drawing, a novel 2-stage generation framework is proposed to replace the traditional end-to-end framework to reduce the generation difficulty for AI models. Secondly, in terms of generative AI tools adopted, diffusion models (DMs) are introduced to replace widely used generative adversarial network (GAN)-based models, and a novel physics-based conditional diffusion model (PCDM) is proposed to consider different design prerequisites. Thirdly, in terms of neural networks, an attention block (AB) consisting of a self-attention block (SAB) and a parallel cross-attention block (PCAB) is designed to facilitate cross-domain data fusion. The quantitative and qualitative results demonstrate the powerful generation and representation capabilities of PCDM. Necessary ablation studies are conducted to examine the validity of the methods. This study also shows that DMs have the potential to replace GANs and become the new benchmark for generative problems in civil engineering.

摘要
使用人工智能进行智能结构设计可以有效减少时间开销并提高效率。它在未来可能成为新的设计 парадигма，帮助或代替工程师，因此在学术社区中引起了广泛的研究兴趣。然而，当前的方法还有一些需要解决的限制，包括应用范围、生成结果的视觉质量和评价指标。本研究提出了一个全面的解决方案。首先，我们将建筑信息模型（BIM）引入智能结构设计，并建立了基于BIM和生成AI的结构设计管道，这是之前的框架仅考虑CAD图形的增强。为了提高生成结果的视觉质量和细节，本研究做出了3个贡献。首先，在生成框架方面，我们提出了一种基于人类绘图过程的新型二阶生成框架，以降低AI模型生成难度。其次，在生成AI工具方面，我们引入了扩散模型（DM），取代了广泛使用的生成对抗网络（GAN）模型，并提出了一种基于物理条件的条件扩散模型（PCDM），以考虑不同的设计前提。最后，在神经网络方面，我们设计了一个注意块（AB），包括一个自注意块（SAB）和一个平行交叉注意块（PCAB），以便进行跨领域数据融合。量化和质量 результаados表明了PCDM的强大生成和表示能力。进行必要的减少研究以确保方法的有效性。此外，我们还发现了DMs可能取代GANs，成为未来的生成问题的新标准。

3D EAGAN: 3D edge-aware attention generative adversarial network for prostate segmentation in transrectal ultrasound images

paper_url: http://arxiv.org/abs/2311.04049
repo_url: None
paper_authors: Mengqing Liu, Xiao Shao, Liping Jiang, Kaizhi Wu
for: 这个研究的目的是为了发展一种高效的肿瘤脉冲影像中的肿瘤分类方法，以扩展现有的分类方法，并且能够更好地处理肿瘤的不均匀组织和边缘信息。
methods: 这个方法使用了一个3D缘点注意力生成对抗网络（3D EAGAN），它包括一个缘点注意力分 segmentation 网络（EASNet）和一个判别网络，用于识别预测的肿瘤和实际的肿瘤。 EASNet 由一个encoder-decoder 结构的 U-Net 背景网络、一个细节补偿模组、四个3D 空间和通道注意力模组、一个缘点增强模组和一个全局特征提取器组成。
results: 这个方法可以实现高度的肿瘤分类精度，并且可以优化肿瘤的边缘信息和细节特征。

Abstract
Automatic prostate segmentation in TRUS images has always been a challenging problem, since prostates in TRUS images have ambiguous boundaries and inhomogeneous intensity distribution. Although many prostate segmentation methods have been proposed, they still need to be improved due to the lack of sensibility to edge information. Consequently, the objective of this study is to devise a highly effective prostate segmentation method that overcomes these limitations and achieves accurate segmentation of prostates in TRUS images. A 3D edge-aware attention generative adversarial network (3D EAGAN)-based prostate segmentation method is proposed in this paper, which consists of an edge-aware segmentation network (EASNet) that performs the prostate segmentation and a discriminator network that distinguishes predicted prostates from real prostates. The proposed EASNet is composed of an encoder-decoder-based U-Net backbone network, a detail compensation module, four 3D spatial and channel attention modules, an edge enhance module, and a global feature extractor. The detail compensation module is proposed to compensate for the loss of detailed information caused by the down-sampling process of the encoder. The features of the detail compensation module are selectively enhanced by the 3D spatial and channel attention module. Furthermore, an edge enhance module is proposed to guide shallow layers in the EASNet to focus on contour and edge information in prostates. Finally, features from shallow layers and hierarchical features from the decoder module are fused through the global feature extractor to predict the segmentation prostates.

摘要
自动肾脏分割在TRUS图像中 siempre ha sido un problema desafiante, ya que las prostata en las imágenes TRUS tienen límites ambiguos y distribución no homogénea de intensidad. A pesar de que muchos métodos de segmentación de próstata han sido propuestos, todavía necesitan mejorar debido a la falta de sensibilidad a la información de borde. Por lo tanto, el objetivo de este estudio es desarrollar un método de segmentación de próstata altamente efectivo que supera estas limitaciones y realiza una segmentación precisa de las prostata en las imágenes TRUS.Se propone un método de segmentación de próstata basado en una red de generador de adversarios 3D edge-aware (3D EAGAN), que consta de una red de segmentación edge-aware (EASNet) que realiza la segmentación de próstata y una red de discriminador que distingue las prostata predichas de las reales. La red EASNet está compuesta por una backbone network U-Net basada en una red de encoder-decodificador, un módulo de compensación de detalles, cuatro módulos de atención espacial y de canal 3D, un módulo de mejora de borde y un extractor de características globales. El módulo de compensación de detalles se propone para compensar la pérdida de información detallada causada por el proceso de down-sampling del encoder. Las características del módulo de compensación de detalles se seleccionan y se enfocan por el módulo de atención espacial y de canal 3D. Además, se propone un módulo de mejora de borde para guiar las capas superficiales de la red EASNet para que se centren en la información de contorno y borde en las prostata. Finalmente, las características de las capas superficiales y las características jerárquicas del módulo de descodificador se fusionan a través del extractor de características globales para predecir la segmentación de próstata.

Analyzing Near-Infrared Hyperspectral Imaging for Protein Content Regression and Grain Variety Classification Using Bulk References and Varying Grain-to-Background Ratios

paper_url: http://arxiv.org/abs/2311.04042
repo_url: None
paper_authors: Ole-Christian Galbo Engstrøm, Erik Schou Dreier, Birthe Møller Jespersen, Kim Steenstrup Pedersen
for: 本研究使用 Near-Infrared Hyperspectral Imaging (NIR-HSI) 图像来调整模型，主要关注蛋白内含量预测和谷物种类分类两个任务。
methods: 研究者使用了限制参考数据来扩大蛋白内含量的预测，通过下游采样和相关联系来减少预测分布的偏见。然而，这种方法引入了明显的偏见，影响了PLS-R 和深度 CNN 模型的预测。研究者提议了一些纠正方法来减少这些偏见。
results: 研究者发现，高比例的谷物至背景图像在两个任务中具有更高的预测精度。然而，包含较低比例的图像在核心化过程中可以增强模型的Robustness。

Abstract
Based on previous work, we assess the use of NIR-HSI images for calibrating models on two datasets, focusing on protein content regression and grain variety classification. Limited reference data for protein content is expanded by subsampling and associating it with the bulk sample. However, this method introduces significant biases due to skewed leptokurtic prediction distributions, affecting both PLS-R and deep CNN models. We propose adjustments to mitigate these biases, improving mean protein reference predictions. Additionally, we investigate the impact of grain-to-background ratios on both tasks. Higher ratios yield more accurate predictions, but including lower-ratio images in calibration enhances model robustness for such scenarios.

摘要

Data exploitation: multi-task learning of object detection and semantic segmentation on partially annotated data

paper_url: http://arxiv.org/abs/2311.04040
repo_url: None
paper_authors: Hoàng-Ân Lê, Minh-Tan Pham
for: 这篇论文主要探讨了多任务部分标注数据的 JOINT 学习，以扩展物探测和 semantic segmentation 两个最受欢迎的视觉任务的性能。
methods: 本论文使用了多任务学习和知识传播来将两个任务结合在一起，并进行了广泛的实验测试来评估每个任务的性能和弹性。
results: 实验结果显示， JOINT 学习和知识传播可以提高多任务学习的性能，并且在无法同时优化两个任务的情况下，可以提供更好的性能。

Abstract
Multi-task partially annotated data where each data point is annotated for only a single task are potentially helpful for data scarcity if a network can leverage the inter-task relationship. In this paper, we study the joint learning of object detection and semantic segmentation, the two most popular vision problems, from multi-task data with partial annotations. Extensive experiments are performed to evaluate each task performance and explore their complementarity when a multi-task network cannot optimize both tasks simultaneously. We propose employing knowledge distillation to leverage joint-task optimization. The experimental results show favorable results for multi-task learning and knowledge distillation over single-task learning and even full supervision scenario. All code and data splits are available at https://github.com/lhoangan/multas

摘要
多任务半标注数据，每个数据点只有一个任务的标注，在数据缺乏时可能是有帮助的。在这篇论文中，我们研究了对象检测和 semantic segmentation 两个最受欢迎的视觉问题的共同学习，从多任务数据中获得了半标注。我们进行了广泛的实验来评估每个任务的性能，并探索它们之间的补偿性。我们提出了使用知识传授来利用共同优化。实验结果表明，多任务学习和知识传授在单任务学习和完整监督方式之上具有有利的效果。所有代码和数据分割可以在 GitHub 上找到：https://github.com/lhoangan/multas。

Exploring Dataset-Scale Indicators of Data Quality

paper_url: http://arxiv.org/abs/2311.04016
repo_url: None
paper_authors: Benjamin Feuer, Chinmay Hegde
for: 这个论文主要用于探讨计算机视觉基础模型的训练所需的数据质量。
methods: 该论文使用了两个重要的数据集级别组成部分来评估数据质量：标签集设计和类别均衡。
results: 研究人员通过监测这些指标来预测模型的性能和对分布变化的Robustness。

Abstract
Modern computer vision foundation models are trained on massive amounts of data, incurring large economic and environmental costs. Recent research has suggested that improving data quality can significantly reduce the need for data quantity. But what constitutes data quality in computer vision? We posit that the quality of a given dataset can be decomposed into distinct sample-level and dataset-level constituents, and that the former have been more extensively studied than the latter. We ablate the effects of two important dataset-level constituents: label set design, and class balance. By monitoring these constituents using key indicators we provide, researchers and practitioners can better anticipate model performance, measured in terms of its accuracy and robustness to distribution shifts.

摘要
现代计算机视觉基础模型通常通过巨量数据训练，产生大量经济和环境成本。 latest research suggests that improving data quality can significantly reduce the need for data quantity. But what constitutes data quality in computer vision? We argue that the quality of a given dataset can be decomposed into two distinct constituents: sample-level and dataset-level. While the former has been more extensively studied, the latter has been relatively underexplored. We investigate the effects of two important dataset-level constituents: label set design and class balance. By monitoring these constituents using key indicators we provide, researchers and practitioners can better anticipate model performance, measured in terms of its accuracy and robustness to distribution shifts.Note: Please note that the translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you need the translation in Traditional Chinese, please let me know.

AGNES: Abstraction-guided Framework for Deep Neural Networks Security

paper_url: http://arxiv.org/abs/2311.04009
repo_url: None
paper_authors: Akshay Dhonthi, Marcello Eiermann, Ernst Moritz Hahn, Vahid Hashemi
for: 该论文旨在检测 Deep Neural Networks (DNNs) 中的后门，以确保图像识别 task 的正确性。
methods: 该论文基于一种新的方法 AGNES，用于检测 DNNs 中的后门。该方法基于一种新的特征提取技术，可以准确地检测后门。
results: 作者通过多个 relevante case studies 表明，AGNES 比许多现有的方法表现更好，可以准确地检测 DNNs 中的后门。

Abstract
Deep Neural Networks (DNNs) are becoming widespread, particularly in safety-critical areas. One prominent application is image recognition in autonomous driving, where the correct classification of objects, such as traffic signs, is essential for safe driving. Unfortunately, DNNs are prone to backdoors, meaning that they concentrate on attributes of the image that should be irrelevant for their correct classification. Backdoors are integrated into a DNN during training, either with malicious intent (such as a manipulated training process, because of which a yellow sticker always leads to a traffic sign being recognised as a stop sign) or unintentional (such as a rural background leading to any traffic sign being recognised as animal crossing, because of biased training data). In this paper, we introduce AGNES, a tool to detect backdoors in DNNs for image recognition. We discuss the principle approach on which AGNES is based. Afterwards, we show that our tool performs better than many state-of-the-art methods for multiple relevant case studies.

摘要
在这篇论文中，我们介绍了 AGNES，一种用于检测 DNNs 中的后门的工具。我们讲述了 AGNES 的原理方法。然后，我们表明了我们的工具在多个相关的案例研究中表现出色。

Bias and Diversity in Synthetic-based Face Recognition

paper_url: http://arxiv.org/abs/2311.03970
repo_url: None
paper_authors: Marco Huber, Anh Thi Luu, Fadi Boutros, Arjan Kuijper, Naser Damer
for: 本研究旨在investigate synthetic face recognition dataset的多样性和生成模型训练数据的影响，以及这些模型对不同属性的偏见。
methods: 本研究使用了三种最新的生成模型，并对这些模型的训练数据进行分析，以了解生成的人脸数据的分布和偏见。
results: 研究结果显示，生成模型可以生成与训练数据的分布相似的人脸数据，但是这些模型仍然存在偏见问题，尤其是对 gender、ethnicity、age 和head position 等属性的偏见。

Abstract
Synthetic data is emerging as a substitute for authentic data to solve ethical and legal challenges in handling authentic face data. The current models can create real-looking face images of people who do not exist. However, it is a known and sensitive problem that face recognition systems are susceptible to bias, i.e. performance differences between different demographic and non-demographics attributes, which can lead to unfair decisions. In this work, we investigate how the diversity of synthetic face recognition datasets compares to authentic datasets, and how the distribution of the training data of the generative models affects the distribution of the synthetic data. To do this, we looked at the distribution of gender, ethnicity, age, and head position. Furthermore, we investigated the concrete bias of three recent synthetic-based face recognition models on the studied attributes in comparison to a baseline model trained on authentic data. Our results show that the generator generate a similar distribution as the used training data in terms of the different attributes. With regard to bias, it can be seen that the synthetic-based models share a similar bias behavior with the authentic-based models. However, with the uncovered lower intra-identity attribute consistency seems to be beneficial in reducing bias.

摘要
“人工数据正在成为真实数据的替代品，以解决伦理和法律问题。现有的模型可以创建真实看起来的人脸图像，但是知道的敏感问题是，人脸识别系统受到偏见的影响，即不同的民族和非民族属性之间的性能差异，可能导致不公正的决策。在这项工作中，我们研究了人工数据的多样性与真实数据的多样性之间的关系，以及生成模型的训练数据分布对人工数据的影响。我们分析了性别、民族、年龄和头部位置等属性的分布。此外，我们还研究了三种最新的人工数据基于面Recognition模型对这些属性的偏见情况，与基准模型在真实数据上训练的情况进行比较。我们的结果表明，生成器对不同属性的分布具有类似的分布，而且与真实数据上的偏见行为也很相似。然而，通过降低内部同一个属性的一致性，可以减少偏见。”Note: Please note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and other countries. If you need Traditional Chinese, please let me know.

CeCNN: Copula-enhanced convolutional neural networks in joint prediction of refraction error and axial length based on ultra-widefield fundus images

paper_url: http://arxiv.org/abs/2311.03967
repo_url: None
paper_authors: Chong Zhong, Yang Li, Danjuan Yang, Meiyan Li, Xingyao Zhou, Bo Fu, Catherine C. Liu, A. H. Welsh
for: 这个研究旨在应用深度学习模型来预测视力短暂性和视力过度问题，并使用多重回应任务来提高预测精度。
methods: 研究使用了高维度数据生成模型（CeCNN），具有抽象特征和依赖关系模型，以提高预测精度。
results: 研究发现，使用CeCNN模型可以提高预测精度，并且可以在不同的背景和实验设计下运行。

Abstract
Ultra-widefield (UWF) fundus images are replacing traditional fundus images in screening, detection, prediction, and treatment of complications related to myopia because their much broader visual range is advantageous for highly myopic eyes. Spherical equivalent (SE) is extensively used as the main myopia outcome measure, and axial length (AL) has drawn increasing interest as an important ocular component for assessing myopia. Cutting-edge studies show that SE and AL are strongly correlated. Using the joint information from SE and AL is potentially better than using either separately. In the deep learning community, though there is research on multiple-response tasks with a 3D image biomarker, dependence among responses is only sporadically taken into consideration. Inspired by the spirit that information extracted from the data by statistical methods can improve the prediction accuracy of deep learning models, we formulate a class of multivariate response regression models with a higher-order tensor biomarker, for the bivariate tasks of regression-classification and regression-regression. Specifically, we propose a copula-enhanced convolutional neural network (CeCNN) framework that incorporates the dependence between responses through a Gaussian copula (with parameters estimated from a warm-up CNN) and uses the induced copula-likelihood loss with the backbone CNNs. We establish the statistical framework and algorithms for the aforementioned two bivariate tasks. We show that the CeCNN has better prediction accuracy after adding the dependency information to the backbone models. The modeling and the proposed CeCNN algorithm are applicable beyond the UWF scenario and can be effective with other backbones beyond ResNet and LeNet.

摘要
ultra-widefield (UWF) fundus images replacing traditional fundus images in screening, detection, prediction, and treatment of myopia complications because their much broader visual range is advantageous for highly myopic eyes. spherical equivalent (SE) is extensively used as the main myopia outcome measure, and axial length (AL) has drawn increasing interest as an important ocular component for assessing myopia. cutting-edge studies show that SE and AL are strongly correlated. using the joint information from SE and AL is potentially better than using either separately. in the deep learning community, though there is research on multiple-response tasks with a 3D image biomarker, dependence among responses is only sporadically taken into consideration. inspired by the spirit that information extracted from the data by statistical methods can improve the prediction accuracy of deep learning models, we formulate a class of multivariate response regression models with a higher-order tensor biomarker, for the bivariate tasks of regression-classification and regression-regression. specifically, we propose a copula-enhanced convolutional neural network (CeCNN) framework that incorporates the dependence between responses through a gaussian copula (with parameters estimated from a warm-up CNN) and uses the induced copula-likelihood loss with the backbone CNNs. we establish the statistical framework and algorithms for the aforementioned two bivariate tasks. we show that the CeCNN has better prediction accuracy after adding the dependency information to the backbone models. the modeling and the proposed CeCNN algorithm are applicable beyond the UWF scenario and can be effective with other backbones beyond ResNet and LeNet.

Fast Sun-aligned Outdoor Scene Relighting based on TensoRF

paper_url: http://arxiv.org/abs/2311.03965
repo_url: None
paper_authors: Yeonjin Chang, Yearim Kim, Seunghyeon Seo, Jung Yi, Nojun Kwak
for: 这个论文是为了提出一种名为Sun-aligned Relighting TensoRF（SR-TensoRF）的外部场景重新照明方法，用于Neural Radiance Fields（NeRF）。
methods: SR-TensoRF方法使用了sun direction作为阴影生成的输入，从而简化了推理过程的需求，同时也利用了TensoRF的训练效率。
results: SR-TensoRF方法可以快速地生成高质量的阴影和渲染结果，并且在训练和渲染过程中比现有方法更快。

Abstract
In this work, we introduce our method of outdoor scene relighting for Neural Radiance Fields (NeRF) named Sun-aligned Relighting TensoRF (SR-TensoRF). SR-TensoRF offers a lightweight and rapid pipeline aligned with the sun, thereby achieving a simplified workflow that eliminates the need for environment maps. Our sun-alignment strategy is motivated by the insight that shadows, unlike viewpoint-dependent albedo, are determined by light direction. We directly use the sun direction as an input during shadow generation, simplifying the requirements of the inference process significantly. Moreover, SR-TensoRF leverages the training efficiency of TensoRF by incorporating our proposed cubemap concept, resulting in notable acceleration in both training and rendering processes compared to existing methods.

摘要
在这项工作中，我们介绍了一种名为室内场景重新照明的方法，即太阳平行Relighting TensoRF（SR-TensoRF）。SR-TensoRF提供了一个轻量级、快速的管道，其中sun direction作为输入直接使用在阴影生成过程中，从而大幅简化了推理过程的需求。此外，SR-TensoRF利用了TensoRF的训练效率，通过我们提议的立方体图概念，从而在训练和渲染过程中具有显著的加速。

Enhancing Multimodal Compositional Reasoning of Visual Language Models with Generative Negative Mining

paper_url: http://arxiv.org/abs/2311.03964
repo_url: https://github.com/ugorsahin/Generative-Negative-Mining
paper_authors: Ugur Sahin, Hang Li, Qadeer Khan, Daniel Cremers, Volker Tresp
for: 提高大规模图像语言模型（VLM）在多模态组合理解任务中的表现
methods: 提出了一个框架，不仅在两个方向上挖掘负例，还生成了多Modal compositional reasoning任务中难度很高的负样本，以提高 VLM 的表现
results: 通过利用这些生成的困难负样本，可以significantly enhance VLMs的表现在多模态组合理解任务中

Abstract
Contemporary large-scale visual language models (VLMs) exhibit strong representation capacities, making them ubiquitous for enhancing image and text understanding tasks. They are often trained in a contrastive manner on a large and diverse corpus of images and corresponding text captions scraped from the internet. Despite this, VLMs often struggle with compositional reasoning tasks which require a fine-grained understanding of the complex interactions of objects and their attributes. This failure can be attributed to two main factors: 1) Contrastive approaches have traditionally focused on mining negative examples from existing datasets. However, the mined negative examples might not be difficult for the model to discriminate from the positive. An alternative to mining would be negative sample generation 2) But existing generative approaches primarily focus on generating hard negative texts associated with a given image. Mining in the other direction, i.e., generating negative image samples associated with a given text has been ignored. To overcome both these limitations, we propose a framework that not only mines in both directions but also generates challenging negative samples in both modalities, i.e., images and texts. Leveraging these generative hard negative samples, we significantly enhance VLMs' performance in tasks involving multimodal compositional reasoning. Our code and dataset are released at https://ugorsahin.github.io/enhancing-multimodal-compositional-reasoning-of-vlm.html.

摘要
当代大规模视觉语言模型（VLM）具有强大的表达能力，使其在图像和文本理解任务中普遍使用。它们通常通过对大量和多样化的图像和相关文本描述进行对比式训练。然而，VLM经常在组合逻辑任务中失败，这可以归结于两个主要因素：1）对比方法 tradicionalmente 在现有数据集中挖掘负例，但这些挖掘出来的负例可能并不是模型很难地区分出来的。二）现有的生成方法主要关注生成困难的文本对应图像的负例，而忽略了生成图像对应文本的负例。为了解决这两个限制，我们提出了一个框架，不仅挖掘两个方向，而且生成了图像和文本两个模态的困难负例。利用这些生成的困难负例，我们可以在多模态组合逻辑任务中明显提高 VLM 的表现。我们的代码和数据集在上发布。

Improving the Effectiveness of Deep Generative Data

paper_url: http://arxiv.org/abs/2311.03959
repo_url: None
paper_authors: Ruyu Wang, Sabrina Schmedding, Marco F. Huber
for: 本研究旨在探讨使用深度生成模型生成的假图像在下游图像处理任务中的表现，并提出一种新的分类法以提高其表现。
methods: 本研究使用了生成对抗网络（GAN）和扩散概率模型（DPM）生成高品质图像，并对CIFAR-10 dataset进行了广泛的实验。
results: 研究发现，Content Gap是使用深度生成模型生成的假图像表现下降的主要原因，并提出了一些策略来更好地利用这些图像在下游任务中。实验结果表明，我们的方法在 Synthetic-to-Real 和 Data Augmentation 两种情况下都能够获得更高的表现，特别是在数据缺乏的情况下。

Abstract
Recent deep generative models (DGMs) such as generative adversarial networks (GANs) and diffusion probabilistic models (DPMs) have shown their impressive ability in generating high-fidelity photorealistic images. Although looking appealing to human eyes, training a model on purely synthetic images for downstream image processing tasks like image classification often results in an undesired performance drop compared to training on real data. Previous works have demonstrated that enhancing a real dataset with synthetic images from DGMs can be beneficial. However, the improvements were subjected to certain circumstances and yet were not comparable to adding the same number of real images. In this work, we propose a new taxonomy to describe factors contributing to this commonly observed phenomenon and investigate it on the popular CIFAR-10 dataset. We hypothesize that the Content Gap accounts for a large portion of the performance drop when using synthetic images from DGM and propose strategies to better utilize them in downstream tasks. Extensive experiments on multiple datasets showcase that our method outperforms baselines on downstream classification tasks both in case of training on synthetic only (Synthetic-to-Real) and training on a mix of real and synthetic data (Data Augmentation), particularly in the data-scarce scenario.

摘要
最近的深度生成模型（DGM），如生成敌方网络（GAN）和扩散概率模型（DPM），已经显示出生成高品质光真图像的能力。虽然看起来很有吸引力，但是在用Synthetic图像进行下游图像处理任务如图像分类时，却会导致性能下降。先前的工作表明，扩充真实数据集中的Synthetic图像可以提供利益。然而，这些改进受到特定情况的限制，并且与添加相同数量的真实图像相比，并不能达到相同的水平。在这项工作中，我们提出了一个新的分类方法，描述了在常见的情况下，使用DGM生成的Synthetic图像导致性能下降的因素。我们认为，内容差距占了大量的性能下降，并提出了使用这些Synthetic图像更好地利用下游任务的策略。我们的方法在多个数据集上进行了广泛的实验，并证明了与基eline相比，在Synthetic只进行训练（Synthetic-to-Real）和在真实和Synthetic数据进行混合训练（数据扩展）情况下，我们的方法在下游分类任务中表现出色，特别是在数据缺乏的情况下。

CLIP Guided Image-perceptive Prompt Learning for Image Enhancement

paper_url: http://arxiv.org/abs/2311.03943
repo_url: None
paper_authors: Zinuo Li, Qiuhong Ke, Weiwen Chen
for: 这 paper 的目的是提出一种基于对比语言图像预训练（CLIP）指导提问学习的图像提升方法。
methods: 该方法使用 CLIP 模型学习图像感知提问，并将其作为提升网络的损害函数来驱动图像提升。
results: 研究表明，通过将 CLIP 模型的前知识引入到图像提升中，可以获得满意的结果，并且该方法比传统的 LUT 方法更加简单和高效。

Abstract
Image enhancement is a significant research area in the fields of computer vision and image processing. In recent years, many learning-based methods for image enhancement have been developed, where the Look-up-table (LUT) has proven to be an effective tool. In this paper, we delve into the potential of Contrastive Language-Image Pre-Training (CLIP) Guided Prompt Learning, proposing a simple structure called CLIP-LUT for image enhancement. We found that the prior knowledge of CLIP can effectively discern the quality of degraded images, which can provide reliable guidance. To be specific, We initially learn image-perceptive prompts to distinguish between original and target images using CLIP model, in the meanwhile, we introduce a very simple network by incorporating a simple baseline to predict the weights of three different LUT as enhancement network. The obtained prompts are used to steer the enhancement network like a loss function and improve the performance of model. We demonstrate that by simply combining a straightforward method with CLIP, we can obtain satisfactory results.

摘要
Image enhancement is a significant research area in the fields of computer vision and image processing. In recent years, many learning-based methods for image enhancement have been developed, where the Look-up-table (LUT) has proven to be an effective tool. In this paper, we explore the potential of Contrastive Language-Image Pre-Training (CLIP) Guided Prompt Learning, proposing a simple structure called CLIP-LUT for image enhancement. We found that the prior knowledge of CLIP can effectively discern the quality of degraded images, which can provide reliable guidance. To be specific, we first learn image-perceptive prompts to distinguish between original and target images using the CLIP model, and then introduce a very simple network by incorporating a simple baseline to predict the weights of three different LUTs as an enhancement network. The obtained prompts are used to steer the enhancement network like a loss function and improve the performance of the model. We demonstrate that by simply combining a straightforward method with CLIP, we can obtain satisfactory results.

Analysis of NaN Divergence in Training Monocular Depth Estimation Model

paper_url: http://arxiv.org/abs/2311.03938
repo_url: None
paper_authors: Bum Jun Kim, Hyeonah Jang, Sang Woo Kim
for: 提高深度学习模型的准确性
methods: 通过对训练笔记深度估计网络的NaN损失进行深入分析，并确定了NaN损失的三种敏感性：使用平方根损失引起的不稳定梯度问题、使用对数 sigmoid 函数存在数学稳定问题、和某些变量实现问题导致的错误计算问题。
results: 通过遵循我们的指南，可以提高估计网络的优化稳定性和性能。

Abstract
The latest advances in deep learning have facilitated the development of highly accurate monocular depth estimation models. However, when training a monocular depth estimation network, practitioners and researchers have observed not a number (NaN) loss, which disrupts gradient descent optimization. Although several practitioners have reported the stochastic and mysterious occurrence of NaN loss that bothers training, its root cause is not discussed in the literature. This study conducted an in-depth analysis of NaN loss during training a monocular depth estimation network and identified three types of vulnerabilities that cause NaN loss: 1) the use of square root loss, which leads to an unstable gradient; 2) the log-sigmoid function, which exhibits numerical stability issues; and 3) certain variance implementations, which yield incorrect computations. Furthermore, for each vulnerability, the occurrence of NaN loss was demonstrated and practical guidelines to prevent NaN loss were presented. Experiments showed that both optimization stability and performance on monocular depth estimation could be improved by following our guidelines.

摘要
最新的深度学习技术发展已经使得单目深度估计模型的准确性得到了大幅提高。然而，在训练单目深度估计网络时，实践者和研究人员通常会遇到NaN损失，这会阻碍梯度下降优化。虽然许多实践者已经报道了随机和神秘的NaN损失的出现，但在文献中没有讨论其根本原因。本研究对单目深度估计网络训练中的NaN损失进行了深入分析，并确定了三种可能导致NaN损失的潜在漏洞：1）使用平方根损失，导致梯度不稳定；2）使用对数sigmoid函数，存在数学稳定问题；3）certain variance实现，导致计算错误。此外，我们还提供了避免NaN损失的实践指南，并通过实验证明了我们的指南可以提高优化稳定性和单目深度估计性能。

FLORA: Fine-grained Low-Rank Architecture Search for Vision Transformer

paper_url: http://arxiv.org/abs/2311.03912
repo_url: https://github.com/shadowpa0327/flora
paper_authors: Chi-Chih Chang, Yuan-Yao Sung, Shixing Yu, Ning-Chi Huang, Diana Marculescu, Kai-Chiang Wu
for: 这个论文的目的是提出一个自动化探索低阶数据的框架，以便实现对 Computer Vision Task 的最佳化。methods: 这个方法使用了 Low-Rank 测试和 One-Shot NAS 的相似性和适束，并且将这两个方法融合成一个统一的框架。它还使用了低阶数据检查和排除低性能的候选者，以避免实验过程中的偏见和干扰。results: 这个方法可以自动生成更细节的低阶数据配置，并且可以实现约 33% 的 Computational FLOPs 节省。具体来说，FLORA-DeiT-B/FLORA-Swin-B 可以节省约 55%/42% Computational FLOPs，而且可以与主流压缩技术或紧凑结构整合，实现更大的 Computational FLOPs 节省。

Abstract
Vision Transformers (ViT) have recently demonstrated success across a myriad of computer vision tasks. However, their elevated computational demands pose significant challenges for real-world deployment. While low-rank approximation stands out as a renowned method to reduce computational loads, efficiently automating the target rank selection in ViT remains a challenge. Drawing from the notable similarity and alignment between the processes of rank selection and One-Shot NAS, we introduce FLORA, an end-to-end automatic framework based on NAS. To overcome the design challenge of supernet posed by vast search space, FLORA employs a low-rank aware candidate filtering strategy. This method adeptly identifies and eliminates underperforming candidates, effectively alleviating potential undertraining and interference among subnetworks. To further enhance the quality of low-rank supernets, we design a low-rank specific training paradigm. First, we propose weight inheritance to construct supernet and enable gradient sharing among low-rank modules. Secondly, we adopt low-rank aware sampling to strategically allocate training resources, taking into account inherited information from pre-trained models. Empirical results underscore FLORA's efficacy. With our method, a more fine-grained rank configuration can be generated automatically and yield up to 33% extra FLOPs reduction compared to a simple uniform configuration. More specific, FLORA-DeiT-B/FLORA-Swin-B can save up to 55%/42% FLOPs almost without performance degradtion. Importantly, FLORA boasts both versatility and orthogonality, offering an extra 21%-26% FLOPs reduction when integrated with leading compression techniques or compact hybrid structures. Our code is publicly available at https://github.com/shadowpa0327/FLORA.

摘要
幻 transformer (ViT) 在计算机视觉任务中显示出了成功，但是它们的计算需求增加了现实世界部署的挑战。而且，减少计算负担的低级排名选择在 ViT 中仍然是一个挑战。基于计算机视觉任务和一次 NAS 的相似性和对应关系，我们提出了 FLORA，一个端到端自动化的框架。为了解决 supernet 的设计挑战，FLORA 使用了低级排名意识 filtering 策略，可以干预低性能的候选者，从而避免 potential 的弱化和子网之间的干扰。此外，我们还提出了一种低级特定的训练方法，包括 weight 继承和低级排名检测。通过这些方法，我们可以生成更细致的排名配置，并在 FLOPs 减少方面获得更好的性能。我们的实验结果表明，FLORA 可以自动生成更细致的排名配置，并在不同的模型和任务上实现 FLOPs 减少。 Specifically，FLORA-DeiT-B 和 FLORA-Swin-B 可以将 FLOPs 减少约 33%，而且减少后性能几乎不受影响。此外，FLORA 还具有多样性和正交性，可以与主流压缩技术或紧凑型结构结合使用，从而提高 FLOPs 减少的效果。我们的代码可以在 GitHub 上找到：https://github.com/shadowpa0327/FLORA。

RobustMat: Neural Diffusion for Street Landmark Patch Matching under Challenging Environments

paper_url: http://arxiv.org/abs/2311.03904
repo_url: https://github.com/ai-it-avs/robustmat
paper_authors: Rui She, Qiyu Kang, Sijie Wang, Yuan-Rui Yang, Kai Zhao, Yang Song, Wee Peng Tay
for: The paper is written for the task of matching landmark patches in street scenes for autonomous vehicles (AVs), under challenging driving environments caused by changing seasons, weather, and illumination.
methods: The paper proposes an approach called RobustMat, which uses a convolutional neural ODE diffusion module to learn the feature representation for landmark patches, and a graph neural PDE diffusion module to aggregate information from neighboring landmark patches.
results: The paper demonstrates state-of-the-art matching results under environmental perturbations, using several street scene datasets.Here’s the text in Simplified Chinese:
for: 论文是为了匹配street scene中的标志性 patch，在自动驾驶车（AV）中扮演关键角色。
methods: 论文提出了一种方法，名为RobustMat，它利用了 convolutional neural ODE diffusion module来学习标志性 patch 的特征表示，并使用 graph neural PDE diffusion module来聚合邻近标志性 patch 的信息。
results: 论文在多个 street scene 数据集上进行了评估，并达到了在环境扰动下的最佳匹配结果。

Abstract
For autonomous vehicles (AVs), visual perception techniques based on sensors like cameras play crucial roles in information acquisition and processing. In various computer perception tasks for AVs, it may be helpful to match landmark patches taken by an onboard camera with other landmark patches captured at a different time or saved in a street scene image database. To perform matching under challenging driving environments caused by changing seasons, weather, and illumination, we utilize the spatial neighborhood information of each patch. We propose an approach, named RobustMat, which derives its robustness to perturbations from neural differential equations. A convolutional neural ODE diffusion module is used to learn the feature representation for the landmark patches. A graph neural PDE diffusion module then aggregates information from neighboring landmark patches in the street scene. Finally, feature similarity learning outputs the final matching score. Our approach is evaluated on several street scene datasets and demonstrated to achieve state-of-the-art matching results under environmental perturbations.

摘要
(Simplified Chinese translation)For autonomous vehicles (AVs), visual perception techniques based on sensors like cameras play crucial roles in information acquisition and processing. In various computer perception tasks for AVs, it may be helpful to match landmark patches taken by an onboard camera with other landmark patches captured at a different time or saved in a street scene image database. To perform matching under challenging driving environments caused by changing seasons, weather, and illumination, we utilize the spatial neighborhood information of each patch. We propose an approach, named RobustMat, which derives its robustness to perturbations from neural differential equations. A convolutional neural ODE diffusion module is used to learn the feature representation for the landmark patches. A graph neural PDE diffusion module then aggregates information from neighboring landmark patches in the street scene. Finally, feature similarity learning outputs the final matching score. Our approach is evaluated on several street scene datasets and demonstrated to achieve state-of-the-art matching results under environmental perturbations.Note: The translation is in Simplified Chinese, which is the standardized form of Chinese used in mainland China and Singapore. The Traditional Chinese translation would be slightly different.

MeVGAN: GAN-based Plugin Model for Video Generation with Applications in Colonoscopy

paper_url: http://arxiv.org/abs/2311.03884
repo_url: None
paper_authors: Łukasz Struski, Tomasz Urbańczyk, Krzysztof Bucki, Bartłomiej Cupiał, Aneta Kaczyńska, Przemysław Spurek, Jacek Tabor
for: 这个论文是为了提出一种高效的视频生成模型，它可以生成高分辨率视频数据，并且可以在医学领域中应用。
methods: 这个模型使用插件型架构，使用预训练的2D图像GAN，并只添加了一个简单的神经网络来构造噪声空间中的旅程，以将旅程传递 через GAN 模型构建真实的视频。
results: 这个模型可以生成高质量的 sintetic colonoscopy 视频，可以用于虚拟 simulate colonoscopy 过程，并且可以帮助年轻的colonoscopist 更好地学习这个重要的医学过程。

Abstract
Video generation is important, especially in medicine, as much data is given in this form. However, video generation of high-resolution data is a very demanding task for generative models, due to the large need for memory. In this paper, we propose Memory Efficient Video GAN (MeVGAN) - a Generative Adversarial Network (GAN) which uses plugin-type architecture. We use a pre-trained 2D-image GAN and only add a simple neural network to construct respective trajectories in the noise space, so that the trajectory forwarded through the GAN model constructs a real-life video. We apply MeVGAN in the task of generating colonoscopy videos. Colonoscopy is an important medical procedure, especially beneficial in screening and managing colorectal cancer. However, because colonoscopy is difficult and time-consuming to learn, colonoscopy simulators are widely used in educating young colonoscopists. We show that MeVGAN can produce good quality synthetic colonoscopy videos, which can be potentially used in virtual simulators.

摘要
文本翻译：视频生成对医学领域非常重要，因为大量数据都是 видео形式。然而，生成高分辨率视频是对生成模型的非常具有挑战性，因为需要很大的内存。在这篇论文中，我们提出了 Memory Efficient Video GAN（MeVGAN），它是一种基于生成对抗网络（GAN）的插件式架构。我们使用预训练的2D图像GAN，只是添加了一个简单的神经网络，以在噪声空间中构建各个旅程，从而使得噪声通过GAN模型构建的旅程是真实的视频。我们在colonoscopy视频生成任务中应用MeVGAN。colonoscopy是医学非常重要的检查方法，尤其是在检测和治疗肠Rectal癌中。然而，因为colonoscopy是学习困难和时间consuming的，因此colonoscopy模拟器在培训年轻的colonoscopists中广泛使用。我们表明MeVGAN可以生成高质量的 sintetic colonoscopy视频，这些视频可能在虚拟模拟器中使用。

A Comparative Study of Knowledge Transfer Methods for Misaligned Urban Building Labels

paper_url: http://arxiv.org/abs/2311.03867
repo_url: None
paper_authors: Bipul Neupane, Jagannath Aryal, Abbas Rajabifard
for: Addressing the misalignment issue in Earth observation (EO) images and building labels to train accurate convolutional neural networks (CNNs) for semantic segmentation of building footprints.
methods: Comparative study of three Teacher-Student knowledge transfer methods: supervised domain adaptation (SDA), knowledge distillation (KD), and deep mutual learning (DML).
results: SDA is the most effective method to address the misalignment problem, while KD and DML can efficiently compress network size without significant loss in performance. The 158 experiments and datasets developed in this study will be valuable to minimise the misaligned labels.Here’s the format you requested:
for: <what are the paper written for?>
methods: <what methods the paper use?>
results: <what results the paper get?>

Abstract
Misalignment in Earth observation (EO) images and building labels impact the training of accurate convolutional neural networks (CNNs) for semantic segmentation of building footprints. Recently, three Teacher-Student knowledge transfer methods have been introduced to address this issue: supervised domain adaptation (SDA), knowledge distillation (KD), and deep mutual learning (DML). However, these methods are merely studied for different urban buildings (low-rise, mid-rise, high-rise, and skyscrapers), where misalignment increases with building height and spatial resolution. In this study, we present a workflow for the systematic comparative study of the three methods. The workflow first identifies the best (with the highest evaluation scores) hyperparameters, lightweight CNNs for the Student (among 43 CNNs from Computer Vision), and encoder-decoder networks (EDNs) for both Teachers and Students. Secondly, three building footprint datasets are developed to train and evaluate the identified Teachers and Students in the three transfer methods. The results show that U-Net with VGG19 (U-VGG19) is the best Teacher, and U-EfficientNetv2B3 and U-EfficientNet-lite0 are among the best Students. With these Teacher-Student pairs, SDA could yield upto 0.943, 0.868, 0.912, and 0.697 F1 scores in the low-rise, mid-rise, high-rise, and skyscrapers respectively. KD and DML provide model compression of upto 82%, despite marginal loss in performance. This new comparison concludes that SDA is the most effective method to address the misalignment problem, while KD and DML can efficiently compress network size without significant loss in performance. The 158 experiments and datasets developed in this study will be valuable to minimise the misaligned labels.

摘要
地球观测（EO）图像和建筑标签的不一致问题会影响建筑 semantic 分类器的训练。最近，三种教师-学生知识传递方法被提出来解决这个问题：指导适应（SDA）、知识储存（KD）和深度相互学习（DML）。然而，这些方法只在不同的城市建筑（低层、中层、高层和尖塔）进行了研究，而这些建筑的不一致程度随着建筑高度和空间分辨率增加。在这种研究中，我们提供了一个工作流程，用于系统 Comparative 研究这三种方法。这个 workflow 首先确定最佳（最高评价分）的参数、轻量级 CNN（从计算机视觉中选择）和编码器-解码器网络（EDN） для两个教师和学生。其次，为了训练和评估选定的教师和学生，我们开发了三个建筑footprint 数据集。结果显示，U-Net with VGG19（U-VGG19）是最佳教师，而 U-EfficientNetv2B3 和 U-EfficientNet-lite0 是最佳学生之一。使用这些教师-学生对，SDA 可以实现upto 0.943、0.868、0.912 和 0.697 F1 分数在不同的建筑高度上。KD 和 DML 可以压缩网络大小，即使是82%，尽管表现下降了一些。这种新的比较结论是，SDA 是解决不一致标签问题的最有效的方法，而 KD 和 DML 可以高效地压缩网络大小，不会导致重要的表现下降。这个研究中进行的 158 个实验和开发的数据集将有助于减少不一致标签的问题。

SCONE-GAN: Semantic Contrastive learning-based Generative Adversarial Network for an end-to-end image translation

paper_url: http://arxiv.org/abs/2311.03866
repo_url: None
paper_authors: Iman Abbasnejad, Fabio Zambetta, Flora Salim, Timothy Wiley, Jeffrey Chan, Russell Gallagher, Ehsan Abbasnejad
for: 这篇论文旨在探讨如何实现图像转换，以生成更加现实和多样的景观图像。
methods: 本研究使用了图像转换为主的GAN方法，并通过实现物件相依性，保持图像结构和Semantics，实现图像转换。此外，我们引入了式参考图像，以增加生成的图像多样性。
results: 我们透过实验 validate了我们的方法，并证明了它在四个数据集上的效果。结果显示，我们的方法可以生成更加现实和多样的景观图像，并且可以增加生成的图像多样性。

Abstract
SCONE-GAN presents an end-to-end image translation, which is shown to be effective for learning to generate realistic and diverse scenery images. Most current image-to-image translation approaches are devised as two mappings: a translation from the source to target domain and another to represent its inverse. While successful in many applications, these approaches may suffer from generating trivial solutions with limited diversity. That is because these methods learn more frequent associations rather than the scene structures. To mitigate the problem, we propose SCONE-GAN that utilises graph convolutional networks to learn the objects dependencies, maintain the image structure and preserve its semantics while transferring images into the target domain. For more realistic and diverse image generation we introduce style reference image. We enforce the model to maximize the mutual information between the style image and output. The proposed method explicitly maximizes the mutual information between the related patches, thus encouraging the generator to produce more diverse images. We validate the proposed algorithm for image-to-image translation and stylizing outdoor images. Both qualitative and quantitative results demonstrate the effectiveness of our approach on four dataset.

摘要
SCONE-GAN 提出了一种端到端图像翻译方法，可以学习生成真实和多样化的景观图像。当前大多数图像到图像翻译方法都是设计为两个映射：一个从源领域到目标领域，另一个用于表示其逆转。虽然在许多应用程序中得到了成功，但这些方法可能会导致生成平庸的解决方案，因为它们学习的是更频繁的关联而不是场景结构。为了解决这问题，我们提议使用图像 convolutional networks 来学习对象的依赖关系，保持图像结构，并保持图像 semantics 的意义，而无需将图像转换到目标领域。为了生成更真实和多样化的图像，我们引入了风格参考图像。我们要求模型通过最大化风格图像和输出之间的共mutual information 来强制实现这个目标。我们的方法通过显式地最大化相关块之间的共mutual information 来鼓励生成器生成更多样化的图像。我们验证了我们的方法在图像到图像翻译和风格化户外图像方面的效果。Qualitative 和量化结果表明我们的方法在四个数据集上具有显著的效果。

GC-VTON: Predicting Globally Consistent and Occlusion Aware Local Flows with Neighborhood Integrity Preservation for Virtual Try-on

paper_url: http://arxiv.org/abs/2311.04932
repo_url: None
paper_authors: Hamza Rawal, Muhammad Junaid Ahmad, Farooq Zaman
for: 这研究旨在提高基于图像的虚拟试穿网络中的衣服折叠方法，以提高虚拟试穿的效果。
methods: 该研究提出了一种新的方法，即通过分解全局边界准确和本地文件保持任务，使用全局网络（GlobalNet）和本地网络（LocalNet）模块进行分解。此外，还使用了一种新的定制损失函数（NIPR）来评估折叠流的质量，并在折叠流中遮盖了身体部位可见掩码，以避免因身体部位或其他衣服 occlusion 而导致的损害。
results: 该研究的实验结果表明，与当前最佳方法相比，该方法在虚拟试穿数据集上表现出色，提高了虚拟试穿的效果。

Abstract
Flow based garment warping is an integral part of image-based virtual try-on networks. However, optimizing a single flow predicting network for simultaneous global boundary alignment and local texture preservation results in sub-optimal flow fields. Moreover, dense flows are inherently not suited to handle intricate conditions like garment occlusion by body parts or by other garments. Forcing flows to handle the above issues results in various distortions like texture squeezing, and stretching. In this work, we propose a novel approach where we disentangle the global boundary alignment and local texture preserving tasks via our GlobalNet and LocalNet modules. A consistency loss is then employed between the two modules which harmonizes the local flows with the global boundary alignment. Additionally, we explicitly handle occlusions by predicting body-parts visibility mask, which is used to mask out the occluded regions in the warped garment. The masking prevents the LocalNet from predicting flows that distort texture to compensate for occlusions. We also introduce a novel regularization loss (NIPR), that defines a criteria to identify the regions in the warped garment where texture integrity is violated (squeezed or stretched). NIPR subsequently penalizes the flow in those regions to ensure regular and coherent warps that preserve the texture in local neighborhoods. Evaluation on a widely used virtual try-on dataset demonstrates strong performance of our network compared to the current SOTA methods.

摘要
流基 garment 扭曲是虚拟试穿网络中的一个重要组成部分。然而，单独优化一个流预测网络，以同时实现全局边界对齐和本地文件保存，会导致下面的问题：1. 文件流不适用于处理复杂的衣物干扰，如衣物部分遮挡或衣物之间的干扰。2. 压缩文件和扩展文件会导致文件的扭曲和质量下降。为了解决这些问题，我们提出了一种新的方法。我们在GlobalNet和LocalNet模块之间分离了全局边界对齐和本地文件保存任务。在这两个模块之间使用一个一致性损失，使得本地流与全局边界对齐融为一体。此外，我们还显式处理 occlusion 问题，通过预测身体部分可见掩码，以遮掩 occlusion 区域中的扭曲。这些掩码使得 LocalNet 不会预测扭曲以补偿 occlusion。最后，我们还引入了一种新的规范损失（NIPR），它定义了扭曲的区域，并对这些区域进行违反约束，以确保流fields 具有规则和一致的扭曲，并保持文件的质量。我们在一个广泛使用虚拟试穿数据集进行了评估，并与当前 State-of-the-Art 方法进行了比较。结果表明，我们的网络在虚拟试穿任务中具有强大的性能。

Multi-view Information Integration and Propagation for Occluded Person Re-identification

paper_url: http://arxiv.org/abs/2311.03828
repo_url: https://github.com/nengdong96/mviip
paper_authors: Neng Dong, Shuanglin Yan, Hao Tang, Jinhui Tang, Liyan Zhang
for: 提高 occluded person re-identification 的精度和稳定性，使用多视图图像来减少遮挡噪声的影响。
methods: 提出了一种基于多视图图像的 Multi-view Information Integration and Propagation (MVI$^{2}$P) 框架，通过综合利用多视图图像的特征图来提高批量人脸识别的精度和稳定性。
results: 经过广泛的实验和分析，证明了 MVI$^{2}$P 的有效性和超越性，能够在 occluded person re-identification task 中提高精度和稳定性。

Abstract
Occluded person re-identification (re-ID) presents a challenging task due to occlusion perturbations. Although great efforts have been made to prevent the model from being disturbed by occlusion noise, most current solutions only capture information from a single image, disregarding the rich complementary information available in multiple images depicting the same pedestrian. In this paper, we propose a novel framework called Multi-view Information Integration and Propagation (MVI$^{2}$P). Specifically, realizing the potential of multi-view images in effectively characterizing the occluded target pedestrian, we integrate feature maps of which to create a comprehensive representation. During this process, to avoid introducing occlusion noise, we develop a CAMs-aware Localization module that selectively integrates information contributing to the identification. Additionally, considering the divergence in the discriminative nature of different images, we design a probability-aware Quantification module to emphatically integrate highly reliable information. Moreover, as multiple images with the same identity are not accessible in the testing stage, we devise an Information Propagation (IP) mechanism to distill knowledge from the comprehensive representation to that of a single occluded image. Extensive experiments and analyses have unequivocally demonstrated the effectiveness and superiority of the proposed MVI$^{2}$P. The code will be released at \url{https://github.com/nengdong96/MVIIP}.

摘要
occluded person re-identification (re-ID) 面临许多挑战，主要是因为干扰噪声。 despite great efforts to prevent the model from being disturbed by occlusion noise, most current solutions only capture information from a single image, ignoring the rich complementary information available in multiple images depicting the same pedestrian. In this paper, we propose a novel framework called Multi-view Information Integration and Propagation (MVI$^{2}$P). Specifically, we leverage the potential of multi-view images in effectively characterizing the occluded target pedestrian, and integrate feature maps to create a comprehensive representation. During this process, we develop a CAMs-aware Localization module that selectively integrates information contributing to the identification, and a probability-aware Quantification module to emphatically integrate highly reliable information. Moreover, as multiple images with the same identity are not accessible in the testing stage, we devise an Information Propagation (IP) mechanism to distill knowledge from the comprehensive representation to that of a single occluded image. Extensive experiments and analyses have unequivocally demonstrated the effectiveness and superiority of the proposed MVI$^{2}$P. The code will be released at \url{https://github.com/nengdong96/MVIIP}.

Detecting Any Human-Object Interaction Relationship: Universal HOI Detector with Spatial Prompt Learning on Foundation Models

paper_url: http://arxiv.org/abs/2311.03799
repo_url: https://github.com/caoyichao/unihoi
paper_authors: Yichao Cao, Qingfei Tang, Xiu Su, Chen Song, Shan You, Xiaobo Lu, Chang Xu
for: 本研究旨在实现开放世界下的人物交互检测，通过视觉语言基础模型和大型自然语言模型（LLM）来解决复杂的人物交互问题。
methods: 本研究使用了视觉语言基础模型和大型自然语言模型（LLM），并提出了一种基于HO提示的学习方法（HO prompt-based learning），以帮助将高级关系表示与不同的HO对 associations。
results: 研究表明，UniHOI方法可以在开放世界下高度提高人物交互检测的性能，并在不同的输入类型（交互短语或解释句）下支持多种开放类型交互recognition。

Abstract
Human-object interaction (HOI) detection aims to comprehend the intricate relationships between humans and objects, predicting $$ triplets, and serving as the foundation for numerous computer vision tasks. The complexity and diversity of human-object interactions in the real world, however, pose significant challenges for both annotation and recognition, particularly in recognizing interactions within an open world context. This study explores the universal interaction recognition in an open-world setting through the use of Vision-Language (VL) foundation models and large language models (LLMs). The proposed method is dubbed as \emph{\textbf{UniHOI}. We conduct a deep analysis of the three hierarchical features inherent in visual HOI detectors and propose a method for high-level relation extraction aimed at VL foundation models, which we call HO prompt-based learning. Our design includes an HO Prompt-guided Decoder (HOPD), facilitates the association of high-level relation representations in the foundation model with various HO pairs within the image. Furthermore, we utilize a LLM (\emph{i.e.} GPT) for interaction interpretation, generating a richer linguistic understanding for complex HOIs. For open-category interaction recognition, our method supports either of two input types: interaction phrase or interpretive sentence. Our efficient architecture design and learning methods effectively unleash the potential of the VL foundation models and LLMs, allowing UniHOI to surpass all existing methods with a substantial margin, under both supervised and zero-shot settings. The code and pre-trained weights are available at: \url{https://github.com/Caoyichao/UniHOI}.

摘要
人物交互探测（HOI）目标是理解人与物之间复杂的关系，预测<人,动作,物> triplets，并成为许多计算机视觉任务的基础。然而，在实际世界中人物交互的复杂性和多样性带来了对注释和识别的挑战，特别是在开放世界上下文中。本研究通过使用视觉语言基础模型（VL）和大型自然语言模型（LLM）来实现在开放世界上进行一般交互探测。我们提出了一种名为UniHOI的方法，包括一种HO Prompt-based Learning（HOPD），可以帮助将视觉基础模型中的高级关系表示与不同的HO对在图像中相关联。此外，我们使用GPT作为交互解释，以生成更加丰富的语言理解，用于处理复杂的HOI。为了进行开放类交互探测，我们的方法支持两种输入类型：交互短语和解释句子。我们的有效的建筑设计和学习方法可以有效发挥VL基础模型和LLM的潜力，使UniHOI在超过所有现有方法的情况下，在有监督和零值设定下达到了很高的性能。代码和预训练 веса可以在以下链接中下载：。

Self-MI: Efficient Multimodal Fusion via Self-Supervised Multi-Task Learning with Auxiliary Mutual Information Maximization

paper_url: http://arxiv.org/abs/2311.03785
repo_url: None
paper_authors: Cam-Van Thi Nguyen, Ngoc-Hoa Thi Nguyen, Duc-Trong Le, Quang-Thuy Ha
for: 本研究旨在提高多模态学习中的特征表示学习，以便更好地捕捉不同模态之间的相互关系。
methods: 本研究使用了Self-MI方法，具体来说是通过对不同模态的输入对进行自然监督学习，并采用了Contrastive Predictive Coding（CPC）技术来增强多模态融合结果与单模态输入之间的相互信息。
results: 实验结果表明，Self-MI方法能够在三个 benchmark 数据集（CMU-MOSI、CMU-MOSEI 和 SIMS）上显著提高多模态融合任务的性能。

Abstract
Multimodal representation learning poses significant challenges in capturing informative and distinct features from multiple modalities. Existing methods often struggle to exploit the unique characteristics of each modality due to unified multimodal annotations. In this study, we propose Self-MI in the self-supervised learning fashion, which also leverage Contrastive Predictive Coding (CPC) as an auxiliary technique to maximize the Mutual Information (MI) between unimodal input pairs and the multimodal fusion result with unimodal inputs. Moreover, we design a label generation module, $ULG_{MI}$ for short, that enables us to create meaningful and informative labels for each modality in a self-supervised manner. By maximizing the Mutual Information, we encourage better alignment between the multimodal fusion and the individual modalities, facilitating improved multimodal fusion. Extensive experiments on three benchmark datasets including CMU-MOSI, CMU-MOSEI, and SIMS, demonstrate the effectiveness of Self-MI in enhancing the multimodal fusion task.

摘要
多模态表示学习受到多模态特征 capture 约束，现有方法通常因为多模态约束而难以充分发挥每个模态的特点。在本研究中，我们提议了基于自我supervised learning的Self-MI方法，同时利用Contrastive Predictive Coding（CPC）作为auxiliary技术，以最大化多模态融合结果与单模态输入对的Mutual Information（MI）。此外，我们还设计了一个标签生成模块，简称为$ULG_{MI}$，以便在自我supervised manner中生成每个模态的有意义和有用的标签。通过最大化MI，我们鼓励多模态融合和单模态更好地对齐，从而提高多模态融合。我们在CMU-MOSI、CMU-MOSEI和SIMS三个 benchmark datasets上进行了广泛的实验，并证明了Self-MI在多模态融合任务中的效果。

UP-NeRF: Unconstrained Pose-Prior-Free Neural Radiance Fields

paper_url: http://arxiv.org/abs/2311.03784
repo_url: None
paper_authors: Injae Kim, Minhyuk Choi, Hyunwoo J. Kim
for: 用于处理不受约束的图像集合，不需要摄像机pose prior。
methods: 使用代理任务优化不敏感特征场和分离遮挡模块，以及候选头和可靠深度监督来提高pose估计和遮挡抗性。
results: 在一个复杂的互联网照片收藏中，比基线方法（包括BARF和其变体）表现出色，证明了我们的方法的优越性。

Abstract
Neural Radiance Field (NeRF) has enabled novel view synthesis with high fidelity given images and camera poses. Subsequent works even succeeded in eliminating the necessity of pose priors by jointly optimizing NeRF and camera pose. However, these works are limited to relatively simple settings such as photometrically consistent and occluder-free image collections or a sequence of images from a video. So they have difficulty handling unconstrained images with varying illumination and transient occluders. In this paper, we propose $\textbf{UP-NeRF}$ ($\textbf{U}$nconstrained $\textbf{P}$ose-prior-free $\textbf{Ne}$ural $\textbf{R}$adiance $\textbf{F}$ields) to optimize NeRF with unconstrained image collections without camera pose prior. We tackle these challenges with surrogate tasks that optimize color-insensitive feature fields and a separate module for transient occluders to block their influence on pose estimation. In addition, we introduce a candidate head to enable more robust pose estimation and transient-aware depth supervision to minimize the effect of incorrect prior. Our experiments verify the superior performance of our method compared to the baselines including BARF and its variants in a challenging internet photo collection, $\textit{Phototourism}$ dataset.

摘要
neural radiance field (NeRF) 已经启动了高精度的新视角合成，只需要提供图像和摄像头位置。然而，这些工作受到了一些限制，例如：图像集是 photometrically 一致的，没有遮挡物品和遮挡物品。在这篇论文中，我们提出了 $\textbf{UP-NeRF}$（$\textbf{U}$nconstrained $\textbf{P}$ose-prior-free $\textbf{Ne}$ural $\textbf{R}$adiance $\textbf{F}$ields），它可以在不受任何几何假设的情况下优化 NeRF。我们通过增加彩色不敏感特征场和遮挡物品封页来解决这些挑战。此外，我们还引入了候选首选项，以实现更加稳定的姿势估计和遮挡物品封页。我们的实验显示，我们的方法与 BARF 和其他基eline的比较，在一个具有挑战性的互联网照片集，$\textit{Phototourism}$ 数据集中，具有较高的性能。

CapST: An Enhanced and Lightweight Method for Deepfake Video Classification

paper_url: http://arxiv.org/abs/2311.03782
repo_url: None
paper_authors: Wasim Ahmad, Yan-Tsung Peng, Yuan-Hao Chang, Gaddisa Olani Ganfure, Sarwar Khan, Sahibzada Adil Shahzad
for: 本研究旨在针对深伪视频（Deepfake Video）进行分类，以提高各种领域（如政治、娱乐、安全）的识别能力。
methods: 该研究提出了一种新型的深伪视频分类模型，利用VGG19bn的一部分作为底层模型，并将卷积神经网络和空间时间注意力机制相结合，以提高分类精度。
results: 实验结果表明，该模型在一个广泛的深伪视频 benchmark 数据集（DFDM）上实现了比基eline模型高达4%的提升，同时具有更加高效的计算资源使用。

Abstract
The proliferation of deepfake videos, synthetic media produced through advanced Artificial Intelligence techniques has raised significant concerns across various sectors, encompassing realms such as politics, entertainment, and security. In response, this research introduces an innovative and streamlined model designed to classify deepfake videos generated by five distinct encoders adeptly. Our approach not only achieves state of the art performance but also optimizes computational resources. At its core, our solution employs part of a VGG19bn as a backbone to efficiently extract features, a strategy proven effective in image-related tasks. We integrate a Capsule Network coupled with a Spatial Temporal attention mechanism to bolster the model's classification capabilities while conserving resources. This combination captures intricate hierarchies among features, facilitating robust identification of deepfake attributes. Delving into the intricacies of our innovation, we introduce an existing video level fusion technique that artfully capitalizes on temporal attention mechanisms. This mechanism serves to handle concatenated feature vectors, capitalizing on the intrinsic temporal dependencies embedded within deepfake videos. By aggregating insights across frames, our model gains a holistic comprehension of video content, resulting in more precise predictions. Experimental results on an extensive benchmark dataset of deepfake videos called DFDM showcase the efficacy of our proposed method. Notably, our approach achieves up to a 4 percent improvement in accurately categorizing deepfake videos compared to baseline models, all while demanding fewer computational resources.

摘要
“深圳技术”的普及，通过先进人工智能技术生成的假视频，在不同领域引发了重大关注，包括政治、娱乐和安全等。为应对这些挑战，本研究提出了一种创新的和高效的模型，用于分类生成自五种不同编码器的深圳视频。我们的方法不仅实现了状态的表现，还优化了计算资源。我们的解决方案在核心上采用一部分的VGG19bn作为背景，高效地提取特征，这种策略在图像相关任务中已经证明有效。我们将卷积网络和空间时间注意力机制结合使用，以增强模型的分类能力，同时保持资源的合理使用。这种结合使得模型能够快速和准确地识别深圳视频的特征。在我们的创新中，我们还介绍了一种现有的视频级别融合技术，利用时间注意力机制来处理 concatenated 特征向量， capitalizing on the intrinsic temporal dependencies embedded within deepfake videos。通过聚合帧内信息，我们的模型获得了视频内容的全面理解，从而实现更加准确的预测。实验结果表明，我们的提议方法在大量的深圳视频数据集（DFDM）上达到了4%的提升，与基eline模型相比，同时减少计算资源的需求。

Meta-Adapter: An Online Few-shot Learner for Vision-Language Model

paper_url: http://arxiv.org/abs/2311.03774
repo_url: None
paper_authors: Cheng Cheng, Lin Song, Ruoyi Xue, Hang Wang, Hongbin Sun, Yixiao Ge, Ying Shan
for: 这篇论文旨在提出一种能够在线进行几个样本微调的方法，以提高CLIP模型在几shot检测中的效能。
methods: 本文提出了一种名为Meta-Adapter的轻量级复修器，可以在线进行CLIP特征的细化，以应对几shot检测中的挑战。
results: 本文的方法可以在几个样本微调下，实现高效的几shot检测能力，并且可以在未见到的数据或任务下保持竞争性的性能。

Abstract
The contrastive vision-language pre-training, known as CLIP, demonstrates remarkable potential in perceiving open-world visual concepts, enabling effective zero-shot image recognition. Nevertheless, few-shot learning methods based on CLIP typically require offline fine-tuning of the parameters on few-shot samples, resulting in longer inference time and the risk of over-fitting in certain domains. To tackle these challenges, we propose the Meta-Adapter, a lightweight residual-style adapter, to refine the CLIP features guided by the few-shot samples in an online manner. With a few training samples, our method can enable effective few-shot learning capabilities and generalize to unseen data or tasks without additional fine-tuning, achieving competitive performance and high efficiency. Without bells and whistles, our approach outperforms the state-of-the-art online few-shot learning method by an average of 3.6\% on eight image classification datasets with higher inference speed. Furthermore, our model is simple and flexible, serving as a plug-and-play module directly applicable to downstream tasks. Without further fine-tuning, Meta-Adapter obtains notable performance improvements in open-vocabulary object detection and segmentation tasks.

摘要
“对于开放世界的视觉概念，对于零步识别有潜在的优秀潜力。然而，基于CLIP的几步学习方法通常需要静态精度调整，导致较长的推导时间和特定领域中的过滤过沾。为了解决这些挑战，我们提出了Meta-Adapter，一个轻量级的残差式适配器，可以在线上方式进行CLIP特征的精度调整。仅需几个训练样本，我们的方法可以实现有效的几步学习能力，并能够适应未见到的数据或任务，无需额外精度调整，实现竞争性的表现和高效率。而且，我们的方法简单易应用，可以直接应用到下游任务，并且不需进一步精度调整，在开放 vocabulary 物件检测和分类任务中获得了杰出的表现改善。”

paper_url: http://arxiv.org/abs/2311.03770
repo_url: None
paper_authors: Yatao Zhong, Ilya Zharkov
for: 高解像肖像剪辑（portrait matting）
methods: 提出了一种轻量级模型，使用视Transformer（ViT）作为低分辨率网络的基础，并在精度修正网络中添加了一种新的跨区域注意力（CRA）模块，以提高本地区域的信息传递。
results: 比较其他基eline模型，本方法在三个标准测试集上达到了更高的Result，同时具有实时性和较低的计算开销。

Abstract
We present a lightweight model for high resolution portrait matting. The model does not use any auxiliary inputs such as trimaps or background captures and achieves real time performance for HD videos and near real time for 4K. Our model is built upon a two-stage framework with a low resolution network for coarse alpha estimation followed by a refinement network for local region improvement. However, a naive implementation of the two-stage model suffers from poor matting quality if not utilizing any auxiliary inputs. We address the performance gap by leveraging the vision transformer (ViT) as the backbone of the low resolution network, motivated by the observation that the tokenization step of ViT can reduce spatial resolution while retain as much pixel information as possible. To inform local regions of the context, we propose a novel cross region attention (CRA) module in the refinement network to propagate the contextual information across the neighboring regions. We demonstrate that our method achieves superior results and outperforms other baselines on three benchmark datasets while only uses $1/20$ of the FLOPS compared to the existing state-of-the-art model.

摘要
我们提出了一种轻量级的高解像肖像抹除模型。该模型不使用任何辅助输入，如trimap或背景捕获，并在HD视频和4K视频的实时性下达到了实时性。我们的模型基于两个阶段框架，其中低分辨率网络用于粗略α估计，然后是一个改进网络用于地方区域的改进。然而，直观实现该两个阶段模型会导致抹除质量低下，除非使用辅助输入。我们解决这个性能差距问题，通过利用视Transformer（ViT）作为低分辨率网络的后夹，这是因为ViT的token化步骤可以减少空间分辨率，保留最多的像素信息。为了在邻近区域传递Contextual信息，我们提出了一种新的跨区域注意力（CRA）模块，用于在邻近区域之间传递Contextual信息。我们示示了我们的方法可以在三个标准测试集上达到更高的Result，并且只用了$1/20$的FLOPS，相比之前的状态则模型。

Image change detection with only a few samples

paper_url: http://arxiv.org/abs/2311.03762
repo_url: https://github.com/chrisneagu/FTC-Skystone-Dark-Angels-Romania-2020
paper_authors: Ke Liu, Zhaoyi Song, Haoyue Bai
for: 本研究针对仅有少量样本的图像变化检测，解决了对于仅有少量样本的挑战。
methods: 本研究使用简单的图像处理方法生成了Synthetic数据，并设计了基于物件检测的早期融合网络，以提高 Change detection 模型的通用能力。
results: 实验结果显示，使用Synthetic数据进行训练后，可以实现高度的通用能力，并且使用仅有 tens of 样本进行精度调整亦可以 achieve excellent results。

Abstract
This paper considers image change detection with only a small number of samples, which is a significant problem in terms of a few annotations available. A major impediment of image change detection task is the lack of large annotated datasets covering a wide variety of scenes. Change detection models trained on insufficient datasets have shown poor generalization capability. To address the poor generalization issue, we propose using simple image processing methods for generating synthetic but informative datasets, and design an early fusion network based on object detection which could outperform the siamese neural network. Our key insight is that the synthetic data enables the trained model to have good generalization ability for various scenarios. We compare the model trained on the synthetic data with that on the real-world data captured from a challenging dataset, CDNet, using six different test sets. The results demonstrate that the synthetic data is informative enough to achieve higher generalization ability than the insufficient real-world data. Besides, the experiment shows that utilizing a few (often tens of) samples to fine-tune the model trained on the synthetic data will achieve excellent results.

摘要
The authors compare the model trained on synthetic data with one trained on real-world data from a challenging dataset, CDNet, using six different test sets. The results show that the synthetic data is informative enough to achieve higher generalization ability than the insufficient real-world data. Moreover, the experiment demonstrates that fine-tuning the model trained on synthetic data with a few (often tens of) samples can achieve excellent results.

Multiclass Segmentation using Teeth Attention Modules for Dental X-ray Images

paper_url: http://arxiv.org/abs/2311.03749
repo_url: None
paper_authors: Afnan Ghafoor, Seong-Yong Moon, Bumshik Lee
for: 这篇论文旨在提出一种革新的多类牙齿分割架构，以解决现有牙齿图像分割方法中的不准确和不可靠的问题。
methods: 该方法 integrate了M-Net-like结构、Swin Transformers以及一个名为牙齿注意块（TAB）。TAB使用了一种特有的注意机制，具体是专门针对牙齿复杂的结构进行注意。
results: 实验结果表明，该方法可以准确地分割牙齿图像，并且在多个标准牙齿图像 dataset 上表现出优于现有状态对牙齿图像分割的方法。

Abstract
This paper proposed a cutting-edge multiclass teeth segmentation architecture that integrates an M-Net-like structure with Swin Transformers and a novel component named Teeth Attention Block (TAB). Existing teeth image segmentation methods have issues with less accurate and unreliable segmentation outcomes due to the complex and varying morphology of teeth, although teeth segmentation in dental panoramic images is essential for dental disease diagnosis. We propose a novel teeth segmentation model incorporating an M-Net-like structure with Swin Transformers and TAB. The proposed TAB utilizes a unique attention mechanism that focuses specifically on the complex structures of teeth. The attention mechanism in TAB precisely highlights key elements of teeth features in panoramic images, resulting in more accurate segmentation outcomes. The proposed architecture effectively captures local and global contextual information, accurately defining each tooth and its surrounding structures. Furthermore, we employ a multiscale supervision strategy, which leverages the left and right legs of the U-Net structure, boosting the performance of the segmentation with enhanced feature representation. The squared Dice loss is utilized to tackle the class imbalance issue, ensuring accurate segmentation across all classes. The proposed method was validated on a panoramic teeth X-ray dataset, which was taken in a real-world dental diagnosis. The experimental results demonstrate the efficacy of our proposed architecture for tooth segmentation on multiple benchmark dental image datasets, outperforming existing state-of-the-art methods in objective metrics and visual examinations. This study has the potential to significantly enhance dental image analysis and contribute to advances in dental applications.

摘要

SBCFormer: Lightweight Network Capable of Full-size ImageNet Classification at 1 FPS on Single Board Computers

paper_url: http://arxiv.org/abs/2311.03747
repo_url: https://github.com/xyonglu/sbcformer
paper_authors: Xiangyong Lu, Masanori Suganuma, Takayuki Okatani
for: 这篇论文旨在解决单板电脑（SBC）上的深度学习应用，特别是适用于智能农业、渔业和家畜饲养等领域。
methods: 这篇论文提出了一个名为SBCFormer的CNN-ViT混合网络，可以在低端CPU上实现高精度和快速计算。
results: 在Raspberry Pi 4 Model B上运行SBCFormer后，它可以在1.0帧/秒的速度下达到ImageNet-1K顶峰1的精度约80%，这是在SBC上首次实现的。

Abstract
Computer vision has become increasingly prevalent in solving real-world problems across diverse domains, including smart agriculture, fishery, and livestock management. These applications may not require processing many image frames per second, leading practitioners to use single board computers (SBCs). Although many lightweight networks have been developed for mobile/edge devices, they primarily target smartphones with more powerful processors and not SBCs with the low-end CPUs. This paper introduces a CNN-ViT hybrid network called SBCFormer, which achieves high accuracy and fast computation on such low-end CPUs. The hardware constraints of these CPUs make the Transformer's attention mechanism preferable to convolution. However, using attention on low-end CPUs presents a challenge: high-resolution internal feature maps demand excessive computational resources, but reducing their resolution results in the loss of local image details. SBCFormer introduces an architectural design to address this issue. As a result, SBCFormer achieves the highest trade-off between accuracy and speed on a Raspberry Pi 4 Model B with an ARM-Cortex A72 CPU. For the first time, it achieves an ImageNet-1K top-1 accuracy of around 80% at a speed of 1.0 frame/sec on the SBC. Code is available at https://github.com/xyongLu/SBCFormer.

摘要
计算机视觉在实际应用中日益普遍，包括智能农业、渔业和畜牧管理等领域。这些应用可能不需要处理大量的图像帧数，因此实际使用单板计算机（SBC）。虽然许多轻量级网络已经为移动/边缘设备开发，但它们主要针对更强的处理器，而不是SBC的低端CPU。本文介绍一种名为SBCFormer的CNN-ViT混合网络，可以在这些低端CPU上实现高精度和快速计算。由于硬件限制，转换器的注意机制在低端CPU上更加有利，但使用注意在低端CPU上存在挑战：高分辨率内部特征地图需要极高的计算资源，但是降低其分辨率会导致图像细节的丢失。SBCFormer提出了一种建筑设计来解决这个问题。因此，SBCFormer在raspberry Pi 4 Model B上的ARM-Cortex A72 CPU上实现了最高的精度和速度之间的平衡，达到了ImageNet-1K top-1准确率约为80%，速度为1.0帧/秒。代码可以在https://github.com/xyongLu/SBCFormer上获取。

Unsupervised Video Summarization

paper_url: http://arxiv.org/abs/2311.03745
repo_url: https://github.com/KaiyangZhou/pytorch-vsumm-reinforce
paper_authors: Hanqing Li, Diego Klabjan, Jean Utke
for: 本研究旨在提出一种新的、无监督的自动视频概要生成方法，利用生成对抗网络的想法，但是去掉了识别器，使得模型具有简单的损失函数和分解不同部分的模型的训练。
methods: 本研究使用了迭代训练策略，先训练恢复器，然后训练帧选择器，并在训练和评估中添加了可调IMASK向量。模型还包括一种无监督的模型选择算法。
results: 在两个公共数据集（SumMe和TVSum）以及四个自创数据集（足球、LoL、MLB和ShortMLB）上进行了实验，结果表明每个组件对模型性能的影响，特别是迭代训练策略。评估和相对于当前状态的方法进行比较，表明提出的方法在性能、稳定性和训练效率方面具有优势。

Abstract
This paper introduces a new, unsupervised method for automatic video summarization using ideas from generative adversarial networks but eliminating the discriminator, having a simple loss function, and separating training of different parts of the model. An iterative training strategy is also applied by alternately training the reconstructor and the frame selector for multiple iterations. Furthermore, a trainable mask vector is added to the model in summary generation during training and evaluation. The method also includes an unsupervised model selection algorithm. Results from experiments on two public datasets (SumMe and TVSum) and four datasets we created (Soccer, LoL, MLB, and ShortMLB) demonstrate the effectiveness of each component on the model performance, particularly the iterative training strategy. Evaluations and comparisons with the state-of-the-art methods highlight the advantages of the proposed method in performance, stability, and training efficiency.

摘要
这篇论文提出了一种新的、不监督的自动视频摘要方法，基于生成对抗网络的想法，但是去掉了识别器，使用简单的损失函数，并将模型的不同部分分别训练。此外，这种方法还使用了训练和评估阶段 alternate 的训练策略，并在摘要生成过程中添加了可调整的面向量。此外，这种方法还包括一种无监督模型选择算法。经过实验 validate 在两个公共数据集（SumMe 和 TVSum）以及我们创建的四个数据集（足球、LoL、MLB 和 ShortMLB）上，每一部分的效果都有显著提高。对比之下，这种方法在性能、稳定性和训练效率方面具有明显的优势。

3DifFusionDet: Diffusion Model for 3D Object Detection with Robust LiDAR-Camera Fusion

paper_url: http://arxiv.org/abs/2311.03742
repo_url: None
paper_authors: Xinhao Xiang, Simon Dräger, Jiawei Zhang
for: 提高3D对象检测性能，尤其是在LiDAR-Camera感知器上。
methods: 提出了3DifFusionDet框架，将3D对象检测视为一种噪声扩散过程，从噪声3D桶向目标桶进行匹配。在训练过程中，模型学习恢复噪声过程，而在推断过程中，模型逐渐精细化一组随机生成的桶，以达到最终结果。
results: 在KITTI测试 dataset上进行了广泛的实验，并证明了3DifFusionDet在实际交通对象识别中表现出色，比早期的检测器更为有力。

Abstract
Good 3D object detection performance from LiDAR-Camera sensors demands seamless feature alignment and fusion strategies. We propose the 3DifFusionDet framework in this paper, which structures 3D object detection as a denoising diffusion process from noisy 3D boxes to target boxes. In this framework, ground truth boxes diffuse in a random distribution for training, and the model learns to reverse the noising process. During inference, the model gradually refines a set of boxes that were generated at random to the outcomes. Under the feature align strategy, the progressive refinement method could make a significant contribution to robust LiDAR-Camera fusion. The iterative refinement process could also demonstrate great adaptability by applying the framework to various detecting circumstances where varying levels of accuracy and speed are required. Extensive experiments on KITTI, a benchmark for real-world traffic object identification, revealed that 3DifFusionDet is able to perform favorably in comparison to earlier, well-respected detectors.

摘要
好的3D物体探测性能从LiDAR-Camera传感器需要无缝特征对应和融合策略。我们在这篇论文中提出了3DifFusionDet框架，它将3D物体探测视为一种噪声扩散过程，从噪声3D盒子到目标盒子的探测。在这个框架中，真实的盒子在训练中随机分布，模型学习恢复噪声过程。在推理中，模型逐渐精细化一组随机生成的盒子，以达到最终结果。在特征对应策略下，进行进化的精细化方法可以在不同的探测情况下提供更加稳定的LiDAR-Camera融合。此外，iterative refinement过程还能够在不同的精度和速度要求下展现出优秀的适应性。根据KITTI数据集，一个实际世界交通物体识别的标准 benchmark，我们的3DifFusionDet能够与先前的著名探测器相比，表现出色。

ADFactory: Automated Data Factory for Optical Flow Tasks

paper_url: http://arxiv.org/abs/2311.04246
repo_url: None
paper_authors: Han Ling
for: 提高现实世界中 optical flow 方法的泛化能力
methods: 使用高级 Nerf 技术重建场景，计算摄像机pose对的光流结果，并从多个方面筛选生成的训练数据
results: 在 KITTI 上超过现有自动学习光流和单摄像头场景流算法，并在实际世界零点泛化评估中常常超过最佳监督方法

Abstract
A major challenge faced by current optical flow methods is the difficulty in generalizing them well into the real world, mainly due to the high production cost of datasets, which currently do not have a large real-world optical flow dataset. To address this challenge, we introduce a novel optical flow training framework that can efficiently train optical flow networks on the target data domain without manual annotation. Specifically, we use advanced Nerf technology to reconstruct scenes from photo groups collected by monocular cameras, and calculate the optical flow results between camera pose pairs from the rendered results. On this basis, we screen the generated training data from various aspects such as Nerf's reconstruction quality, visual consistency of optical flow labels, reconstruction depth consistency, etc. The filtered training data can be directly used for network supervision. Experimentally, the generalization ability of our scheme on KITTI surpasses existing self-supervised optical flow and monocular scene flow algorithms. Moreover, it can always surpass most supervised methods in real-world zero-point generalization evaluation.

摘要
Current optical flow methods face a major challenge in generalizing well to real-world scenarios due to the high cost of large-scale real-world datasets. To address this challenge, we propose a novel optical flow training framework that can efficiently train optical flow networks on the target data domain without manual annotation. Specifically, we use advanced NeRF technology to reconstruct scenes from photo groups collected by monocular cameras, and calculate the optical flow results between camera pose pairs from the rendered results. We then screen the generated training data from various aspects such as NeRF's reconstruction quality, visual consistency of optical flow labels, reconstruction depth consistency, etc. The filtered training data can be directly used for network supervision. Experimentally, our scheme surpasses existing self-supervised optical flow and monocular scene flow algorithms in terms of generalization ability on KITTI, and can also surpass most supervised methods in real-world zero-point generalization evaluation.

DeepInspect: An AI-Powered Defect Detection for Manufacturing Industries

paper_url: http://arxiv.org/abs/2311.03725
repo_url: None
paper_authors: Arti Kumbhar, Amruta Chougule, Priya Lokhande, Saloni Navaghane, Aditi Burud, Saee Nimbalkar
for: 这个研究是为了提高生产过程中的瑕疵检测精度和效率。
methods: 这个系统使用了卷积神经网络（CNN）、回传神经网络（RNN）和生成敌方算法（GAN），采用了一个创新的方法来检测生产过程中的瑕疵。
results: 这个系统可以对生产过程中的瑕疵进行高精度的检测，并且可以在不同的瑕疵情况下保持模型的稳定性和适应性。

Abstract
Utilizing Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and Generative Adversarial Networks (GANs), our system introduces an innovative approach to defect detection in manufacturing. This technology excels in precisely identifying faults by extracting intricate details from product photographs, utilizing RNNs to detect evolving errors and generating synthetic defect data to bolster the model's robustness and adaptability across various defect scenarios. The project leverages a deep learning framework to automate real-time flaw detection in the manufacturing process. It harnesses extensive datasets of annotated images to discern complex defect patterns. This integrated system seamlessly fits into production workflows, thereby boosting efficiency and elevating product quality. As a result, it reduces waste and operational costs, ultimately enhancing market competitiveness.

摘要
我们的系统利用卷积神经网络（CNNs）、回归神经网络（RNNs）和生成对抗网统（GANs），提出一个创新的瑕疵检测方法。这个技术可以精确地识别瑕疵，通过提取产品照片中的细微细节，利用RNNs检测进行变化检测，并生成实验数据来增强模型的韧性和适应力。这个项目利用深度学习框架自动化生产过程中的实时瑕疵检测。它利用大量标注图像的数据集来识别复杂的瑕疵模式。这个整合系统适应生产工作流程，因此提高效率和产品质量。因此，它减少废物和运营成本，最终提高市场竞争力。

Inertial Guided Uncertainty Estimation of Feature Correspondence in Visual-Inertial Odometry/SLAM

paper_url: http://arxiv.org/abs/2311.03722
repo_url: None
paper_authors: Seongwook Yoon, Jaehyun Kim, Sanghoon Sull
for: 提高自主导航和增强现实系统的稳定性和准确性。
methods: 利用不同视角的3D点对匹配，并使用不同视角的2D点对匹配来估算相对uncertainty。
results: 提出一种基于不同视角的3D点对匹配的不确定性估算方法，可以减少视觉误差、遮挡和光照变化等因素的影响，提高自主导航和增强现实系统的稳定性和准确性。

Abstract
Visual odometry and Simultaneous Localization And Mapping (SLAM) has been studied as one of the most important tasks in the areas of computer vision and robotics, to contribute to autonomous navigation and augmented reality systems. In case of feature-based odometry/SLAM, a moving visual sensor observes a set of 3D points from different viewpoints, correspondences between the projected 2D points in each image are usually established by feature tracking and matching. However, since the corresponding point could be erroneous and noisy, reliable uncertainty estimation can improve the accuracy of odometry/SLAM methods. In addition, inertial measurement unit is utilized to aid the visual sensor in terms of Visual-Inertial fusion. In this paper, we propose a method to estimate the uncertainty of feature correspondence using an inertial guidance robust to image degradation caused by motion blur, illumination change and occlusion. Modeling a guidance distribution to sample possible correspondence, we fit the distribution to an energy function based on image error, yielding more robust uncertainty than conventional methods. We also demonstrate the feasibility of our approach by incorporating it into one of recent visual-inertial odometry/SLAM algorithms for public datasets.

摘要
Visual odometry和同时地图和定位（SLAM）是计算机视觉和机器人领域中的一项非常重要的任务，以帮助实现自主导航和增强现实系统。在特征基于的ODometry/SLAM中，一个在不同视点观察的移动视觉器 observes一组3D点，通常通过特征追踪和匹配来确定图像中的2D点对应关系。然而，由于对应点可能存在误差和噪声，可靠的uncertainty估计可以提高ODometry/SLAM方法的准确性。此外，使用测距仪来帮助视觉器，通过视觉-测距拟合来提高方法的稳定性。在本文中，我们提出了一种用于估计特征对应关系的不确定性的方法，基于测距仪的引导分布来采样可能的对应关系，并使用图像误差基于的能量函数来适应图像质量的变化。我们还demonstrate了我们的方法的可行性，通过在一个现有的视觉-测距ODometry/SLAM算法中 incorporate它。

Multimodal deep representation learning for quantum cross-platform verification

paper_url: http://arxiv.org/abs/2311.03713
repo_url: None
paper_authors: Yang Qian, Yuxuan Du, Zhenliang He, Min-hsiu Hsieh, Dacheng Tao
for: 这篇论文旨在解决早期量子计算阶段的跨平台验证问题，特别是在大量子比特的情况下，使用最少的量子测量数据来描述两个不完美的量子设备在执行同一个算法时的相似性。
methods: 本文提出了一种创新的多模态学习方法，认为在这个任务中存在两种不同的模式：测量结果和编译到explored量子设备上的类型circuit的классификация，两者都含有独特的信息。通过这种见解，我们设计了一种多模态神经网络，独立地提取这两种模式中的知识，然后进行融合操作以创建一个完整的数据表示。
results: 我们在不同的噪声模型下，包括50个量子比特的系统，进行了测试，结果表明，相比于随机测量和已有数据集中的算法，我们的提议可以提高预测精度三个数量级，并且提供了证明量子设备之间的相似性在新的量子算法执行时的有力证据。这些发现开创了用多模态学习解决更广泛的量子系统学习任务的可能性。

Abstract
Cross-platform verification, a critical undertaking in the realm of early-stage quantum computing, endeavors to characterize the similarity of two imperfect quantum devices executing identical algorithms, utilizing minimal measurements. While the random measurement approach has been instrumental in this context, the quasi-exponential computational demand with increasing qubit count hurdles its feasibility in large-qubit scenarios. To bridge this knowledge gap, here we introduce an innovative multimodal learning approach, recognizing that the formalism of data in this task embodies two distinct modalities: measurement outcomes and classical description of compiled circuits on explored quantum devices, both enriched with unique information. Building upon this insight, we devise a multimodal neural network to independently extract knowledge from these modalities, followed by a fusion operation to create a comprehensive data representation. The learned representation can effectively characterize the similarity between the explored quantum devices when executing new quantum algorithms not present in the training data. We evaluate our proposal on platforms featuring diverse noise models, encompassing system sizes up to 50 qubits. The achieved results demonstrate a three-orders-of-magnitude improvement in prediction accuracy compared to the random measurements and offer compelling evidence of the complementary roles played by each modality in cross-platform verification. These findings pave the way for harnessing the power of multimodal learning to overcome challenges in wider quantum system learning tasks.

摘要
To address this challenge, we propose an innovative multimodal learning approach that recognizes the two distinct modalities in the task: measurement outcomes and classical descriptions of compiled circuits on explored quantum devices. By independently extracting knowledge from these modalities using a multimodal neural network, and then fusing the information, we can effectively characterize the similarity between the explored quantum devices when executing new quantum algorithms not present in the training data.We evaluate our proposal on platforms with diverse noise models, up to 50 qubits, and achieve a three-orders-of-magnitude improvement in prediction accuracy compared to random measurements. Our findings demonstrate the complementary roles played by each modality in cross-platform verification and pave the way for harnessing the power of multimodal learning to overcome challenges in wider quantum system learning tasks.

Unsupervised convolutional neural network fusion approach for change detection in remote sensing images

paper_url: http://arxiv.org/abs/2311.03679
repo_url: None
paper_authors: Weidong Yan, Pei Yan, Li Cao
for: 本研究旨在提出一种基于深度学习的无监督 shallow convolutional neural network (USCNN) 变化检测方法，以适应 Remote sensing 图像变化检测中的差分检测问题。
methods: 本方法首先将 би-时间图像转换为不同的特征空间，使用不同大小的卷积核来提取图像中的多尺度信息。然后，对同一个卷积核的输出特征图像进行减去操作，得到对应的差分特征图像。最后，将不同尺度的差分特征图像进行拼接，并使用 1 * 1 卷积层进行拼接。
results: 实验结果表明，提出的方法可以准确地检测 Remote sensing 图像中的变化。四个实验数据集的实验结果表明该方法的可行性和有效性。

Abstract
With the rapid development of deep learning, a variety of change detection methods based on deep learning have emerged in recent years. However, these methods usually require a large number of training samples to train the network model, so it is very expensive. In this paper, we introduce a completely unsupervised shallow convolutional neural network (USCNN) fusion approach for change detection. Firstly, the bi-temporal images are transformed into different feature spaces by using convolution kernels of different sizes to extract multi-scale information of the images. Secondly, the output features of bi-temporal images at the same convolution kernels are subtracted to obtain the corresponding difference images, and the difference feature images at the same scale are fused into one feature image by using 1 * 1 convolution layer. Finally, the output features of different scales are concatenated and a 1 * 1 convolution layer is used to fuse the multi-scale information of the image. The model parameters are obtained by a redesigned sparse function. Our model has three features: the entire training process is conducted in an unsupervised manner, the network architecture is shallow, and the objective function is sparse. Thus, it can be seen as a kind of lightweight network model. Experimental results on four real remote sensing datasets indicate the feasibility and effectiveness of the proposed approach.

摘要
随着深度学习的快速发展，深度学习基于的变化检测方法在最近几年出现了许多。然而，这些方法通常需要训练样本的大量，因此非常昂贵。在这篇论文中，我们介绍了一种完全无监督的浅层卷积神经网络（USCNN）融合方法 для变化检测。首先，bi-temporal图像被转换成不同的特征空间，使用不同的卷积核来提取图像的多尺度信息。其次，bi-temporal图像的输出特征在同一个卷积核上被减去，以获得对应的差异图像，并将差异特征图像在同一个尺度上融合到一个特征图像中，使用1*1卷积层。最后，不同尺度的输出特征被 concatenate 并使用1*1卷积层融合图像的多尺度信息。模型参数由一种重新定义的稀疏函数来获得。我们的模型具有三个特点：整个训练过程是无监督的，网络架构是浅层的，目标函数是稀疏的。因此，可以看作一种轻量级的网络模型。实验结果表明，提议的方法在四个实际遥感数据集上是可行和有效的。

Image Generation and Learning Strategy for Deep Document Forgery Detection

paper_url: http://arxiv.org/abs/2311.03650
repo_url: None
paper_authors: Yamato Okamoto, Osada Genki, Iu Yahiro, Rintaro Hasegawa, Peifei Zhu, Hirokatsu Kataoka
for: 本研究旨在対抗深度神经网络（DNN）方法所带来的文档伪造威胁。
methods: 我们使用自然图像和文档图像的自适应学习来预训练模型，并构建了一个包含多种攻击方法的文档伪造图像集（FD-VIED）。
results: 我们的方法在实验中显示出提高检测性能的result。

Abstract
In recent years, document processing has flourished and brought numerous benefits. However, there has been a significant rise in reported cases of forged document images. Specifically, recent advancements in deep neural network (DNN) methods for generative tasks may amplify the threat of document forgery. Traditional approaches for forged document images created by prevalent copy-move methods are unsuitable against those created by DNN-based methods, as we have verified. To address this issue, we construct a training dataset of document forgery images, named FD-VIED, by emulating possible attacks, such as text addition, removal, and replacement with recent DNN-methods. Additionally, we introduce an effective pre-training approach through self-supervised learning with both natural images and document images. In our experiments, we demonstrate that our approach enhances detection performance.

摘要

Instruct Me More! Random Prompting for Visual In-Context Learning

paper_url: http://arxiv.org/abs/2311.03648
repo_url: None
paper_authors: Jiahao Zhang, Bowen Wang, Liangzhi Li, Yuta Nakashima, Hajime Nagahara
for: 用于提高计算机视觉中的启发式学习（ICL）性能。
methods: 使用可学习的扰动（提示）来补充输入-输出图像对（叫做启发对），以提高模型的性能。
results: 比基eline无法学习提示的情况下，使用InMeMo方法可以提高多个主流任务的性能，包括对eground segmentation和单个物体检测任务的性能。 Specifically, InMeMo方法可以提高mIoU分数 by 7.35和15.13。

Abstract
Large-scale models trained on extensive datasets, have emerged as the preferred approach due to their high generalizability across various tasks. In-context learning (ICL), a popular strategy in natural language processing, uses such models for different tasks by providing instructive prompts but without updating model parameters. This idea is now being explored in computer vision, where an input-output image pair (called an in-context pair) is supplied to the model with a query image as a prompt to exemplify the desired output. The efficacy of visual ICL often depends on the quality of the prompts. We thus introduce a method coined Instruct Me More (InMeMo), which augments in-context pairs with a learnable perturbation (prompt), to explore its potential. Our experiments on mainstream tasks reveal that InMeMo surpasses the current state-of-the-art performance. Specifically, compared to the baseline without learnable prompt, InMeMo boosts mIoU scores by 7.35 and 15.13 for foreground segmentation and single object detection tasks, respectively. Our findings suggest that InMeMo offers a versatile and efficient way to enhance the performance of visual ICL with lightweight training. Code is available at https://github.com/Jackieam/InMeMo.

摘要

Random Field Augmentations for Self-Supervised Representation Learning

paper_url: http://arxiv.org/abs/2311.03629
repo_url: None
paper_authors: Philip Andrew Mansfield, Arash Afkanpour, Warren Richard Morningstar, Karan Singhal
for: 自动增强学习（self-supervised representation learning），用于提高图像识别和泛化能力。
methods: 提出了一种新的本地变换方法，基于 Gaussian 随机场，用于生成图像增强。这些变换包括翻译、旋转、颜色干扰等，并且允许变换参数值从像素到像素 vary。
results: 实验结果表明，新的变换方法可以提高自动增强学习的性能，包括 ImageNet 下排名第一的批处理精度提高1.7%，以及 iNaturalist 下批处理精度提高3.6%。但是，由于新的变换的灵活性，学习得到的表示结构可以受到 hyperparameter 的影响，需要平衡增强表示的多样性和强度。

Abstract
Self-supervised representation learning is heavily dependent on data augmentations to specify the invariances encoded in representations. Previous work has shown that applying diverse data augmentations is crucial to downstream performance, but augmentation techniques remain under-explored. In this work, we propose a new family of local transformations based on Gaussian random fields to generate image augmentations for self-supervised representation learning. These transformations generalize the well-established affine and color transformations (translation, rotation, color jitter, etc.) and greatly increase the space of augmentations by allowing transformation parameter values to vary from pixel to pixel. The parameters are treated as continuous functions of spatial coordinates, and modeled as independent Gaussian random fields. Empirical results show the effectiveness of the new transformations for self-supervised representation learning. Specifically, we achieve a 1.7% top-1 accuracy improvement over baseline on ImageNet downstream classification, and a 3.6% improvement on out-of-distribution iNaturalist downstream classification. However, due to the flexibility of the new transformations, learned representations are sensitive to hyperparameters. While mild transformations improve representations, we observe that strong transformations can degrade the structure of an image, indicating that balancing the diversity and strength of augmentations is important for improving generalization of learned representations.

摘要
自我超级vised学习中很重要地依赖数据扩展来确定表示中的不变性。先前的工作表明，对数据扩展进行多样化应用是下游性能的关键，但是扩展技术还很少被探索。在这种工作中，我们提出了一种新的本地变换家族，基于 Gaussian random fields 生成图像扩展 для自我超级vised学习。这些变换扩展了已知的平移、旋转和颜色扩展（翻译、旋转、颜色扩展等），并大大增加了扩展的空间。变换参数被视为像素坐标上独立的 Gaussian random fields，其中每个像素坐标上的参数值可以不同。我们的实验结果表明，这些新的变换对自我超级vised学习具有效果。具体来说，我们在 ImageNet 下游分类任务上 achieved 1.7% 的 top-1 精度提升，并在 iNaturalist 下游分类任务上 achieved 3.6% 的精度提升。然而，由于新的变换的灵活性，学习的表示受到了参数的影响。虽然柔和的变换可以提高表示的质量，但是强大的变换可能会破坏图像的结构，这表明在提高学习的表示总体化的过程中，需要考虑扩展的多样性和强度。

FusionViT: Hierarchical 3D Object Detection via LiDAR-Camera Vision Transformer Fusion

paper_url: http://arxiv.org/abs/2311.03620
repo_url: None
paper_authors: Xinhao Xiang, Jiawei Zhang
for: 这 paper 的目的是提出一种基于 vision transformer 的三 Dimensional 物体检测模型，以提高三 Dimensional 物体检测性能。
methods: 该模型使用 hierarchical 的 transformer 模型来对图像和点云进行嵌入，并通过视处理变换模型来进行数据表示学盘learn 。
results: 对于实际的交通Scene 数据集 KITTI 和 Waymo Open，我们的 FusionViT 模型可以达到状态空间的表现，并比之前的基于单Modal 图像或点云的方法和最新的多Modal 图像-点云深度融合方法都高。

Abstract
For 3D object detection, both camera and lidar have been demonstrated to be useful sensory devices for providing complementary information about the same scenery with data representations in different modalities, e.g., 2D RGB image vs 3D point cloud. An effective representation learning and fusion of such multi-modal sensor data is necessary and critical for better 3D object detection performance. To solve the problem, in this paper, we will introduce a novel vision transformer-based 3D object detection model, namely FusionViT. Different from the existing 3D object detection approaches, FusionViT is a pure-ViT based framework, which adopts a hierarchical architecture by extending the transformer model to embed both images and point clouds for effective representation learning. Such multi-modal data embedding representations will be further fused together via a fusion vision transformer model prior to feeding the learned features to the object detection head for both detection and localization of the 3D objects in the input scenery. To demonstrate the effectiveness of FusionViT, extensive experiments have been done on real-world traffic object detection benchmark datasets KITTI and Waymo Open. Notably, our FusionViT model can achieve state-of-the-art performance and outperforms not only the existing baseline methods that merely rely on camera images or lidar point clouds, but also the latest multi-modal image-point cloud deep fusion approaches.

摘要
为3D物体检测，Camera和Lidar都被证明是有用的感知设备，提供不同模式的数据表示，如2DRGB图像和3D点云。为了提高3D物体检测性能，需要有效地学习并融合这些多模式感知数据。在这篇论文中，我们将介绍一种新的视transformer基于3D物体检测模型，即FusionViT。与现有的3D物体检测方法不同，FusionViT是一个纯ViT基本框架，通过扩展transformer模型，以便对图像和点云进行有效的表示学习。这些多模式数据表示将被通过视transformer模型进行融合，然后传递给物体检测头进行3D物体在输入场景中的检测和位置确定。为证明FusionViT的效果，我们在KITTI和Waymo开放 datasets上进行了广泛的实验。需要注意的是，我们的FusionViT模型可以在实际的交通场景中达到领先的性能，并在只使用Camera图像或Lidar点云的基础方法上出perform。此外，我们的模型还可以在多模式图像-点云深度融合方法上达到更高的性能。

2023-11-07

3DiffTection: 3D Object Detection with Geometry-Aware Diffusion Features

Basis restricted elastic shape analysis on the space of unregistered surfaces

A Deep Learning Approach to Video Anomaly Detection using Convolutional Autoencoders

SaFL: Sybil-aware Federated Learning with Application to Face Recognition

Efficient Semantic Matching with Hypercolumn Correlation

A Data Perspective on Enhanced Identity Preservation for Diffusion Personalization

Holistic Evaluation of Text-To-Image Models

Video Instance Matting

Deep Hashing via Householder Quantization

High-fidelity 3D Reconstruction of Plants using Neural Radiance Field

I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models

Perceptual Quality Improvement in Videoconferencing using Keyframes-based GAN

Interactive Semantic Map Representation for Skill-based Visual Object Navigation

DeepPatent2: A Large-Scale Benchmarking Corpus for Technical Drawing Understanding

Image-Pointcloud Fusion based Anomaly Detection using PD-REAL Dataset

Proceedings of the 5th International Workshop on Reading Music Systems

Restoration of Analog Videos Using Swin-UNet

Learning Super-Resolution Ultrasound Localization Microscopy from Radio-Frequency Data

Augmenting Lane Perception and Topology Understanding with Standard Definition Navigation Maps

Energy-based Calibrated VAE with Test Time Free Lunch

LISBET: a self-supervised Transformer model for the automatic segmentation of social behavior motifs

mmFUSION: Multimodal Fusion for 3D Objects Detection

Generative Structural Design Integrating BIM and Diffusion Model

3D EAGAN: 3D edge-aware attention generative adversarial network for prostate segmentation in transrectal ultrasound images

Analyzing Near-Infrared Hyperspectral Imaging for Protein Content Regression and Grain Variety Classification Using Bulk References and Varying Grain-to-Background Ratios

Data exploitation: multi-task learning of object detection and semantic segmentation on partially annotated data

Exploring Dataset-Scale Indicators of Data Quality

AGNES: Abstraction-guided Framework for Deep Neural Networks Security

Bias and Diversity in Synthetic-based Face Recognition

CeCNN: Copula-enhanced convolutional neural networks in joint prediction of refraction error and axial length based on ultra-widefield fundus images

Fast Sun-aligned Outdoor Scene Relighting based on TensoRF

Enhancing Multimodal Compositional Reasoning of Visual Language Models with Generative Negative Mining

Improving the Effectiveness of Deep Generative Data

CLIP Guided Image-perceptive Prompt Learning for Image Enhancement

Analysis of NaN Divergence in Training Monocular Depth Estimation Model

FLORA: Fine-grained Low-Rank Architecture Search for Vision Transformer

RobustMat: Neural Diffusion for Street Landmark Patch Matching under Challenging Environments

MeVGAN: GAN-based Plugin Model for Video Generation with Applications in Colonoscopy

A Comparative Study of Knowledge Transfer Methods for Misaligned Urban Building Labels

SCONE-GAN: Semantic Contrastive learning-based Generative Adversarial Network for an end-to-end image translation

GC-VTON: Predicting Globally Consistent and Occlusion Aware Local Flows with Neighborhood Integrity Preservation for Virtual Try-on

Multi-view Information Integration and Propagation for Occluded Person Re-identification

Detecting Any Human-Object Interaction Relationship: Universal HOI Detector with Spatial Prompt Learning on Foundation Models

Self-MI: Efficient Multimodal Fusion via Self-Supervised Multi-Task Learning with Auxiliary Mutual Information Maximization

UP-NeRF: Unconstrained Pose-Prior-Free Neural Radiance Fields

CapST: An Enhanced and Lightweight Method for Deepfake Video Classification

Meta-Adapter: An Online Few-shot Learner for Vision-Language Model

Lightweight Portrait Matting via Regional Attention and Refinement

Image change detection with only a few samples

Multiclass Segmentation using Teeth Attention Modules for Dental X-ray Images

SBCFormer: Lightweight Network Capable of Full-size ImageNet Classification at 1 FPS on Single Board Computers

Unsupervised Video Summarization

3DifFusionDet: Diffusion Model for 3D Object Detection with Robust LiDAR-Camera Fusion

ADFactory: Automated Data Factory for Optical Flow Tasks

DeepInspect: An AI-Powered Defect Detection for Manufacturing Industries

Inertial Guided Uncertainty Estimation of Feature Correspondence in Visual-Inertial Odometry/SLAM

Multimodal deep representation learning for quantum cross-platform verification

Unsupervised convolutional neural network fusion approach for change detection in remote sensing images

Image Generation and Learning Strategy for Deep Document Forgery Detection

Instruct Me More! Random Prompting for Visual In-Context Learning

Random Field Augmentations for Self-Supervised Representation Learning

FusionViT: Hierarchical 3D Object Detection via LiDAR-Camera Vision Transformer Fusion