2023-10-01

cs.CV

cs.CV - 2023-10-01

Sharingan: A Transformer-based Architecture for Gaze Following

paper_url: http://arxiv.org/abs/2310.00816
repo_url: None
paper_authors: Samy Tafasca, Anshul Gupta, Jean-Marc Odobez
for: 这 paper 是为了研究人类视线跟踪的模型，以便在各种应用领域中使用。
methods: 这 paper 使用了一种新的 transformer-based 架构来实现 2D 视线预测。
results: 这 paper 在 GazeFollow 和 VideoAttentionTarget 数据集上 achieved state-of-the-art 结果。Here’s the full translation in Simplified Chinese:
for: 这 paper 是为了研究人类视线跟踪的模型，以便在各种应用领域中使用。
methods: 这 paper 使用了一种新的 transformer-based 架构来实现 2D 视线预测。
results: 这 paper 在 GazeFollow 和 VideoAttentionTarget 数据集上 achieved state-of-the-art 结果。I hope this helps! Let me know if you have any further questions.

Abstract
Gaze is a powerful form of non-verbal communication and social interaction that humans develop from an early age. As such, modeling this behavior is an important task that can benefit a broad set of application domains ranging from robotics to sociology. In particular, Gaze Following is defined as the prediction of the pixel-wise 2D location where a person in the image is looking. Prior efforts in this direction have focused primarily on CNN-based architectures to perform the task. In this paper, we introduce a novel transformer-based architecture for 2D gaze prediction. We experiment with 2 variants: the first one retains the same task formulation of predicting a gaze heatmap for one person at a time, while the second one casts the problem as a 2D point regression and allows us to perform multi-person gaze prediction with a single forward pass. This new architecture achieves state-of-the-art results on the GazeFollow and VideoAttentionTarget datasets. The code for this paper will be made publicly available.

摘要
gaze 是一种强大的非语言通信和社交互动方式，人类从 early age 开始发展。因此，模拟这种行为是一项重要的任务，可以 benefiting Broad 应用领域，从机器人学到社会学。特别是，瞥向预测（Gaze Following）定义为图像中人员的 pixel-wise 2D 位置预测。先前的尝试都是通过 CNN arquitectures 来完成这项任务。在这篇论文中，我们提出了一种新的 transformer 结构来实现 2D 瞥向预测。我们实验了两个变体：第一个保持了同样的任务表述，即预测一个人的瞥向热图 ; 第二个将问题定义为2D 点 regression，允许我们通过单一的前进 pass 进行多人瞥向预测。这种新的结构实现了 GazeFollow 和 VideoAttentionTarget 数据集的状态图。代码将公开发布。

Completing Visual Objects via Bridging Generation and Segmentation

paper_url: http://arxiv.org/abs/2310.00808
repo_url: None
paper_authors: Xiang Li, Yinpeng Chen, Chung-Ching Lin, Rita Singh, Bhiksha Raj, Zicheng Liu
for: reconstruction of a complete object from its partially visible components
methods: iterative stages of generation and segmentation, with the object mask provided as an additional condition
results: superior object completion results compared to existing approaches such as ControlNet and Stable Diffusion

Abstract
This paper presents a novel approach to object completion, with the primary goal of reconstructing a complete object from its partially visible components. Our method, named MaskComp, delineates the completion process through iterative stages of generation and segmentation. In each iteration, the object mask is provided as an additional condition to boost image generation, and, in return, the generated images can lead to a more accurate mask by fusing the segmentation of images. We demonstrate that the combination of one generation and one segmentation stage effectively functions as a mask denoiser. Through alternation between the generation and segmentation stages, the partial object mask is progressively refined, providing precise shape guidance and yielding superior object completion results. Our experiments demonstrate the superiority of MaskComp over existing approaches, e.g., ControlNet and Stable Diffusion, establishing it as an effective solution for object completion.

摘要
这篇论文提出了一种新的物体完成方法，主要目标是从部分可见的组件中重建完整的物体。我们的方法，名为MaskComp，通过 iterate 的生成和分割阶段来进行分割。在每个迭代阶段，提供对象Mask作为附加条件，以提高图像生成，并在返回的图像中提取更加精确的Mask。我们发现，通过生成和分割阶段的交互，可以有效地减少Mask的噪声。通过 alternate 生成和分割阶段，部分物体Mask可以逐渐进行精细化，提供精确的形状指导，并且实现了更好的物体完成效果。我们的实验表明，MaskComp 比 existed 方法（如 ControlNet 和 Stable Diffusion）更加有效， establishing 它为物体完成的有效解决方案。

Propagating Semantic Labels in Video Data

paper_url: http://arxiv.org/abs/2310.00783
repo_url: None
paper_authors: David Balaban, Justin Medich, Pranay Gosar, Justin Hart
for: 这个论文的目的是提出一种基于Foundation Models的视频 segmentation方法，以减少人工标注成本。
methods: 该方法使用Segment Anything Model (SAM)和Structure from Motion (SfM)两种技术来实现视频 segmentation。首先，视频输入被重构为3D几何结构使用SfM，然后使用SAM进行每帧的分割。最后，对于每帧的分割结果，进行3D几何投影，以便在新的视角下进行跟踪。
results: 该方法可以大幅减少人工标注成本，但是与人工标注的性能相比，其性能有所下降。三个主要纪录器都用于评估系统性能：计算时间、面积 overlap with manual labels和跟踪损失数量。结果表明，该系统在跟踪对象在视频帧上的计算时间方面具有显著的提高，但是在性能方面却存在一定的下降。

Abstract
Semantic Segmentation combines two sub-tasks: the identification of pixel-level image masks and the application of semantic labels to those masks. Recently, so-called Foundation Models have been introduced; general models trained on very large datasets which can be specialized and applied to more specific tasks. One such model, the Segment Anything Model (SAM), performs image segmentation. Semantic segmentation systems such as CLIPSeg and MaskRCNN are trained on datasets of paired segments and semantic labels. Manual labeling of custom data, however, is time-consuming. This work presents a method for performing segmentation for objects in video. Once an object has been found in a frame of video, the segment can then be propagated to future frames; thus reducing manual annotation effort. The method works by combining SAM with Structure from Motion (SfM). The video input to the system is first reconstructed into 3D geometry using SfM. A frame of video is then segmented using SAM. Segments identified by SAM are then projected onto the the reconstructed 3D geometry. In subsequent video frames, the labeled 3D geometry is reprojected into the new perspective, allowing SAM to be invoked fewer times. System performance is evaluated, including the contributions of the SAM and SfM components. Performance is evaluated over three main metrics: computation time, mask IOU with manual labels, and the number of tracking losses. Results demonstrate that the system has substantial computation time improvements over human performance for tracking objects over video frames, but suffers in performance.

摘要
Semantic Segmentation 将两个子任务结合在一起：Pixel-level图像mask的标识和图像mask的semantic标签应用。最近，称之为基础模型的模型被引入，这些模型可以在很大的数据集上训练，然后应用到更特定的任务上。一个such model是Segment Anything Model（SAM），它实现了图像 segmentation。图像 segmentation系统such as CLIPSeg和MaskRCNN通常是在paired segments和semantic labels的数据集上训练的。然而，手动标注自定义数据是时间consuming。这个工作提出了一种方法，通过结合SAM和Structure from Motion（SfM）来实现对视频帧中对象的分割。首先，视频输入被重建为3D几何结构使用SfM。然后，在SAM中Segment一帧视频。由SAM标识的分割被 проекted onto the reconstructed 3D几何结构。在后续的视频帧中，标注的3D几何结构被重新投影到新的视角，以便在新的视频帧中invoked SAM fewer times。系统性能被评估，包括SAM和SfM组件的贡献。性能被评估以三个主要指标：计算时间、mask IOU with manual labels和跟踪损失数。结果表明，系统在跟踪对象在视频帧之间的计算时间上有substantial的提高，但是性能不如人工标注。

SMOOT: Saliency Guided Mask Optimized Online Training

paper_url: http://arxiv.org/abs/2310.00772
repo_url: None
paper_authors: Ali Karkehabadi, Houman Homayoun, Avesta Sasan
for: 这种论文的目的是提出一种新的隐藏导航法（Saliency-Guided Training，SGT），以提高深度神经网络的解释性。
methods: 这种方法使用反射和修改Gradient来引导模型强调最重要的特征，以提高模型的解释性。
results: 实验结果表明，我们的提案可以有效地提高模型的准确率和隐藏特征的明确度。

Abstract
Deep Neural Networks are powerful tools for understanding complex patterns and making decisions. However, their black-box nature impedes a complete understanding of their inner workings. Saliency-Guided Training (SGT) methods try to highlight the prominent features in the model's training based on the output to alleviate this problem. These methods use back-propagation and modified gradients to guide the model toward the most relevant features while keeping the impact on the prediction accuracy negligible. SGT makes the model's final result more interpretable by masking input partially. In this way, considering the model's output, we can infer how each segment of the input affects the output. In the particular case of image as the input, masking is applied to the input pixels. However, the masking strategy and number of pixels which we mask, are considered as a hyperparameter. Appropriate setting of masking strategy can directly affect the model's training. In this paper, we focus on this issue and present our contribution. We propose a novel method to determine the optimal number of masked images based on input, accuracy, and model loss during the training. The strategy prevents information loss which leads to better accuracy values. Also, by integrating the model's performance in the strategy formula, we show that our model represents the salient features more meaningful. Our experimental results demonstrate a substantial improvement in both model accuracy and the prominence of saliency, thereby affirming the effectiveness of our proposed solution.

摘要

Counterfactual Image Generation for adversarially robust and interpretable Classifiers

paper_url: http://arxiv.org/abs/2310.00761
repo_url: None
paper_authors: Rafael Bischof, Florian Scheidegger, Michael A. Kraus, A. Cristiano I. Malossi
for: 这种方法的目的是提高神经网络图像分类器的解释性和robustness。
methods: 该方法使用图像到图像翻译生成器（GANs）来生成对应的替换样本，以提高解释性和对抗性。
results: 该方法可以生成高度描述性的解释图像，并且可以提高模型对抗性。此外，该方法还可以用来评估模型的不确定性。

Abstract
Neural Image Classifiers are effective but inherently hard to interpret and susceptible to adversarial attacks. Solutions to both problems exist, among others, in the form of counterfactual examples generation to enhance explainability or adversarially augment training datasets for improved robustness. However, existing methods exclusively address only one of the issues. We propose a unified framework leveraging image-to-image translation Generative Adversarial Networks (GANs) to produce counterfactual samples that highlight salient regions for interpretability and act as adversarial samples to augment the dataset for more robustness. This is achieved by combining the classifier and discriminator into a single model that attributes real images to their respective classes and flags generated images as "fake". We assess the method's effectiveness by evaluating (i) the produced explainability masks on a semantic segmentation task for concrete cracks and (ii) the model's resilience against the Projected Gradient Descent (PGD) attack on a fruit defects detection problem. Our produced saliency maps are highly descriptive, achieving competitive IoU values compared to classical segmentation models despite being trained exclusively on classification labels. Furthermore, the model exhibits improved robustness to adversarial attacks, and we show how the discriminator's "fakeness" value serves as an uncertainty measure of the predictions.

摘要
Our framework combines the classifier and discriminator into a single model, which attributes real images to their respective classes and flags generated images as "fake". We evaluate the effectiveness of our method by assessing the produced explainability masks on a semantic segmentation task for concrete cracks and the model's resilience against the Projected Gradient Descent (PGD) attack on a fruit defects detection problem.Our produced saliency maps are highly descriptive and achieve competitive IoU values compared to classical segmentation models, despite being trained exclusively on classification labels. Additionally, the model exhibits improved robustness to adversarial attacks, and we show how the discriminator's "fakeness" value serves as an uncertainty measure of the predictions.

Top-down Green-ups: Satellite Sensing and Deep Models to Predict Buffelgrass Phenology

paper_url: http://arxiv.org/abs/2310.00740
repo_url: https://github.com/lurosenb/phenology_projects
paper_authors: Lucas Rosenblatt, Bin Han, Erin Posthumus, Theresa Crimmins, Bill Howe
For: 预测buffelgrass的”绿化”（即Ready for herbicidal treatment），以预防南部美国的严重野火和生物多样性损失。* Methods: 使用卫星感知和深度学习模型，包括时间、视觉和多模态模型，以提高预测 buffelgrass 绿化的精度。* Results: 所有神经网络基于的方法都超越了传统 buffelgrass 绿化模型，并讨论了如何实现神经网络模型的部署，以实现 significiant resource savings。

Abstract
An invasive species of grass known as "buffelgrass" contributes to severe wildfires and biodiversity loss in the Southwest United States. We tackle the problem of predicting buffelgrass "green-ups" (i.e. readiness for herbicidal treatment). To make our predictions, we explore temporal, visual and multi-modal models that combine satellite sensing and deep learning. We find that all of our neural-based approaches improve over conventional buffelgrass green-up models, and discuss how neural model deployment promises significant resource savings.

摘要
“一种入侵性的草本植物──牛肚草”在南部美国引起了严重的野火和生物多样性损失。我们面临着预测牛肚草“绿化”（即Ready for 药物处理）的问题。为了实现这一目标，我们探讨了时间、视觉和多模态模型， combining satellite sensing和深度学习。我们发现所有的神经网络方法都超过了传统的牛肚草绿化模型，并讨论了如何部署神经网络模型以实现显著的资源节约。”Note that "牛肚草" (bù dù cǎo) is the Simplified Chinese term for "buffelgrass".

HOH: Markerless Multimodal Human-Object-Human Handover Dataset with Large Object Count

paper_url: http://arxiv.org/abs/2310.00723
repo_url: None
paper_authors: Noah Wiederhold, Ava Megyeri, DiMaggio Paris, Sean Banerjee, Natasha Kholgade Banerjee
for: 论文主要用于促进数据驱动的手夹研究、人机手夹实现以及人工智能手夹参数估计的数据集。
methods: 本论文使用了多视图RGB和深度数据、skeleton、笔oint clouds、抓取类型和手夹性labels、物体、接受手和发送手2D和3D分割、接受手和发送手舒适评分、对象元数据和对应的3D模型等数据来描述136种物品的人类互动。
results: 本论文通过使用HOH数据集进行神经网络训练，实现了抓取、orientation和轨迹预测等任务。相比标注数据集，HOH数据集不需要特定的装备，可以更自然地捕捉人类之间的手夹互动，并且包含了人类手夹互动的高分辨率手夹跟踪数据。至今为止，HOH数据集是手夹数据集中最大的物品数、参与者数、对应的对话对数和总交互记录的数据集。

Abstract
We present the HOH (Human-Object-Human) Handover Dataset, a large object count dataset with 136 objects, to accelerate data-driven research on handover studies, human-robot handover implementation, and artificial intelligence (AI) on handover parameter estimation from 2D and 3D data of person interactions. HOH contains multi-view RGB and depth data, skeletons, fused point clouds, grasp type and handedness labels, object, giver hand, and receiver hand 2D and 3D segmentations, giver and receiver comfort ratings, and paired object metadata and aligned 3D models for 2,720 handover interactions spanning 136 objects and 20 giver-receiver pairs-40 with role-reversal-organized from 40 participants. We also show experimental results of neural networks trained using HOH to perform grasp, orientation, and trajectory prediction. As the only fully markerless handover capture dataset, HOH represents natural human-human handover interactions, overcoming challenges with markered datasets that require specific suiting for body tracking, and lack high-resolution hand tracking. To date, HOH is the largest handover dataset in number of objects, participants, pairs with role reversal accounted for, and total interactions captured.

摘要
我们提出了人机物交换数据集（HOH），包含136种物品，以加速基于数据驱动的手柄研究、人机手柄实现和人工智能（AI）在手柄参数估计方面。HOH包含多视角RGB和深度数据、skeleton、粘合点云、抓取类型和手征标注、物品、付给人手和接受人手2D和3D分割、付给人和接受人的舒适评分、对应的对象元数据和对应的3D模型。我们还显示了使用HOH训练神经网络进行抓取、方向和轨迹预测的实验结果。作为唯一的无标记手柄捕捉数据集，HOH表现了自然的人人手柄交换互动，超越了标记数据集需要特定的服装以满足身体跟踪的问题，以及缺乏高分辨率手征跟踪。到目前为止，HOH是手柄数据集中最大的物品数、参与者数、对话对数和总交换次数。

Logical Bias Learning for Object Relation Prediction

paper_url: http://arxiv.org/abs/2310.00712
repo_url: None
paper_authors: Xinyu Zhou, Zihan Ji, Anna Zhu
for: 提高Scene Graph生成（SGG）的精度和可靠性，以提高图像理解和下游任务的能力。
methods: 基于 causal inference 的对象关系预测策略，并提出一个对象提升模块来进行缺省研究。
results: 在Visual Gnome 150（VG-150）dataset上实验证明了我们提议的方法的有效性。

Abstract
Scene graph generation (SGG) aims to automatically map an image into a semantic structural graph for better scene understanding. It has attracted significant attention for its ability to provide object and relation information, enabling graph reasoning for downstream tasks. However, it faces severe limitations in practice due to the biased data and training method. In this paper, we present a more rational and effective strategy based on causal inference for object relation prediction. To further evaluate the superiority of our strategy, we propose an object enhancement module to conduct ablation studies. Experimental results on the Visual Gnome 150 (VG-150) dataset demonstrate the effectiveness of our proposed method. These contributions can provide great potential for foundation models for decision-making.

摘要
Scene graph generation (SGG) 目标是自动将图像映射到 semantic 结构图，以提高场景理解。它吸引了大量注意力，因为它可以提供对象和关系信息，使得图reasoning 可能。然而，在实践中，它面临严重的限制，主要是因为数据和训练方法偏向。在这篇论文中，我们提出了基于 causal inference 的更合理和有效的策略，用于对象关系预测。为了进一步证明我们的策略的超越性，我们提出了对象增强模块进行缺失研究。实验结果表明，我们提posed 方法在 Visual Gnome 150 (VG-150) 数据集上得到了较好的效果。这些贡献可以为基础模型提供巨大的潜力。

You Do Not Need Additional Priors in Camouflage Object Detection

paper_url: http://arxiv.org/abs/2310.00702
repo_url: None
paper_authors: Yuchen Dong, Heng Zhou, Chengyang Li, Junjie Xie, Yongqiang Xie, Zhongbo Li
for: 本研究旨在开发一种不需要额外知识的掩蔽物检测网络，以解决现有方法强调额外知识的问题。
methods: 我们提出了一种新的自适应特征综合方法，通过多层特征信息的组合生成导航信息，不同于之前的方法，我们直接从图像特征中提取信息来导航模型训练。
results: 我们通过广泛的实验结果表明，我们的提议方法可以与现有的方法相比或超越其性能。

Abstract
Camouflage object detection (COD) poses a significant challenge due to the high resemblance between camouflaged objects and their surroundings. Although current deep learning methods have made significant progress in detecting camouflaged objects, many of them heavily rely on additional prior information. However, acquiring such additional prior information is both expensive and impractical in real-world scenarios. Therefore, there is a need to develop a network for camouflage object detection that does not depend on additional priors. In this paper, we propose a novel adaptive feature aggregation method that effectively combines multi-layer feature information to generate guidance information. In contrast to previous approaches that rely on edge or ranking priors, our method directly leverages information extracted from image features to guide model training. Through extensive experimental results, we demonstrate that our proposed method achieves comparable or superior performance when compared to state-of-the-art approaches.

摘要
高度掩蔽物检测（COD）具有 significannot challenges，因为掩蔽物和周围环境的高度相似性。当前深度学习方法已经在检测掩蔽物方面做出了 significannot进步，但大多数其中依赖于额外的先验信息。然而，在实际场景中获取这种额外先验信息是both expensive和不实际的。因此，有必要开发一种不依赖于额外先验信息的掩蔽物检测网络。在这篇论文中，我们提出了一种新的 adaptive feature aggregation 方法，可以有效地将多层特征信息集成成导航信息。与先前的方法相比，我们的方法直接利用图像特征中提取的信息来导航模型训练。通过广泛的实验结果，我们证明了我们提出的方法可以与当前状态的方法相比或更高的性能。

A quantum moving target segmentation algorithm for grayscale video

paper_url: http://arxiv.org/abs/2310.03038
repo_url: None
paper_authors: Wenjie Liu, Lu Wang, Qingshan Wu
for: 用于实时分割视频中移动目标。
methods: 使用量子机制同时计算所有邻帧图像差异，然后快速分割移动目标。设计了可行的量子比较器，用于判断灰度值与阈值的差异。
results: 对 IBM Q 进行实验，确认了我们的算法在不纯量子时代（NISQ）中的可行性。对于一个量子视频包含 $2^m$ 帧 ($每帧是 $2^n\times 2^n$ 图像，每个像素有 $q$ 灰度水平），我们的算法的复杂度可以降至 O $(n^2 + q) $。与 классический对比，它具有对数快速速度增长，同时也高于现有的量子算法。

Abstract
The moving target segmentation (MTS) aims to segment out moving targets in the video, however, the classical algorithm faces the huge challenge of real-time processing in the current video era. Some scholars have successfully demonstrated the quantum advantages in some video processing tasks, but not concerning moving target segmentation. In this paper, a quantum moving target segmentation algorithm for grayscale video is proposed, which can use quantum mechanism to simultaneously calculate the difference of all pixels in all adjacent frames and then quickly segment out the moving target. In addition, a feasible quantum comparator is designed to distinguish the grayscale values with the threshold. Then several quantum circuit units, including three-frame difference, binarization and AND operation, are designed in detail, and then are combined together to construct the complete quantum circuits for segmenting the moving target. For a quantum video with $2^m$ frames (every frame is a $2^n\times 2^n$ image with $q$ grayscale levels), the complexity of our algorithm can be reduced to O$(n^2 + q)$. Compared with the classic counterpart, it is an exponential speedup, while its complexity is also superior to the existing quantum algorithms. Finally, the experiment is conducted on IBM Q to show the feasibility of our algorithm in the noisy intermediate-scale quantum (NISQ) era.

摘要
traditional Chinese version:运动目标分割（MTS）的目标是将影像中的运动目标分割出来，但 класиical algorithm在现今的影像时代中面临巨大的实时处理挑战。一些学者已经成功地显示了量子优势在一些影像处理任务中，但不包括运动目标分割。本文提出了一个量子运动目标分割算法 для灰度影像，可以使用量子机制同时计算所有帧的差值，快速地分割出运动目标。此外，一个可行的量子比较器也被设计出来，用于区分灰度值与阈值。然后，一些量子Circuit单元，包括三帧差值、binarization 和 AND 操作，在细节中被设计出来，然后被组合起来建立完整的量子Circuits для分割运动目标。对于一个具有 $2^m$ 帧影像（每帧是 $2^n\times 2^n$ 图像，每个像素有 $q$ 灰度水平）的量子影像，我们的算法的复杂度可以降至 O $(n^2 + q)$。相比 классиical counterpart，这是一个指数快速的优化，而且其复杂度也高于现有的量子算法。最后，我们在 IBM Q 上进行实验，以显示我们的算法在不确定中等量子（NISQ）时代的可行性。Here's the translation in Simplified Chinese:运动目标分割（MTS）的目标是将影像中的运动目标分割出来，但 classical algorithm在现今的影像时代中面临巨大的实时处理挑战。一些学者已经成功地显示了量子优势在一些影像处理任务中，但不包括运动目标分割。本文提出了一个量子运动目标分割算法 для灰度影像，可以使用量子机制同时计算所有帧的差值，快速地分割出运动目标。此外，一个可行的量子比较器也被设计出来，用于区分灰度值与阈值。然后，一些量子Circuit单元，包括三帧差值、binarization 和 AND 操作，在细节中被设计出来，然后被组合起来建立完整的量子Circuits для分割运动目标。对于一个具有 $2^m$ 帧影像（每帧是 $2^n\times 2^n$ 图像，每个像素有 $q$ 灰度水平）的量子影像，我们的算法的复杂度可以降至 O $(n^2 + q)$。相比 classical counterpart，这是一个指数快速的优化，而且其复杂度也高于现有的量子算法。最后，我们在 IBM Q 上进行实验，以显示我们的算法在不确定中等量子（NISQ）时代的可行性。

Comics for Everyone: Generating Accessible Text Descriptions for Comic Strips

paper_url: http://arxiv.org/abs/2310.00698
repo_url: None
paper_authors: Reshma Ramaprasad
for: 为了让漫画可以对视障群体开放，提供可读的自然语言描述。
methods: 使用计算机视觉技术提取漫画图片中的信息，包括panel、角色和文本信息，然后使用这些信息作为多Modal大语言模型的提示，生成描述。
results: 对一组已经得到人工注释的漫画进行测试，测试结果具有较好的量化和质量指标。

Abstract
Comic strips are a popular and expressive form of visual storytelling that can convey humor, emotion, and information. However, they are inaccessible to the BLV (Blind or Low Vision) community, who cannot perceive the images, layouts, and text of comics. Our goal in this paper is to create natural language descriptions of comic strips that are accessible to the visually impaired community. Our method consists of two steps: first, we use computer vision techniques to extract information about the panels, characters, and text of the comic images; second, we use this information as additional context to prompt a multimodal large language model (MLLM) to produce the descriptions. We test our method on a collection of comics that have been annotated by human experts and measure its performance using both quantitative and qualitative metrics. The outcomes of our experiments are encouraging and promising.

摘要
漫画是一种受欢迎且表达力强的视觉故事形式，可以传达幽默、情感和信息。然而，它们对视障（Blind or Low Vision）社区不可见，无法感受到漫画的图片、布局和文本。我们的目标是创建可访问的漫画描述，以便让视障社区可以享受漫画的乐趣。我们的方法包括两步：第一步，我们使用计算机视觉技术提取漫画图片中的信息，包括画格、人物和文本信息；第二步，我们使用这些信息作为多模态大语言模型（MLLM）的提示，以生成描述。我们对一个收录了人工 эксперTS的漫画集进行测试，并使用量化和质量指标评估我们的方法的性能。实验结果很Encouraging和Promising。

A Hierarchical Graph-based Approach for Recognition and Description Generation of Bimanual Actions in Videos

paper_url: http://arxiv.org/abs/2310.00670
repo_url: None
paper_authors: Fatemeh Ziaeetabar, Reza Safabakhsh, Saeedeh Momtazi, Minija Tamosiunaite, Florentin Wörgötter
for: 这项研究旨在提高视频中人体动作的描述精度和全面性，以满足机器人学、人机交互和视频分析等领域的需求。
methods: 该研究提出了一种新的方法，结合图形模型和层次嵌入式注意机制，以提高视频描述的精度和全面性。该方法首先编码视频中对象和动作之间的空间时间相互关系，然后使用三级建构的层次注意机制，以recognize本地和全局上下文元素。
results: 对多个2D和3D数据集进行了实验，并与状态对比，该方法 consistently 获得了更高的准确率、精度和上下文相关性。在大量的减少实验中，我们也评估了不同组件的作用。该方法可以生成不同 semantic 深度的描述，类似于不同人的描述。此外，更深入的二手手Object交互的理解可能会降低人工智能领域中的机器人模拟动作的准确性。

Abstract
Nuanced understanding and the generation of detailed descriptive content for (bimanual) manipulation actions in videos is important for disciplines such as robotics, human-computer interaction, and video content analysis. This study describes a novel method, integrating graph based modeling with layered hierarchical attention mechanisms, resulting in higher precision and better comprehensiveness of video descriptions. To achieve this, we encode, first, the spatio-temporal inter dependencies between objects and actions with scene graphs and we combine this, in a second step, with a novel 3-level architecture creating a hierarchical attention mechanism using Graph Attention Networks (GATs). The 3-level GAT architecture allows recognizing local, but also global contextual elements. This way several descriptions with different semantic complexity can be generated in parallel for the same video clip, enhancing the discriminative accuracy of action recognition and action description. The performance of our approach is empirically tested using several 2D and 3D datasets. By comparing our method to the state of the art we consistently obtain better performance concerning accuracy, precision, and contextual relevance when evaluating action recognition as well as description generation. In a large set of ablation experiments we also assess the role of the different components of our model. With our multi-level approach the system obtains different semantic description depths, often observed in descriptions made by different people, too. Furthermore, better insight into bimanual hand-object interactions as achieved by our model may portend advancements in the field of robotics, enabling the emulation of intricate human actions with heightened precision.

摘要
importance of nuanced understanding and detailed descriptive content for (bimanual) manipulation actions in videos is crucial for fields such as robotics, human-computer interaction, and video content analysis. This study introduces a novel method that combines graph-based modeling with layered hierarchical attention mechanisms, resulting in more precise and comprehensive video descriptions. To achieve this, we first encode the spatio-temporal interdependencies between objects and actions using scene graphs, and then combine this with a novel 3-level architecture that creates a hierarchical attention mechanism using Graph Attention Networks (GATs). The 3-level GAT architecture allows for the recognition of both local and global contextual elements, enabling the generation of multiple descriptions with different semantic complexity for the same video clip. This approach improves the discriminative accuracy of action recognition and description generation. Our method is empirically tested on several 2D and 3D datasets, and we consistently obtain better performance compared to the state of the art in terms of accuracy, precision, and contextual relevance. In a series of ablation experiments, we also assess the role of the different components of our model. Our multi-level approach enables the system to obtain different semantic description depths, often observed in descriptions made by different people, and may also contribute to advancements in the field of robotics by enabling the emulation of intricate human actions with heightened precision.

Liveness Detection Competition – Noncontact-based Fingerprint Algorithms and Systems (LivDet-2023 Noncontact Fingerprint)

paper_url: http://arxiv.org/abs/2310.00659
repo_url: None
paper_authors: Sandip Purnapatra, Humaira Rezaie, Bhavin Jawade, Yu Liu, Yue Pan, Luke Brosell, Mst Rumana Sumi, Lambert Igene, Alden Dimarco, Srirangaraj Setlur, Soumyabrata Dey, Stephanie Schuckers, Marco Huber, Jan Niklas Kolf, Meiling Fang, Naser Damer, Banafsheh Adami, Raul Chitic, Karsten Seelert, Vishesh Mistry, Rahul Parthe, Umit Kacar
For: The paper is written for the assessment and reporting of state-of-the-art in Presentation Attack Detection (PAD) using noncontact fingerprint-based methods.* Methods: The paper uses a noncontact fingerprint-based PAD competition for algorithms and systems, with a common evaluation protocol that includes finger photos of various Presentation Attack Instruments (PAIs) and live fingers.* Results: The winning algorithm achieved an APCER of 11.35% and a BPCER of 0.62%, while the winning system achieved an APCER of 13.04% and a BPCER of 1.68%. Additionally, four-finger systems that make individual finger-based PAD decisions were also tested.Here are the three key points in Simplified Chinese text:* For: 这篇论文是用于评估和报告非接触指纹基于方法的攻击检测（PAD）的国际竞赛系列。* Methods: 这篇论文使用了一种非接触指纹基于的PAD竞赛，使用共同评估协议，包括指纹 фотографирования多种攻击工具（PAIs）和真实的手指。* Results: 赢家算法实现了APCER的11.35%和BPCER的0.62%，而赢家系统实现了APCER的13.04%和BPCER的1.68%。此外，四根手指系统也进行了个体指纹基于的PAD决策。

Abstract
Liveness Detection (LivDet) is an international competition series open to academia and industry with the objec-tive to assess and report state-of-the-art in Presentation Attack Detection (PAD). LivDet-2023 Noncontact Fingerprint is the first edition of the noncontact fingerprint-based PAD competition for algorithms and systems. The competition serves as an important benchmark in noncontact-based fingerprint PAD, offering (a) independent assessment of the state-of-the-art in noncontact-based fingerprint PAD for algorithms and systems, and (b) common evaluation protocol, which includes finger photos of a variety of Presentation Attack Instruments (PAIs) and live fingers to the biometric research community (c) provides standard algorithm and system evaluation protocols, along with the comparative analysis of state-of-the-art algorithms from academia and industry with both old and new android smartphones. The winning algorithm achieved an APCER of 11.35% averaged overall PAIs and a BPCER of 0.62%. The winning system achieved an APCER of 13.0.4%, averaged over all PAIs tested over all the smartphones, and a BPCER of 1.68% over all smartphones tested. Four-finger systems that make individual finger-based PAD decisions were also tested. The dataset used for competition will be available 1 to all researchers as per data share protocol

摘要
生命检测（LivDet）是一个国际竞赛系列，开放于学术和产业领域，旨在评估和报告当前最佳的演示攻击检测（PAD）技术。LivDet-2023非接触指纹是第一届非接触指纹基于PAD竞赛，用于评估和比较不同算法和系统的性能。这个竞赛作为非接触指纹PAD领域的重要标准，提供了独立的评估标准，以及一套共同的评估协议。该竞赛包括了多种演示攻击工具（PAIs）和真实的手指图像，以及一套标准的评估协议。winning algorithm achieved an APCER of 11.35% and a BPCER of 0.62% over all PAIs, and the winning system achieved an APCER of 13.04% and a BPCER of 1.68% over all smartphones tested. In addition, four-finger systems that make individual finger-based PAD decisions were also tested. The dataset used for the competition will be made available to all researchers according to the data sharing protocol.

Beyond Task Performance: Evaluating and Reducing the Flaws of Large Multimodal Models with In-Context Learning

paper_url: http://arxiv.org/abs/2310.00647
repo_url: https://github.com/mshukor/EvALign-ICL
paper_authors: Mustafa Shukor, Alexandre Rame, Corentin Dancette, Matthieu Cord
for: 这个论文旨在探讨大型多模型（LMMs）的问题和局限性，以及如何通过增强ICL（增强内容学习）来解决这些问题。
methods: 这个论文使用了8种不同的开源LMM（基于FLAMINGO架构），并对这些模型进行了5个轴的评估：幻觉、抑郁、 композиitional、解释性和遵循指令。此外，论文还研究了ICL的效果于LMMs的问题。
results: 论文发现，尽管LMMs在任务性能方面表现出色，但它们仍然存在许多问题，例如幻觉、抑郁、不compositional和解释性不足。ICL可以有效解决一些问题，但并不能解决所有问题。此外，论文还提出了一些新的多模态ICL方法，如多任务ICL、链式回忆ICL和自我修正ICL，以解决LMMs的问题。

Abstract
Following the success of Large Language Models (LLMs), Large Multimodal Models (LMMs), such as the Flamingo model and its subsequent competitors, have started to emerge as natural steps towards generalist agents. However, interacting with recent LMMs reveals major limitations that are hardly captured by the current evaluation benchmarks. Indeed, task performances (e.g., VQA accuracy) alone do not provide enough clues to understand their real capabilities, limitations, and to which extent such models are aligned to human expectations. To refine our understanding of those flaws, we deviate from the current evaluation paradigm and propose the EvALign-ICL framework, in which we (1) evaluate 8 recent open-source LMMs (based on the Flamingo architecture such as OpenFlamingo and IDEFICS) on 5 different axes; hallucinations, abstention, compositionality, explainability and instruction following. Our evaluation on these axes reveals major flaws in LMMs. To efficiently address these problems, and inspired by the success of in-context learning (ICL) in LLMs, (2) we explore ICL as a solution and study how it affects these limitations. Based on our ICL study, (3) we push ICL further and propose new multimodal ICL approaches such as; Multitask-ICL, Chain-of-Hindsight-ICL, and Self-Correcting-ICL. Our findings are as follows; (1) Despite their success, LMMs have flaws that remain unsolved with scaling alone. (2) The effect of ICL on LMMs flaws is nuanced; despite its effectiveness for improved explainability, abstention, and instruction following, ICL does not improve compositional abilities, and actually even amplifies hallucinations. (3) The proposed ICL variants are promising as post-hoc approaches to efficiently tackle some of those flaws. The code is available here: https://evalign-icl.github.io/

摘要
以 Large Language Models (LLMs) 的成功为契机，Large Multimodal Models (LMMs) 也在出现，如FLAMINGO模型和其竞争对手。然而，与最近的 LMMs 交互后，我们发现它们存在重要的局限性，这些局限性并不被当前的评价标准完全捕捉。实际上，任务性能（如 VQA 准确率） alone 不能够反映它们的真正能力和局限性，以及与人类期望的对应度。为了更好地理解这些问题，我们在评价标准之外尝试了 EvALign-ICL 框架，其中我们（1）评价了 8 个最近开源 LMMs（基于 FLAMINGO 架构，如 OpenFlamingo 和 IDEFICS）在 5 个轴上，即幻觉、抑制、复合性、解释性和遵从性。我们的评价表明，LMMs 存在重要的问题。为了有效地解决这些问题，我们（2）探索了 ICL 的潜在作用，并研究了 ICL 如何影响这些局限性。基于我们的 ICL 研究，我们（3）将 ICL 推广到多Modal ICL，并提出了新的多模态 ICL 方法，如 Multitask-ICL、Chain-of-Hindsight-ICL 和 Self-Correcting-ICL。我们的发现是，（1）虽然 LMMs 成功，但它们仍存在不解决的问题，不能通过缩放 alone 解决。（2）ICL 对 LMMs 的缺陷有复杂的影响，虽有效提高了解释性、抑制和遵从性，但是不会改善复合性，并且实际上会加剧幻觉。（3）我们提出的 ICL 变体是可以有效地解决一些问题的后续方法。代码可以在以下链接获取：https://evalign-icl.github.io/

RegBN: Batch Normalization of Multimodal Data with Regularization

paper_url: http://arxiv.org/abs/2310.00641
repo_url: https://github.com/mogvision/regbn
paper_authors: Morteza Ghahremani, Christian Wachinger
for: 这篇论文的目的是提出一种新的多modal资料Normalization方法，以便将多种不同的数据模式融合在一起，提高模组的表现。
methods: 这篇论文使用了RegBN方法，具有调整Regularization的功能，可以干预干扰因素和背景噪音的影响，并且可以跨多个数据模式进行normalization。
results: 这篇论文在八个数据库上进行验证，包括语言、音频、图像、视频、深度、表格和3D MRI等多种数据模式，以及不同的架构（如多层感知神经网络、卷积神经网络和视觉转移神经网络），展示了RegBN方法的通用性和效iveness。

Abstract
Recent years have witnessed a surge of interest in integrating high-dimensional data captured by multisource sensors, driven by the impressive success of neural networks in the integration of multimodal data. However, the integration of heterogeneous multimodal data poses a significant challenge, as confounding effects and dependencies among such heterogeneous data sources introduce unwanted variability and bias, leading to suboptimal performance of multimodal models. Therefore, it becomes crucial to normalize the low- or high-level features extracted from data modalities before their fusion takes place. This paper introduces a novel approach for the normalization of multimodal data, called RegBN, that incorporates regularization. RegBN uses the Frobenius norm as a regularizer term to address the side effects of confounders and underlying dependencies among different data sources. The proposed method generalizes well across multiple modalities and eliminates the need for learnable parameters, simplifying training and inference. We validate the effectiveness of RegBN on eight databases from five research areas, encompassing diverse modalities such as language, audio, image, video, depth, tabular, and 3D MRI. The proposed method demonstrates broad applicability across different architectures such as multilayer perceptrons, convolutional neural networks, and vision transformers, enabling effective normalization of both low- and high-level features in multimodal neural networks. RegBN is available at \url{https://github.com/mogvision/regbn}.

摘要
近年来，有一个强大的兴趣在将多维数据集成到多源感知器中，这主要归功于神经网络在多模态数据的集成中的出色成绩。然而，多模态数据的集成带来一些挑战，因为不同的数据来源之间存在干扰效应和依赖关系，这会导致模型的性能下降。因此，在模型融合之前，需要对数据模式中的低级或高级特征进行Normalization。这篇文章提出了一种新的多模态数据Normalization方法，称为RegBN，它包含了正则化项。RegBN使用 Frobenius нор为正则化项，以解决不同数据来源之间的干扰效应和依赖关系。提出的方法可以通过多种模式进行扩展，无需学习参数，因此训练和推理变得更加简单。我们在八个数据库中进行了验证，包括语言、音频、图像、视频、深度、表格和3D MRI 等多种模式，RegBN 的效果广泛，可以effective地Normalization多模式的低级和高级特征。RegBN 可以在多层感知器、卷积神经网络和视transformer 等不同的架构上进行应用，为多模态神经网络的Normalization提供了一个简单的解决方案。RegBN 的代码可以在上下载。

Segmentation-based Assessment of Tumor-Vessel Involvement for Surgical Resectability Prediction of Pancreatic Ductal Adenocarcinoma

paper_url: http://arxiv.org/abs/2310.00639
repo_url: None
paper_authors: Christiaan Viviers, Mark Ramaekers, Amaan Valiuddin, Terese Hellström, Nick Tasios, John van der Ven, Igor Jacobs, Lotte Ewals, Joost Nederend, Peter de With, Misha Luyer, Fons van der Sommen
for: This research aims to provide a workflow and deep learning-based segmentation models to automatically assess tumor-vessel involvement in Pancreatic ductal adenocarcinoma (PDAC) patients, which is crucial for determining treatment options and improving patient outcomes.
methods: The proposed workflow involves processing CT scans to segment the tumor and vascular structures, analyzing spatial relationships and the extent of vascular involvement, using three different deep learning-based segmentation architectures (nnU-Net, 3D U-Net, and Probabilistic 3D U-Net).
results: The segmentations achieved a high accuracy in segmenting veins, arteries, and the tumor, and enabled automated detection of tumor involvement with high accuracy (0.88 sensitivity and 0.86 specificity). Additionally, the models captured uncertainty in the predicted involvement, providing clinicians with a clear indication of tumor-vessel involvement and facilitating more informed decision-making for surgical interventions.

Abstract
Pancreatic ductal adenocarcinoma (PDAC) is a highly aggressive cancer with limited treatment options. This research proposes a workflow and deep learning-based segmentation models to automatically assess tumor-vessel involvement, a key factor in determining tumor resectability. Correct assessment of resectability is vital to determine treatment options. The proposed workflow involves processing CT scans to segment the tumor and vascular structures, analyzing spatial relationships and the extent of vascular involvement, which follows a similar way of working as expert radiologists in PDAC assessment. Three segmentation architectures (nnU-Net, 3D U-Net, and Probabilistic 3D U-Net) achieve a high accuracy in segmenting veins, arteries, and the tumor. The segmentations enable automated detection of tumor involvement with high accuracy (0.88 sensitivity and 0.86 specificity) and automated computation of the degree of tumor-vessel contact. Additionally, due to significant inter-observer variability in these important structures, we present the uncertainty captured by each of the models to further increase insights into the predicted involvement. This result provides clinicians with a clear indication of tumor-vessel involvement and may be used to facilitate more informed decision-making for surgical interventions. The proposed method offers a valuable tool for improving patient outcomes, personalized treatment strategies and survival rates in pancreatic cancer.

摘要
《胰腺ductal adenocarcinoma（PDAC）是一种高度侵略性的Cancer，具有有限的治疗选择。本研究提出了一种工作流程和深度学习基于的分割模型，以自动评估肿瘤-血管涉及度，这是确定肿瘤可否切除的关键因素。正确评估可以决定疗程选择。本工作流程包括对CT扫描图进行肿瘤和血管结构分割，分析肿瘤和血管之间的空间关系和血管涉及度，与专业放射科医生在PDAC评估中采用相似的方法。三种分割建筑（nnU-Net、3D U-Net和概率3D U-Net）实现了高精度分割血管、肿瘤和血管。这些分割可以自动检测肿瘤涉及度，并计算肿瘤与血管之间的接触度，并且由于肿瘤-血管结构之间存在显著的Observer variability，我们还提供了每个模型对应的不确定性，以增加预测涉及度的信息。这些结果为临床医生提供了诊断肿瘤涉及度的清晰指导，可能用于改进患者的疗效、个性化治疗策略和存活率。》

Win-Win: Training High-Resolution Vision Transformers from Two Windows

paper_url: http://arxiv.org/abs/2310.00632
repo_url: None
paper_authors: Vincent Leroy, Jerome Revaud, Thomas Lucas, Philippe Weinzaepfel
For: 提高高分辨率视觉转换器的训练和执行效率。* Methods: 随机窗口Masking技术，使模型只需学习每个窗口内的本地交互，以及不同窗口间的全局交互。* Results: 在推理时直接处理高分辨率输入，不需特殊处理，并且在 semantic segmentation 和 optical flow 任务上达到了最佳性能。

Abstract
Transformers have become the standard in state-of-the-art vision architectures, achieving impressive performance on both image-level and dense pixelwise tasks. However, training vision transformers for high-resolution pixelwise tasks has a prohibitive cost. Typical solutions boil down to hierarchical architectures, fast and approximate attention, or training on low-resolution crops. This latter solution does not constrain architectural choices, but it leads to a clear performance drop when testing at resolutions significantly higher than that used for training, thus requiring ad-hoc and slow post-processing schemes. In this paper, we propose a novel strategy for efficient training and inference of high-resolution vision transformers: the key principle is to mask out most of the high-resolution inputs during training, keeping only N random windows. This allows the model to learn local interactions between tokens inside each window, and global interactions between tokens from different windows. As a result, the model can directly process the high-resolution input at test time without any special trick. We show that this strategy is effective when using relative positional embedding such as rotary embeddings. It is 4 times faster to train than a full-resolution network, and it is straightforward to use at test time compared to existing approaches. We apply this strategy to the dense monocular task of semantic segmentation, and find that a simple setting with 2 windows performs best, hence the name of our method: Win-Win. To demonstrate the generality of our contribution, we further extend it to the binocular task of optical flow, reaching state-of-the-art performance on the Spring benchmark that contains Full-HD images with an inference time an order of magnitude faster than the best competitor.

摘要
启示器变得是现代视觉建筑标准，在图像级和密集像素级任务上达到了印象性的表现。然而，在高分辨率像素级任务上训练视觉启示器有束缚的成本。常见的解决方案包括层次结构、快速和 aproximate 注意力以及在训练低分辨率裁剪上进行训练。这个后者不会限制建筑选择，但会导致在测试分辨率远高于训练分辨率时的表现下降，需要特殊的预处理方案。在这篇论文中，我们提出了一种新的高分辨率视觉启示器训练和执行策略：关键原则是在训练时随机隐藏大多数高分辨率输入，只保留N个随机窗口。这 позвоits 模型学习每个窗口内Token之间的本地互动，以及不同窗口内Token之间的全局互动。因此，模型可以直接在测试时处理高分辨率输入，不需要特殊的技巧。我们发现，使用相对位置嵌入，如旋转嵌入，这种策略是最效的。训练时间比普通网络快四倍，并且在测试时使用非常简单。我们在激素分割任务中应用了这种策略，并发现使用2个窗口得到最佳性能，因此我们称之为Win-Win。为了证明我们的贡献的通用性，我们进一步扩展了它到双目任务中，达到了SpringBenchmark上的最新纪录，该纪录包含高清晰度图像，并且在执行时间上比最佳竞争者快一个数量级。

Finger-UNet: A U-Net based Multi-Task Architecture for Deep Fingerprint Enhancement

paper_url: http://arxiv.org/abs/2310.00629
repo_url: None
paper_authors: Ekta Gavas, Anoop Namboodiri
for: 提高低质量指纹识别率
methods: 使用Discrete Wavelet Transform（DWT）进行指纹提高，并使用波峰注意模块代替最大池化，同时使用多任务学习和方向估计任务进行指纹重建。
results: 在FVC 2002和NIST SD302数据库上进行实验，证明我们的方法可以提高低质量指纹识别率，并且比前一些方法更高效。

Abstract
For decades, fingerprint recognition has been prevalent for security, forensics, and other biometric applications. However, the availability of good-quality fingerprints is challenging, making recognition difficult. Fingerprint images might be degraded with a poor ridge structure and noisy or less contrasting backgrounds. Hence, fingerprint enhancement plays a vital role in the early stages of the fingerprint recognition/verification pipeline. In this paper, we investigate and improvise the encoder-decoder style architecture and suggest intuitive modifications to U-Net to enhance low-quality fingerprints effectively. We investigate the use of Discrete Wavelet Transform (DWT) for fingerprint enhancement and use a wavelet attention module instead of max pooling which proves advantageous for our task. Moreover, we replace regular convolutions with depthwise separable convolutions, which significantly reduces the memory footprint of the model without degrading the performance. We also demonstrate that incorporating domain knowledge with fingerprint minutiae prediction task can improve fingerprint reconstruction through multi-task learning. Furthermore, we also integrate the orientation estimation task to propagate the knowledge of ridge orientations to enhance the performance further. We present the experimental results and evaluate our model on FVC 2002 and NIST SD302 databases to show the effectiveness of our approach compared to previous works.

摘要
In this paper, we improve the encoder-decoder style architecture and suggest intuitive modifications to U-Net to enhance low-quality fingerprints effectively. We use Discrete Wavelet Transform (DWT) for fingerprint enhancement and replace regular convolutions with depthwise separable convolutions, which significantly reduces the memory footprint of the model without compromising performance.Moreover, we incorporate domain knowledge with fingerprint minutiae prediction tasks to improve fingerprint reconstruction through multi-task learning. We also integrate orientation estimation tasks to propagate the knowledge of ridge orientations and enhance performance further.We present experimental results and evaluate our model on FVC 2002 and NIST SD302 databases to demonstrate the effectiveness of our approach compared to previous works.

GhostEncoder: Stealthy Backdoor Attacks with Dynamic Triggers to Pre-trained Encoders in Self-supervised Learning

paper_url: http://arxiv.org/abs/2310.00626
repo_url: None
paper_authors: Qiannan Wang, Changchun Yin, Zhe Liu, Liming Fang, Run Wang, Chenhao Lin
for: 本研究旨在提出一种隐藏式、动态Backdoor攻击方法，用于自动学习预训练的图像编码器。
methods: 该攻击方法利用图像隐写技术，将隐藏信息编码到无害图像中，生成后门样本。然后，通过精制预训练图像编码器，植入后门。
results: 试验结果表明，GhostEncoder可以在图像上实现高度的隐藏性，让目标模型具有高度的攻击成功率，而不会丢失其实用性。此外，GhostEncoder也可以抵御现有的防御技术。

Abstract
Within the realm of computer vision, self-supervised learning (SSL) pertains to training pre-trained image encoders utilizing a substantial quantity of unlabeled images. Pre-trained image encoders can serve as feature extractors, facilitating the construction of downstream classifiers for various tasks. However, the use of SSL has led to an increase in security research related to various backdoor attacks. Currently, the trigger patterns used in backdoor attacks on SSL are mostly visible or static (sample-agnostic), making backdoors less covert and significantly affecting the attack performance. In this work, we propose GhostEncoder, the first dynamic invisible backdoor attack on SSL. Unlike existing backdoor attacks on SSL, which use visible or static trigger patterns, GhostEncoder utilizes image steganography techniques to encode hidden information into benign images and generate backdoor samples. We then fine-tune the pre-trained image encoder on a manipulation dataset to inject the backdoor, enabling downstream classifiers built upon the backdoored encoder to inherit the backdoor behavior for target downstream tasks. We evaluate GhostEncoder on three downstream tasks and results demonstrate that GhostEncoder provides practical stealthiness on images and deceives the victim model with a high attack success rate without compromising its utility. Furthermore, GhostEncoder withstands state-of-the-art defenses, including STRIP, STRIP-Cl, and SSL-Cleanse.

摘要
在计算机视觉领域，自主学习（SSL）指的是使用大量未标注图像进行训练已经预训练的图像编码器。这些预训练图像编码器可以作为特征提取器，帮助建立下游分类器 для多种任务。然而，使用SSL带来了安全研究中的各种后门攻击。现在，许多后门攻击使用SSL的触发模式都是可见或静止的（样本不具特定），这使得后门变得更加明显，对攻击性能产生负面影响。在这种情况下，我们提出了 GhostEncoder，首个在SSL中的动态隐藏后门攻击。与现有的SSL后门攻击不同，GhostEncoder使用图像隐写技术来编码隐藏信息到正常图像中，并生成后门样本。然后，我们精细调整预训练图像编码器，使其在扭曲数据集上进行后门插入，使得基于后门编码器的下游分类器继承后门行为，并且不会增加负面影响。我们对 GhostEncoder 进行了三个下游任务的评估，结果表明，GhostEncoder 在图像上具有实际的隐藏性，诱导了受试模型，并且不会降低其实用性。此外，GhostEncoder 可以抵御当前的防御技术，包括 STRIP、STRIP-Cl 和 SSL-Cleanse。

Understanding Adversarial Transferability in Federated Learning

paper_url: http://arxiv.org/abs/2310.00616
repo_url: None
paper_authors: Yijiang Li, Ying Gao, Haohan Wang
for: This paper investigates the robustness and security issues of federated learning (FL) systems in a practical setting where malicious clients disguise their identities and launch transferable adversarial attacks.methods: The paper uses empirical experiments and theoretical analysis to study the robustness of FL systems against such attacks, and hypothesizes that the decentralized training on distributed data and the averaging operation contribute to the system’s robustness.results: The paper finds that the federated model is more robust compared to its centralized counterpart when the accuracy on clean images is comparable, and provides evidence from both empirical experiments and theoretical analysis to support this conclusion.

Abstract
We investigate the robustness and security issues from a novel and practical setting: a group of malicious clients has impacted the model during training by disguising their identities and acting as benign clients, and only revealing their adversary position after the training to conduct transferable adversarial attacks with their data, which is usually a subset of the data that FL system is trained with. Our aim is to offer a full understanding of the challenges the FL system faces in this practical setting across a spectrum of configurations. We notice that such an attack is possible, but the federated model is more robust compared with its centralized counterpart when the accuracy on clean images is comparable. Through our study, we hypothesized the robustness is from two factors: the decentralized training on distributed data and the averaging operation. We provide evidence from both the perspective of empirical experiments and theoretical analysis. Our work has implications for understanding the robustness of federated learning systems and poses a practical question for federated learning applications.

摘要
我们研究了一种新和实际的场景中的安全和稳定性问题：一群恶意客户端在训练过程中对模型产生了影响，通过掩饰自己的身份和行为如善意客户端，并只在训练后 revelation 自己为敌对位置，以进行可转移性攻击。我们的目标是对 Federated Learning 系统在这种实际场景中所面临的挑战进行全面的理解，并通过不同的配置进行spectrum 的研究。我们发现这种攻击是可能的，但在模型级别的清洁图像准确率相似时， federated model 比其中央化模型更加稳定。我们认为这种稳定性来自两个因素：分布式训练在分布式数据上和平均操作。我们通过实验和理论分析提供证据。我们的工作对 Federated Learning 系统的稳定性有重要的意义，并提出了实际问题 для Federated Learning 应用。

Scene-aware Human Motion Forecasting via Mutual Distance Prediction

paper_url: http://arxiv.org/abs/2310.00615
repo_url: None
paper_authors: Chaoyue Xing, Wei Mao, Miaomiao Liu
for: 本研究强调解决人体动作预测中的场景相关性问题，通过模elling人体-场景交互来预测未来人体动作。
methods: 我们提出了基于人体-场景距离的人体动作预测方法，其中距离包括人体Vertex与场景表面之间的积分距离和基准场景点与人体网格之间的距离。我们还开发了一个预测步骤两步管道，先预测未来距离，然后根据预测距离预测未来人体动作。在训练过程中，我们显式地促进了预测pose与距离之间的一致性。
results: 我们的方法在 sintetic和实际数据集上比前学者的方法表现更好，提高了人体动作预测的精度和可靠性。

Abstract
In this paper, we tackle the problem of scene-aware 3D human motion forecasting. A key challenge of this task is to predict future human motions that are consistent with the scene, by modelling the human-scene interactions. While recent works have demonstrated that explicit constraints on human-scene interactions can prevent the occurrence of ghost motion, they only provide constraints on partial human motion e.g., the global motion of the human or a few joints contacting the scene, leaving the rest motion unconstrained. To address this limitation, we propose to model the human-scene interaction with the mutual distance between the human body and the scene. Such mutual distances constrain both the local and global human motion, resulting in a whole-body motion constrained prediction. In particular, mutual distance constraints consist of two components, the signed distance of each vertex on the human mesh to the scene surface, and the distance of basis scene points to the human mesh. We develop a pipeline with two prediction steps that first predicts the future mutual distances from the past human motion sequence and the scene, and then forecasts the future human motion conditioning on the predicted mutual distances. During training, we explicitly encourage consistency between the predicted poses and the mutual distances. Our approach outperforms the state-of-the-art methods on both synthetic and real datasets.

摘要

Skip-Plan: Procedure Planning in Instructional Videos via Condensed Action Space Learning

paper_url: http://arxiv.org/abs/2310.00608
repo_url: None
paper_authors: Zhiheng Li, Wenjia Geng, Muheng Li, Lei Chen, Yansong Tang, Jiwen Lu, Jie Zhou
for: 提出了一种基于减少行动空间学习的过程规划方法，以解决现有方法在高维状态监测和动作序列错误积累问题上遇到困难。
methods: 将过程规划问题抽象为数学链模型，通过跳过不确定节点和边，将长和复杂的序列函数转化为短而可靠的两种方式。
results: 对 CrossTask 和 COIN 测试集进行了广泛的实验，并达到了当前状态的最佳性能。

Abstract
In this paper, we propose Skip-Plan, a condensed action space learning method for procedure planning in instructional videos. Current procedure planning methods all stick to the state-action pair prediction at every timestep and generate actions adjacently. Although it coincides with human intuition, such a methodology consistently struggles with high-dimensional state supervision and error accumulation on action sequences. In this work, we abstract the procedure planning problem as a mathematical chain model. By skipping uncertain nodes and edges in action chains, we transfer long and complex sequence functions into short but reliable ones in two ways. First, we skip all the intermediate state supervision and only focus on action predictions. Second, we decompose relatively long chains into multiple short sub-chains by skipping unreliable intermediate actions. By this means, our model explores all sorts of reliable sub-relations within an action sequence in the condensed action space. Extensive experiments show Skip-Plan achieves state-of-the-art performance on the CrossTask and COIN benchmarks for procedure planning.

摘要
在这篇论文中，我们提出了Skip-Plan方法，这是一种简化动作空间学习方法 для过程规划在教程视频中。现有的过程规划方法都是在每个时间步骤上预测状态-动作对，这与人类直觉相吻合，但这种方法ология在高维状态监督和动作序列错误积累方面一直遇到困难。在这种工作中，我们抽象了过程规划问题为数学链模型。通过在动作链中跳过不确定的节点和边，我们将长而复杂的序列函数转化为短而可靠的两种方式。第一种方法是跳过所有间接状态监督，只Focus on 动作预测。第二种方法是将相对较长的链分解成多个短的子链，通过跳过不可靠的间接动作来实现。通过这种方式，我们的模型可以在简化动作空间中探索所有可靠的子关系。我们的实验表明Skip-Plan在CrossTask和COIN测试准则上达到了过程规划领域的状态天空。

Quantum image edge detection based on eight-direction Sobel operator for NEQR

paper_url: http://arxiv.org/abs/2310.03037
repo_url: None
paper_authors: Wenjie Liu, Lu Wang
for: 这个论文是为了提出一种基于量子机制的图像边缘检测算法（QSED），以解决经典算法遇到的实时问题。
methods: 该算法基于八个方向的 Sobel 算子，不仅可以减少部分图像的边缘信息损失，还同时计算所有像素的八个方向的梯度值。
results: 对于 2^n x 2^n 图像，该算法的复杂度可以降至 O(n^2 + q^2)，比其他经典或量子算法低。实验表明，该算法可以更好地检测高清像中的对角边缘。

Abstract
Quantum Sobel edge detection (QSED) is a kind of algorithm for image edge detection using quantum mechanism, which can solve the real-time problem encountered by classical algorithms. However, the existing QSED algorithms only consider two- or four-direction Sobel operator, which leads to a certain loss of edge detail information in some high-definition images. In this paper, a novel QSED algorithm based on eight-direction Sobel operator is proposed, which not only reduces the loss of edge information, but also simultaneously calculates eight directions' gradient values of all pixel in a quantum image. In addition, the concrete quantum circuits, which consist of gradient calculation, non-maximum suppression, double threshold detection and edge tracking units, are designed in details. For a 2^n x 2^n image with q gray scale, the complexity of our algorithm can be reduced to O(n^2 + q^2), which is lower than other existing classical or quantum algorithms. And the simulation experiment demonstrates that our algorithm can detect more edge information, especially diagonal edges, than the two- and four-direction QSED algorithms.

摘要

Image Data Hiding in Neural Compressed Latent Representations

paper_url: http://arxiv.org/abs/2310.00568
repo_url: None
paper_authors: Chen-Hsiu Huang, Ja-Ling Wu
for: 这个论文是为了开发一个朴素的图像数据隐藏框架，用于嵌入和提取秘密信息。
methods: 该方法使用了一种朴素的神经压缩器，并且使用了一种我们提出的消息编码器和解码器，同时使用了一种感知损失函数来实现高品质图像和高比特率。
results: 该方法可以在压缩领域中实现高水平的图像秘密性和竞争力强的水印鲁棒性，同时提高嵌入速度，比传统方法快上百倍。这些结果表明了将数据隐藏技术与神经压缩相结合的潜在优势和应用前景。

Abstract
We propose an end-to-end learned image data hiding framework that embeds and extracts secrets in the latent representations of a generic neural compressor. By leveraging a perceptual loss function in conjunction with our proposed message encoder and decoder, our approach simultaneously achieves high image quality and high bit accuracy. Compared to existing techniques, our framework offers superior image secrecy and competitive watermarking robustness in the compressed domain while accelerating the embedding speed by over 50 times. These results demonstrate the potential of combining data hiding techniques and neural compression and offer new insights into developing neural compression techniques and their applications.

摘要
我们提出了一个末端学习的图像数据隐藏框架，该框架在一个通用的神经压缩器中嵌入和提取秘密。通过我们提出的消息编码器和解码器以及一种感知损失函数，我们的方法同时实现高质量图像和高比特率。与现有技术相比，我们的框架在压缩领域中提供了更高的图像机密性和竞争力强的水印鲁棒性，同时加速嵌入速度，提高了50倍以上。这些结果表明将数据隐藏技术与神经压缩结合可以实现新的应用和技术突破，并为神经压缩技术的发展提供新的视角。

CPIPS: Learning to Preserve Perceptual Distances in End-to-End Image Compression

paper_url: http://arxiv.org/abs/2310.00559
repo_url: None
paper_authors: Chen-Hsiu Huang, Ja-Ling Wu
for: 这篇论文目的是提出一种基于神经科学和生物系统的压缩图像 Similarity Metric，以提高机器视觉任务中的图像压缩和比较效率。
methods: 该方法基于一种已经学习的神经网络编码器，通过修改压缩缓存来优先级化 semantics relevance，同时保持 perceived distance。
results: 对比 traditional DNN-based perceptual metrics，CPIPS 可以在计算速度和复杂度上具有明显的优势，而且可以在机器视觉任务中提高图像压缩和比较效率。

Abstract
Lossy image coding standards such as JPEG and MPEG have successfully achieved high compression rates for human consumption of multimedia data. However, with the increasing prevalence of IoT devices, drones, and self-driving cars, machines rather than humans are processing a greater portion of captured visual content. Consequently, it is crucial to pursue an efficient compressed representation that caters not only to human vision but also to image processing and machine vision tasks. Drawing inspiration from the efficient coding hypothesis in biological systems and the modeling of the sensory cortex in neural science, we repurpose the compressed latent representation to prioritize semantic relevance while preserving perceptual distance. Our proposed method, Compressed Perceptual Image Patch Similarity (CPIPS), can be derived at a minimal cost from a learned neural codec and computed significantly faster than DNN-based perceptual metrics such as LPIPS and DISTS.

摘要
产生损失的图像编码标准如JPEG和MPEG已经成功实现了多媒体数据的高压缩率 для人类消耗。然而，随着互联网物联网设备、无人机和自动驾驶车的普及，机器正在处理更多的捕捉视觉内容。因此，我们需要追求一种高效的压缩表示，不仅适合人类视觉，还适合图像处理和机器视觉任务。 drawing inspiration from生物系统中的高效编码假设和神经科学中的感觉脑层模型，我们重新利用压缩潜在表示，优先级 semantic relevance 而保持perceptual distance。我们提议的方法，压缩感知图像patch similarity（CPIPS），可以在学习神经编码器的基础上得到，并且可以在DNN基于的感知度量方法，如LPIPS和DISTS，中计算得到更快。

Diving into the Depths of Spotting Text in Multi-Domain Noisy Scenes

paper_url: http://arxiv.org/abs/2310.00558
repo_url: None
paper_authors: Alloy Das, Sanket Biswas, Umapada Pal, Josep Lladós
for: 本研究旨在开发一种能够通用多个频道的自动Scene文本检测系统，以便在实际世界中针对不同频道进行文本检测。
methods: 我们采用了一种培训模型使用多个频道源数据，以便将其直接应用于目标频道中进行文本检测，而不是特定频道或enario中的精化。
results: 我们提出了一种基于超解析的终端转换器基线模型，称为DA-TextSpotter，可以在常见和arbitrary-shapedScene文本检测benchmark上达到或超越现有的文本检测建筑，同时具有较高的模型效率。

Abstract
When used in a real-world noisy environment, the capacity to generalize to multiple domains is essential for any autonomous scene text spotting system. However, existing state-of-the-art methods employ pretraining and fine-tuning strategies on natural scene datasets, which do not exploit the feature interaction across other complex domains. In this work, we explore and investigate the problem of domain-agnostic scene text spotting, i.e., training a model on multi-domain source data such that it can directly generalize to target domains rather than being specialized for a specific domain or scenario. In this regard, we present the community a text spotting validation benchmark called Under-Water Text (UWT) for noisy underwater scenes to establish an important case study. Moreover, we also design an efficient super-resolution based end-to-end transformer baseline called DA-TextSpotter which achieves comparable or superior performance over existing text spotting architectures for both regular and arbitrary-shaped scene text spotting benchmarks in terms of both accuracy and model efficiency. The dataset, code and pre-trained models will be released upon acceptance.

摘要
当用于实际噪声环境中的自动Scene文本检测系统时，能够泛化到多个领域是非常重要的。然而，现有的状态艺术方法通常采用预训练和细化策略在自然场景数据集上，这并不利用场景文本之间的特征互动。在这种情况下，我们探索和探讨域性文本检测问题，即在多个领域源数据上训练一个模型，使其直接泛化到目标领域而不是特定领域或enario。为此，我们向社区提供了文本检测验证 benchmark called Under-Water Text (UWT)，以便在水下场景中进行噪声检测。此外，我们还设计了一种高效的超解像基于 transformer 结构的终端模型called DA-TextSpotter，它在常见和任意形状场景文本检测benchmark上实现了和现有文本检测建筑物之间的比较或更好的性能，同时具有更高的模型效率。数据集、代码和预训练模型将在接受后发布。

Seal2Real: Prompt Prior Learning on Diffusion Model for Unsupervised Document Seal Data Generation and Realisation

paper_url: http://arxiv.org/abs/2310.00546
repo_url: None
paper_authors: Jiancheng Huang, Yifan Liu, Yi Huang, Shifeng Chen
For: 提供了一种生成大量标注文档印章数据的方法，以便提高Docuement Processing中的印章相关任务的性能。* Methods: 使用了一种基于静止扩散模型的提问先学架构，通过无监督训练将生成器的优先生成能力迁移到印章生成任务中。* Results: 在Seal-DB dataset上进行实验，表明Seal2Real方法可以生成高度真实的印章图像，对后续的印章相关任务进行实际数据上的提升。

Abstract
In document processing, seal-related tasks have very large commercial applications, such as seal segmentation, seal authenticity discrimination, seal removal, and text recognition under seals. However, these seal-related tasks are highly dependent on labelled document seal datasets, resulting in very little work on these tasks. To address the lack of labelled datasets for these seal-related tasks, we propose Seal2Real, a generative method that generates a large amount of labelled document seal data, and construct a Seal-DB dataset containing 20K images with labels. In Seal2Real, we propose a prompt prior learning architecture based on a pre-trained Stable Diffusion Model that migrates the prior generative power of to our seal generation task with unsupervised training. The realistic seal generation capability greatly facilitates the performance of downstream seal-related tasks on real data. Experimental results on the Seal-DB dataset demonstrate the effectiveness of Seal2Real.

摘要
在文档处理领域中，有很多商业应用，如印章分割、印章真实性识别、印章去除和文本识别下印章。然而，这些印章相关任务都受到了标注文档印章数据的限制，导致了这些任务的研究得到了非常少的积极性。为了解决标注文档印章数据的缺乏，我们提出了Seal2Real方法，该方法可以生成大量标注文档印章数据，并构建了一个名为Seal-DB的数据集，包含20K个图像和标签。在Seal2Real中，我们提出了一种提前学习的推荐模型，基于静止扩散模型，将先前的生成能力传递到我们的印章生成任务中，并在无监督下训练。这种印章生成能力可以帮助下游印章相关任务在真实数据上表现出色。实验结果表明，Seal2Real是有效的。

Implicit Neural Representations and the Algebra of Complex Wavelets

paper_url: http://arxiv.org/abs/2310.00545
repo_url: None
paper_authors: T. Mitchell Roddenberry, Vishwanath Saragadam, Maarten V. de Hoop, Richard G. Baraniuk
for: 这个论文旨在探讨隐形神经表示（INR）如何用于欧几丁素空间上的信号处理和机器学习。
methods: 该论文使用多层感知器（MLP）来Parameterize图像，并使用浮动函数或浮动滤波器来实现INR。
results: 研究发现，使用浮动滤波器作为激活函数可以同时具有频率和空间特征的地方化特征，从而提高信号处理和机器学习的性能。此外，该论文还提出了多种INR架构设计方法，包括复杂滤波器、分离低频和高频拟合、以及基于所求信号的初始化方案。

Abstract
Implicit neural representations (INRs) have arisen as useful methods for representing signals on Euclidean domains. By parameterizing an image as a multilayer perceptron (MLP) on Euclidean space, INRs effectively represent signals in a way that couples spatial and spectral features of the signal that is not obvious in the usual discrete representation, paving the way for continuous signal processing and machine learning approaches that were not previously possible. Although INRs using sinusoidal activation functions have been studied in terms of Fourier theory, recent works have shown the advantage of using wavelets instead of sinusoids as activation functions, due to their ability to simultaneously localize in both frequency and space. In this work, we approach such INRs and demonstrate how they resolve high-frequency features of signals from coarse approximations done in the first layer of the MLP. This leads to multiple prescriptions for the design of INR architectures, including the use of complex wavelets, decoupling of low and band-pass approximations, and initialization schemes based on the singularities of the desired signal.

摘要
启发神经表示（INR）在欧几何空间上表示信号已成为有用的方法。通过将图像 Parametric 为多层感知器（MLP）在欧几何空间中，INR 可以将信号表示为不可分离的空间和频谱特征，使得不可分离的信号处理和机器学习方法变得可能。虽然使用惯性函数的 INR 已经被研究，但是最近的工作表明使用浪谱函数作为激活函数的优势，因为它可以同时在频谱和空间中进行本地化。在这个工作中，我们研究了这些 INR 和它们如何在 MLP 的第一层中解决高频特征。这导致了多种 INR 架构的设计方法，包括复杂浪谱、分离低频和高频拟合、以及基于感知信号的初始化方案。

Enabling Neural Radiance Fields (NeRF) for Large-scale Aerial Images – A Multi-tiling Approach and the Geometry Assessment of NeRF

paper_url: http://arxiv.org/abs/2310.00530
repo_url: None
paper_authors: Ningli Xu, Rongjun Qin, Debao Huang, Fabio Remondino
for: 这个论文旨在提高大规模飞行图像数据上的NeRF纹理场的渐进性和准确性。
methods: 作者提出了一种位置特定采样技术和多摄像头分割策略来降低图像加载、表示训练和缓存内存占用，并提高内部缓存的速度。
results: 作者对两个典型的飞行图像数据集进行了比较，结果表明提出的NeRF方法在完整性和物体细节方面表现更好，但还有一定的准确性不足。

Abstract
Neural Radiance Fields (NeRF) offer the potential to benefit 3D reconstruction tasks, including aerial photogrammetry. However, the scalability and accuracy of the inferred geometry are not well-documented for large-scale aerial assets,since such datasets usually result in very high memory consumption and slow convergence.. In this paper, we aim to scale the NeRF on large-scael aerial datasets and provide a thorough geometry assessment of NeRF. Specifically, we introduce a location-specific sampling technique as well as a multi-camera tiling (MCT) strategy to reduce memory consumption during image loading for RAM, representation training for GPU memory, and increase the convergence rate within tiles. MCT decomposes a large-frame image into multiple tiled images with different camera models, allowing these small-frame images to be fed into the training process as needed for specific locations without a loss of accuracy. We implement our method on a representative approach, Mip-NeRF, and compare its geometry performance with threephotgrammetric MVS pipelines on two typical aerial datasets against LiDAR reference data. Both qualitative and quantitative results suggest that the proposed NeRF approach produces better completeness and object details than traditional approaches, although as of now, it still falls short in terms of accuracy.

摘要
neural radiance fields (NeRF) 可能帮助3D重建任务，包括航空相机摄影。然而，对大规模航空资产的推广和准确性不够 document。在这篇论文中，我们希望通过缩放NeRF来适应大规模航空资产，并对NeRF的geometry进行全面评估。我们引入了位置特定的采样技术以及多camera tilting（MCT）策略，以降低内存占用量，提高内存中的表示训练，并提高分割区域内的快速转换。MCT将大幅度图像分解成多个不同摄像机模型的小幅度图像，以便在特定位置上无损loss的方式进行训练。我们实现了这种方法，并与三种光学多视角摄影管道进行比较，以评估NeRF的geometry性能。结果表明，我们的方法可以在两个典型的航空数据集上提供更好的完整性和物体细节，although it still lags behind in terms of accuracy。

Self-supervised Learning of Contextualized Local Visual Embeddings

paper_url: http://arxiv.org/abs/2310.00527
repo_url: https://github.com/sthalles/clove
paper_authors: Thalles Santos Silva, Helio Pedrini, Adín Ramírez Rivera
for: 这篇论文是为了提出一种基于自我supervised convolutional neural network（CNN）的方法，以学习适合紧密预测任务的表示。
methods: 这篇论文使用了一种新的多头自我注意层，通过对不同部分的图像特征进行相似性combine来学习 contextualized embedding。
results: 该论文在多个数据集上进行了广泛的 benchmarking，并达到了基于CNN架构的 dense prediction downstream tasks中的国际级表现，包括物体检测、实例分割、关键点检测和紧密pose estimation。

Abstract
We present Contextualized Local Visual Embeddings (CLoVE), a self-supervised convolutional-based method that learns representations suited for dense prediction tasks. CLoVE deviates from current methods and optimizes a single loss function that operates at the level of contextualized local embeddings learned from output feature maps of convolution neural network (CNN) encoders. To learn contextualized embeddings, CLoVE proposes a normalized mult-head self-attention layer that combines local features from different parts of an image based on similarity. We extensively benchmark CLoVE's pre-trained representations on multiple datasets. CLoVE reaches state-of-the-art performance for CNN-based architectures in 4 dense prediction downstream tasks, including object detection, instance segmentation, keypoint detection, and dense pose estimation.

摘要
我团队现在发布 Contextualized Local Visual Embeddings (CLoVE)，这是一种自动学习的卷积神经网络方法，用于学习适用于紧凑预测任务的表示。CLoVE与现有方法不同，它优化了基于输出特征图卷积神经网络Encoder学习的上下文化 embedding 的单个损失函数。为了学习上下文化 embedding，CLoVE提议了一种归一化多头自注意层，通过相似性将不同部分的图像特征相结合。我们对 CLoVE 的预训练表示进行了广泛的比较，并达到了基于 CNN 架构的 dense prediction 下游任务中的国际级表现。