2023-09-21

cs.CV

cs.CV - 2023-09-21

A Sentence Speaks a Thousand Images: Domain Generalization through Distilling CLIP with Language Guidance

paper_url: http://arxiv.org/abs/2309.12530
repo_url: https://github.com/oodbag/rise
paper_authors: Zeyi Huang, Andy Zhou, Zijian Lin, Mu Cai, Haohan Wang, Yong Jae Lee
for: 这篇论文旨在应用大型感知语言模型（CLIP教师模型）来培训一个更小的模型，使其在未见过的领域中具有普遍性。
methods: 这篇论文提出了一种新的方法，名为RISE（固定不变性与semantic embedding），它使用CLIP教师模型的学习图像表现来对学习过程进行调整。
results: 研究发现，RISE方法可以在多个benchmark数据集上实现更好的领域普遍性，并且比之前的领域普遍性方法更好。

Abstract
Domain generalization studies the problem of training a model with samples from several domains (or distributions) and then testing the model with samples from a new, unseen domain. In this paper, we propose a novel approach for domain generalization that leverages recent advances in large vision-language models, specifically a CLIP teacher model, to train a smaller model that generalizes to unseen domains. The key technical contribution is a new type of regularization that requires the student's learned image representations to be close to the teacher's learned text representations obtained from encoding the corresponding text descriptions of images. We introduce two designs of the loss function, absolute and relative distance, which provide specific guidance on how the training process of the student model should be regularized. We evaluate our proposed method, dubbed RISE (Regularized Invariance with Semantic Embeddings), on various benchmark datasets and show that it outperforms several state-of-the-art domain generalization methods. To our knowledge, our work is the first to leverage knowledge distillation using a large vision-language model for domain generalization. By incorporating text-based information, RISE improves the generalization capability of machine learning models.

摘要
域 generale 研究训练一个模型使用多个域（或分布）的样本，然后测试模型使用新、未经见过的域的样本。在这篇论文中，我们提出了一种新的方法 для域 generale，利用最近的大视语模型，具体来说是 CLIP 教师模型，来训练一个更小的模型，以便在未经见过的域上进行泛化。我们的关键技术贡献是一种新的规范，即要求学生模型学习的图像表示必须与教师模型对图像的文本描述进行编码后获得的文本表示之间很近。我们提出了两种损失函数的设计：绝对距离和相对距离，它们为训练学生模型的规范过程提供了特定的指导。我们称之为 RISE（固有协调 with 语义嵌入）。我们对多个标准测试集进行评估，并证明我们的提出方法可以超越一些状态实际的域泛化方法。我们知道，我们的工作是首次利用知识填充大视语模型来实现域泛化。通过包含文本信息，RISE 可以提高机器学习模型的泛化能力。

License Plate Super-Resolution Using Diffusion Models

paper_url: http://arxiv.org/abs/2309.12506
repo_url: None
paper_authors: Sawsan AlHalawani, Bilel Benjdira, Adel Ammar, Anis Koubaa, Anas M. Ali
for: 这个研究旨在提高识别车牌的精度，并且对于surveillance系统中的车牌识别有实际的应用。methods: 本研究使用了cutting-edge diffusion model，并通过对沙乌地车牌 dataset的训练，以提高车牌图像的Restoration。results: 研究发现，diffusion model在车牌图像Restoration中表现出色，与SwinIR和ESRGAN相比，它在PSNR和SSIM上分别提高了12.55%和37.32%，并且92%的人类评审者对于我们的图像有所喜欢。

Abstract
In surveillance, accurately recognizing license plates is hindered by their often low quality and small dimensions, compromising recognition precision. Despite advancements in AI-based image super-resolution, methods like Convolutional Neural Networks (CNNs) and Generative Adversarial Networks (GANs) still fall short in enhancing license plate images. This study leverages the cutting-edge diffusion model, which has consistently outperformed other deep learning techniques in image restoration. By training this model using a curated dataset of Saudi license plates, both in low and high resolutions, we discovered the diffusion model's superior efficacy. The method achieves a 12.55\% and 37.32% improvement in Peak Signal-to-Noise Ratio (PSNR) over SwinIR and ESRGAN, respectively. Moreover, our method surpasses these techniques in terms of Structural Similarity Index (SSIM), registering a 4.89% and 17.66% improvement over SwinIR and ESRGAN, respectively. Furthermore, 92% of human evaluators preferred our images over those from other algorithms. In essence, this research presents a pioneering solution for license plate super-resolution, with tangible potential for surveillance systems.

摘要
surveillance中，因license plate的低质量和小尺寸，识别精度受到阻碍。尽管人工智能基于图像超分辨技术如Convolutional Neural Networks (CNNs)和Generative Adversarial Networks (GANs)已经取得了进步，但这些方法仍然无法提高license plate图像的识别精度。这项研究利用了当今最佳的扩散模型，通过使用精心制作的Saudi license plates数据集，并在低和高分辨率下训练这个模型，我们发现了这个模型在图像恢复方面的超越。这种方法在PSNR指标上提高12.55%和37.32%，并在SSIM指标上提高4.89%和17.66%，相比SwirIR和ESRGAN。此外，92%的人类评估者偏好了我们的图像。简而言之，这项研究提供了一种领先的license plate超分辨技术，具有实际应用的潜在价值。

Impact of architecture on robustness and interpretability of multispectral deep neural networks

paper_url: http://arxiv.org/abs/2309.12463
repo_url: https://github.com/hendrycks/robustness
paper_authors: Charles Godfrey, Elise Bishoff, Myles McKay, Eleanor Byler
for: 这种研究是为了探讨不同的融合策略如何改善多光谱深度学习模型在视觉任务中表现。
methods: 这些模型使用了不同的融合方法，包括早期融合和晚期融合。早期融合将额外频谱通道与RGB频谱通道一起堆叠成一个高达多个频谱通道的输入图像。晚期融合则是将RGB和非RGB频谱通道通过不同的深度学习模型分支，并在最终分类或分割层前进行融合。
results: 这些模型的表现被评估，并分析了它们对自然主义图像损害的 robustness。研究发现，早期融合和晚期融合的表现差异较大，而且不同的输入频谱通道之间的融合方式对模型的性能有着不同的影响。

Abstract
Including information from additional spectral bands (e.g., near-infrared) can improve deep learning model performance for many vision-oriented tasks. There are many possible ways to incorporate this additional information into a deep learning model, but the optimal fusion strategy has not yet been determined and can vary between applications. At one extreme, known as "early fusion," additional bands are stacked as extra channels to obtain an input image with more than three channels. At the other extreme, known as "late fusion," RGB and non-RGB bands are passed through separate branches of a deep learning model and merged immediately before a final classification or segmentation layer. In this work, we characterize the performance of a suite of multispectral deep learning models with different fusion approaches, quantify their relative reliance on different input bands and evaluate their robustness to naturalistic image corruptions affecting one or more input channels.

摘要
可以包含更多 спектраль频谱信息（例如近红外）可以提高深度学习模型对视觉任务的性能。有多种方式可以将这些额外信息integrated到深度学习模型中，但最佳的融合策略尚未确定，可能因应用场景不同而异。一种方法是“早期融合”，其中附加的频谱通道与RGB频谱合并为多通道输入图像。另一种方法是“晚期融合”，RGB和非RGB频谱通道通过不同的深度学习模型分支进行处理，并在最后的分类或分割层进行融合。本工作将characterize不同融合方法的多spectral深度学习模型的性能，量化它们对不同输入频谱通道的依赖度，以及它们对自然场景中图像损害的Robustness。

DIOR: Dataset for Indoor-Outdoor Reidentification – Long Range 3D/2D Skeleton Gait Collection Pipeline, Semi-Automated Gait Keypoint Labeling and Baseline Evaluation Methods

paper_url: http://arxiv.org/abs/2309.12429
repo_url: None
paper_authors: Yuyang Chen, Praveen Raj Masilamani, Bhavin Jawade, Srirangaraj Setlur, Karthik Dantu
for: 本研究旨在提供一个数据收集框架和半自动标注方法，以及一个包含14名受试者和1649万帧RGB帧的数据集，以便进行人体识别和重新识别。
methods: 本研究使用了进阶的3D计算机视觉技术来实现像素精度的人体识别，并且在室内设置中使用动作捕捉系统来进行标注。在外部长距离设置中，我们使用了一个低成本的Hybrid3D计算机视觉和学习架构，只需4个低成本的RGB摄像头，成功地实现了精确的骨架标注，甚至在距离较远的对象中，其高度仅限于20-25像素。
results: 本研究获得了精确的骨架标注结果，包括200,000帧的长距离摄像头标注。

Abstract
In recent times, there is an increased interest in the identification and re-identification of people at long distances, such as from rooftop cameras, UAV cameras, street cams, and others. Such recognition needs to go beyond face and use whole-body markers such as gait. However, datasets to train and test such recognition algorithms are not widely prevalent, and fewer are labeled. This paper introduces DIOR -- a framework for data collection, semi-automated annotation, and also provides a dataset with 14 subjects and 1.649 million RGB frames with 3D/2D skeleton gait labels, including 200 thousands frames from a long range camera. Our approach leverages advanced 3D computer vision techniques to attain pixel-level accuracy in indoor settings with motion capture systems. Additionally, for outdoor long-range settings, we remove the dependency on motion capture systems and adopt a low-cost, hybrid 3D computer vision and learning pipeline with only 4 low-cost RGB cameras, successfully achieving precise skeleton labeling on far-away subjects, even when their height is limited to a mere 20-25 pixels within an RGB frame. On publication, we will make our pipeline open for others to use.

摘要

Synthetic Image Detection: Highlights from the IEEE Video and Image Processing Cup 2022 Student Competition

paper_url: http://arxiv.org/abs/2309.12428
repo_url: None
paper_authors: Davide Cozzolino, Koki Nagano, Lucas Thomaz, Angshul Majumdar, Luisa Verdoliva
for: 本研究旨在开发一种能够分辨真实图像和生成图像的系统，以满足现在AI生成图像技术的快速发展和媒体内容的可靠性问题。
methods: 本研究使用了一种基于Diffusion Models的生成图像检测方法，通过分析图像的扩散特征来 отличи出真实图像和生成图像。
results: 研究结果表明，该方法可以准确地分辨真实图像和生成图像，并且可以承受大量的生成图像。这种方法有广泛的应用前景，可以用于媒体内容的可靠性检测和识别生成图像。

Abstract
The Video and Image Processing (VIP) Cup is a student competition that takes place each year at the IEEE International Conference on Image Processing. The 2022 IEEE VIP Cup asked undergraduate students to develop a system capable of distinguishing pristine images from generated ones. The interest in this topic stems from the incredible advances in the AI-based generation of visual data, with tools that allows the synthesis of highly realistic images and videos. While this opens up a large number of new opportunities, it also undermines the trustworthiness of media content and fosters the spread of disinformation on the internet. Recently there was strong concern about the generation of extremely realistic images by means of editing software that includes the recent technology on diffusion models. In this context, there is a need to develop robust and automatic tools for synthetic image detection.

摘要
《视频和图像处理（VIP）杯赛》是每年在IEEE国际图像处理会议上举行的学生比赛。2022年IEEE VIP杯赛要求了本科生开发一个能够分辨真实图像和生成图像的系统。这个主题的兴趣源于人工智能在生成视频数据方面的异常进步，具有生成高度真实的图像和视频的工具。然而，这也导致媒体内容的可信度受到了损害，促使虚假信息在互联网上广泛传播。最近，对于使用扩散模型生成高度真实图像的技术表示了强烈的关注。在这种情况下，需要开发一些自动和可靠的生成图像检测工具。

DualToken-ViT: Position-aware Efficient Vision Transformer with Dual Token Fusion

paper_url: http://arxiv.org/abs/2309.12424
repo_url: None
paper_authors: Zhenzhen Chu, Jiayu Chen, Cen Chen, Chengyu Wang, Ziheng Wu, Jun Huang, Weining Qian
for: 这个研究的目的是提出一个轻量级和高效的Computer Vision Transformer（ViT）模型，以获得更好的Computer Vision任务效果。
methods: 这个模型使用了一种称为DualToken-ViT的新型自注意力架构，具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有�

Abstract
Self-attention-based vision transformers (ViTs) have emerged as a highly competitive architecture in computer vision. Unlike convolutional neural networks (CNNs), ViTs are capable of global information sharing. With the development of various structures of ViTs, ViTs are increasingly advantageous for many vision tasks. However, the quadratic complexity of self-attention renders ViTs computationally intensive, and their lack of inductive biases of locality and translation equivariance demands larger model sizes compared to CNNs to effectively learn visual features. In this paper, we propose a light-weight and efficient vision transformer model called DualToken-ViT that leverages the advantages of CNNs and ViTs. DualToken-ViT effectively fuses the token with local information obtained by convolution-based structure and the token with global information obtained by self-attention-based structure to achieve an efficient attention structure. In addition, we use position-aware global tokens throughout all stages to enrich the global information, which further strengthening the effect of DualToken-ViT. Position-aware global tokens also contain the position information of the image, which makes our model better for vision tasks. We conducted extensive experiments on image classification, object detection and semantic segmentation tasks to demonstrate the effectiveness of DualToken-ViT. On the ImageNet-1K dataset, our models of different scales achieve accuracies of 75.4% and 79.4% with only 0.5G and 1.0G FLOPs, respectively, and our model with 1.0G FLOPs outperforms LightViT-T using global tokens by 0.7%.

摘要
自注意力基于视transformer（ViT）在计算机视觉领域已经出现为非常竞争力的建筑。不同于卷积神经网络（CNN），ViT可以共享全局信息。随着不同类型的ViT的开发，ViT在许多视觉任务上变得越来越有利。然而，自注意力的 quadratic complexity使得ViT computationally intensive，而且它们没有对于局部性和平移对称性的偏好，因此需要 larger model size compared to CNNs 以有效地学习视觉特征。在这篇文章中，我们提出了一种轻量级和高效的视transformer模型called DualToken-ViT，该模型利用了CNN和ViT的优点。DualToken-ViT通过将token与局部信息通过卷积结构获得的local information和token与全局信息通过自注意力结构获得的global information进行有效的融合，以实现高效的注意结构。此外，我们在所有阶段使用position-aware global tokens，以增强全局信息的效果，这些position-aware global tokens还包含图像的位置信息，使我们的模型更适合视觉任务。我们在ImageNet-1K dataset上进行了广泛的实验，我们的不同规模的模型在分类、物体检测和 semantic segmentation 任务上达到了75.4%和79.4%的准确率，并且我们的1.0G FLOPs模型超过了LightViT-T使用全球token的模型。

Speeding up Resnet Architecture with Layers Targeted Low Rank Decomposition

paper_url: http://arxiv.org/abs/2309.12412
repo_url: None
paper_authors: Walid Ahmed, Habib Hajimolahoseini, Austin Wen, Yang Liu
for: 降低神经网络训练和推理的速度
methods: 使用低级别分解来压缩网络层
results: 在Nvidia V100和Huawei Ascend910两种不同硬件系统上实现5.36%的训练速度提升和15.79%的推理速度提升，只有1%的精度下降相比原始不压缩模型

Abstract
Compression of a neural network can help in speeding up both the training and the inference of the network. In this research, we study applying compression using low rank decomposition on network layers. Our research demonstrates that to acquire a speed up, the compression methodology should be aware of the underlying hardware as analysis should be done to choose which layers to compress. The advantage of our approach is demonstrated via a case study of compressing ResNet50 and training on full ImageNet-ILSVRC2012. We tested on two different hardware systems Nvidia V100 and Huawei Ascend910. With hardware targeted compression, results on Ascend910 showed 5.36% training speedup and 15.79% inference speed on Ascend310 with only 1% drop in accuracy compared to the original uncompressed model

摘要
压缩神经网络可以帮助提高神经网络的训练和推断速度。在这项研究中，我们研究了使用低级别分解来压缩神经网络层。我们的研究表明，为了提高速度，压缩方法应该了解下面硬件，并进行分析选择哪些层进行压缩。我们的方法的优点被示例通过压缩ResNet50并在全 ImageNet-ILSVRC2012 上训练。我们在两个不同的硬件系统Nvidia V100和Huawei Ascend910上进行测试。与目标硬件压缩，我们在Ascend910上获得了5.36%的训练速度提升和15.79%的推断速度提升在Ascend310上，只有1%的精度下降相比于原始未压缩模型。

POLAR3D: Augmenting NASA’s POLAR Dataset for Data-Driven Lunar Perception and Rover Simulation

paper_url: http://arxiv.org/abs/2309.12397
repo_url: https://github.com/uwsbel/polar-digital
paper_authors: Bo-Hsun Chen, Peter Negrut, Thomas Liang, Nevindu Batagoda, Harry Zhang, Dan Negrut
for: 这个论文的目的是提供一个基于NASA的POLAR数据集的三维数据集，用于 lunar 探测和synthesize 高品质的图像。
methods: 这个论文使用了两种方法：首先，对POLAR数据集中的每个照片进行了标注，提供了约23000个岩石和其阴影的标签。其次，利用POLAR的LiDAR点云数据，对月表地形场景进行了数字化。 specifically, the authors constructed detailed obj files for all identifiable assets by utilizing both the lunar photos and the POLAR’s LiDAR point clouds.
results: 这个论文的结果是POLAR3D，一个包含岩石/阴影标签和月表地形场景的数字化资产集。这个数据集可以用于训练探测算法、synthesize 高品质图像以及模拟月球环境。

Abstract
We report on an effort that led to POLAR3D, a set of digital assets that enhance the POLAR dataset of stereo images generated by NASA to mimic lunar lighting conditions. Our contributions are twofold. First, we have annotated each photo in the POLAR dataset, providing approximately 23 000 labels for rocks and their shadows. Second, we digitized several lunar terrain scenarios available in the POLAR dataset. Specifically, by utilizing both the lunar photos and the POLAR's LiDAR point clouds, we constructed detailed obj files for all identifiable assets. POLAR3D is the set of digital assets comprising of rock/shadow labels and obj files associated with the digital twins of lunar terrain scenarios. This new dataset can be used for training perception algorithms for lunar exploration and synthesizing photorealistic images beyond the original POLAR collection. Likewise, the obj assets can be integrated into simulation environments to facilitate realistic rover operations in a digital twin of a POLAR scenario. POLAR3D is publicly available to aid perception algorithm development, camera simulation efforts, and lunar simulation exercises.POLAR3D is publicly available at https://github.com/uwsbel/POLAR-digital.

摘要
我们报道了一项工作，它导致了POLAR3D，一组数字资产，用于增强由美国国家航空航天局生成的POLAR数据集中的月球照明条件。我们的贡献是两重。首先，我们为POLAR数据集中每张照片 annotated，提供了约23000个岩石和其阴影的标签。其次，我们利用了月球地表场景的数字图像和POLAR的 LiDAR点云，对可识别的资产进行了详细的数字化。POLAR3D是这些数字资产的集合，包括岩石/阴影标签和与数字双胞虫相关的obj文件。这个新的数据集可以用于训练月球探测的观察算法，并生成超出原始POLAR收集的 fotorealistic 图像。同时，obj资产可以与 simulation 环境集成，以便在数字双胞虫中进行真实的月球车辆操作。POLAR3D公开可用，以便帮助观察算法开发、摄像头模拟和月球 simulations 演练。POLAR3D可以在 GitHub 上找到：https://github.com/uwsbel/POLAR-digital。

Active Stereo Without Pattern Projector

paper_url: http://arxiv.org/abs/2309.12315
repo_url: https://github.com/bartn8/vppstereo
paper_authors: Luca Bartolomei, Matteo Poggi, Fabio Tosi, Andrea Conti, Stefano Mattoccia
for: 提高标准透镜系统中的活动三维视觉效果，无需物理 patrern projector。
methods: 通过虚拟投影pattern onto left and right images，根据深度感知器的稀缺度量获取。任何设备可以无缝插入我们的框架中，在任何环境下实现虚拟活动三维设置，超越物理 patrern projector的限制，如工作范围或环境条件。
results: 在室内/室外 dataset上，包括长距离和近距离的实验，证明了我们的方法的无缝有效性，提高了 both stereo算法和深度网络的准确性。

Abstract
This paper proposes a novel framework integrating the principles of active stereo in standard passive camera systems without a physical pattern projector. We virtually project a pattern over the left and right images according to the sparse measurements obtained from a depth sensor. Any such devices can be seamlessly plugged into our framework, allowing for the deployment of a virtual active stereo setup in any possible environment, overcoming the limitation of pattern projectors, such as limited working range or environmental conditions. Experiments on indoor/outdoor datasets, featuring both long and close-range, support the seamless effectiveness of our approach, boosting the accuracy of both stereo algorithms and deep networks.

摘要

TinyCLIP: CLIP Distillation via Affinity Mimicking and Weight Inheritance

paper_url: http://arxiv.org/abs/2309.12314
repo_url: https://github.com/microsoft/Cream/tree/main/TinyCLIP
paper_authors: Kan Wu, Houwen Peng, Zhenghong Zhou, Bin Xiao, Mengchen Liu, Lu Yuan, Hong Xuan, Michael Valenzuela, Xi, Chen, Xinggang Wang, Hongyang Chao, Han Hu
for: 这个研究提出了一个新的跨模AL（cross-modal distillation）方法，名为TinyCLIP，用于大规模的语言-图像预训模型。
methods: TinyCLIP方法 introduce two core techniques：互动模式（affinity mimicking）和重量继承（weight inheritance）。互动模式探索了多 modalities during distillation中的互动，使学生模型能够模仿老师模型在视力语言匹配空间中学习跨modal feature alignment。重量继承传递老师模型的预训重量到学生模型，以提高填充效率。
results: 实验结果显示TinyCLIP可以将预训CLIP ViT-B/32的大小增加50%，保持相似的零基eline性能。而且，将TinyCLIP与重量继承结合，可以将训练时间速度提高1.4-7.8倍，比较训练从零的效率。此外，我们的TinyCLIP ViT-8M/16，在YFCC-15M上训练，在ImageNet上 achieves zero-shot top-1准确率41.1%，比原CLIP ViT-B/16高3.5%，同时只使用8.9%的参数。最后，我们显示了TinyCLIP在不同的下游任务中的好转移性。代码和模型将在https://aka.ms/tinyclip上公开。

Abstract
In this paper, we propose a novel cross-modal distillation method, called TinyCLIP, for large-scale language-image pre-trained models. The method introduces two core techniques: affinity mimicking and weight inheritance. Affinity mimicking explores the interaction between modalities during distillation, enabling student models to mimic teachers' behavior of learning cross-modal feature alignment in a visual-linguistic affinity space. Weight inheritance transmits the pre-trained weights from the teacher models to their student counterparts to improve distillation efficiency. Moreover, we extend the method into a multi-stage progressive distillation to mitigate the loss of informative weights during extreme compression. Comprehensive experiments demonstrate the efficacy of TinyCLIP, showing that it can reduce the size of the pre-trained CLIP ViT-B/32 by 50%, while maintaining comparable zero-shot performance. While aiming for comparable performance, distillation with weight inheritance can speed up the training by 1.4 - 7.8 $\times$ compared to training from scratch. Moreover, our TinyCLIP ViT-8M/16, trained on YFCC-15M, achieves an impressive zero-shot top-1 accuracy of 41.1% on ImageNet, surpassing the original CLIP ViT-B/16 by 3.5% while utilizing only 8.9% parameters. Finally, we demonstrate the good transferability of TinyCLIP in various downstream tasks. Code and models will be open-sourced at https://aka.ms/tinyclip.

摘要
在这篇论文中，我们提出了一种新的跨模态填充方法，叫做TinyCLIP，用于大规模语言图像预训练模型。该方法 introduce two core techniques：对媒体之间的互动进行填充，以及继承权重。对媒体之间的互动可以让学生模型模仿教师的行为，即在视觉语言相互作用空间中学习跨模态特征对齐。继承权重可以将教师模型预训练的权重传递给学生模型，以提高填充效率。此外，我们将方法拓展到多个阶段进行进程式填充，以避免极端压缩中的有用权重的产生。实验表明，TinyCLIP可以将预训练CLIP ViT-B/32的大小减少50%，保持与零shot性能相似。而在尝试保持相似性的情况下，填充与权重继承可以提高训练速度1.4-7.8倍。此外，我们的TinyCLIP ViT-8M/16，在YFCC-15M上训练，在ImageNet上 achieve Zero-shot top-1准确率41.1%，比原CLIP ViT-B/16提高3.5%，同时只使用8.9%的参数。最后，我们展示了TinyCLIP在多个下游任务中的好传输性。代码和模型将在https://aka.ms/tinyclip上开源。

TalkNCE: Improving Active Speaker Detection with Talk-Aware Contrastive Learning

paper_url: http://arxiv.org/abs/2309.12306
repo_url: None
paper_authors: Chaeyoung Jung, Suyeon Lee, Kihyun Nam, Kyeongha Rho, You Jin Kim, Youngjoon Jang, Joon Son Chung
for: 本研究的目的是提出一种 Active Speaker Detection (ASD) 任务，即在视频帧序中判断一个人是否在说话。先前的工作主要关注网络架构，而学习有效表示的研究得到了更少的关注。
methods: 我们提出了一种新的对话有用的抽象损失函数，即 TalkNCE。该损失函数只在屏幕上的人正在说话的部分应用，这使得模型学习有效的表示，通过自然的语音和脸部运动的相干关系。我们的损失函数可以与现有的 ASD 训练目标一起优化，不需要额外的监督或训练数据。
results: 我们的方法在 AVA-ActiveSpeaker 和 ASW 数据集上达到了状态之Art的性能。

Abstract
The goal of this work is Active Speaker Detection (ASD), a task to determine whether a person is speaking or not in a series of video frames. Previous works have dealt with the task by exploring network architectures while learning effective representations has been less explored. In this work, we propose TalkNCE, a novel talk-aware contrastive loss. The loss is only applied to part of the full segments where a person on the screen is actually speaking. This encourages the model to learn effective representations through the natural correspondence of speech and facial movements. Our loss can be jointly optimized with the existing objectives for training ASD models without the need for additional supervision or training data. The experiments demonstrate that our loss can be easily integrated into the existing ASD frameworks, improving their performance. Our method achieves state-of-the-art performances on AVA-ActiveSpeaker and ASW datasets.

摘要
目标是活动说话人检测（ASD），即在视频帧序中确定人是否正在说话。先前的工作主要关注网络架构，学习有效表示得到了更少的关注。在这个工作中，我们提出了一种新的对话意识强制损失（TalkNCE）。这种损失仅应用于屏幕上人是说话的部分段落，从而鼓励模型通过自然的语音和面部运动的相干学习有效的表示。我们的损失可以与现有的ASD模型训练目标一起优化，无需额外的监督或训练数据。实验表明，我们的损失可以轻松地与现有的ASD框架集成，提高其性能。我们的方法在AVA-ActiveSpeaker和ASW数据集上达到了状态计算的表现。

SlowFast Network for Continuous Sign Language Recognition

paper_url: http://arxiv.org/abs/2309.12304
repo_url: None
paper_authors: Junseok Ahn, Youngjoon Jang, Joon Son Chung
for: 本文目的是提高连续手语识别（CSLR）中的空间和动态特征EXTRACTION。
methods: 作者使用了两路快慢网络，其中每个路径在不同的时间分辨率下运行，分别捕捉手势（手势、表情）和动态信息（运动）。此外，作者还提出了两种特点适应CSLR的特点的特征融合方法：一是双向特征融合（BFF），可以将动态 semantics transfer into spatial semantics和vice versa; 二是路径特征增强（PFE），可以通过辅助子网络增强动态和空间表示，而不需要额外的推理时间。
results: 作者的模型在流行的CSLR数据集上（包括PHOENIX14、PHOENIX14-T和CSL-Daily）达到了当前状态的艺术。

Abstract
The objective of this work is the effective extraction of spatial and dynamic features for Continuous Sign Language Recognition (CSLR). To accomplish this, we utilise a two-pathway SlowFast network, where each pathway operates at distinct temporal resolutions to separately capture spatial (hand shapes, facial expressions) and dynamic (movements) information. In addition, we introduce two distinct feature fusion methods, carefully designed for the characteristics of CSLR: (1) Bi-directional Feature Fusion (BFF), which facilitates the transfer of dynamic semantics into spatial semantics and vice versa; and (2) Pathway Feature Enhancement (PFE), which enriches dynamic and spatial representations through auxiliary subnetworks, while avoiding the need for extra inference time. As a result, our model further strengthens spatial and dynamic representations in parallel. We demonstrate that the proposed framework outperforms the current state-of-the-art performance on popular CSLR datasets, including PHOENIX14, PHOENIX14-T, and CSL-Daily.

摘要

Bi-directional Feature Fusion (BFF), which facilitates the transfer of dynamic semantics into spatial semantics and vice versa.2. Pathway Feature Enhancement (PFE), which enriches dynamic and spatial representations through auxiliary subnetworks, while avoiding the need for extra inference time.As a result, our model further strengthens spatial and dynamic representations in parallel. We demonstrate that the proposed framework outperforms the current state-of-the-art performance on popular CSLR datasets, including PHOENIX14, PHOENIX14-T, and CSL-Daily.

PanoVOS: Bridging Non-panoramic and Panoramic Views with Transformer for Video Segmentation

paper_url: http://arxiv.org/abs/2309.12303
repo_url: https://github.com/shilinyan99/panovos
paper_authors: Shilin Yan, Xiaohao Xu, Lingyi Hong, Wenchao Chen, Wenqiang Zhang, Wei Zhang
for: 该论文主要用于提出一个全新的пано拉миче视频分割数据集，以及一种基于这个数据集的新的视频对象分割方法。
methods: 该论文使用了15种市场上的视频对象分割模型进行评估，并通过错误分析发现这些模型无法处理panoramic视频中的像素级别内容继续性。为了解决这个问题，该论文提出了一种基于semantic boundary信息的Pixel-level匹配方法。
results: 对比于之前的最佳模型，该论文的PSCFormer网络在panoramic设定下表现出了出色的优势，segmenation结果较为出色。

Abstract
Panoramic videos contain richer spatial information and have attracted tremendous amounts of attention due to their exceptional experience in some fields such as autonomous driving and virtual reality. However, existing datasets for video segmentation only focus on conventional planar images. To address the challenge, in this paper, we present a panoramic video dataset, PanoVOS. The dataset provides 150 videos with high video resolutions and diverse motions. To quantify the domain gap between 2D planar videos and panoramic videos, we evaluate 15 off-the-shelf video object segmentation (VOS) models on PanoVOS. Through error analysis, we found that all of them fail to tackle pixel-level content discontinues of panoramic videos. Thus, we present a Panoramic Space Consistency Transformer (PSCFormer), which can effectively utilize the semantic boundary information of the previous frame for pixel-level matching with the current frame. Extensive experiments demonstrate that compared with the previous SOTA models, our PSCFormer network exhibits a great advantage in terms of segmentation results under the panoramic setting. Our dataset poses new challenges in panoramic VOS and we hope that our PanoVOS can advance the development of panoramic segmentation/tracking.

摘要
паннорамные видео содержат богатую информацию о пространстве и привлекли огромное внимание из-за своей выдающейся экспедиции в некоторых областях, таких как автономное управление и виртуальная реальность. Однако существующие данные для видеосегментации сосредоточены только на конвенциональных плоских изображениях. Чтобы решить эту проблему, в этой статье мы представляем панорамный видеоданные, PanoVOS. Данные предоставляют 150 видео с высокой разрешающей способностью и разнообразными движениями. Чтобы оценить разрыв доменных областей между плоскими видео и панорамными видео, мы оцениваем 15 готовых видеообъектной сегментации (VOS) моделей на PanoVOS. After error analysis, we found that all of them fail to handle pixel-level content discontinuities of panoramic videos. Therefore, we propose a Panoramic Space Consistency Transformer (PSCFormer), which can effectively utilize the semantic boundary information of the previous frame for pixel-level matching with the current frame. Extensive experiments show that compared with previous SOTA models, our PSCFormer network exhibits a great advantage in terms of segmentation results under the panoramic setting. Our dataset poses new challenges in panoramic VOS, and we hope that our PanoVOS can advance the development of panoramic segmentation/tracking.

Text-Guided Vector Graphics Customization

paper_url: http://arxiv.org/abs/2309.12302
repo_url: None
paper_authors: Peiying Zhang, Nanxuan Zhao, Jing Liao
for: 生成高质量自定义 вектор图形，基于文本提示。
methods: 利用大规模预训练文本到图像模型，进行精度的文本提示导向图像生成，并使用 semantic-based path alignment 方法初始化 SVG。
results: 生成了多种高质量自定义 вектор图形，并通过多种纬度的评估方法得到了极高的评估结果。

Abstract
Vector graphics are widely used in digital art and valued by designers for their scalability and layer-wise topological properties. However, the creation and editing of vector graphics necessitate creativity and design expertise, leading to a time-consuming process. In this paper, we propose a novel pipeline that generates high-quality customized vector graphics based on textual prompts while preserving the properties and layer-wise information of a given exemplar SVG. Our method harnesses the capabilities of large pre-trained text-to-image models. By fine-tuning the cross-attention layers of the model, we generate customized raster images guided by textual prompts. To initialize the SVG, we introduce a semantic-based path alignment method that preserves and transforms crucial paths from the exemplar SVG. Additionally, we optimize path parameters using both image-level and vector-level losses, ensuring smooth shape deformation while aligning with the customized raster image. We extensively evaluate our method using multiple metrics from vector-level, image-level, and text-level perspectives. The evaluation results demonstrate the effectiveness of our pipeline in generating diverse customizations of vector graphics with exceptional quality. The project page is https://intchous.github.io/SVGCustomization.

摘要
vector graphics 广泛应用于数字艺术中，因其可扩展性和层次结构而受到设计师的喜爱。然而，创建和修改 vector graphics 需要创作力和设计技巧，这会导致时间消耗。在这篇论文中，我们提出了一个新的管道，可以基于文本提示生成高质量自定义 vector graphics，保留原始 SVG 的属性和层次信息。我们利用大型预训练的文本到图像模型的能力，通过微调模型的跨层注意力层，生成基于文本提示的静态图像。为初始化 SVG，我们提出了基于 semantics 的路径对齐方法，保留和转换原始 SVG 中重要的路径。此外，我们使用图像级和向量级损失进行路径参数优化，确保形状变换平滑，同时与自定义静态图像对齐。我们进行了多metric 的全面评估，证明我们的管道可以生成高质量自定义 vector graphics，并且具有多样性。项目页面是。

Adaptive Input-image Normalization for Solving the Mode Collapse Problem in GAN-based X-ray Images

paper_url: http://arxiv.org/abs/2309.12245
repo_url: None
paper_authors: Muhammad Muneeb Saad, Mubashir Husain Rehmani, Ruairi O’Reilly
for: 增强生成的骨科影像数据集中的数据异质性，以提高机器学习分类器的性能。
methods: 使用生成对抗网络（DCGAN和ACGAN）生成增强的骨科影像数据集，并采用输入图像normalization来缓解模式塌井问题。
results: 对比使用不具有normalization的DCGAN和ACGAN，使用具有normalization的DCGAN和ACGAN能够提高分类器的性能和多样性指标。

Abstract
Biomedical image datasets can be imbalanced due to the rarity of targeted diseases. Generative Adversarial Networks play a key role in addressing this imbalance by enabling the generation of synthetic images to augment datasets. It is important to generate synthetic images that incorporate a diverse range of features to accurately represent the distribution of features present in the training imagery. Furthermore, the absence of diverse features in synthetic images can degrade the performance of machine learning classifiers. The mode collapse problem impacts Generative Adversarial Networks' capacity to generate diversified images. Mode collapse comes in two varieties: intra-class and inter-class. In this paper, both varieties of the mode collapse problem are investigated, and their subsequent impact on the diversity of synthetic X-ray images is evaluated. This work contributes an empirical demonstration of the benefits of integrating the adaptive input-image normalization with the Deep Convolutional GAN and Auxiliary Classifier GAN to alleviate the mode collapse problems. Synthetically generated images are utilized for data augmentation and training a Vision Transformer model. The classification performance of the model is evaluated using accuracy, recall, and precision scores. Results demonstrate that the DCGAN and the ACGAN with adaptive input-image normalization outperform the DCGAN and ACGAN with un-normalized X-ray images as evidenced by the superior diversity scores and classification scores.

摘要
There are two types of mode collapse: intra-class and inter-class. In this paper, both types of mode collapse are investigated, and their impact on the diversity of synthetic X-ray images is evaluated. The authors propose integrating adaptive input-image normalization with GANs to alleviate the mode collapse problems.The proposed method is evaluated using a Vision Transformer model, and the classification performance is measured using accuracy, recall, and precision scores. The results show that the DCGAN and ACGAN with adaptive input-image normalization outperform the DCGAN and ACGAN with un-normalized X-ray images, as evidenced by superior diversity scores and classification scores.In summary, the authors propose a method to address the mode collapse problem in GANs for generating diverse synthetic biomedical images, and demonstrate its effectiveness using a Vision Transformer model. The proposed method can potentially improve the accuracy and robustness of biomedical image classification tasks.

Can We Reliably Improve the Robustness to Image Acquisition of Remote Sensing of PV Systems?

paper_url: http://arxiv.org/abs/2309.12214
repo_url: None
paper_authors: Gabriel Kasmi, Laurent Dubus, Yves-Marie Saint-Drenan, Philippe Blanc
for: 监测区域规模的热顶solar电力安装 fleet的发展
methods: 利用wavelet scale attribution method (WCAM)来评估深度学习模型的鲁棒性和可靠性
results: 提高深度学习系统的可靠性和鲁棒性，以便安全地集成清洁能源到电力系统中

Abstract
Photovoltaic (PV) energy is crucial for the decarbonization of energy systems. Due to the lack of centralized data, remote sensing of rooftop PV installations is the best option to monitor the evolution of the rooftop PV installed fleet at a regional scale. However, current techniques lack reliability and are notably sensitive to shifts in the acquisition conditions. To overcome this, we leverage the wavelet scale attribution method (WCAM), which decomposes a model's prediction in the space-scale domain. The WCAM enables us to assess on which scales the representation of a PV model rests and provides insights to derive methods that improve the robustness to acquisition conditions, thus increasing trust in deep learning systems to encourage their use for the safe integration of clean energy in electric systems.

摘要
彩绘太阳能（PV）是加速化清洁能源系统的关键。由于缺乏中央数据，远程探测楼顶PV设备是监测区域规模上批量PV设备的最佳选择。然而，现有技术缺乏可靠性，特别是对获取条件的变化非常敏感。为解决这问题，我们利用波лет级别归属方法（WCAM），它在空间频谱域中分解模型预测。WCAM允许我们评估模型预测中哪些级别的表示很重要，并提供了改进鲁棒性的方法，以便在不同的获取条件下提高深度学习系统的可靠性，从而激发使用清洁能源系统，并降低环境污染。

Brain Tumor Detection Using Deep Learning Approaches

paper_url: http://arxiv.org/abs/2309.12193
repo_url: https://github.com/Arminsbss/tumor-classification
paper_authors: Razia Sultana Misu
for: 本研究旨在使用深度学习技术自动检测脑肿。
methods: 本研究使用了五种转移学习模型，包括VGG16、VGG19、DenseNet121、ResNet50和YOLO V4，其中ResNet50得到了最高精度99.54%。
results: 本研究表明，使用深度学习技术可以准确地检测脑肿，并且ResNet50模型得到了最高精度。

Abstract
Brain tumors are collections of abnormal cells that can develop into masses or clusters. Because they have the potential to infiltrate other tissues, they pose a risk to the patient. The main imaging technique used, MRI, may be able to identify a brain tumor with accuracy. The fast development of Deep Learning methods for use in computer vision applications has been facilitated by a vast amount of training data and improvements in model construction that offer better approximations in a supervised setting. The need for these approaches has been the main driver of this expansion. Deep learning methods have shown promise in improving the precision of brain tumor detection and classification using magnetic resonance imaging (MRI). The study on the use of deep learning techniques, especially ResNet50, for brain tumor identification is presented in this abstract. As a result, this study investigates the possibility of automating the detection procedure using deep learning techniques. In this study, I utilized five transfer learning models which are VGG16, VGG19, DenseNet121, ResNet50 and YOLO V4 where ResNet50 provide the best or highest accuracy 99.54%. The goal of the study is to guide researchers and medical professionals toward powerful brain tumor detecting systems by employing deep learning approaches by way of this evaluation and analysis.

摘要
脑肿是一种集合异常细胞的疾病，可能发展成为肿体或集群。由于它们可能会扩散到其他组织，因此对病人存在风险。主要用于识别脑肿的成像技术是MRI，可能能够准确地识别脑肿。深度学习方法在计算机视觉应用中的快速发展，主要受到了大量的训练数据和改进的模型构建的推动。这些方法在辅助脑肿检测和分类方面表现出了承诺。本研究使用了五种传输学习模型，即VGG16、VGG19、DenseNet121、ResNet50和YOLO V4，其中ResNet50提供了最高或最高精度99.54%。本研究的目标是通过深度学习方法来自动化脑肿检测过程，以帮助研究人员和医疗专业人员建立高效的脑肿检测系统。

SG-Bot: Object Rearrangement via Coarse-to-Fine Robotic Imagination on Scene Graphs

paper_url: http://arxiv.org/abs/2309.12188
repo_url: None
paper_authors: Guangyao Zhai, Xiaoni Cai, Dianye Huang, Yan Di, Fabian Manhardt, Federico Tombari, Nassir Navab, Benjamin Busam
for: 本研究旨在提供一个轻量级、即时、用户可控的物品重新排序框架，以便在人工智能肉体中实现环境互动。
methods: 本研究使用了一个粗细排序方案，其中包括使用场景图来表示场景，并且运用了三种程序—观察、想像和实施—以实现任务。
results: 实验结果显示，SG-Bot 在与竞争对手比较之下，有着很大的进步。

Abstract
Object rearrangement is pivotal in robotic-environment interactions, representing a significant capability in embodied AI. In this paper, we present SG-Bot, a novel rearrangement framework that utilizes a coarse-to-fine scheme with a scene graph as the scene representation. Unlike previous methods that rely on either known goal priors or zero-shot large models, SG-Bot exemplifies lightweight, real-time, and user-controllable characteristics, seamlessly blending the consideration of commonsense knowledge with automatic generation capabilities. SG-Bot employs a three-fold procedure--observation, imagination, and execution--to adeptly address the task. Initially, objects are discerned and extracted from a cluttered scene during the observation. These objects are first coarsely organized and depicted within a scene graph, guided by either commonsense or user-defined criteria. Then, this scene graph subsequently informs a generative model, which forms a fine-grained goal scene considering the shape information from the initial scene and object semantics. Finally, for execution, the initial and envisioned goal scenes are matched to formulate robotic action policies. Experimental results demonstrate that SG-Bot outperforms competitors by a large margin.

摘要
<> translate("Object rearrangement is pivotal in robotic-environment interactions, representing a significant capability in embodied AI. In this paper, we present SG-Bot, a novel rearrangement framework that utilizes a coarse-to-fine scheme with a scene graph as the scene representation. Unlike previous methods that rely on either known goal priors or zero-shot large models, SG-Bot exemplifies lightweight, real-time, and user-controllable characteristics, seamlessly blending the consideration of commonsense knowledge with automatic generation capabilities. SG-Bot employs a three-fold procedure--observation, imagination, and execution--to adeptly address the task. Initially, objects are discerned and extracted from a cluttered scene during the observation. These objects are first coarsely organized and depicted within a scene graph, guided by either commonsense or user-defined criteria. Then, this scene graph subsequently informs a generative model, which forms a fine-grained goal scene considering the shape information from the initial scene and object semantics. Finally, for execution, the initial and envisioned goal scenes are matched to formulate robotic action policies. Experimental results demonstrate that SG-Bot outperforms competitors by a large margin.")Here's the translation: объект перераспределение является ключевым в взаимодействиях робота с окружающей средой, представляя значительную возможность в инкорпорированном ИИ. В этой статье мы представляем SG-Бот, новый фреймворк перераспределения, который использует схему "коarse-to-fine" с графиком сцены как представление сцены. В отличие от предыдущих методов, которые основаны на известных принципах целей или моделях zeroshot, SG-Бот демонстрирует лёгкость, реальное времени и управляемые характеристики, гармонично сочетая рассмотрение общих смыслов с автоматическими возможностями. SG-Бот использует трёхступенчатую процедуру - наблюдение, воображение и выполнение - для эффективного решения задачи. Сначала объекты определяются и извлекаются из переплетенной сцены во время наблюдения. Эти объекты первоначально грубо организуются и описываются в графике сцены, руководствуясь Either common sense или критериями, определенными пользователем. Затем эта сцена графика влияет на модель генерации, которая формирует фине-задачу сцены, учитывая информацию о форме из исходной сцены и семантике объектов. Наконец, для выполнения, инициализированная и задуманная сцена графика соответствуют, чтобы сформулировать политики действий робота. Экспериментальные результаты подтверждают, что SG-Бот превышает конкурентов на значительном масштабе.

ORTexME: Occlusion-Robust Human Shape and Pose via Temporal Average Texture and Mesh Encoding

paper_url: http://arxiv.org/abs/2309.12183
repo_url: None
paper_authors: Yu Cheng, Bo Wang, Robby T. Tan
for: addressed the problem of occlusion in 3D human shape and pose estimation from monocular videos, which is common in real-world scenarios.
methods: proposed an occlusion-robust temporal method called ORTexME, which utilizes temporal information from the input video to better regularize occluded body parts. The method is based on NeRF, and uses a novel average texture learning approach and human body mesh to guide the opacity-field updates and suppress blur and noise.
results: achieved significant improvement on the challenging multi-person 3DPW dataset, with 1.8 P-MPJPE error reduction compared to the state-of-the-art rendering-based methods, which enlarged the error up to 5.6 on the same dataset.

Abstract
In 3D human shape and pose estimation from a monocular video, models trained with limited labeled data cannot generalize well to videos with occlusion, which is common in the wild videos. The recent human neural rendering approaches focusing on novel view synthesis initialized by the off-the-shelf human shape and pose methods have the potential to correct the initial human shape. However, the existing methods have some drawbacks such as, erroneous in handling occlusion, sensitive to inaccurate human segmentation, and ineffective loss computation due to the non-regularized opacity field. To address these problems, we introduce ORTexME, an occlusion-robust temporal method that utilizes temporal information from the input video to better regularize the occluded body parts. While our ORTexME is based on NeRF, to determine the reliable regions for the NeRF ray sampling, we utilize our novel average texture learning approach to learn the average appearance of a person, and to infer a mask based on the average texture. In addition, to guide the opacity-field updates in NeRF to suppress blur and noise, we propose the use of human body mesh. The quantitative evaluation demonstrates that our method achieves significant improvement on the challenging multi-person 3DPW dataset, where our method achieves 1.8 P-MPJPE error reduction. The SOTA rendering-based methods fail and enlarge the error up to 5.6 on the same dataset.

摘要
在单目视频中的人体形态和姿态估计中，使用有限的标注数据训练的模型不能generalize well于受遮挡影响的视频，这是野外视频中的常见情况。 recent human neural rendering approaches focusing on novel view synthesis initialized by off-the-shelf human shape and pose methods have the potential to correct the initial human shape. However, the existing methods have some drawbacks such as, erroneous in handling occlusion, sensitive to inaccurate human segmentation, and ineffective loss computation due to the non-regularized opacity field. To address these problems, we introduce ORTexME, an occlusion-robust temporal method that utilizes temporal information from the input video to better regularize the occluded body parts. While our ORTexME is based on NeRF, to determine the reliable regions for the NeRF ray sampling, we utilize our novel average texture learning approach to learn the average appearance of a person, and to infer a mask based on the average texture. In addition, to guide the opacity-field updates in NeRF to suppress blur and noise, we propose the use of human body mesh. The quantitative evaluation demonstrates that our method achieves significant improvement on the challenging multi-person 3DPW dataset, where our method achieves 1.8 P-MPJPE error reduction. The SOTA rendering-based methods fail and enlarge the error up to 5.6 on the same dataset.

Autoregressive Sign Language Production: A Gloss-Free Approach with Discrete Representations

paper_url: http://arxiv.org/abs/2309.12179
repo_url: None
paper_authors: Eui Jun Hwang, Huije Lee, Jong C. Park
for: 这篇论文是为了提供一种直接将口语句子翻译成手语的方法，而不需要 intermediate gloss。
methods: 这篇论文提出了一种新的手语vector量化网络方法，该方法利用vector量化来 derivate discrete representation from sign pose sequences。
results: 该方法在 comprehensive evaluations 中表现出了较好的性能，并且比 Priors SLP 方法更加可靠，同时还提出了使用 Back-Translation 和 Fréchet Gesture Distance 作为评价指标的可靠性。

Abstract
Gloss-free Sign Language Production (SLP) offers a direct translation of spoken language sentences into sign language, bypassing the need for gloss intermediaries. This paper presents the Sign language Vector Quantization Network, a novel approach to SLP that leverages Vector Quantization to derive discrete representations from sign pose sequences. Our method, rooted in both manual and non-manual elements of signing, supports advanced decoding methods and integrates latent-level alignment for enhanced linguistic coherence. Through comprehensive evaluations, we demonstrate superior performance of our method over prior SLP methods and highlight the reliability of Back-Translation and Fr\'echet Gesture Distance as evaluation metrics.

摘要
<>转换给定文本到简化中文。<>流利手语生产（SLP）提供了直接将口语句子转换为手语，无需间接采用概念介质。本文介绍了手语向量量化网络，一种新的SLP方法，利用向量量化 derive discrete representation from sign pose sequences。我们的方法受到手语的手势和非手势元素支持高级解码方法，并实现了层次匹配以提高语言一致性。通过全面评估，我们证明了我们的方法在先前SLP方法之上具有更高的性能，并高亮了回传和Fréchet手势距离的评估指标。

paper_url: http://arxiv.org/abs/2309.12172
repo_url: None
paper_authors: Sagar M. Waghmare, Kimberly Wilber, Dave Hawkey, Xuan Yang, Matthew Wilson, Stephanie Debats, Cattalyya Nuengsigkapian, Astuti Sharma, Lars Pandikow, Huisheng Wang, Hartwig Adam, Mikhail Sirotenko
For: The paper is written for researchers and developers working on video segmentation, depth estimation, multi-task visual modeling, and synthetic-to-real domain adaptation.* Methods: The paper uses a large-scale egocentric video dataset called SANPO, which contains stereo video sessions collected in diverse outdoor environments, as well as rendered synthetic video sessions. The dataset includes dense depth and odometry labels, as well as temporally consistent dense panoptic segmentation labels for some sessions.* Results: The paper provides zero-shot baselines and SANPO benchmarks for future research, with the goal of advancing the state-of-the-art in the above-mentioned areas while enabling human navigation systems.Here’s the information in Simplified Chinese text format:* For: 这篇论文是为研究者和开发者们而写的，他们工作在视频分割、深度估计、多任务视觉模型和真实到 sintetic 领域的域 adaptation 等领域。* Methods: 这篇论文使用了一个大规模的 egocentric 视频数据集 called SANPO，该数据集包括多种不同的户外环境中的双视频会话，以及由 Parallel Domain 提供的 Rendered 的 synthetic 视频会话。数据集包括深度和运动标签，以及一些会话中的时间协调的 dense panoptic segmentation 标签。* Results: 这篇论文提供了 zero-shot baselines 和 SANPO benchmarks，以便未来的研究者可以通过这些 benchmaks 进行研究，以达到提高视频分割、深度估计、多任务视觉模型和真实到 sintetic 领域的状态前瞻。同时，这些 benchmaks 也可以帮助人类导航系统的开发。

Abstract
We introduce SANPO, a large-scale egocentric video dataset focused on dense prediction in outdoor environments. It contains stereo video sessions collected across diverse outdoor environments, as well as rendered synthetic video sessions. (Synthetic data was provided by Parallel Domain.) All sessions have (dense) depth and odometry labels. All synthetic sessions and a subset of real sessions have temporally consistent dense panoptic segmentation labels. To our knowledge, this is the first human egocentric video dataset with both large scale dense panoptic segmentation and depth annotations. In addition to the dataset we also provide zero-shot baselines and SANPO benchmarks for future research. We hope that the challenging nature of SANPO will help advance the state-of-the-art in video segmentation, depth estimation, multi-task visual modeling, and synthetic-to-real domain adaptation, while enabling human navigation systems. SANPO is available here: https://google-research-datasets.github.io/sanpo_dataset/

摘要
我们介绍SANPO，一个大规模自我视频数据集， focus on dense prediction in outdoor environments。它包含了不同的outdoor环境中的stereo视频会议，以及由Parallel Domain提供的Synthetic视频会议。所有会议都有dense的深度和odometry标签。Synthetic会议和一些真实会议都有时间相同的dense panoptic segmentation标签。根据我们所知，这是人类自我视频数据集中首次同时拥有大规模的dense panoptic segmentation和深度标签。此外，我们还提供了零基eline和SANPObenchmark，以便未来的研究。我们希望SANPO能帮助进步类比类比预测、深度估计、多任务视觉模型和Synthetic-to-real域转换，并帮助人类NAVIGATION系统。SANPO可以在以下网站上获取：https://google-research-datasets.github.io/sanpo_dataset/

Information Forensics and Security: A quarter-century-long journey

paper_url: http://arxiv.org/abs/2309.12159
repo_url: None
paper_authors: Mauro Barni, Patrizio Campisi, Edward J. Delp, Gwenael Doërr, Jessica Fridrich, Nasir Memon, Fernando Pérez-González, Anderson Rocha, Luisa Verdoliva, Min Wu
for: Ensuring that people use devices, data, and intellectual properties for authorized purposes, and facilitating the gathering of solid evidence to hold perpetrators accountable.
methods: Technological advances in various focus areas, including but not limited to signal processing, data analysis, and machine learning, to address the societal needs of the digital information era.
results: Landmark technical contributions and future trends in the field of Information Forensics and Security (IFS) over the last 25 years, as celebrated by the IEEE Signal Processing Society (SPS).

Abstract
Information Forensics and Security (IFS) is an active R&D area whose goal is to ensure that people use devices, data, and intellectual properties for authorized purposes and to facilitate the gathering of solid evidence to hold perpetrators accountable. For over a quarter century since the 1990s, the IFS research area has grown tremendously to address the societal needs of the digital information era. The IEEE Signal Processing Society (SPS) has emerged as an important hub and leader in this area, and the article below celebrates some landmark technical contributions. In particular, we highlight the major technological advances on some selected focus areas in the field developed in the last 25 years from the research community and present future trends.

摘要
信息 FORENSICS 和安全 (IFS) 是一个活跃的研发领域，旨在确保人们在授权的情况下使用设备、数据和知识产权，并且为追究过失者负责任而收集坚实的证据。自1990年代以来的半个世纪以来，IFS研究领域已经快速增长，以应对数字信息时代的社会需求。IEEE信号处理学会（SPS）在这个领域中已经成为重要的枢纽和领导者，本文将highlight一些在过去25年中由研究社区提出的重要技术进步，并提出未来趋势。

Vulnerability of 3D Face Recognition Systems to Morphing Attacks

paper_url: http://arxiv.org/abs/2309.12118
repo_url: None
paper_authors: Sanjeet Vardam, Luuk Spreeuwers
For: 本研究探讨了3DFR系统对3D面部变换攻击的Robustness。* Methods: 本文提出了一些方法来生成高质量的3D面部变换，并对这些变换进行检测。* Results: 实验结果显示，当3DFR系统遇到相似 morphs 攻击时，其最大同比差度（MMPMR）约为40%，相对差度（RMMR）约为41.76%。

Abstract
In recent years face recognition systems have been brought to the mainstream due to development in hardware and software. Consistent efforts are being made to make them better and more secure. This has also brought developments in 3D face recognition systems at a rapid pace. These 3DFR systems are expected to overcome certain vulnerabilities of 2DFR systems. One such problem that the domain of 2DFR systems face is face image morphing. A substantial amount of research is being done for generation of high quality face morphs along with detection of attacks from these morphs. Comparatively the understanding of vulnerability of 3DFR systems against 3D face morphs is less. But at the same time an expectation is set from 3DFR systems to be more robust against such attacks. This paper attempts to research and gain more information on this matter. The paper describes a couple of methods that can be used to generate 3D face morphs. The face morphs that are generated using this method are then compared to the contributing faces to obtain similarity scores. The highest MMPMR is obtained around 40% with RMMR of 41.76% when 3DFRS are attacked with look-a-like morphs.

摘要

AutoPET Challenge 2023: Sliding Window-based Optimization of U-Net

paper_url: http://arxiv.org/abs/2309.12114
repo_url: https://github.com/matt3o/autopet2-submission
paper_authors: Matthias Hadlich, Zdravko Marinov, Rainer Stiefelhagen
for: 这个研究是为了提高医疗影像中肿瘤分类的精确性，并且利用PET和CT两种图像技术来结合metros� and anatomical information。
methods: 这个研究使用了FDG-PET/CT扫描，并且提出了一个挑战task来验证肿瘤特有的FDG取摄，并且使用了一个自动化的分类方法来分类肿瘤和正常组织。
results: 这个研究获得了1014个FDG-PET/CT研究数据，并且显示了一个高度精确的肿瘤分类方法，并且可以对肿瘤进行严格的分类和分析。

Abstract
Tumor segmentation in medical imaging is crucial and relies on precise delineation. Fluorodeoxyglucose Positron-Emission Tomography (FDG-PET) is widely used in clinical practice to detect metabolically active tumors. However, FDG-PET scans may misinterpret irregular glucose consumption in healthy or benign tissues as cancer. Combining PET with Computed Tomography (CT) can enhance tumor segmentation by integrating metabolic and anatomic information. FDG-PET/CT scans are pivotal for cancer staging and reassessment, utilizing radiolabeled fluorodeoxyglucose to highlight metabolically active regions. Accurately distinguishing tumor-specific uptake from physiological uptake in normal tissues is a challenging aspect of precise tumor segmentation. The AutoPET challenge addresses this by providing a dataset of 1014 FDG-PET/CT studies, encouraging advancements in accurate tumor segmentation and analysis within the FDG-PET/CT domain. Code: https://github.com/matt3o/AutoPET2-Submission/

摘要
肿体分割在医学影像中是关键和需要精准定义。 fluorodeoxyglucose positron emission tomography（FDG-PET）广泛应用于临床实践中检测活跃的肿体。然而，FDG-PET扫描可能会错误地认为正常或无害组织中的不规则糖分摄取为癌。将PET与计算机扫描成像（CT）结合可以提高肿体分割，将元素学和解剖信息结合起来。FDG-PET/CT扫描是癌病 stagings和重新评估中的关键工具，使用标记的fluorodeoxyglucose来高亮活跃的区域。准确地分辨肿体特有的摄取和正常组织中的 physiological uptake 是精准肿体分割的挑战。AutoPET挑战提供了一个包含1014个FDG-PET/CT研究的数据集，激励创新在FDG-PET/CT领域中的准确肿体分割和分析。代码：https://github.com/matt3o/AutoPET2-Submission/

paper_url: http://arxiv.org/abs/2309.12110
repo_url: None
paper_authors: Alberto Baldrati, Marco Bertini, Tiberio Uricchio, Alberto Del Bimbo
for: 这个论文主要是为了研究如何应用最近的多Modal图像预训练模型在艺术领域中。
methods: 这个论文使用的方法是使用semantic density的文本超级视觉模型，以提高模型的泛化能力。
results: 在NoisyArt dataset上进行了广泛的实验，CLIP模型在零批分类和描述到图像和艺术作品之间的转换中表现出色，并在描述到图像和艺术作品之间的转换中达到了有前例的结果。

Abstract
Given the recent advances in multimodal image pretraining where visual models trained with semantically dense textual supervision tend to have better generalization capabilities than those trained using categorical attributes or through unsupervised techniques, in this work we investigate how recent CLIP model can be applied in several tasks in artwork domain. We perform exhaustive experiments on the NoisyArt dataset which is a dataset of artwork images crawled from public resources on the web. On such dataset CLIP achieves impressive results on (zero-shot) classification and promising results in both artwork-to-artwork and description-to-artwork domain.

摘要

FourierLoss: Shape-Aware Loss Function with Fourier Descriptors

paper_url: http://arxiv.org/abs/2309.12106
repo_url: None
paper_authors: Mehmet Bahadir Erden, Selahattin Cansiz, Onur Caki, Haya Khattak, Durmus Etiz, Melek Cosar Yakar, Kerem Duruer, Berke Barut, Cigdem Gunduz-Demir
for: 这个研究是为了提高医学影像分类 tasks 中的综合性和准确性。
methods: 这个研究使用了 encoder-decoder 网络，并引入了一个新的 shape-aware loss function，named FourierLoss，来视�{数学描述者} 计算的物体形状差异，并对这个差异进行处理。
results: 这个研究显示，使用 proposed adaptive loss update mechanism 和 FourierLoss loss function，可以将网络的注意力从学习物体的大致形状转移到学习物体的细微形状，或是vice versa，以提高医学影像分类的准确性。在2879个 Computed Tomography 影像中，这个方法比其他方法 statistically significantly better 的结果。

Abstract
Encoder-decoder networks become a popular choice for various medical image segmentation tasks. When they are trained with a standard loss function, these networks are not explicitly enforced to preserve the shape integrity of an object in an image. However, this ability of the network is important to obtain more accurate results, especially when there is a low-contrast difference between the object and its surroundings. In response to this issue, this work introduces a new shape-aware loss function, which we name FourierLoss. This loss function relies on quantifying the shape dissimilarity between the ground truth and the predicted segmentation maps through the Fourier descriptors calculated on their objects, and penalizing this dissimilarity in network training. Different than the previous studies, FourierLoss offers an adaptive loss function with trainable hyperparameters that control the importance of the level of the shape details that the network is enforced to learn in the training process. This control is achieved by the proposed adaptive loss update mechanism, which end-to-end learns the hyperparameters simultaneously with the network weights by backpropagation. As a result of using this mechanism, the network can dynamically change its attention from learning the general outline of an object to learning the details of its contour points, or vice versa, in different training epochs. Working on 2879 computed tomography images of 93 subjects, our experiments revealed that the proposed adaptive shape-aware loss function led to statistically significantly better results for liver segmentation, compared to its counterparts.

摘要
现代编码器-解码器网络在医疗图像分割任务中变得非常流行。当它们被训练于标准损失函数时，这些网络没有Explicitly保持图像中对象的形状完整性。然而，这种网络的能力是获得更高准确的结果的关键，特别是在对象和周围环境之间存在低对比的情况下。为解决这个问题，本研究提出了一种新的形态意识损失函数，我们称之为FourierLoss。这个损失函数基于计算图像中对象的Fourier描述符，并将其用于训练网络。与前一 Studies不同，FourierLoss提供了一个可调参数的损失函数，可以在训练过程中控制网络学习的形态细节水平。这种控制由我们提出的适应式损失更新机制实现，该机制通过反向传播来同时学习网络参数和损失函数参数。因此，网络可以在不同的训练纪元中动态地改变它的注意力，从学习对象的总轮廓到学习对象的轮廓点，或者vice versa。在我们对2879个计算Tomography图像的93个Subject进行实验后，我们发现，提出的适应式形态意识损失函数在肝 segmentation任务中具有统计学上significantly Better的结果，相比其他Counterparts。

Bayesian sparsification for deep neural networks with Bayesian model reduction

paper_url: http://arxiv.org/abs/2309.12095
repo_url: https://github.com/dimarkov/bmr4pml
paper_authors: Dimitrije Marković, Karl J. Friston, Stefan J. Kiebel
for: 这篇论文旨在探讨 bayesian 简化技术的应用在深度学习中，以提高深度学习模型的计算效率和表现。
methods: 本研究使用 bayesian 简化技术，结合结构缩小假设和数学随机构造推断，实现了对深度学习模型的简化。
results: 研究比较了不同的简化方法，结果显示 bayesian 模型简化（BMR）方法在不同的深度学习架构上具有优越的表现，并且比较简单和有效。

Abstract
Deep learning's immense capabilities are often constrained by the complexity of its models, leading to an increasing demand for effective sparsification techniques. Bayesian sparsification for deep learning emerges as a crucial approach, facilitating the design of models that are both computationally efficient and competitive in terms of performance across various deep learning applications. The state-of-the-art -- in Bayesian sparsification of deep neural networks -- combines structural shrinkage priors on model weights with an approximate inference scheme based on stochastic variational inference. However, model inversion of the full generative model is exceptionally computationally demanding, especially when compared to standard deep learning of point estimates. In this context, we advocate for the use of Bayesian model reduction (BMR) as a more efficient alternative for pruning of model weights. As a generalization of the Savage-Dickey ratio, BMR allows a post-hoc elimination of redundant model weights based on the posterior estimates under a straightforward (non-hierarchical) generative model. Our comparative study highlights the advantages of the BMR method relative to established approaches based on hierarchical horseshoe priors over model weights. We illustrate the potential of BMR across various deep learning architectures, from classical networks like LeNet to modern frameworks such as Vision Transformers and MLP-Mixers.

摘要

Multi-Task Cooperative Learning via Searching for Flat Minima

paper_url: http://arxiv.org/abs/2309.12090
repo_url: None
paper_authors: Fuping Wu, Le Zhang, Yang Sun, Yuanhan Mo, Thomas Nichols, Bartlomiej W. Papiez
for: This paper is written for medical image analysis, specifically to improve the generalizability of learned features and performance in individual tasks using multi-task learning (MTL).
methods: The paper proposes a multi/bi-level optimization approach to MTL, where features are learned in a cooperative manner by updating the sub-model for each task alternatively, taking advantage of the learned sub-models of the other tasks. To alleviate negative transfer, the paper searches for flat minima with regard to features from other tasks.
results: The proposed method is validated on three publicly available datasets and shows promising results compared to state-of-the-art MTL approaches, demonstrating the effectiveness of cooperative learning in medical image analysis.

Abstract
Multi-task learning (MTL) has shown great potential in medical image analysis, improving the generalizability of the learned features and the performance in individual tasks. However, most of the work on MTL focuses on either architecture design or gradient manipulation, while in both scenarios, features are learned in a competitive manner. In this work, we propose to formulate MTL as a multi/bi-level optimization problem, and therefore force features to learn from each task in a cooperative approach. Specifically, we update the sub-model for each task alternatively taking advantage of the learned sub-models of the other tasks. To alleviate the negative transfer problem during the optimization, we search for flat minima for the current objective function with regard to features from other tasks. To demonstrate the effectiveness of the proposed approach, we validate our method on three publicly available datasets. The proposed method shows the advantage of cooperative learning, and yields promising results when compared with the state-of-the-art MTL approaches. The code will be available online.

摘要

Self-Calibrating, Fully Differentiable NLOS Inverse Rendering

paper_url: http://arxiv.org/abs/2309.12047
repo_url: None
paper_authors: Kiseok Choi, Inchul Kim, Dongyoung Choi, Julio Marco, Diego Gutierrez, Min H. Kim
for: This paper aims to improve the accuracy and robustness of non-line-of-sight (NLOS) imaging methods for reconstructing hidden scenes.
methods: The proposed method uses a fully-differentiable end-to-end NLOS inverse rendering pipeline that self-calibrates imaging parameters during the reconstruction process, using measured illumination in both the time and frequency domains.
results: The method is able to consistently reconstruct detailed geometry and albedo of hidden scenes, even under significant noise levels, by using a combination of diffraction-based volumetric NLOS reconstruction, path-space light transport, and a simple ray marching technique.

Abstract
Existing time-resolved non-line-of-sight (NLOS) imaging methods reconstruct hidden scenes by inverting the optical paths of indirect illumination measured at visible relay surfaces. These methods are prone to reconstruction artifacts due to inversion ambiguities and capture noise, which are typically mitigated through the manual selection of filtering functions and parameters. We introduce a fully-differentiable end-to-end NLOS inverse rendering pipeline that self-calibrates the imaging parameters during the reconstruction of hidden scenes, using as input only the measured illumination while working both in the time and frequency domains. Our pipeline extracts a geometric representation of the hidden scene from NLOS volumetric intensities and estimates the time-resolved illumination at the relay wall produced by such geometric information using differentiable transient rendering. We then use gradient descent to optimize imaging parameters by minimizing the error between our simulated time-resolved illumination and the measured illumination. Our end-to-end differentiable pipeline couples diffraction-based volumetric NLOS reconstruction with path-space light transport and a simple ray marching technique to extract detailed, dense sets of surface points and normals of hidden scenes. We demonstrate the robustness of our method to consistently reconstruct geometry and albedo, even under significant noise levels.

摘要
现有的时间分解非直视（NLOS）成像方法利用测量的 indirect 照明的光学路径进行场景重建。这些方法容易出现重建 artifacts，因为它们通常需要手动选择筛选函数和参数来 Mitigate 这些artefacts。我们介绍了一个完全可导的端到端 NLOS 反推管线，该管线在重建隐藏场景时自动调整成像参数，使用只有测量的照明作为输入，同时在时间和频率两个频率域中工作。我们的管线从 NLOS 体积强度中提取隐藏场景的几何表示，并估算在静止墙上生成的时间分解照明，使用可导的漫游技术来提取详细的表面点和法向量。然后，我们使用梯度下降优化成像参数，使得模拟的时间分解照明与测量的照明之间的错误最小化。我们的端到端可导管线结合了干涉基本的体积NLOS 重建、路径空间光传输和简单的漫游技术，以提取细腻的表面点和法向量。我们 demonstarte 了我们的方法可以在各种噪音水平下一致地重建场景的几何和反射率。

Beyond Image Borders: Learning Feature Extrapolation for Unbounded Image Composition

paper_url: http://arxiv.org/abs/2309.12042
repo_url: https://github.com/liuxiaoyu1104/unic
paper_authors: Xiaoyu Liu, Ming Liu, Junyi Li, Shuai Liu, Xiaotao Wang, Lei Lei, Wangmeng Zuo
for: 提高图像组合和美观品质，大多数现有方法会修剪捕捉到的图像，但这些方法的修剪范围有限。
methods: 我们提出了一个联合框架，可以同时进行无限的摄像头视图建议和图像组合（i.e., UNIC），以确保生成的修剪图像是真实的和图像质量高。
results: 我们的方法可以在基于现有图像剪辑 datasets 的 dataset 上进行广泛的实验，并显示了我们的 UNIC 在无限的摄像头视图建议和图像组合方面的效果。

Abstract
For improving image composition and aesthetic quality, most existing methods modulate the captured images by striking out redundant content near the image borders. However, such image cropping methods are limited in the range of image views. Some methods have been suggested to extrapolate the images and predict cropping boxes from the extrapolated image. Nonetheless, the synthesized extrapolated regions may be included in the cropped image, making the image composition result not real and potentially with degraded image quality. In this paper, we circumvent this issue by presenting a joint framework for both unbounded recommendation of camera view and image composition (i.e., UNIC). In this way, the cropped image is a sub-image of the image acquired by the predicted camera view, and thus can be guaranteed to be real and consistent in image quality. Specifically, our framework takes the current camera preview frame as input and provides a recommendation for view adjustment, which contains operations unlimited by the image borders, such as zooming in or out and camera movement. To improve the prediction accuracy of view adjustment prediction, we further extend the field of view by feature extrapolation. After one or several times of view adjustments, our method converges and results in both a camera view and a bounding box showing the image composition recommendation. Extensive experiments are conducted on the datasets constructed upon existing image cropping datasets, showing the effectiveness of our UNIC in unbounded recommendation of camera view and image composition. The source code, dataset, and pretrained models is available at https://github.com/liuxiaoyu1104/UNIC.

摘要
为提高图像组合和艺术质量，现有方法通常对捕捉到的图像进行剪辑，但这些图像剪辑方法有限制的视野范围。一些方法已经建议了从拟合图像中预测剪辑框。然而，生成的拟合区域可能包含在剪辑后的图像中，导致图像组合结果不真实并且可能受到质量下降的影响。在这篇论文中，我们解决了这个问题，提出了一个共同框架，即UNIC，以实现无限制的摄像头视野和图像组合。具体来说，我们的框架接受当前摄像头预览帧作为输入，并提供无限制的视野调整建议，包括图像边缘不受限制的缩放、摄像头移动等操作。为了提高视野调整预测精度，我们还进一步扩展了视野范围，通过特征拟合。经过一次或多次视野调整，我们的方法会 converges，并产生一个摄像头视野和图像组合建议。我们在基于现有图像剪辑数据集构建的数据集上进行了广泛的实验，证明了我们的UNIC在无限制的摄像头视野和图像组合方面的效果。源代码、数据集和预训练模型可以在https://github.com/liuxiaoyu1104/UNIC上下载。

BASE: Probably a Better Approach to Multi-Object Tracking

paper_url: http://arxiv.org/abs/2309.12035
repo_url: None
paper_authors: Martin Vonheim Larsen, Sigmund Rolfsjord, Daniel Gusland, Jörgen Ahlberg, Kim Mathiassen
for: 这篇论文是为了探讨可靠的视觉对象跟踪方法，以帮助解决现有的跟踪问题。
methods: 这篇论文使用了 bayesian 方法，并提出了一种简单、高效的视觉跟踪模型，称为 BASE（ bayesian approximation single-hypothesis estimator），可以在 MOT17 和 MOT20 上达到 state-of-the-art 水平。
results: 该模型在 MOT17 和 MOT20 上实现了 state-of-the-art 的跟踪效果，而无需使用 Re-Id。

Abstract
The field of visual object tracking is dominated by methods that combine simple tracking algorithms and ad hoc schemes. Probabilistic tracking algorithms, which are leading in other fields, are surprisingly absent from the leaderboards. We found that accounting for distance in target kinematics, exploiting detector confidence and modelling non-uniform clutter characteristics is critical for a probabilistic tracker to work in visual tracking. Previous probabilistic methods fail to address most or all these aspects, which we believe is why they fall so far behind current state-of-the-art (SOTA) methods (there are no probabilistic trackers in the MOT17 top 100). To rekindle progress among probabilistic approaches, we propose a set of pragmatic models addressing these challenges, and demonstrate how they can be incorporated into a probabilistic framework. We present BASE (Bayesian Approximation Single-hypothesis Estimator), a simple, performant and easily extendible visual tracker, achieving state-of-the-art (SOTA) on MOT17 and MOT20, without using Re-Id. Code will be made available at https://github.com/ffi-no

摘要
visual 目标跟踪领域受到简单跟踪算法和尝试性方案的控制。 probabilistic 跟踪算法，在其他领域的领导地位，在视觉跟踪领域却缺失。我们发现，考虑目标动力学中的距离，利用探测器信任度和非对称雷达特征是 kritical 的。 previous probabilistic methods 缺乏这些方面的处理，我们认为这是为什么它们落后于当前状态的方法（MOT17 top 100 中没有probabilistic tracker）。为了恢复 probablistic 方法的进步，我们提出了一组做实的模型，并示出如何将它们 incorporated 到 probablistic 框架中。我们介绍了 BASE（Bayesian Approximation Single-hypothesis Estimator），一种简单、高性能和易扩展的视觉跟踪器，在 MOT17 和 MOT20 中 achieved state-of-the-art 成绩，不使用 Re-Id。代码将在 https://github.com/ffi-no 上提供。

Face Identity-Aware Disentanglement in StyleGAN

paper_url: http://arxiv.org/abs/2309.12033
repo_url: None
paper_authors: Adrian Suwała, Bartosz Wójcik, Magdalena Proszewska, Jacek Tabor, Przemysław Spurek, Marek Śmieja
for: 本研究旨在解决 Conditional GANs manipulate 人脸图像的特征（如表情、发型、姿势、年龄）时同时改变人脸图像的身份特征的问题。
methods: 我们提出了 PluGeN4Faces，一个 StyleGAN 插件，可以显著分离人脸图像的特征和人脸图像的身份特征。我们的关键想法是在 Movie Frames 中检索到人物出现在不同的姿势和特征下的图像，然后通过一种对比损失来鼓励模型将同一个人的图像分配到相似的 latent space 中。
results: 我们的实验结果表明，PluGeN4Faces 对人脸图像的特征进行修改时，对图像的其他特征的改变相对较少，与现有状态的模型相比。

Abstract
Conditional GANs are frequently used for manipulating the attributes of face images, such as expression, hairstyle, pose, or age. Even though the state-of-the-art models successfully modify the requested attributes, they simultaneously modify other important characteristics of the image, such as a person's identity. In this paper, we focus on solving this problem by introducing PluGeN4Faces, a plugin to StyleGAN, which explicitly disentangles face attributes from a person's identity. Our key idea is to perform training on images retrieved from movie frames, where a given person appears in various poses and with different attributes. By applying a type of contrastive loss, we encourage the model to group images of the same person in similar regions of latent space. Our experiments demonstrate that the modifications of face attributes performed by PluGeN4Faces are significantly less invasive on the remaining characteristics of the image than in the existing state-of-the-art models.

摘要
<>使用可能性GAN进行面像图像的属性修饰，如表情、发型、姿势和年龄等。尽管现有模型成功修改请求的属性，但同时也会修改图像中其他重要特征，如人脸的身份。在这篇论文中，我们关注解决这个问题，我们引入了PluGeN4Faces， StyleGAN 的插件，它将明确分离人脸属性和人脸身份。我们的关键想法是在电影帧中检索到的图像进行训练，图像中一个人出现在不同的姿势和属性下。通过应用一种对比损失，我们鼓励模型将同一个人的图像分组到类似的潜在空间中。我们的实验表明，PluGeN4Faces 对面像图像的修饰是现有状态OF-THE-ART模型相比较不侵略的。Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you prefer Traditional Chinese, please let me know and I can provide that version as well.

Unveiling the Hidden Realm: Self-supervised Skeleton-based Action Recognition in Occluded Environments

paper_url: http://arxiv.org/abs/2309.12029
repo_url: https://github.com/cyfml/opstl
paper_authors: Yifei Chen, Kunyu Peng, Alina Roitberg, David Schneider, Jiaming Zhang, Junwei Zheng, Ruiping Liu, Yufan Chen, Kailun Yang, Rainer Stiefelhagen
for: 这篇论文的目的是将人工智能应用到自主机器人系统中，以便处理目标遮蔽的情况。
methods: 这篇论文使用了阶层学习（Hierarchical Learning）和填写遮蔽（Imputation）方法来解决目标遮蔽的问题。
results: 这篇论文的结果显示，使用这些方法可以将自主机器人系统升级为能够处理目标遮蔽的情况，并且可以实现更高的识别率。

Abstract
To integrate action recognition methods into autonomous robotic systems, it is crucial to consider adverse situations involving target occlusions. Such a scenario, despite its practical relevance, is rarely addressed in existing self-supervised skeleton-based action recognition methods. To empower robots with the capacity to address occlusion, we propose a simple and effective method. We first pre-train using occluded skeleton sequences, then use k-means clustering (KMeans) on sequence embeddings to group semantically similar samples. Next, we employ K-nearest-neighbor (KNN) to fill in missing skeleton data based on the closest sample neighbors. Imputing incomplete skeleton sequences to create relatively complete sequences as input provides significant benefits to existing skeleton-based self-supervised models. Meanwhile, building on the state-of-the-art Partial Spatio-Temporal Learning (PSTL), we introduce an Occluded Partial Spatio-Temporal Learning (OPSTL) framework. This enhancement utilizes Adaptive Spatial Masking (ASM) for better use of high-quality, intact skeletons. The effectiveness of our imputation methods is verified on the challenging occluded versions of the NTURGB+D 60 and NTURGB+D 120. The source code will be made publicly available at https://github.com/cyfml/OPSTL.

摘要
为了将动作识别方法integrated into autonomous robotic systems，需要考虑目标 occlusion 的情况。这种情况，虽然在现有的自适应skeleton-based action recognition方法中 rarely addressed，但它在实际应用中非常重要。为了赋能机器人处理 occlusion，我们提出了一种简单而有效的方法。我们首先使用 occluded skeleton sequences 进行预训练，然后使用 k-means clustering (KMeans) 对序列嵌入进行分组。接着，我们使用 K-nearest-neighbor (KNN) 填充 incomplete skeleton 数据，基于最近的样本 neighborgood 的 nearest 邻居。填充 incomplete skeleton sequences，以创建比较完整的输入序列，对现有skeleton-based self-supervised模型带来了显著的改进。此外，我们在 Partial Spatio-Temporal Learning (PSTL) 框架之上进行了更新，并增加了 Adaptive Spatial Masking (ASM)，以更好地利用高质量、完整的skeleton。我们证明了我们的填充方法的效果，在 NTURGB+D 60 和 NTURGB+D 120 的 occluded 版本上进行了测试。源代码将在 https://github.com/cyfml/OPSTL 上公开。

Precision in Building Extraction: Comparing Shallow and Deep Models using LiDAR Data

paper_url: http://arxiv.org/abs/2309.12027
repo_url: None
paper_authors: Muhammad Sulaiman, Mina Farmanbar, Ahmed Nabil Belbachir, Chunming Rong
for: 本文使用 LiDAR 数据进行检测建筑物的深度学习模型，以提高建筑物的分割精度。
methods: 本文使用了 shallow 模型，并使用了 boundary masks 来提高建筑物的边界精度。
results: compared with deep learning models, shallow models 在 IoU 分数上表现出优异，但是在 BIoU 分数上，deep learning models 表现更好。 boundary masks 可以提高 BIoU 分数 by 4%。 LightGBM 也比 RF 和 XGBoost 表现更好。

Abstract
Building segmentation is essential in infrastructure development, population management, and geological observations. This article targets shallow models due to their interpretable nature to assess the presence of LiDAR data for supervised segmentation. The benchmark data used in this article are published in NORA MapAI competition for deep learning model. Shallow models are compared with deep learning models based on Intersection over Union (IoU) and Boundary Intersection over Union (BIoU). In the proposed work, boundary masks from the original mask are generated to improve the BIoU score, which relates to building shapes' borderline. The influence of LiDAR data is tested by training the model with only aerial images in task 1 and a combination of aerial and LiDAR data in task 2 and then compared. shallow models outperform deep learning models in IoU by 8% using aerial images (task 1) only and 2% in combined aerial images and LiDAR data (task 2). In contrast, deep learning models show better performance on BIoU scores. Boundary masks improve BIoU scores by 4% in both tasks. Light Gradient-Boosting Machine (LightGBM) performs better than RF and Extreme Gradient Boosting (XGBoost).

摘要
监测建筑物分割是基础设施开发、人口管理和地质观测中的关键。这篇文章主要针对使用 shallow model，因为它们的解释能力可以评估 LiDAR 数据是否对 supervised segmentation 有影响。这篇文章使用的标准数据来自 NORA MapAI 比赛，这是深度学习模型的 benchmark。在这篇文章中， shallow model 与深度学习模型进行比较，使用 Intersection over Union (IoU) 和 Boundary Intersection over Union (BIoU) 两个指标。在提议的工作中，从原始Mask中生成了Boundary Mask，以提高 BIoU 分数，这与建筑物的边界相关。在任务1中，使用只有飞行图像的情况下，shallow model 在 IoU 上比深度学习模型高出8%，而在任务2中，使用飞行图像和 LiDAR 数据的组合时，shallow model 和深度学习模型的分数相似。然而，深度学习模型在 BIoU 分数上表现更好。Boundary Mask 在两个任务中提高 BIoU 分数4%。Light Gradient-Boosting Machine (LightGBM) 在 RF 和 Extreme Gradient Boosting (XGBoost) 之上表现更好。

Convolution and Attention Mixer for Synthetic Aperture Radar Image Change Detection

paper_url: http://arxiv.org/abs/2309.12010
repo_url: https://github.com/summitgao/camixer
paper_authors: Haopeng Zhang, Zijing Lin, Feng Gao, Junyu Dong, Qian Du, Heng-Chao Li
for: 本文旨在提出一种基于Transformer-like架构的SAR变化检测方法，以提高SAR图像变化检测的精度和效率。
methods: 本文提出了一种叫做Convolution and Attention Mixer（CAMixer）的新方法，它通过并行的自注意力和平移核函数来提取全球 semantic信息，并通过阻塞机制来增强非线性特征变换。
results: 对于三个SAR数据集，实验结果表明，CAMixer方法可以具有更高的精度和效率，并且可以更好地鲁棒化SAR图像变化检测 task。

Abstract
Synthetic aperture radar (SAR) image change detection is a critical task and has received increasing attentions in the remote sensing community. However, existing SAR change detection methods are mainly based on convolutional neural networks (CNNs), with limited consideration of global attention mechanism. In this letter, we explore Transformer-like architecture for SAR change detection to incorporate global attention. To this end, we propose a convolution and attention mixer (CAMixer). First, to compensate the inductive bias for Transformer, we combine self-attention with shift convolution in a parallel way. The parallel design effectively captures the global semantic information via the self-attention and performs local feature extraction through shift convolution simultaneously. Second, we adopt a gating mechanism in the feed-forward network to enhance the non-linear feature transformation. The gating mechanism is formulated as the element-wise multiplication of two parallel linear layers. Important features can be highlighted, leading to high-quality representations against speckle noise. Extensive experiments conducted on three SAR datasets verify the superior performance of the proposed CAMixer. The source codes will be publicly available at https://github.com/summitgao/CAMixer .

摘要
这是一个实验室内的文章，标题是“Synthetic Aperture Radar（SAR）图像变化检测方法”。这个领域在远程感知领域中具有重要性，但是现有的SAR变化检测方法主要基于卷积神经网络（CNN），对于全球注意机制的考虑却有限。在这封信中，我们尝试使用Transformer-like架构来检测SAR图像变化，以包含全球注意机制。为了补偿对Transformer的推导性，我们结合了自我注意和移位卷积，并在平行的方式下实现了全球 semantic信息的捕捉和本地特征提取。其次，我们采用了阻塞机制，以增强非线性特征转换。阻塞机制是将两个平行的直线层进行元素对元素乘法。这使得重要的特征能够获得突出，从而实现高质量的特征表现，抗衡杂音噪声。我们进行了广泛的实验，证明了我们提出的CAMixer方法的超越性。我们将代码公开在GitHub上，请参考https://github.com/summitgao/CAMixer。

Elevating Skeleton-Based Action Recognition with Efficient Multi-Modality Self-Supervision

paper_url: http://arxiv.org/abs/2309.12009
repo_url: https://github.com/desehuileng0o0/ikem
paper_authors: Yiping Wei, Kunyu Peng, Alina Roitberg, Jiaming Zhang, Junwei Zheng, Ruiping Liu, Yufan Chen, Kailun Yang, Rainer Stiefelhagen
for: 本研究旨在提高人体动作识别的自动学习性能，特别是使用多modalities setup时的表现。
methods: 我们首先提出了一种Implicit Knowledge Exchange Module (IKEM)，以消除低性能Modalities之间的知识协同传递。然后，我们提出了三种新的Modalities，以增强不同Modalities之间的补充信息。最后，我们提出了一种新的教师学生框架，以在引入新Modalities时保持效率，并在anchors, positives和negatives的约束下，将secondaryModalities中的知识透传到primaryModalities中。
results: 实验结果表明，我们的方法有效地提高了skeleton-based多modalities数据的表现，这标志着我们的approach可以有效地使用多modalities setup进行人体动作识别。

Abstract
Self-supervised representation learning for human action recognition has developed rapidly in recent years. Most of the existing works are based on skeleton data while using a multi-modality setup. These works overlooked the differences in performance among modalities, which led to the propagation of erroneous knowledge between modalities while only three fundamental modalities, i.e., joints, bones, and motions are used, hence no additional modalities are explored. In this work, we first propose an Implicit Knowledge Exchange Module (IKEM) which alleviates the propagation of erroneous knowledge between low-performance modalities. Then, we further propose three new modalities to enrich the complementary information between modalities. Finally, to maintain efficiency when introducing new modalities, we propose a novel teacher-student framework to distill the knowledge from the secondary modalities into the mandatory modalities considering the relationship constrained by anchors, positives, and negatives, named relational cross-modality knowledge distillation. The experimental results demonstrate the effectiveness of our approach, unlocking the efficient use of skeleton-based multi-modality data. Source code will be made publicly available at https://github.com/desehuileng0o0/IKEM.

摘要
实际承认人类动作的自我监督学习在近年来有很快的发展。大多数现有的工作都是基于骨架数据，并使用多 modalities 设置。这些工作忽略了不同modalities的表现差异，这导致了错误知识的传播 между modalities，仅有三种基本modalities，即肢体、骨骼和动作，没有进一步 explore 其他modalities。在这个工作中，我们首先提出了隐式知识交换模组（IKEM），以解决错误知识传播 between low-performance modalities。然后，我们进一步提出了三种新的modalities，以增加多modalities 之间的补充信息。最后，为确保效率，我们提出了一个 novel teacher-student 框架，以将次要modalities 中的知识转换到必要modalities 中，考虑到紧缩链、正例和负例的关系，称为关联跨modalities 知识传播。实验结果显示了我们的方法的有效性，从而解锁了骨架基于多modalities 数据的效率使用。源代码将在https://github.com/desehuileng0o0/IKEM 上公开。

Identification of pneumonia on chest x-ray images through machine learning

paper_url: http://arxiv.org/abs/2309.11995
repo_url: https://github.com/Nabeel-105/Covid-19-and-Pneumonia-Detection-Using-Chest-Xray-Images-Full-Desktop-Application-
paper_authors: Eduardo Augusto Roeder
for: 该研究旨在开发一种用于识别儿童肺部X光图像中的抑菌病毒病的软件。
methods: 该软件是基于机器学习技术的计算模型，使用了传输学习技术进行训练。
results: 经过训练后，模型在新的图像上达到了98%的敏感度和97.3%的特异性。

Abstract
Pneumonia is the leading infectious cause of infant death in the world. When identified early, it is possible to alter the prognosis of the patient, one could use imaging exams to help in the diagnostic confirmation. Performing and interpreting the exams as soon as possible is vital for a good treatment, with the most common exam for this pathology being chest X-ray. The objective of this study was to develop a software that identify the presence or absence of pneumonia in chest radiographs. The software was developed as a computational model based on machine learning using transfer learning technique. For the training process, images were collected from a database available online with children's chest X-rays images taken at a hospital in China. After training, the model was then exposed to new images, achieving relevant results on identifying such pathology, reaching 98% sensitivity and 97.3% specificity for the sample used for testing. It can be concluded that it is possible to develop a software that identifies pneumonia in chest X-ray images.

摘要
全球最主要的感染性新生儿死亡原因是肺炎，早期诊断可以改变病人的预后。用于诊断确认的成像检查可以帮助医生，最常用的检查方法是胸部X射线。本研究的目的是开发一种用于识别肺炎在胸部X射线图像中的软件。该软件是基于机器学习技术的计算模型，使用了传输学习技术进行训练。训练过程中获得的图像来自中国医院的儿童胸部X射线图像库。经训练后，模型对新图像进行测试， дости得了98%的敏感性和97.3%的特异性。可以 concluye ，可以开发一种用于识别肺炎在胸部X射线图像中的软件。

Neural Stochastic Screened Poisson Reconstruction

paper_url: http://arxiv.org/abs/2309.11993
repo_url: None
paper_authors: Silvia Sellán, Alec Jacobson
for: 用于重建三维表面从点云数据中
methods: 使用神经网络研究和量化重建不确定性，基于波峰平滑先验
results: 解决现有工作的主要限制，可以完全 интеGRATE到3D扫描管道中，从获取初始重建到决定下一个感知器位置并更新重建数据I hope that helps! Let me know if you have any other questions.

Abstract
Reconstructing a surface from a point cloud is an underdetermined problem. We use a neural network to study and quantify this reconstruction uncertainty under a Poisson smoothness prior. Our algorithm addresses the main limitations of existing work and can be fully integrated into the 3D scanning pipeline, from obtaining an initial reconstruction to deciding on the next best sensor position and updating the reconstruction upon capturing more data.

摘要
<> transtable("Reconstructing a surface from a point cloud is an underdetermined problem.") transtable("We use a neural network to study and quantify this reconstruction uncertainty under a Poisson smoothness prior.") transtable("Our algorithm addresses the main limitations of existing work and can be fully integrated into the 3D scanning pipeline, from obtaining an initial reconstruction to deciding on the next best sensor position and updating the reconstruction upon capturing more data.")>>Here's the translation of the text in Traditional Chinese:<> transtable("重建表面从点云是一个不充分确定的问题。") transtable("我们使用神经网络来研究和评估这种重建不确定性，以Pointer Sobolev smoothness prior为基础。") transtable("我们的算法解决了现有工作的主要限制，可以完全整合到3D扫描管线中，从获取初始重建到决定下一个感知器位置并将更多数据捕获后更新重建。")>>Note that the translation is based on the Google Translate API, and may not be perfect or entirely accurate.

paper_url: http://arxiv.org/abs/2309.11989
repo_url: None
paper_authors: Rajitha de Silva, Grzegorz Cielniak, Junfeng Gao
for: 该论文旨在开发一种基于视觉的移动机器人Navigation系统，可以在农业用途中涵盖整个田地。
methods: 该论文使用了深度学习的RGB图像分割和深度数据，通过探测农作物的结束和下一排农作物的重新入口来实现视觉基于的农作物行进管理策略。
results: 在一个真实的糖芘场中测试了该管理策略，结果表明机器人可以成功地从一排农作物出口，并重新进入下一排农作物， median误差为19.25cm和6.77度。

Abstract
Vision-based mobile robot navigation systems in arable fields are mostly limited to in-row navigation. The process of switching from one crop row to the next in such systems is often aided by GNSS sensors or multiple camera setups. This paper presents a novel vision-based crop row-switching algorithm that enables a mobile robot to navigate an entire field of arable crops using a single front-mounted camera. The proposed row-switching manoeuvre uses deep learning-based RGB image segmentation and depth data to detect the end of the crop row, and re-entry point to the next crop row which would be used in a multi-state row switching pipeline. Each state of this pipeline use visual feedback or wheel odometry of the robot to successfully navigate towards the next crop row. The proposed crop row navigation pipeline was tested in a real sugar beet field containing crop rows with discontinuities, varying light levels, shadows and irregular headland surfaces. The robot could successfully exit from one crop row and re-enter the next crop row using the proposed pipeline with absolute median errors averaging at 19.25 cm and 6.77{\deg} for linear and rotational steps of the proposed manoeuvre.

摘要
视觉基于移动机器人农业场 Navigation 系统通常只能进行行间 navigation。 switching 过程中常用 GNSS 感知器或多个摄像头设计。本文提出了一种新的视觉基于的农作物行 switching 算法，可以使移动机器人在一个全场农作物中进行整个途径。提出的行 switching 举动使用深度学习基于 RGB 图像分割和深度数据检测农作物行的结束和下一行的重新入口点，并在多个状态的管道中使用视觉反馈或机器人轮胎速度进行成功导航到下一行农作物。该管道在实际的糖葱田中进行测试，包括农作物行间缺陷、变化的照明水平、阴影和不规则的机场表面。机器人可以成功从一个农作物行出现在下一个农作物行中使用提案的管道，相对 median 误差为 19.25 cm 和 6.77°。

ZS6D: Zero-shot 6D Object Pose Estimation using Vision Transformers

paper_url: http://arxiv.org/abs/2309.11986
repo_url: None
paper_authors: Philipp Ausserlechner, David Haberger, Stefan Thalhammer, Jean-Baptiste Weibel, Markus Vincze
for: zeroshot 6D对象pose estimation
methods: 使用pre-trained Vision Transformers（ViT）抽取视觉描述符，并使用RANSAC-based PnP算法对比query图像和模板图像进行对应。
results: 对比两个现有的状态态的方法，提高了所有三个数据集的平均回归率。

Abstract
As robotic systems increasingly encounter complex and unconstrained real-world scenarios, there is a demand to recognize diverse objects. The state-of-the-art 6D object pose estimation methods rely on object-specific training and therefore do not generalize to unseen objects. Recent novel object pose estimation methods are solving this issue using task-specific fine-tuned CNNs for deep template matching. This adaptation for pose estimation still requires expensive data rendering and training procedures. MegaPose for example is trained on a dataset consisting of two million images showing 20,000 different objects to reach such generalization capabilities. To overcome this shortcoming we introduce ZS6D, for zero-shot novel object 6D pose estimation. Visual descriptors, extracted using pre-trained Vision Transformers (ViT), are used for matching rendered templates against query images of objects and for establishing local correspondences. These local correspondences enable deriving geometric correspondences and are used for estimating the object's 6D pose with RANSAC-based PnP. This approach showcases that the image descriptors extracted by pre-trained ViTs are well-suited to achieve a notable improvement over two state-of-the-art novel object 6D pose estimation methods, without the need for task-specific fine-tuning. Experiments are performed on LMO, YCBV, and TLESS. In comparison to one of the two methods we improve the Average Recall on all three datasets and compared to the second method we improve on two datasets.

摘要
为了应对机器人系统在复杂和无束缚的实际场景中识别多种物体的需求，现状的6D物体姿态估计方法依赖于物体特定的训练，因此无法泛化到未见过的物体。最新的novel object pose estimation方法通过使用任务特定的精度调整的 convolutional neural networks (CNNs) 进行深度模板匹配来解决这个问题。这种适应仍然需要昂贵的数据渲染和训练过程。例如，MegaPose 是在包含20,000个不同的物体图像中训练的，以达到这种泛化能力。为了解决这个缺点，我们介绍了 Zero-shot Novel Object 6D Pose Estimation（ZS6D）方法。我们使用预训练的 Vision Transformers (ViT) 提取的视觉描述符来匹配渲染的模板图像和查询图像之间的本地匹配。这些本地匹配使得我们可以 derivation геометрические匹配，并用 RANSAC-based PnP 方法来估计物体的6D姿态。这种方法显示了预训练的 ViT 提取的图像描述符能够达到两种现状的novel object 6D pose estimation方法的显著改进，无需进行任务特定的精度调整。我们在 LMO、YCBV 和 TLESS 上进行了实验，与两种方法进行比较。相比之下，我们在所有三个数据集上的均值回归得分都有所提高，相比第二种方法，我们在两个数据集上有所提高。

Spatially Guiding Unsupervised Semantic Segmentation Through Depth-Informed Feature Distillation and Sampling

paper_url: http://arxiv.org/abs/2309.12378
repo_url: None
paper_authors: Leon Sick, Dominik Engel, Pedro Hermosilla, Timo Ropinski
for: 降低需要人工标注的劳动成本，通过不监督学习方法进行 semantic segmentation 训练。
methods: 利用图像随机样本的特征进行学习，并通过 depth 信息了解场景结构。
results: 对多个 benchmark 数据集进行了广泛的实验，并得到了显著的性能改进。

Abstract
Traditionally, training neural networks to perform semantic segmentation required expensive human-made annotations. But more recently, advances in the field of unsupervised learning have made significant progress on this issue and towards closing the gap to supervised algorithms. To achieve this, semantic knowledge is distilled by learning to correlate randomly sampled features from images across an entire dataset. In this work, we build upon these advances by incorporating information about the structure of the scene into the training process through the use of depth information. We achieve this by (1) learning depth-feature correlation by spatially correlate the feature maps with the depth maps to induce knowledge about the structure of the scene and (2) implementing farthest-point sampling to more effectively select relevant features by utilizing 3D sampling techniques on depth information of the scene. Finally, we demonstrate the effectiveness of our technical contributions through extensive experimentation and present significant improvements in performance across multiple benchmark datasets.

摘要

通过将特征图与深度图进行空间相关性学习，以便从图像结构中获取知识。2. 通过使用深度信息来实现更有效的特征选择，通过利用3D抽样技术来选择相关的特征。最后，我们通过广泛的实验证明了我们的技术贡献的效果，并在多个 benchmark 数据集上显示了显著的改善。

NeuralLabeling: A versatile toolset for labeling vision datasets using Neural Radiance Fields

paper_url: http://arxiv.org/abs/2309.11966
repo_url: https://github.com/FlorisE/neural-labeling
paper_authors: Floris Erich, Naoya Chiba, Yusuke Yoshiyasu, Noriaki Ando, Ryo Hanai, Yukiyasu Domae
for: 该论文旨在提出一种基于神经辐射场（NeRF）的场景标注方法和工具集，用于生成分割图、可用性图、2D bounding box、3D bounding box、6DOF对象位姿、深度图和对象体系。
methods: 该方法使用NeRF作为渲染器，通过利用多视点图像输入和三角函数等几何准确信息，进行3D空间工具进行标注，不需要特定的标注工具或扫描仪。
results: 在应用于机器人实际问题的情况下，通过添加深度图ground truth，使用30000帧透明物体RGB和噪音深度图捕捉到的碗洗机器人中的玻璃镜扭损捕捉到的30000帧碗洗机器人中的玻璃镜扭损，并训练一个简单的深度神经网络，使用标注的深度图进行监督，可以获得较高的重建性能，比之前使用弱监督方法更高。

Abstract
We present NeuralLabeling, a labeling approach and toolset for annotating a scene using either bounding boxes or meshes and generating segmentation masks, affordance maps, 2D bounding boxes, 3D bounding boxes, 6DOF object poses, depth maps and object meshes. NeuralLabeling uses Neural Radiance Fields (NeRF) as renderer, allowing labeling to be performed using 3D spatial tools while incorporating geometric clues such as occlusions, relying only on images captured from multiple viewpoints as input. To demonstrate the applicability of NeuralLabeling to a practical problem in robotics, we added ground truth depth maps to 30000 frames of transparent object RGB and noisy depth maps of glasses placed in a dishwasher captured using an RGBD sensor, yielding the Dishwasher30k dataset. We show that training a simple deep neural network with supervision using the annotated depth maps yields a higher reconstruction performance than training with the previously applied weakly supervised approach.

摘要
我们介绍NeuralLabeling，一种标注方法和工具集 для使用矩形框或多面体标注场景并生成分割图、可用性图、2D矩形框、3D矩形框、6DOF物体位势、深度图和物体多面体。NeuralLabeling使用神经辐射场（NeRF）作为渲染器，允许使用3D空间工具进行标注，同时利用图像从多个视角捕捉的光学信息，如 occlusion 等。为证明NeuralLabeling在机器人学中的实用性，我们添加了透明物体RGB和噪声深度图的30000帧拍摄到的碗洗器dataset。我们表明，通过对标注深度图进行超级vision的深度神经网络训练，可以获得更高的重建性能，比较于之前应用的弱supervision方法。

Ego3DPose: Capturing 3D Cues from Binocular Egocentric Views

paper_url: http://arxiv.org/abs/2309.11962
repo_url: https://github.com/tho-kn/Ego3DPose
paper_authors: Taeho Kang, Kyungjin Lee, Jinrui Zhang, Youngki Lee
methods: Two-path network architecture with binocular heatmaps and perspective-aware representation using trigonometryresults: Outperforms state-of-the-art models by 23.1% in MPJPE reduction in UnrealEgo dataset, with superior performance in challenging occlusion cases and visible joint positions.

Abstract
We present Ego3DPose, a highly accurate binocular egocentric 3D pose reconstruction system. The binocular egocentric setup offers practicality and usefulness in various applications, however, it remains largely under-explored. It has been suffering from low pose estimation accuracy due to viewing distortion, severe self-occlusion, and limited field-of-view of the joints in egocentric 2D images. Here, we notice that two important 3D cues, stereo correspondences, and perspective, contained in the egocentric binocular input are neglected. Current methods heavily rely on 2D image features, implicitly learning 3D information, which introduces biases towards commonly observed motions and leads to low overall accuracy. We observe that they not only fail in challenging occlusion cases but also in estimating visible joint positions. To address these challenges, we propose two novel approaches. First, we design a two-path network architecture with a path that estimates pose per limb independently with its binocular heatmaps. Without full-body information provided, it alleviates bias toward trained full-body distribution. Second, we leverage the egocentric view of body limbs, which exhibits strong perspective variance (e.g., a significantly large-size hand when it is close to the camera). We propose a new perspective-aware representation using trigonometry, enabling the network to estimate the 3D orientation of limbs. Finally, we develop an end-to-end pose reconstruction network that synergizes both techniques. Our comprehensive evaluations demonstrate that Ego3DPose outperforms state-of-the-art models by a pose estimation error (i.e., MPJPE) reduction of 23.1% in the UnrealEgo dataset. Our qualitative results highlight the superiority of our approach across a range of scenarios and challenges.

摘要
我们介绍EGO3DPose，一个高精度双目人体3D姿势重建系统。双目人体设置具有实用性和有用性，但它受到低姿势估计精度的影响，主要是因为视野扭曲、严重的自遮挡和 JOINTS 的限制。我们发现，双目人体输入中含有两种重要的3D征象，即σtereo对称和视角，但现有方法仅仅将重点放在2D图像特征上，这会导致对常见运动的偏好和低整体精度。我们观察到，它们不仅在难度遮挡的情况下失败，而且在可视 JOINTS 的估计也失败。为解决这些挑战，我们提出了两个新的方法。首先，我们设计了一个两条路径网络架构，其中一条路径估计每个肢体的姿势独立地使用双目热映图。不需要全身信息提供，这样可以减少对训练全身份布的偏好。其次，我们利用 egocentric 视野中的身体部分，其中展现了强大的视角变化（例如，在相机近距离时，手部将变得非常大）。我们提出了一新的视角感知表示方法，使得网络可以估计肢体的3D方向。最后，我们实现了一个统一的姿势重建网络，融合了这两种技术。我们的全面评估显示，EGO3DPose 比前方 Models 的姿势估计误差（即MPJPE） reduction 为23.1%。我们的质数结果显示我们的方法在各种情况和挑战中具有优越性。

A Study of Forward-Forward Algorithm for Self-Supervised Learning

paper_url: http://arxiv.org/abs/2309.11955
repo_url: None
paper_authors: Jonas Brenig, Radu Timofte
for: 本研究是 investigate the performance of forward-forward algorithm vs. backpropagation for self-supervised representation learning, and provide insights into the learned representation spaces.
methods: 本研究使用了四个标准数据集（MNIST、F-MNIST、SVHN和CIFAR-10）和三种常用的自助学习表示学习技术（旋转、翻转和碎片）。
results: 研究发现，虽然forward-forward算法与backpropagation在(自-)超vised学习中表现相似，但在所有研究 setting中转移性能明显落后。这可能是由多个因素引起的，包括每层有独立损失函数和在forward-forward paradigm中实现supervised training的方式。与backpropagation相比，forward-forward算法更关注边界和抛弃一些不必要的信息，这可能妨碍了表示学习的目标。进一步的调查和研究是必要的，以稳定forward-forward策略在自助学习中，并能够在不同的数据集和配置上进行可靠的应用。

Abstract
Self-supervised representation learning has seen remarkable progress in the last few years, with some of the recent methods being able to learn useful image representations without labels. These methods are trained using backpropagation, the de facto standard. Recently, Geoffrey Hinton proposed the forward-forward algorithm as an alternative training method. It utilizes two forward passes and a separate loss function for each layer to train the network without backpropagation. In this study, for the first time, we study the performance of forward-forward vs. backpropagation for self-supervised representation learning and provide insights into the learned representation spaces. Our benchmark employs four standard datasets, namely MNIST, F-MNIST, SVHN and CIFAR-10, and three commonly used self-supervised representation learning techniques, namely rotation, flip and jigsaw. Our main finding is that while the forward-forward algorithm performs comparably to backpropagation during (self-)supervised training, the transfer performance is significantly lagging behind in all the studied settings. This may be caused by a combination of factors, including having a loss function for each layer and the way the supervised training is realized in the forward-forward paradigm. In comparison to backpropagation, the forward-forward algorithm focuses more on the boundaries and drops part of the information unnecessary for making decisions which harms the representation learning goal. Further investigation and research are necessary to stabilize the forward-forward strategy for self-supervised learning, to work beyond the datasets and configurations demonstrated by Geoffrey Hinton.

摘要
自顾的表示学习在最近几年内取得了非常出色的进步，其中一些最新的方法可以无需标签学习有用的图像表示。这些方法通常通过反射传播来训练网络。在这项研究中，我们第一次比较了反射传播和反射传播两种训练方法的性能，并对学习的表示空间提供了深入的启示。我们的基准使用了四个标准数据集，即MNIST、F-MNIST、SVHN和CIFAR-10，以及三种常用的自顾表示学习技术，即旋转、翻折和缝隙。我们的主要发现是，虽然反射传播在（自）超vised训练中和反射传播相当，但在所有研究的设置中，转移性能明显落后。这可能是由多种因素引起的，包括每层有自己的损失函数以及在反射传播中实现自顾训练的方式。与反射传播相比，反射传播更关注边界，抛弃一些无关于做出决策的信息，这会妨碍表示学习的目标。进一步的调查和研究是必要的，以稳定反射传播的自顾学习策略，并在不同的数据集和配置下进行研究。

Fully Transformer-Equipped Architecture for End-to-End Referring Video Object Segmentation

paper_url: http://arxiv.org/abs/2309.11933
repo_url: None
paper_authors: Ping Li, Yu Zhang, Li Yuan, Xianghua Xu
for: 本研究旨在提出一个 completel y built upon transformers 的 Referring Video Object Segmentation (RVOS) 框架，以解决跨modal зада件中的 объек对象搜寻问题。
methods: 本研究使用 transformers 完全建立了一个 RVOS 框架，并将任务视为一个 mask sequence learning 问题，将所有在视频中的物件视为候选物件。
results: 验证研究表明，提案的方法在三个 benchmark 上表现出色，例如在 A2D Sentences 和 J-HMDB Sentences 上的 mAP 分别为 45.1% 和 38.7%，在 Ref-YouTube-VOS 上的 $\mathcal{J&F}$ 分别为 56.6%。相比最佳候选方法，提案方法在前两个 benchmark 上的 P$@$0.5 分别提高了 2.1% 和 3.2%，在 Ref-YouTube-VOS 上的 $\mathcal{J}$ 分别提高了 2.9%。

Abstract
Referring Video Object Segmentation (RVOS) requires segmenting the object in video referred by a natural language query. Existing methods mainly rely on sophisticated pipelines to tackle such cross-modal task, and do not explicitly model the object-level spatial context which plays an important role in locating the referred object. Therefore, we propose an end-to-end RVOS framework completely built upon transformers, termed \textit{Fully Transformer-Equipped Architecture} (FTEA), which treats the RVOS task as a mask sequence learning problem and regards all the objects in video as candidate objects. Given a video clip with a text query, the visual-textual features are yielded by encoder, while the corresponding pixel-level and word-level features are aligned in terms of semantic similarity. To capture the object-level spatial context, we have developed the Stacked Transformer, which individually characterizes the visual appearance of each candidate object, whose feature map is decoded to the binary mask sequence in order directly. Finally, the model finds the best matching between mask sequence and text query. In addition, to diversify the generated masks for candidate objects, we impose a diversity loss on the model for capturing more accurate mask of the referred object. Empirical studies have shown the superiority of the proposed method on three benchmarks, e.g., FETA achieves 45.1% and 38.7% in terms of mAP on A2D Sentences (3782 videos) and J-HMDB Sentences (928 videos), respectively; it achieves 56.6% in terms of $\mathcal{J\&F}$ on Ref-YouTube-VOS (3975 videos and 7451 objects). Particularly, compared to the best candidate method, it has a gain of 2.1% and 3.2% in terms of P$@$0.5 on the former two, respectively, while it has a gain of 2.9% in terms of $\mathcal{J}$ on the latter one.

摘要
referring video object segmentation (RVOS)需要将视频中的对象分割成自然语言查询中引用的对象。现有方法主要基于复杂的管道来解决这种跨模态任务，而不直接模型对象水平的空间上下文，这将对于定位引用对象具有重要作用。因此，我们提出了一个 completelystructured upon transformers的框架，称为完全转换器装置架构（FTEA），它将RVOS任务视为一个mask sequence学习问题，并将所有视频中的对象视为候选对象。给定一个视频clip和文本查询，则可以通过encoder提取视觉和文本特征，并将它们在semantic similarity上对应。为了捕捉对象水平的空间上下文，我们开发了堆叠transformer，它可以在不同的对象水平上彩色化每个候选对象的视觉特征，并将其解码成直接对应的二进制掩码序列。最后，模型将找到与文本查询最佳匹配的mask sequence。此外，为了让模型生成更加准确的掩码，我们对模型征加多样性损失，以捕捉更多的对象特征。实验表明，我们的方法在三个标准测试集上表现出色，例如，FETA在A2D Sentences（3782个视频）和J-HMDB Sentences（928个视频）上达到了45.1%和38.7%的mAP，并在Ref-YouTube-VOS（3975个视频和7451个对象）上达到了56.6%的$\mathcal{J\&F}$。特别是，与最佳候选方法相比，FETA在前两个测试集上具有2.1%和3.2%的P$@$0.5提升，而在后一个测试集上具有2.9%的提升。

Bridging the Gap: Learning Pace Synchronization for Open-World Semi-Supervised Learning

paper_url: http://arxiv.org/abs/2309.11930
repo_url: None
paper_authors: Bo Ye, Kai Gan, Tong Wei, Min-Ling Zhang
for: 这篇论文的目的是解决开放世界半监督学习中的新类发现问题，即使用无标签数据来增强模型对已知类的性能。
methods: 这篇论文提出了两种方法来解决这个问题：1）使用适应margin损失，根据估算的类分布来强制将seen类和novel类的学习速度融合，以同步学习速度。2）使用假标签对比归一化，将可能属于同一个类的样本集中，以提高新类发现。
results: 对多个数据集进行了广泛的评估，发现现有模型仍然困难地学习新类，而我们的方法却能够平衡seen和novel类，在ImageNet数据集上取得了3%的平均准确率提升，至于先前的状态艺术。此外，我们发现在默认的先前文献中进行自我超参数 fine-tuning 可以显著提高性能。

Abstract
In open-world semi-supervised learning, a machine learning model is tasked with uncovering novel categories from unlabeled data while maintaining performance on seen categories from labeled data. The central challenge is the substantial learning gap between seen and novel categories, as the model learns the former faster due to accurate supervisory information. To address this, we introduce 1) an adaptive margin loss based on estimated class distribution, which encourages a large negative margin for samples in seen classes, to synchronize learning paces, and 2) pseudo-label contrastive clustering, which pulls together samples which are likely from the same class in the output space, to enhance novel class discovery. Our extensive evaluations on multiple datasets demonstrate that existing models still hinder novel class learning, whereas our approach strikingly balances both seen and novel classes, achieving a remarkable 3% average accuracy increase on the ImageNet dataset compared to the prior state-of-the-art. Additionally, we find that fine-tuning the self-supervised pre-trained backbone significantly boosts performance over the default in prior literature. After our paper is accepted, we will release the code.

摘要
在开放世界半supervised学习中，一个机器学习模型被要求发现未经标注的类，并保持已经标注的类的性能。中心挑战是seen和novel类之间的学习差距，因为模型通过准确的监督信息更快地学习seen类。为此，我们提出了两点解决方案：1）适应margin损失基于估计类分布，以便同步学习速度，和2） Pseudo-label对比聚合，以便增强novel类发现。我们在多个数据集进行了广泛的评估，发现现有模型仍然受限于novel类学习，而我们的方法能够平衡seen和novel类，在ImageNet数据集上实现了3%的平均准确率提升，相比之前的状态的艺术。此外，我们发现在先前的文献中 defaults 的自然语言预训练模型进行了显著的性能提升。接下来，我们将接受论文后，将代码发布。

Video Scene Location Recognition with Neural Networks

paper_url: http://arxiv.org/abs/2309.11928
repo_url: None
paper_authors: Lukáš Korel, Petr Pulc, Jiří Tumpach, Martin Holeňa
for: 这个论文探讨了基于视频序列的场景识别问题，使用人工神经网络来实现场景识别。
methods: 该方法选择每个场景中的一组帧，使用预训练的单图预处理卷积网络进行转换，并使用后续层的神经网络进行场景位置的分类。
results: 研究人员在使用不同层的神经网络进行组合，发现只有一些方法适用于这种任务。

Abstract
This paper provides an insight into the possibility of scene recognition from a video sequence with a small set of repeated shooting locations (such as in television series) using artificial neural networks. The basic idea of the presented approach is to select a set of frames from each scene, transform them by a pre-trained singleimage pre-processing convolutional network, and classify the scene location with subsequent layers of the neural network. The considered networks have been tested and compared on a dataset obtained from The Big Bang Theory television series. We have investigated different neural network layers to combine individual frames, particularly AveragePooling, MaxPooling, Product, Flatten, LSTM, and Bidirectional LSTM layers. We have observed that only some of the approaches are suitable for the task at hand.

摘要
Translation notes:* "scene recognition" is translated as "场景识别" (chǎngjìng zhībèi)* "video sequence" is translated as "视频序列" (zhìpín xùxiàn)* "artificial neural networks" is translated as "人工神经网络" (réngōng shénxiào wǎngluō)* "pre-trained" is translated as "预训练" (yùxùnliào)* "single-image pre-processing" is translated as "单图预处理" (dāngràng yùxùnliào)* "classify" is translated as "分类" (fēngróng)* "scene location" is translated as "场景位置" (chǎngjìng weíqióng)* "neural network layers" is translated as "神经网络层" (shénxiào wǎngluō jié)* "AveragePooling" is translated as "平均池化" (píngyuan chíhuà)* "MaxPooling" is translated as "最大池化" (máxī chíhuà)* "Product" is translated as "乘法" (shūfǎ)* "Flatten" is translated as "平铺" (píngshì)* "LSTM" is translated as "长期记忆神经网络" (chángjì shēngyì shénxiào wǎngluō)* "Bidirectional LSTM" is translated as "双向长期记忆神经网络" (shuāngxiàng chángjì shēngyì shénxiào wǎngluō)

TextCLIP: Text-Guided Face Image Generation And Manipulation Without Adversarial Training

paper_url: http://arxiv.org/abs/2309.11923
repo_url: None
paper_authors: Xiaozhou You, Jian Zhang
for: 文章目的是提出一种基于文本的图像生成和修改方法，无需对抗训练。
methods: 方法利用 StyleGAN 的强大生成能力和 CLIP 的文本图像表示能力，通过特制的映射网络实现图像生成和修改。
results: 在 Multi-modal CelebA-HQ 数据集上进行了广泛的实验，表明我们的提出方法在图像生成和修改任务上具有优于现有方法的性能。

Abstract
Text-guided image generation aimed to generate desired images conditioned on given texts, while text-guided image manipulation refers to semantically edit parts of a given image based on specified texts. For these two similar tasks, the key point is to ensure image fidelity as well as semantic consistency. Many previous approaches require complex multi-stage generation and adversarial training, while struggling to provide a unified framework for both tasks. In this work, we propose TextCLIP, a unified framework for text-guided image generation and manipulation without adversarial training. The proposed method accepts input from images or random noise corresponding to these two different tasks, and under the condition of the specific texts, a carefully designed mapping network that exploits the powerful generative capabilities of StyleGAN and the text image representation capabilities of Contrastive Language-Image Pre-training (CLIP) generates images of up to $1024\times1024$ resolution that can currently be generated. Extensive experiments on the Multi-modal CelebA-HQ dataset have demonstrated that our proposed method outperforms existing state-of-the-art methods, both on text-guided generation tasks and manipulation tasks.

摘要
文本导向图像生成和修改旨在生成基于给定文本的所需图像，而文本导向图像修改则是基于指定文本进行Semantic的修改。为这两个相似任务，关键点是保持图像准确性和Semantic一致。许多前一代方法需要复杂的多阶段生成和对抗训练，而困难提供一个简单的框架 для这两个任务。在这项工作中，我们提出了TextCLIP，一个不需要对抗训练的简单框架，可以同时进行文本导向图像生成和修改。提案的方法接受图像或随机噪声作为输入，根据特定的文本来生成高分辨率图像（最大支持1024x1024）。广泛的实验表明，我们的提案方法在Multi-modal CelebA-HQ dataset上比前一代方法更高效，同时在文本导向生成和修改任务上都有优异表现。

Spatial-Temporal Transformer based Video Compression Framework

paper_url: http://arxiv.org/abs/2309.11913
repo_url: None
paper_authors: Yanbo Gao, Wenjia Huang, Shuai Li, Hui Yuan, Mao Ye, Siwei Ma
for: 提高learned video compression（LVC）的效率和稳定性。
methods: 基于卷积神经网络（NN）的拟合推理，包括弹性推理（Uformer）、多个参考帧（MGP）和空间特征分布预测（SFD-T）等模块。
results: 与VTM比较，实现13.5%的BD率降低。

Abstract
Learned video compression (LVC) has witnessed remarkable advancements in recent years. Similar as the traditional video coding, LVC inherits motion estimation/compensation, residual coding and other modules, all of which are implemented with neural networks (NNs). However, within the framework of NNs and its training mechanism using gradient backpropagation, most existing works often struggle to consistently generate stable motion information, which is in the form of geometric features, from the input color features. Moreover, the modules such as the inter-prediction and residual coding are independent from each other, making it inefficient to fully reduce the spatial-temporal redundancy. To address the above problems, in this paper, we propose a novel Spatial-Temporal Transformer based Video Compression (STT-VC) framework. It contains a Relaxed Deformable Transformer (RDT) with Uformer based offsets estimation for motion estimation and compensation, a Multi-Granularity Prediction (MGP) module based on multi-reference frames for prediction refinement, and a Spatial Feature Distribution prior based Transformer (SFD-T) for efficient temporal-spatial joint residual compression. Specifically, RDT is developed to stably estimate the motion information between frames by thoroughly investigating the relationship between the similarity based geometric motion feature extraction and self-attention. MGP is designed to fuse the multi-reference frame information by effectively exploring the coarse-grained prediction feature generated with the coded motion information. SFD-T is to compress the residual information by jointly exploring the spatial feature distributions in both residual and temporal prediction to further reduce the spatial-temporal redundancy. Experimental results demonstrate that our method achieves the best result with 13.5% BD-Rate saving over VTM.

摘要
Traditional video coding 的发展已经做出了很大的进步，但是这些方法通常难以稳定地生成从输入色彩特征中的动态信息，即几何特征。此外，模块之间的独立性使得减少空间时间重复的效率受到限制。为解决这些问题，在这篇论文中，我们提出了一种新的空间时间变换基于视频压缩（STT-VC）框架。它包括一个宽度自适应变换（RDT），基于uformer的偏移估计来实现动态信息估计和补做，一个多级别预测（MGP）模块，基于多个参考帧来进行预测精度的提升，以及一个空间特征分布先前基于变换（SFD-T）来高效地压缩剩余信息。具体来说，RDT是通过系统地研究 similarity 基于几何动态特征提取和自我注意来稳定地估计动态信息的。MGP是通过有效地探索粗级预测特征，使用编码动态信息来融合多个参考帧信息。SFD-T是通过同时探索剩余信息中的空间特征分布来进一步减少空间时间重复。实验结果表明，我们的方法可以在 VTM 上实现13.5%的BD-Rate减少。

Heart Rate Detection Using an Event Camera

paper_url: http://arxiv.org/abs/2309.11891
repo_url: None
paper_authors: Aniket Jagtap, RamaKrishna Venkatesh Saripalli, Joe Lemley, Waseem Shariff, Alan F. Smeaton
for: 用于非侵入式心率监测
methods: 使用事件摄像机技术进行血液征迹捕捉
results: 成功实现了非接触式心率测量，并且可以减轻误差和难以控制的人体自然抖动等问题

Abstract
Event cameras, also known as neuromorphic cameras, are an emerging technology that offer advantages over traditional shutter and frame-based cameras, including high temporal resolution, low power consumption, and selective data acquisition. In this study, we propose to harnesses the capabilities of event-based cameras to capture subtle changes in the surface of the skin caused by the pulsatile flow of blood in the wrist region. We investigate whether an event camera could be used for continuous noninvasive monitoring of heart rate (HR). Event camera video data from 25 participants, comprising varying age groups and skin colours, was collected and analysed. Ground-truth HR measurements obtained using conventional methods were used to evaluate of the accuracy of automatic detection of HR from event camera data. Our experimental results and comparison to the performance of other non-contact HR measurement methods demonstrate the feasibility of using event cameras for pulse detection. We also acknowledge the challenges and limitations of our method, such as light-induced flickering and the sub-conscious but naturally-occurring tremors of an individual during data capture.

摘要
事件摄像机也称为neuromorphic摄像机，是一种出现在技术领域的新兴技术，它们比传统的闭锁和帧采集机器更具有优点，包括高时间分辨率、低功耗和选择性数据采集。在这项研究中，我们利用事件驱动的摄像机来捕捉血液循环在手部区域中的微妙变化。我们研究了使用事件摄像机进行无侵入性的血液总流率（HR）连续监测。我们收集了25名参与者的事件摄像机视频数据，这些参与者来自不同的年龄组和皮肤颜色。我们使用传统方法获取的真实HR测量来评估自动从事件摄像机数据中检测HR的准确性。我们的实验结果和与其他非接触HR测量方法的比较表明了使用事件摄像机进行脉律检测的可能性。我们也认可了我们的方法的挑战和限制，例如光线辐射引起的闪光和个体在数据采集过程中的自然发生的微小振荡。

On-the-Fly SfM: What you capture is What you get

paper_url: http://arxiv.org/abs/2309.11883
repo_url: None
paper_authors: Zongqian Zhan, Rui Xia, Yifei Yu, Yibo Xu, Xin Wang
for: 实时Structure from motion（SfM），可以在摄像头捕捉图像的同时进行注册和三角坐标估算。
methods: 我们的方法包括使用自动学习的词汇树进行快速图像检索、使用最小二乘（LSM）匹配机制提高图像对齐性、以及使用层次权重本地杆对调 optimization。
results: 我们的实验结果表明，在线SfM可以在实时捕捉图像时具有robustness和稳定性，并且可以实现图像注册和三角坐标估算。

Abstract
Over the last decades, ample achievements have been made on Structure from motion (SfM). However, the vast majority of them basically work in an offline manner, i.e., images are firstly captured and then fed together into a SfM pipeline for obtaining poses and sparse point cloud. In this work, on the contrary, we present an on-the-fly SfM: running online SfM while image capturing, the newly taken On-the-Fly image is online estimated with the corresponding pose and points, i.e., what you capture is what you get. Specifically, our approach firstly employs a vocabulary tree that is unsupervised trained using learning-based global features for fast image retrieval of newly fly-in image. Then, a robust feature matching mechanism with least squares (LSM) is presented to improve image registration performance. Finally, via investigating the influence of newly fly-in image's connected neighboring images, an efficient hierarchical weighted local bundle adjustment (BA) is used for optimization. Extensive experimental results demonstrate that on-the-fly SfM can meet the goal of robustly registering the images while capturing in an online way.

摘要
Over the past few decades, significant progress has been made in Structure from Motion (SfM). However, most of these methods work in an offline manner, where images are captured and then fed into a SfM pipeline to obtain poses and sparse point clouds. In this work, we propose an on-the-fly SfM approach that runs SfM online while capturing images. Specifically, our method first employs an unsupervised vocabulary tree trained using learning-based global features for fast image retrieval of newly captured images. Then, we present a robust feature matching mechanism with least squares (LSM) to improve image registration performance. Finally, we use an efficient hierarchical weighted local bundle adjustment (BA) to optimize the images. Experimental results show that on-the-fly SfM can robustly register images captured online.Here's the word-for-word translation of the text into Simplified Chinese:过去几十年，结构从运动（SfM）领域得到了充足的成果。然而，大多数方法都是在离线模式下工作，即首先捕捉图像，然后将其传输到SfM管道中进行获取pose和稀疏点云。在这种情况下，我们提出了在线SfM方法：在捕捉图像时，新 captured On-the-Fly图像会在线被估算pose和点云，即你捕捉的就是你得到的。specifically，我们的方法首先采用了一个不supervised的词汇树，通过学习基于全局特征的学习来快速检索新飞入的图像。然后，我们提出了一种基于小正方形（LSM）的强健特征匹配机制，以提高图像匹配性能。最后，我们通过研究新飞入图像的相邻图像的影响，使用高效的层次权重本地加载平衡（BA）来优化图像。广泛的实验结果表明，在线SfM可以robustly register captured图像。

Using Saliency and Cropping to Improve Video Memorability

paper_url: http://arxiv.org/abs/2309.11881
repo_url: https://github.com/hieu9955/ggggg
paper_authors: Vaibhav Mudgal, Qingyang Wang, Lorin Sweeney, Alan F. Smeaton
for: 提高视频记忆性，以便提高视频的分享、播放和讨论可能性。
methods: 通过基于图像引人注意力的选择性剪辑来提高视频记忆性。实验包括基本固定剪辑和动态剪辑，其中剪辑大小和位置随视频播放和引人注意力跟踪变化。
results: Results indicate that especially for videos of low initial memorability, the memorability score can be improved.

Abstract
Video memorability is a measure of how likely a particular video is to be remembered by a viewer when that viewer has no emotional connection with the video content. It is an important characteristic as videos that are more memorable are more likely to be shared, viewed, and discussed. This paper presents results of a series of experiments where we improved the memorability of a video by selectively cropping frames based on image saliency. We present results of a basic fixed cropping as well as the results from dynamic cropping where both the size of the crop and the position of the crop within the frame, move as the video is played and saliency is tracked. Our results indicate that especially for videos of low initial memorability, the memorability score can be improved.

摘要
视频记忆度是观看者视频内容无情感连接时视频的记忆程度。这是一项重要的特性，因为更有记忆力的视频更有可能被分享、播放和讨论。这篇论文介绍了一系列实验，我们通过选择性剪辑帧来提高视频的记忆力。我们发现，特别是初始记忆力较低的视频，通过动态剪辑（即剪辑大小和位置随视频播放和注意力追踪而变化）可以提高记忆力。

TCOVIS: Temporally Consistent Online Video Instance Segmentation

paper_url: http://arxiv.org/abs/2309.11857
repo_url: https://github.com/jun-long-li/tcovis
paper_authors: Junlong Li, Bingyao Yu, Yongming Rao, Jie Zhou, Jiwen Lu
for: 本文提出了一种新的在线视频实例分割方法（TCOVIS），用于解决视频实例分割 task 中的时间一致性问题。
methods: TCOVIS 方法包括全局实例匹配策略和空间时间增强模块，这两个部分都可以提高视频中的时间一致性。
results: 在四个广泛采用的视频实例分割benchmark上（YouTube-VIS 2019/2021/2022 和 OVIS），TCOVIS 方法达到了所有benchmark上的最佳性能，不需要额外的技术。例如，在 YouTube-VIS 2021 上，TCOVIS 方法使用 ResNet-50 和 Swin-L 的背部板，分别获得了 49.5 AP 和 61.3 AP。

Abstract
In recent years, significant progress has been made in video instance segmentation (VIS), with many offline and online methods achieving state-of-the-art performance. While offline methods have the advantage of producing temporally consistent predictions, they are not suitable for real-time scenarios. Conversely, online methods are more practical, but maintaining temporal consistency remains a challenging task. In this paper, we propose a novel online method for video instance segmentation, called TCOVIS, which fully exploits the temporal information in a video clip. The core of our method consists of a global instance assignment strategy and a spatio-temporal enhancement module, which improve the temporal consistency of the features from two aspects. Specifically, we perform global optimal matching between the predictions and ground truth across the whole video clip, and supervise the model with the global optimal objective. We also capture the spatial feature and aggregate it with the semantic feature between frames, thus realizing the spatio-temporal enhancement. We evaluate our method on four widely adopted VIS benchmarks, namely YouTube-VIS 2019/2021/2022 and OVIS, and achieve state-of-the-art performance on all benchmarks without bells-and-whistles. For instance, on YouTube-VIS 2021, TCOVIS achieves 49.5 AP and 61.3 AP with ResNet-50 and Swin-L backbones, respectively. Code is available at https://github.com/jun-long-li/TCOVIS.

摘要
近年来，视频实例分割（VIS）领域内，有很多离线和在线方法实现了状态数据最佳性。然而，离线方法在实时场景下不够实用，而在线方法尚未保证时间一致性。在这篇论文中，我们提出了一种新的在线视频实例分割方法，即TCOVIS，该方法完全利用视频帧序中的时间信息。TCOVIS的核心包括全局实例分配策略和空间时间增强模块，这两者共同提高了特征序列中的时间一致性。具体来说，我们在整个视频帧序中进行全局最佳匹配，并将模型监督global最佳目标。此外，我们还捕捉了空间特征，将其与语义特征在帧之间归一化，实现了空间时间增强。我们在四个广泛采用的 VIS 标准测试集上进行评估，分别是 YouTube-VIS 2019/2021/2022 和 OVIS，并在所有标准测试集上取得了状态数据最佳性。例如，在 YouTube-VIS 2021 上，TCOVIS 取得了 49.5 AP 和 61.3 AP，使用 ResNet-50 和 Swin-L 框架。代码可以在上获取。

DEYOv3: DETR with YOLO for Real-time Object Detection

paper_url: http://arxiv.org/abs/2309.11851
repo_url: None
paper_authors: Haodong Ouyang
for: 提出了一种新的训练方法，以提高实时物体检测器的性能和投入成本。
methods: 使用步骤训练方法，首先使用预训练的 YOLO 检测器来初始化结束到端检测器，然后在第二个阶段将Encoder和背景匹配到 DETR 类型模型中，但只需要重新训练检测器。
results: 提出了一种brand-new的实时物体检测模型called DEYOv3，可以在 COCO validate2017 上 достичь 41.1% 的分数和 T4 GPU 上达到 270 FPS，同时 DEYOv3-L 可以在 COCO validate2017 上达到 51.3% AP 和 102 FPS。此外，DEYOv3 不需要额外的训练数据，可以在 N、S 和 M 级模型上 Completed 在 COCO dataset 上训练，只需要一个 24GB RTX3090 GPU。

Abstract
Recently, end-to-end object detectors have gained significant attention from the research community due to their outstanding performance. However, DETR typically relies on supervised pretraining of the backbone on ImageNet, which limits the practical application of DETR and the design of the backbone, affecting the model's potential generalization ability. In this paper, we propose a new training method called step-by-step training. Specifically, in the first stage, the one-to-many pre-trained YOLO detector is used to initialize the end-to-end detector. In the second stage, the backbone and encoder are consistent with the DETR-like model, but only the detector needs to be trained from scratch. Due to this training method, the object detector does not need the additional dataset (ImageNet) to train the backbone, which makes the design of the backbone more flexible and dramatically reduces the training cost of the detector, which is helpful for the practical application of the object detector. At the same time, compared with the DETR-like model, the step-by-step training method can achieve higher accuracy than the traditional training method of the DETR-like model. With the aid of this novel training method, we propose a brand-new end-to-end real-time object detection model called DEYOv3. DEYOv3-N achieves 41.1% on COCO val2017 and 270 FPS on T4 GPU, while DEYOv3-L achieves 51.3% AP and 102 FPS. Without the use of additional training data, DEYOv3 surpasses all existing real-time object detectors in terms of both speed and accuracy. It is worth noting that for models of N, S, and M scales, the training on the COCO dataset can be completed using a single 24GB RTX3090 GPU. Code will be released at https://github.com/ouyanghaodong/DEYOv3.

摘要
最近，端到端对象检测器在研究 сообществе中获得了重要的注意力，因为它们的表现非常出色。然而，DETR通常需要supervised预训练的后IONet，这限制了DETR的实际应用和后IONet的设计，从而影响了模型的总体化能力。在这篇论文中，我们提出了一种新的训练方法called step-by-step training。具体来说，在第一个阶段，使用pre-trained YOLO检测器进行一对多的初始化，然后在第二个阶段，后IONet和编码器与DETR-like模型相同，但是检测器需要从零开始训练。由于这种训练方法，对象检测器不需要额外的数据集（ImageNet）来训练后IONet，这使得后IONet的设计更加灵活，减少了检测器的训练成本，有助于实际应用。同时，相比DETR-like模型，step-by-step training方法可以在同样的精度下提高对象检测器的速度。通过这种新的训练方法，我们提出了一种全新的端到端实时对象检测模型called DEYOv3。DEYOv3-N在COCO val2017上得到了41.1%的分数和270 FPS的速度，而DEYOv3-L在COCO val2017上得到了51.3%的AP和102 FPS。不需要额外的训练数据，DEYOv3超过了所有现有的实时对象检测器，在速度和精度两个方面。值得注意的是，对于N、S、M缩放的模型，在COCO数据集上进行训练可以使用单个24GB RTX3090 GPU。代码将在https://github.com/ouyanghaodong/DEYOv3上发布。

MEFLUT: Unsupervised 1D Lookup Tables for Multi-exposure Image Fusion

paper_url: http://arxiv.org/abs/2309.11847
repo_url: https://github.com/hedlen/meflut
paper_authors: Ting Jiang, Chuan Wang, Xinpeng Li, Ru Li, Haoqiang Fan, Shuaicheng Liu
for: 高品质多曝光图像融合 (MEF)
methods: 提出了一种新的方法，通过编码折衔表 (LUT) 来实现高效和高质量的多曝光图像融合，并通过注意力机制在不同维度进行调整，以提高融合质量。
results: 对比州时的最佳方法 (SOTA)，新方法在两个 dataset 上表现出较高的质量和效率，并且运行速度快（less than 4ms）。此外，该方法已经被广泛应用在 Android 手机上，并在多个国际品牌中推广。

Abstract
In this paper, we introduce a new approach for high-quality multi-exposure image fusion (MEF). We show that the fusion weights of an exposure can be encoded into a 1D lookup table (LUT), which takes pixel intensity value as input and produces fusion weight as output. We learn one 1D LUT for each exposure, then all the pixels from different exposures can query 1D LUT of that exposure independently for high-quality and efficient fusion. Specifically, to learn these 1D LUTs, we involve attention mechanism in various dimensions including frame, channel and spatial ones into the MEF task so as to bring us significant quality improvement over the state-of-the-art (SOTA). In addition, we collect a new MEF dataset consisting of 960 samples, 155 of which are manually tuned by professionals as ground-truth for evaluation. Our network is trained by this dataset in an unsupervised manner. Extensive experiments are conducted to demonstrate the effectiveness of all the newly proposed components, and results show that our approach outperforms the SOTA in our and another representative dataset SICE, both qualitatively and quantitatively. Moreover, our 1D LUT approach takes less than 4ms to run a 4K image on a PC GPU. Given its high quality, efficiency and robustness, our method has been shipped into millions of Android mobiles across multiple brands world-wide. Code is available at: https://github.com/Hedlen/MEFLUT.

摘要
在这篇论文中，我们介绍了一种新的高质量多曝光图像融合（MEF）方法。我们显示了一个曝光的融合权重可以被编码成1D查找表（LUT），该表接受像素强度值作为输入，并生成融合权重作为输出。我们学习了每个曝光的1D LUT，然后所有的像素从不同的曝光照片都可以独立地查询该曝光的1D LUT，以实现高质量和高效的融合。具体来说，为了学习这些1D LUT，我们在MEF任务中涉及了注意力机制在不同的维度，包括帧、通道和空间维度，以此实现显著的质量改进。此外，我们收集了一个新的MEF数据集，包含960个样本，其中155个是由专业人员手动调整为标准参考。我们的网络在这个数据集上进行了无监督的训练。我们进行了广泛的实验，以证明所有我们提出的新组件的效果，结果显示我们的方法在我们的数据集和另一个代表性数据集SICE中，都有较高的质量和效率。此外，我们的1D LUT方法在4K图像上只需要0.4毫秒钟，在PC GPU上运行。由于其高质量、效率和稳定性，我们的方法已经被安装在全球多个Android手机品牌上。代码可以在https://github.com/Hedlen/MEFLUT中找到。

paper_url: http://arxiv.org/abs/2309.11839
repo_url: None
paper_authors: Haozhi Cao, Yuecong Xu, Jianfei Yang, Pengyu Yin, Shenghai Yuan, Lihua Xie
for: 这个研究旨在提高3D semantic segmentation中的罕见物类分类性能，并且不需要耗费价值的点据标注。
methods: 本研究使用Multi-modal Prior Aided（MoPA）领域对应，提出Valid Ground-based Insertion（VGI）和SAM consistency loss等方法来缓解自我训练中的分布不均势问题，并且将多modal特征知识共享到各自的领域中。
results: 实验结果显示，本研究在MM-UDAbenchmark上的表现凌驾了现有的方法，并且在罕见物类分类上具有更高的准确性。

Abstract
Multi-modal unsupervised domain adaptation (MM-UDA) for 3D semantic segmentation is a practical solution to embed semantic understanding in autonomous systems without expensive point-wise annotations. While previous MM-UDA methods can achieve overall improvement, they suffer from significant class-imbalanced performance, restricting their adoption in real applications. This imbalanced performance is mainly caused by: 1) self-training with imbalanced data and 2) the lack of pixel-wise 2D supervision signals. In this work, we propose Multi-modal Prior Aided (MoPA) domain adaptation to improve the performance of rare objects. Specifically, we develop Valid Ground-based Insertion (VGI) to rectify the imbalance supervision signals by inserting prior rare objects collected from the wild while avoiding introducing artificial artifacts that lead to trivial solutions. Meanwhile, our SAM consistency loss leverages the 2D prior semantic masks from SAM as pixel-wise supervision signals to encourage consistent predictions for each object in the semantic mask. The knowledge learned from modal-specific prior is then shared across modalities to achieve better rare object segmentation. Extensive experiments show that our method achieves state-of-the-art performance on the challenging MM-UDA benchmark. Code will be available at https://github.com/AronCao49/MoPA.

摘要
多模态无监督领域适应（MM-UDA）为3D语义分割提供了实用的解决方案，以嵌入自主系统中的语义理解无需昂贵的点级标注。先前的MM-UDA方法可以实现总体改进，但它们受到类别不均衡性的影响，导致其在实际应用中的采用有限。这种不均衡性主要来自于：1）自我训练偏斜数据和2）缺失像素级2D超参信号。在这种工作中，我们提出了多模态依据帮助（MoPA）领域适应，以改善罕见对象的性能。特别是，我们开发了有效的地面基础插入（VGI），以修正不均衡的指导信号，并避免引入人工 artifacts，以避免导致轻微解决方案。此外，我们的SAM一致性损失利用了2D先前语义masks从SAM中的像素级超参信号，以强制每个对象在semantic mask中的一致预测。知识从多模态依据中学习的知识然后被共享到多个模式，以实现更好的罕见对象分割。广泛的实验表明，我们的方法在复杂的MM-UDAbenchmark上实现了状态的最佳性能。代码将在https://github.com/AronCao49/MoPA上公开。

Automatic Endoscopic Ultrasound Station Recognition with Limited Data

paper_url: http://arxiv.org/abs/2309.11820
repo_url: https://github.com/amrita-medical-ai/eusml-labeller
paper_authors: Abhijit Ramesh, Anantha Nandanan, Nikhil Boggavarapu, Priya Nair MD, Gilad Gressel
For: 这个研究旨在帮助医生更有效地诊断胰脏癌，使用人工智能技术来帮助医生更快速地识别胰脏癌的“EUS站”（胰脏ultrasound的不同位置）。* Methods: 这个研究使用了深度学习技术，开发了一个可以在EUS процеду中实时识别胰脏癌的computer-assisted diagnostic（CAD）工具。这个工具可以帮助医生更快速地识别胰脏癌，并且提供可读的和解释的视觉化技术。* Results: 研究发现，只需使用43次程序， без任何参数调整，可以取得90%的平衡精度，与现有的州前测试相当。此外，这个工具还可以提供可读的和解释的视觉化技术，帮助医生更好地理解胰脏癌的特征。

Abstract
Pancreatic cancer is a lethal form of cancer that significantly contributes to cancer-related deaths worldwide. Early detection is essential to improve patient prognosis and survival rates. Despite advances in medical imaging techniques, pancreatic cancer remains a challenging disease to detect. Endoscopic ultrasound (EUS) is the most effective diagnostic tool for detecting pancreatic cancer. However, it requires expert interpretation of complex ultrasound images to complete a reliable patient scan. To obtain complete imaging of the pancreas, practitioners must learn to guide the endoscope into multiple "EUS stations" (anatomical locations), which provide different views of the pancreas. This is a difficult skill to learn, involving over 225 proctored procedures with the support of an experienced doctor. We build an AI-assisted tool that utilizes deep learning techniques to identify these stations of the stomach in real time during EUS procedures. This computer-assisted diagnostic (CAD) will help train doctors more efficiently. Historically, the challenge faced in developing such a tool has been the amount of retrospective labeling required by trained clinicians. To solve this, we developed an open-source user-friendly labeling web app that streamlines the process of annotating stations during the EUS procedure with minimal effort from the clinicians. Our research shows that employing only 43 procedures with no hyperparameter fine-tuning obtained a balanced accuracy of 90%, comparable to the current state of the art. In addition, we employ Grad-CAM, a visualization technology that provides clinicians with interpretable and explainable visualizations.

摘要
肝胆癌是一种致命的癌症，对全球癌症相关死亡率做出了重要贡献。早期癌症检测是关键，可以提高病人预后和存活率。 despite advances in medical imaging techniques, pancreatic cancer remains a challenging disease to detect. Endoscopic ultrasound (EUS) is the most effective diagnostic tool for detecting pancreatic cancer, but it requires expert interpretation of complex ultrasound images to complete a reliable patient scan. To obtain complete imaging of the pancreas, practitioners must learn to guide the endoscope into multiple "EUS stations" (anatomical locations), which provide different views of the pancreas. This is a difficult skill to learn, involving over 225 proctored procedures with the support of an experienced doctor. We build an AI-assisted tool that utilizes deep learning techniques to identify these stations of the stomach in real time during EUS procedures. This computer-assisted diagnostic (CAD) will help train doctors more efficiently. Historically, the challenge faced in developing such a tool has been the amount of retrospective labeling required by trained clinicians. To solve this, we developed an open-source user-friendly labeling web app that streamlines the process of annotating stations during the EUS procedure with minimal effort from the clinicians. Our research shows that employing only 43 procedures with no hyperparameter fine-tuning obtained a balanced accuracy of 90%, comparable to the current state of the art. In addition, we employ Grad-CAM, a visualization technology that provides clinicians with interpretable and explainable visualizations.

FGFusion: Fine-Grained Lidar-Camera Fusion for 3D Object Detection

paper_url: http://arxiv.org/abs/2309.11804
repo_url: https://github.com/xaviergrool/fgfusion
paper_authors: Zixuan Yin, Han Sun, Ningzhong Liu, Huiyu Zhou, Jiaquan Shen
for: 这个研究旨在提高自动驾驶中的3D检测精度，使用照相机和激光测距仪作为重要的感知器。
methods: 本研究提出了细部激光-照相机融合（FGFusion）方法，具有多个描述度的特征，并将其组合在一个细部的方式下。首先，设计了双轮幕架构造，以提取高层次semantic和低层次细部特征。其次，引入了帮助点云特征更好地学习细部空间信息的帮助网络。最后，提出了多个描述度融合（MSF），以融合最后N个图像和点云特征对应的特征对。
results: 实验结果显示，FGFusion方法在KITTI和Waymo两个流行的自动驾驶测试 benchmark上具有高效性。

Abstract
Lidars and cameras are critical sensors that provide complementary information for 3D detection in autonomous driving. While most prevalent methods progressively downscale the 3D point clouds and camera images and then fuse the high-level features, the downscaled features inevitably lose low-level detailed information. In this paper, we propose Fine-Grained Lidar-Camera Fusion (FGFusion) that make full use of multi-scale features of image and point cloud and fuse them in a fine-grained way. First, we design a dual pathway hierarchy structure to extract both high-level semantic and low-level detailed features of the image. Second, an auxiliary network is introduced to guide point cloud features to better learn the fine-grained spatial information. Finally, we propose multi-scale fusion (MSF) to fuse the last N feature maps of image and point cloud. Extensive experiments on two popular autonomous driving benchmarks, i.e. KITTI and Waymo, demonstrate the effectiveness of our method.

摘要
（本文提出了一种新的方法，即细腻激光镜头混合（FGFusion），以便更好地利用摄像头和激光镜头中的多尺度特征。我们首先设计了一个双路层次结构，以提取摄像头中的高层次semantic特征和低层次细节特征。其次，我们引入了一个auxiliary网络，以帮助激光镜头特征更好地学习细腻空间信息。最后，我们提出了多尺度混合（MSF），以混合最后N个特征图。我们在两个流行的自动驾驶 benchmark上进行了广泛的实验，并证明了我们的方法的有效性。）Here's the breakdown of the translation:* 摄像头 (camera) becomes 摄像头 (cameras) in Simplified Chinese.* 激光镜头 (lidar) becomes 激光镜头 (lidars) in Simplified Chinese.* 多尺度特征 (multi-scale features) becomes 多尺度特征 (multiscale features) in Simplified Chinese.* 细腻激光镜头混合 (FGFusion) becomes 细腻激光镜头混合 (FGFusion) in Simplified Chinese.* 高层次semantic特征 (high-level semantic features) becomes 高层次semantic特征 (high-level semantic features) in Simplified Chinese.* 低层次细节特征 (low-level detailed features) becomes 低层次细节特征 (low-level detailed features) in Simplified Chinese.* auxiliary网络 (auxiliary network) becomes auxiliary网络 (auxiliary network) in Simplified Chinese.* 多尺度混合 (MSF) becomes 多尺度混合 (MSF) in Simplified Chinese.

A Real-Time Multi-Task Learning System for Joint Detection of Face, Facial Landmark and Head Pose

paper_url: http://arxiv.org/abs/2309.11773
repo_url: None
paper_authors: Qingtian Wu, Liming Zhang
for: 本研究旨在提出一种实时多任务检测系统，能同时检测面部、面部特征点和头部姿态。
methods: 该系统基于广泛采用的YOLOv8检测框架，并在原始对象检测头上添加了多个特征点准备 regression 头，以高效地定位面部特征点。此外，我们在原始 YOLOv8 框架中进行了优化和改进。
results: 我们在 300W-LP 和 AFLW2000-3D 数据集上进行了广泛的实验， validate 了我们提出的模型在大角度面部姿态下的能力和实时性。结果表明，我们的模型可以有效地解决大角度面部姿态的挑战，并在这些相互连接的任务中具有实时性。

Abstract
Extreme head postures pose a common challenge across a spectrum of facial analysis tasks, including face detection, facial landmark detection (FLD), and head pose estimation (HPE). These tasks are interdependent, where accurate FLD relies on robust face detection, and HPE is intricately associated with these key points. This paper focuses on the integration of these tasks, particularly when addressing the complexities posed by large-angle face poses. The primary contribution of this study is the proposal of a real-time multi-task detection system capable of simultaneously performing joint detection of faces, facial landmarks, and head poses. This system builds upon the widely adopted YOLOv8 detection framework. It extends the original object detection head by incorporating additional landmark regression head, enabling efficient localization of crucial facial landmarks. Furthermore, we conduct optimizations and enhancements on various modules within the original YOLOv8 framework. To validate the effectiveness and real-time performance of our proposed model, we conduct extensive experiments on 300W-LP and AFLW2000-3D datasets. The results obtained verify the capability of our model to tackle large-angle face pose challenges while delivering real-time performance across these interconnected tasks.

摘要
extreme head postures pose a common challenge across a spectrum of facial analysis tasks, including face detection, facial landmark detection (FLD), and head pose estimation (HPE). these tasks are interdependent, where accurate FLD relies on robust face detection, and HPE is intricately associated with these key points. this paper focuses on the integration of these tasks, particularly when addressing the complexities posed by large-angle face poses. the primary contribution of this study is the proposal of a real-time multi-task detection system capable of simultaneously performing joint detection of faces, facial landmarks, and head poses. this system builds upon the widely adopted YOLOv8 detection framework. it extends the original object detection head by incorporating additional landmark regression head, enabling efficient localization of crucial facial landmarks. furthermore, we conduct optimizations and enhancements on various modules within the original YOLOv8 framework. to validate the effectiveness and real-time performance of our proposed model, we conduct extensive experiments on 300w-lp and aflw2000-3d datasets. the results obtained verify the capability of our model to tackle large-angle face pose challenges while delivering real-time performance across these interconnected tasks.

Fast Satellite Tensorial Radiance Field for Multi-date Satellite Imagery of Large Size

paper_url: http://arxiv.org/abs/2309.11767
repo_url: None
paper_authors: Tongtong Zhang, Yuanxiang Li
for: 这篇论文的目的是对于卫星图像进行重建和新视角synthesis，并且解决了现有NeRF模型的速度问题、必要的太阳信息输入和实现大型卫星图像的局限性。
methods: 这篇论文使用了多对多核网络（Multi-scale Tensor Decomposition, MTD）来模型彩色、体积密度和辅助变数，并且将问题视为一个净化任务，以缓解多日期影像之间的不一致。
results: 这篇论文的结果显示，SatensoRF比过去的Sat-NeRF系列具有更好的新视角synthesis表现，并且需要训练 fewer parameters，实现了更快的训练和测试速度，以及降低了计算 overhead。

Abstract
Existing NeRF models for satellite images suffer from slow speeds, mandatory solar information as input, and limitations in handling large satellite images. In response, we present SatensoRF, which significantly accelerates the entire process while employing fewer parameters for satellite imagery of large size. Besides, we observed that the prevalent assumption of Lambertian surfaces in neural radiance fields falls short for vegetative and aquatic elements. In contrast to the traditional hierarchical MLP-based scene representation, we have chosen a multiscale tensor decomposition approach for color, volume density, and auxiliary variables to model the lightfield with specular color. Additionally, to rectify inconsistencies in multi-date imagery, we incorporate total variation loss to restore the density tensor field and treat the problem as a denosing task.To validate our approach, we conducted assessments of SatensoRF using subsets from the spacenet multi-view dataset, which includes both multi-date and single-date multi-view RGB images. Our results clearly demonstrate that SatensoRF surpasses the state-of-the-art Sat-NeRF series in terms of novel view synthesis performance. Significantly, SatensoRF requires fewer parameters for training, resulting in faster training and inference speeds and reduced computational demands.

摘要
现有的卫星图像NeRF模型受到慢速、必须输入太阳信息以及处理大容量卫星图像的限制。为此，我们提出了SatensoRF，它可以快速加速整个过程，并使用 fewer parameters 来处理大容量卫星图像。此外，我们发现了传统的 Lambertian 表面假设在神经辐射场中失去效果，特别是 для植物和水生元素。与传统的层次 MLB Scene 表示方法不同，我们选择了多尺度矩阵分解方法来odel 颜色、体积密度和辅助变量的辐射场，并将问题视为一个减除任务。为验证我们的方法，我们对Spacenet多视图数据集中的子集进行了评估，该数据集包括多日期和单日期多视图RGB图像。我们的结果显示，SatensoRF 超过了状态的艺术 Sat-NeRF 系列在新视图合成性能方面。此外，SatensoRF 具有更快的训练和推理速度，以及减少的计算需求。

Dictionary Attack on IMU-based Gait Authentication

paper_url: http://arxiv.org/abs/2309.11766
repo_url: https://github.com/rajeshjnu2006/dictionaryattackonimugait
paper_authors: Rajesh Kumar, Can Isik, Chilukuri K. Mohan
For: The paper aims to investigate the vulnerability of gait pattern-based authentication systems using inertial measurement units (IMUs) built into smartphones, and to develop a dictionary attack on these systems.* Methods: The paper uses a dataset of 178 unique IMUGait patterns collected from nine physically and demographically diverse individuals, and tests the attack idea on various user authentication models.* Results: The paper finds that it is possible to build a dictionary of IMUGait patterns and use it to launch an attack or find an imitator who can actively reproduce IMUGait patterns that match the target’s IMUGait pattern, and that the error rates of the authentication systems before and after the attack challenge the belief that these systems are the most difficult to spoof.Here are the three points in Simplified Chinese text:* For: 这个论文目的是研究基于智能手机内置的倾斜测量单元（IMU）记录的步幅模式认证系统的攻击性，并开发一种词汇攻击模型。* Methods: 论文使用了9名物理和人口学多样化的个体，在不同的四个可控和可适应步factor（速度、步长、步宽、股提升）下，记录了178个独特的IMUGait模式。这些模式被用来攻击多种用户认证模型。* Results: 论文发现可以建立一个IMUGait模式词汇，并使用它来发动攻击或找到一个可以活动地复制目标IMUGait模式的imitator。此外，论文还发现在攻击前和攻击后的错误率下降，这会让人们对认证系统的安全性产生更多的怀疑。

Abstract
We present a novel adversarial model for authentication systems that use gait patterns recorded by the inertial measurement unit (IMU) built into smartphones. The attack idea is inspired by and named after the concept of a dictionary attack on knowledge (PIN or password) based authentication systems. In particular, this work investigates whether it is possible to build a dictionary of IMUGait patterns and use it to launch an attack or find an imitator who can actively reproduce IMUGait patterns that match the target's IMUGait pattern. Nine physically and demographically diverse individuals walked at various levels of four predefined controllable and adaptable gait factors (speed, step length, step width, and thigh-lift), producing 178 unique IMUGait patterns. Each pattern attacked a wide variety of user authentication models. The deeper analysis of error rates (before and after the attack) challenges the belief that authentication systems based on IMUGait patterns are the most difficult to spoof; further research is needed on adversarial models and associated countermeasures.

摘要
我们提出了一种新的反对抗模型，用于 Authentication systems that use gait patterns recorded by the inertial measurement unit (IMU) built into smartphones. 攻击的想法源于和named after the concept of a dictionary attack on knowledge (PIN or password) based authentication systems. 特别是，这项工作研究了是否可以构建一个IMUGait模式字典，并使用其发动攻击或找到一个可以活动地重现IMUGait模式的imitator。 nine physically and demographically diverse individuals walked at various levels of four predefined controllable and adaptable gait factors (speed, step length, step width, and thigh-lift), producing 178 unique IMUGait patterns. each pattern attacked a wide variety of user authentication models. 更深入的分析错误率 (before and after the attack) 挑战了认为基于IMUGait模式的身份验证系统是最难模仿的; 需要进一步的研究反对模型和相关的防御措施。

SAM-OCTA: A Fine-Tuning Strategy for Applying Foundation Model to OCTA Image Segmentation Tasks

paper_url: http://arxiv.org/abs/2309.11758
repo_url: https://github.com/shellredia/sam-octa
paper_authors: Chengliang Wang, Xinrun Chen, Haojian Ning, Shiying Li
for: 这个论文主要是为了解决Optical coherence tomography angiography（OCTA）图像分割 зада务中的特定目标segmentation问题。
methods: 这个论文使用了low-rank adaptation技术和基于Foundation model的微调，并提出了相应的提示点生成策略来处理不同的分割任务。
results: 该方法在OCTA-500 dataset上进行了实验，并达到了当前最佳性能指标，同时也能够实现当地血管分 segmentation和有效的血管-血管分 segmentation，这些问题在之前的工作中尚未得到了好的解决。

Abstract
In the analysis of optical coherence tomography angiography (OCTA) images, the operation of segmenting specific targets is necessary. Existing methods typically train on supervised datasets with limited samples (approximately a few hundred), which can lead to overfitting. To address this, the low-rank adaptation technique is adopted for foundation model fine-tuning and proposed corresponding prompt point generation strategies to process various segmentation tasks on OCTA datasets. This method is named SAM-OCTA and has been experimented on the publicly available OCTA-500 dataset. While achieving state-of-the-art performance metrics, this method accomplishes local vessel segmentation as well as effective artery-vein segmentation, which was not well-solved in previous works. The code is available at: https://github.com/ShellRedia/SAM-OCTA.

摘要
在Optical coherence tomography angiography（OCTA）图像分析中，需要进行特定目标 segmentation 操作。现有方法通常是通过指导数据集（约几百个样本）进行超参数化训练，这可能会导致过拟合。为解决这问题，我们采用了低级别适应技术，并提出了相应的提示点生成策略，以处理不同的 segmentation 任务。这种方法被称为SAM-OCTA，并在公共可用的OCTA-500数据集上进行了实验。它不仅达到了当前最佳性能指标，还能够有效地完成本地血管分 segmentation 以及血管-血管分 segmentation，这在前一些工作中尚未得到妥善解决。代码可以在：https://github.com/ShellRedia/SAM-OCTA 中找到。

A Vision-Centric Approach for Static Map Element Annotation

paper_url: http://arxiv.org/abs/2309.11754
repo_url: https://github.com/manymuch/cama
paper_authors: Jiaxin Zhang, Shiyuan Chen, Haoran Yin, Ruohong Mei, Xuan Liu, Cong Yang, Qian Zhang, Wei Sui
for: 提供高质量的地图元素标注数据，帮助提高静止地图建模算法的准确率和一致性。
methods: 提出了一种视觉中心的方法，无需LiDAR输入可以生成高质量的3D地图元素标注。
results: 对于流行的nuScenes dataset，使用CAMA方法可以提供高效和准确的标注，并且与原始nuScenes静止地图元素比较，模型训练使用CAMA标注得到的 reprojection 误差较低（例如，4.73 vs. 8.03像素）。

Abstract
The recent development of online static map element (a.k.a. HD Map) construction algorithms has raised a vast demand for data with ground truth annotations. However, available public datasets currently cannot provide high-quality training data regarding consistency and accuracy. To this end, we present CAMA: a vision-centric approach for Consistent and Accurate Map Annotation. Without LiDAR inputs, our proposed framework can still generate high-quality 3D annotations of static map elements. Specifically, the annotation can achieve high reprojection accuracy across all surrounding cameras and is spatial-temporal consistent across the whole sequence. We apply our proposed framework to the popular nuScenes dataset to provide efficient and highly accurate annotations. Compared with the original nuScenes static map element, models trained with annotations from CAMA achieve lower reprojection errors (e.g., 4.73 vs. 8.03 pixels).

摘要
“Recent developments in online static map element (a.k.a. HD Map) construction algorithms have led to a significant demand for high-quality training data. However, public datasets currently available do not provide consistent and accurate data. To address this issue, we propose CAMA, a vision-centric approach for Consistent and Accurate Map Annotation. Our framework can generate high-quality 3D annotations of static map elements without relying on LiDAR inputs. Specifically, the annotations can achieve high reprojection accuracy across all surrounding cameras and are spatial-temporally consistent across the entire sequence. We apply our proposed framework to the popular nuScenes dataset to provide efficient and highly accurate annotations. Compared with the original nuScenes static map element, models trained with annotations from CAMA achieve lower reprojection errors (e.g., 4.73 vs. 8.03 pixels).”Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you prefer Traditional Chinese, please let me know and I can provide the translation in that format as well.

PIE: Simulating Disease Progression via Progressive Image Editing

paper_url: http://arxiv.org/abs/2309.11745
repo_url: https://github.com/irohxu/pie
paper_authors: Kaizhao Liang, Xu Cao, Kuei-Da Liao, Tianren Gao, Wenqian Ye, Zhengyu Chen, Jianguo Cao, Tejas Nama, Jimeng Sun
for: 预测疾病进程和诊断支持
methods: 基于文本生成模型的疾病进程模拟
results: 比CLIP分数和疾病分类信息更高的疾病进程生成Translation:
for: 用于预测疾病进程和诊断支持
methods: 使用基于文本生成模型的疾病进程模拟方法
results: 比CLIP分数和疾病分类信息更高的疾病进程生成结果

Abstract
Disease progression simulation is a crucial area of research that has significant implications for clinical diagnosis, prognosis, and treatment. One major challenge in this field is the lack of continuous medical imaging monitoring of individual patients over time. To address this issue, we develop a novel framework termed Progressive Image Editing (PIE) that enables controlled manipulation of disease-related image features, facilitating precise and realistic disease progression simulation. Specifically, we leverage recent advancements in text-to-image generative models to simulate disease progression accurately and personalize it for each patient. We theoretically analyze the iterative refining process in our framework as a gradient descent with an exponentially decayed learning rate. To validate our framework, we conduct experiments in three medical imaging domains. Our results demonstrate the superiority of PIE over existing methods such as Stable Diffusion Walk and Style-Based Manifold Extrapolation based on CLIP score (Realism) and Disease Classification Confidence (Alignment). Our user study collected feedback from 35 veteran physicians to assess the generated progressions. Remarkably, 76.2% of the feedback agrees with the fidelity of the generated progressions. To our best knowledge, PIE is the first of its kind to generate disease progression images meeting real-world standards. It is a promising tool for medical research and clinical practice, potentially allowing healthcare providers to model disease trajectories over time, predict future treatment responses, and improve patient outcomes.

摘要
疾病发展模拟是医学研究中一个关键领域，具有诊断、诊断和治疗等方面的重要意义。然而，在这个领域中一个主要挑战是缺乏持续医疗影像监测个体患者的能力。为了解决这个问题，我们开发了一个名为进步图像编辑（PIE）的新框架。PIE可以准确地控制疾病相关的图像特征，以便实现 precisel 和现实的疾病发展模拟。具体来说，我们利用了最新的文本生成图像技术来模拟疾病发展，并为每个患者个性化模拟。我们对PIE的迭代缩进过程进行了理论分析，并证明其等价于梯度下降算法。为了验证PIE的有效性，我们在医疗影像领域进行了三个领域的实验。我们的结果表明PIE在CLIP分数（现实）和疾病分类信心度（对齐）等方面比存在方法更高。我们的用户测试收集了35名经验丰富的医生的反馈，并证明76.2%的反馈同意生成的进步准确。到目前为止，PIE是第一个满足现实标准的疾病发展图像生成工具。它是医学研究和临床实践中的一个有前途的工具，可能允许医疗提供者在时间上模拟疾病轨迹，预测未来治疗响应，并提高患者的结果。

CPR-Coach: Recognizing Composite Error Actions based on Single-class Training

paper_url: http://arxiv.org/abs/2309.11718
repo_url: None
paper_authors: Shunli Wang, Qing Yu, Shuaibing Wang, Dingkang Yang, Liuzhen Su, Xiao Zhao, Haopeng Kuang, Peixuan Zhang, Peng Zhai, Lihua Zhang
for: 这篇论文的目的是为了提高紧急救援中的心脏复苏技巧评估，并提出了一个基于视觉数据的系统来识别和评估心脏复苏技巧的错误动作。
methods: 这篇论文使用了视觉数据来定义13种单一错误动作和74种合成错误动作，并建立了一个名为CPR-Coach的视觉数据集。然后，这篇论文对现有的动作识别模型进行了比较和探讨，以解决单簇训练和多簇测试的问题。
results: 这篇论文的实验结果显示，使用ImagineNet框架可以帮助模型增强多错误识别能力，并且可以解决单簇训练和多簇测试的问题。

Abstract
The fine-grained medical action analysis task has received considerable attention from pattern recognition communities recently, but it faces the problems of data and algorithm shortage. Cardiopulmonary Resuscitation (CPR) is an essential skill in emergency treatment. Currently, the assessment of CPR skills mainly depends on dummies and trainers, leading to high training costs and low efficiency. For the first time, this paper constructs a vision-based system to complete error action recognition and skill assessment in CPR. Specifically, we define 13 types of single-error actions and 74 types of composite error actions during external cardiac compression and then develop a video dataset named CPR-Coach. By taking the CPR-Coach as a benchmark, this paper thoroughly investigates and compares the performance of existing action recognition models based on different data modalities. To solve the unavoidable Single-class Training & Multi-class Testing problem, we propose a humancognition-inspired framework named ImagineNet to improve the model's multierror recognition performance under restricted supervision. Extensive experiments verify the effectiveness of the framework. We hope this work could advance research toward fine-grained medical action analysis and skill assessment. The CPR-Coach dataset and the code of ImagineNet are publicly available on Github.

摘要
Recently, the fine-grained medical action analysis task has received significant attention from the pattern recognition community, but it faces challenges such as data and algorithm shortages. cardiopulmonary resuscitation (CPR) is an essential skill in emergency treatment, but the current assessment of CPR skills mainly relies on dummies and trainers, leading to high training costs and low efficiency. For the first time, this paper constructs a vision-based system to complete error action recognition and skill assessment in CPR. Specifically, we define 13 types of single-error actions and 74 types of composite error actions during external cardiac compression and develop a video dataset named CPR-Coach. By using the CPR-Coach as a benchmark, this paper thoroughly investigates and compares the performance of existing action recognition models based on different data modalities. To solve the unavoidable Single-class Training & Multi-class Testing problem, we propose a human-cognition-inspired framework named ImagineNet to improve the model's multierror recognition performance under restricted supervision. Extensive experiments verify the effectiveness of the framework. We hope this work could advance research toward fine-grained medical action analysis and skill assessment. The CPR-Coach dataset and the code of ImagineNet are publicly available on Github.

Deshadow-Anything: When Segment Anything Model Meets Zero-shot shadow removal

paper_url: http://arxiv.org/abs/2309.11715
repo_url: None
paper_authors: Xiao Feng Zhang, Tian Yi Song, Jia Wei Yao
for: Image shadow removal and image restoration.
methods: + Deshadow-Anything: A diffusion model that diffuses along the edges and textures of an image to remove shadows while preserving image details. + Multi-Self-Attention Guidance (MSAG) and adaptive input perturbation (DDPM-AIP) to accelerate the iterative training speed of diffusion.
results: + Effective improvement in image restoration performance in shadow removal tasks.

Abstract
Segment Anything (SAM), an advanced universal image segmentation model trained on an expansive visual dataset, has set a new benchmark in image segmentation and computer vision. However, it faced challenges when it came to distinguishing between shadows and their backgrounds. To address this, we developed Deshadow-Anything, considering the generalization of large-scale datasets, and we performed Fine-tuning on large-scale datasets to achieve image shadow removal. The diffusion model can diffuse along the edges and textures of an image, helping to remove shadows while preserving the details of the image. Furthermore, we design Multi-Self-Attention Guidance (MSAG) and adaptive input perturbation (DDPM-AIP) to accelerate the iterative training speed of diffusion. Experiments on shadow removal tasks demonstrate that these methods can effectively improve image restoration performance.

摘要
segments anything (SAM), an advanced universal image segmentation model trained on an expansive visual dataset, has set a new benchmark in image segmentation and computer vision. However, it faced challenges when it came to distinguishing between shadows and their backgrounds. To address this, we developed Deshadow-Anything, considering the generalization of large-scale datasets, and we performed Fine-tuning on large-scale datasets to achieve image shadow removal. The diffusion model can diffuse along the edges and textures of an image, helping to remove shadows while preserving the details of the image. Furthermore, we design Multi-Self-Attention Guidance (MSAG) and adaptive input perturbation (DDPM-AIP) to accelerate the iterative training speed of diffusion. Experiments on shadow removal tasks demonstrate that these methods can effectively improve image restoration performance.

MoDA: Leveraging Motion Priors from Videos for Advancing Unsupervised Domain Adaptation in Semantic Segmentation

paper_url: http://arxiv.org/abs/2309.11711
repo_url: None
paper_authors: Fei Pan, Xu Yin, Seokju Lee, Sungeui Yoon, In So Kweon
for: 这篇论文的目的是提出一个实用的领域对预设不足的目标领域进行Semantic Segmentation任务。
methods: 这篇论文使用自我监督学来学习 objet motion 的自适应表现，并且将其应用到对预设不足的目标领域进行Semantic Segmentation。
results: 实验结果显示，这篇论文的方法可以对多个测试 benchmark 进行优化，并且可以与现有的州chart-of-the-art方法相互协同运作以进一步改善表现。

Abstract
Unsupervised domain adaptation (UDA) is an effective approach to handle the lack of annotations in the target domain for the semantic segmentation task. In this work, we consider a more practical UDA setting where the target domain contains sequential frames of the unlabeled videos which are easy to collect in practice. A recent study suggests self-supervised learning of the object motion from unlabeled videos with geometric constraints. We design a motion-guided domain adaptive semantic segmentation framework (MoDA), that utilizes self-supervised object motion to learn effective representations in the target domain. MoDA differs from previous methods that use temporal consistency regularization for the target domain frames. Instead, MoDA deals separately with the domain alignment on the foreground and background categories using different strategies. Specifically, MoDA contains foreground object discovery and foreground semantic mining to align the foreground domain gaps by taking the instance-level guidance from the object motion. Additionally, MoDA includes background adversarial training which contains a background category-specific discriminator to handle the background domain gaps. Experimental results on multiple benchmarks highlight the effectiveness of MoDA against existing approaches in the domain adaptive image segmentation and domain adaptive video segmentation. Moreover, MoDA is versatile and can be used in conjunction with existing state-of-the-art approaches to further improve performance.

摘要
无监督领域适应（USDA）是一种有效的方法，用于处理目标领域无监督标注的 semantic segmentation 任务中的缺乏标注问题。在这项工作中，我们考虑了更实际的 USDA 设定，其中目标领域包含序列帧的无标注视频，这些视频易于在实践中收集。一项研究建议通过不监督视频中的对象运动学习自我监督学习。我们设计了一个带有自我监督对象运动学习的动态适应 semantic segmentation 框架（MoDA），该框架利用了自我监督对象运动来学习有效的表示。MoDA 与前期方法不同，它不使用目标领域帧的时间一致约束，而是分别对前景和背景类使用不同的策略进行领域对应。具体来说，MoDA 包括前景对象发现和前景Semantic挖掘，以启用目标领域前景异常的匹配。此外，MoDA 还包括背景反馈学习，其中包括一个特定于背景类别的反馈器，以处理背景领域异常。实验结果表明，MoDA 在多个 benchmark 上表现出色，与现有方法相比，具有更高的效果。此外，MoDA 可以与现有状态监督的方法结合使用，以进一步提高性能。

Efficient Long-Short Temporal Attention Network for Unsupervised Video Object Segmentation

paper_url: http://arxiv.org/abs/2309.11707
repo_url: None
paper_authors: Ping Li, Yu Zhang, Li Yuan, Huaxin Xiao, Binbin Lin, Xianghua Xu
for: Unsupervised Video Object Segmentation (VOS)
methods: Long-Short Temporal Attention network (LSTA)
results: Promising performances with high efficiency on several benchmarks.Here’s the full text in Simplified Chinese:
for: 本研究探讨了无监督视频对象分割（VOS）问题，旨在在视频中快速、高效地分割主要背景 объек。
methods: 我们提出了一种高效的 Long-Short Temporal Attention 网络（简称 LSTA），它包括两个主要模块：长期记忆和短期注意力。前者捕捉了过去帧和当前帧中长期全局像素关系，模型了不断存在的对象的出现模式。而后者揭示了当前帧和一 nearby frame 中短期局部像素关系，模型了移动对象的运动模式。为了加速推理，我们采用了高效投影和地址预测来实现近似线性时间复杂度。
results: 我们在多个 benchmark 上进行了广泛的实验，并证明了提出的方法在高效性和性能方面具有惊人的表现。

Abstract
Unsupervised Video Object Segmentation (VOS) aims at identifying the contours of primary foreground objects in videos without any prior knowledge. However, previous methods do not fully use spatial-temporal context and fail to tackle this challenging task in real-time. This motivates us to develop an efficient Long-Short Temporal Attention network (termed LSTA) for unsupervised VOS task from a holistic view. Specifically, LSTA consists of two dominant modules, i.e., Long Temporal Memory and Short Temporal Attention. The former captures the long-term global pixel relations of the past frames and the current frame, which models constantly present objects by encoding appearance pattern. Meanwhile, the latter reveals the short-term local pixel relations of one nearby frame and the current frame, which models moving objects by encoding motion pattern. To speedup the inference, the efficient projection and the locality-based sliding window are adopted to achieve nearly linear time complexity for the two light modules, respectively. Extensive empirical studies on several benchmarks have demonstrated promising performances of the proposed method with high efficiency.

摘要
Unsupervised Video Object Segmentation (VOS) targets identifying primary foreground objects' contours in videos without prior knowledge. However, previous methods neglect spatial-temporal context and can't handle this challenging task in real-time. This motivates us to develop an efficient Long-Short Temporal Attention network (LSTA) for unsupervised VOS from a holistic view. Specifically, LSTA consists of two main modules: Long Temporal Memory and Short Temporal Attention. The former captures long-term global pixel relations of past frames and the current frame, modeling constantly present objects by encoding appearance pattern. Meanwhile, the latter reveals short-term local pixel relations of one nearby frame and the current frame, modeling moving objects by encoding motion pattern. To speed up inference, efficient projection and locality-based sliding window are adopted to achieve nearly linear time complexity for the two light modules, respectively. Extensive empirical studies on several benchmarks have demonstrated the proposed method's promising performance with high efficiency.

Meta OOD Learning for Continuously Adaptive OOD Detection

paper_url: http://arxiv.org/abs/2309.11705
repo_url: None
paper_authors: Xinheng Wu, Jie Lu, Zhen Fang, Guangquan Zhang
for: 这个研究是为了提出一种可靠地检测深度学习模型中的外部遗传数据（out-of-distribution，OOD）的方法，并且可以在实际世界中的不断变化和迁移中进行适应。
methods: 这个研究使用了一种名为“可动数据适应”（continuously adaptive out-of-distribution，CAOOD）的设定，并且提出了一个名为“多元外部遗传学习”（meta out-of-distribution learning，MOL）的方法来解决CAOOD。MOL使用了一个学习到适应的图表，以便在训练和测试过程中快速适应新的分布。
results: 实验结果显示，MOL可以保持ID分类精度和OOD检测性能在不断变化的分布下，并且在实际世界中的应用中可以提供更高的可靠性和效能。

Abstract
Out-of-distribution (OOD) detection is crucial to modern deep learning applications by identifying and alerting about the OOD samples that should not be tested or used for making predictions. Current OOD detection methods have made significant progress when in-distribution (ID) and OOD samples are drawn from static distributions. However, this can be unrealistic when applied to real-world systems which often undergo continuous variations and shifts in ID and OOD distributions over time. Therefore, for an effective application in real-world systems, the development of OOD detection methods that can adapt to these dynamic and evolving distributions is essential. In this paper, we propose a novel and more realistic setting called continuously adaptive out-of-distribution (CAOOD) detection which targets on developing an OOD detection model that enables dynamic and quick adaptation to a new arriving distribution, with insufficient ID samples during deployment time. To address CAOOD, we develop meta OOD learning (MOL) by designing a learning-to-adapt diagram such that a good initialized OOD detection model is learned during the training process. In the testing process, MOL ensures OOD detection performance over shifting distributions by quickly adapting to new distributions with a few adaptations. Extensive experiments on several OOD benchmarks endorse the effectiveness of our method in preserving both ID classification accuracy and OOD detection performance on continuously shifting distributions.

摘要
现代深度学习应用中，外围分布（OOD）检测是非常重要的，可以识别并警告不应该用于预测的外围样本。现有的OOD检测方法在固定分布下已经做出了 significiant progress。然而，这可能是不现实的，因为实际系统经常发生连续变化和分布的变化。因此，为了有效地应用于实际系统，需要开发能够适应动态和演化分布的OOD检测方法。在这篇论文中，我们提出了一种新的设定，即持续适应外围（CAOOD）检测，旨在开发一种能够在部署时动态适应新 arriving 分布的OOD检测模型。为了解决CAOOD，我们开发了元外围学习（MOL），它是一种学习适应图表，可以在训练过程中初始化一个好的OOD检测模型，并在测试过程中快速适应新的分布，只需要几次适应。我们在多个OOD benchmark上进行了广泛的实验，证明了我们的方法可以在连续变化的分布下保持ID分类精度和OOD检测性能。

2023-09-21

A Sentence Speaks a Thousand Images: Domain Generalization through Distilling CLIP with Language Guidance

License Plate Super-Resolution Using Diffusion Models

Impact of architecture on robustness and interpretability of multispectral deep neural networks

DIOR: Dataset for Indoor-Outdoor Reidentification – Long Range 3D/2D Skeleton Gait Collection Pipeline, Semi-Automated Gait Keypoint Labeling and Baseline Evaluation Methods

Synthetic Image Detection: Highlights from the IEEE Video and Image Processing Cup 2022 Student Competition

DualToken-ViT: Position-aware Efficient Vision Transformer with Dual Token Fusion

Speeding up Resnet Architecture with Layers Targeted Low Rank Decomposition

POLAR3D: Augmenting NASA’s POLAR Dataset for Data-Driven Lunar Perception and Rover Simulation

Active Stereo Without Pattern Projector

TinyCLIP: CLIP Distillation via Affinity Mimicking and Weight Inheritance

TalkNCE: Improving Active Speaker Detection with Talk-Aware Contrastive Learning

SlowFast Network for Continuous Sign Language Recognition

PanoVOS: Bridging Non-panoramic and Panoramic Views with Transformer for Video Segmentation

Text-Guided Vector Graphics Customization

Adaptive Input-image Normalization for Solving the Mode Collapse Problem in GAN-based X-ray Images

Can We Reliably Improve the Robustness to Image Acquisition of Remote Sensing of PV Systems?

Brain Tumor Detection Using Deep Learning Approaches

SG-Bot: Object Rearrangement via Coarse-to-Fine Robotic Imagination on Scene Graphs

ORTexME: Occlusion-Robust Human Shape and Pose via Temporal Average Texture and Mesh Encoding

Autoregressive Sign Language Production: A Gloss-Free Approach with Discrete Representations

SANPO: A Scene Understanding, Accessibility, Navigation, Pathfinding, Obstacle Avoidance Dataset

Information Forensics and Security: A quarter-century-long journey

Vulnerability of 3D Face Recognition Systems to Morphing Attacks

AutoPET Challenge 2023: Sliding Window-based Optimization of U-Net

Exploiting CLIP-based Multi-modal Approach for Artwork Classification and Retrieval

FourierLoss: Shape-Aware Loss Function with Fourier Descriptors

Bayesian sparsification for deep neural networks with Bayesian model reduction

Multi-Task Cooperative Learning via Searching for Flat Minima

Self-Calibrating, Fully Differentiable NLOS Inverse Rendering

Beyond Image Borders: Learning Feature Extrapolation for Unbounded Image Composition

BASE: Probably a Better Approach to Multi-Object Tracking

Face Identity-Aware Disentanglement in StyleGAN

Unveiling the Hidden Realm: Self-supervised Skeleton-based Action Recognition in Occluded Environments

Precision in Building Extraction: Comparing Shallow and Deep Models using LiDAR Data

Convolution and Attention Mixer for Synthetic Aperture Radar Image Change Detection

Elevating Skeleton-Based Action Recognition with Efficient Multi-Modality Self-Supervision

Identification of pneumonia on chest x-ray images through machine learning

Neural Stochastic Screened Poisson Reconstruction

Crop Row Switching for Vision-Based Navigation: A Comprehensive Approach for Efficient Crop Field Navigation

ZS6D: Zero-shot 6D Object Pose Estimation using Vision Transformers

Spatially Guiding Unsupervised Semantic Segmentation Through Depth-Informed Feature Distillation and Sampling

NeuralLabeling: A versatile toolset for labeling vision datasets using Neural Radiance Fields

Ego3DPose: Capturing 3D Cues from Binocular Egocentric Views

A Study of Forward-Forward Algorithm for Self-Supervised Learning

Fully Transformer-Equipped Architecture for End-to-End Referring Video Object Segmentation

Bridging the Gap: Learning Pace Synchronization for Open-World Semi-Supervised Learning

Video Scene Location Recognition with Neural Networks

TextCLIP: Text-Guided Face Image Generation And Manipulation Without Adversarial Training

Spatial-Temporal Transformer based Video Compression Framework

Heart Rate Detection Using an Event Camera

On-the-Fly SfM: What you capture is What you get

Using Saliency and Cropping to Improve Video Memorability

TCOVIS: Temporally Consistent Online Video Instance Segmentation

DEYOv3: DETR with YOLO for Real-time Object Detection

MEFLUT: Unsupervised 1D Lookup Tables for Multi-exposure Image Fusion

MoPA: Multi-Modal Prior Aided Domain Adaptation for 3D Semantic Segmentation

Automatic Endoscopic Ultrasound Station Recognition with Limited Data

FGFusion: Fine-Grained Lidar-Camera Fusion for 3D Object Detection

A Real-Time Multi-Task Learning System for Joint Detection of Face, Facial Landmark and Head Pose

Fast Satellite Tensorial Radiance Field for Multi-date Satellite Imagery of Large Size

Dictionary Attack on IMU-based Gait Authentication

SAM-OCTA: A Fine-Tuning Strategy for Applying Foundation Model to OCTA Image Segmentation Tasks

A Vision-Centric Approach for Static Map Element Annotation

PIE: Simulating Disease Progression via Progressive Image Editing

CPR-Coach: Recognizing Composite Error Actions based on Single-class Training

Deshadow-Anything: When Segment Anything Model Meets Zero-shot shadow removal

MoDA: Leveraging Motion Priors from Videos for Advancing Unsupervised Domain Adaptation in Semantic Segmentation

Efficient Long-Short Temporal Attention Network for Unsupervised Video Object Segmentation

Meta OOD Learning for Continuously Adaptive OOD Detection