2023-09-15

cs.CV

cs.CV - 2023-09-15

EgoObjects: A Large-Scale Egocentric Dataset for Fine-Grained Object Understanding

paper_url: http://arxiv.org/abs/2309.08816
repo_url: https://github.com/facebookresearch/egoobjects
paper_authors: Chenchen Zhu, Fanyi Xiao, Andres Alvarado, Yasmine Babaei, Jiabo Hu, Hichem El-Mohri, Sean Chang Culatana, Roshan Sumbaly, Zhicheng Yan
for: 本研究使用 Egocentric visual data 进行对象理解，这是 egocentric 视觉研究中的基础问题。
methods: 本研究使用了大规模的 Egocentric 数据集 EgoObjects，包含了250名参与者从50多个国家使用4种便携式设备录制了9K个视频，并且包含了368个对象类别的650K个对象标注。
results: EgoObjects 数据集包含了每个对象的实例级标识符，并且包含了14K个唯一的对象实例。此外，EgoObjects 还能够捕捉到不同背景复杂性、周围对象、距离、照明和摄像头运动等多种因素。为了促进 EgoObjects 的研究，本文还提出了4个benchmark任务，包括一个新的实例级对象检测任务和一个传统的类别级对象检测任务。此外，本文还引入了2个新的持续学习对象检测任务。EgoObjects 数据集和 API 可以在 https://github.com/facebookresearch/EgoObjects 上下载。

Abstract
Object understanding in egocentric visual data is arguably a fundamental research topic in egocentric vision. However, existing object datasets are either non-egocentric or have limitations in object categories, visual content, and annotation granularities. In this work, we introduce EgoObjects, a large-scale egocentric dataset for fine-grained object understanding. Its Pilot version contains over 9K videos collected by 250 participants from 50+ countries using 4 wearable devices, and over 650K object annotations from 368 object categories. Unlike prior datasets containing only object category labels, EgoObjects also annotates each object with an instance-level identifier, and includes over 14K unique object instances. EgoObjects was designed to capture the same object under diverse background complexities, surrounding objects, distance, lighting and camera motion. In parallel to the data collection, we conducted data annotation by developing a multi-stage federated annotation process to accommodate the growing nature of the dataset. To bootstrap the research on EgoObjects, we present a suite of 4 benchmark tasks around the egocentric object understanding, including a novel instance level- and the classical category level object detection. Moreover, we also introduce 2 novel continual learning object detection tasks. The dataset and API are available at https://github.com/facebookresearch/EgoObjects.

摘要
<> translate into Simplified Chinese Object understanding in egocentric visual data is a fundamental research topic in egocentric vision. However, existing object datasets are either non-egocentric or have limitations in object categories, visual content, and annotation granularities. In this work, we introduce EgoObjects, a large-scale egocentric dataset for fine-grained object understanding. Its Pilot version contains over 9,000 videos collected by 250 participants from 50+ countries using 4 wearable devices, and over 650,000 object annotations from 368 object categories. Unlike prior datasets containing only object category labels, EgoObjects also annotates each object with an instance-level identifier, and includes over 14,000 unique object instances. EgoObjects was designed to capture the same object under diverse background complexities, surrounding objects, distance, lighting, and camera motion. In parallel to the data collection, we conducted data annotation by developing a multi-stage federated annotation process to accommodate the growing nature of the dataset. To bootstrap the research on EgoObjects, we present a suite of 4 benchmark tasks around the egocentric object understanding, including a novel instance-level and classical category-level object detection. Moreover, we also introduce 2 novel continual learning object detection tasks. The dataset and API are available at https://github.com/facebookresearch/EgoObjects.Note: Please note that the translation is in Simplified Chinese, and the word order and grammar may be different from Traditional Chinese.

paper_url: http://arxiv.org/abs/2309.08769
repo_url: None
paper_authors: Jongwon Lee, Su Yeon Choi, Timothy Bretl
for: This paper is written to quantify the impact of adverse environmental conditions on the detection of fiducial markers by color cameras mounted on rotorcraft.
methods: The paper uses image sequences collected outdoors with cameras mounted on a quadrotor during semi-autonomous takeoff and landing operations under adverse environmental conditions.
results: The paper evaluates the performance of the marker detection system using various performance measures such as precision, recall, continuity, availability, robustness, resiliency, and coverage volume.

Abstract
This paper quantifies the impact of adverse environmental conditions on the detection of fiducial markers (i.e., artificial landmarks) by color cameras mounted on rotorcraft. We restrict our attention to square markers with a black-and-white pattern of grid cells that can be nested to allow detection at multiple scales. These markers have the potential to enhance the reliability of precision takeoff and landing at vertiports by flying vehicles in urban settings. Prior work has shown, in particular, that these markers can be detected with high precision (i.e., few false positives) and high recall (i.e., few false negatives). However, most of this prior work has been based on image sequences collected indoors with hand-held cameras. Our work is based on image sequences collected outdoors with cameras mounted on a quadrotor during semi-autonomous takeoff and landing operations under adverse environmental conditions that include variations in temperature, illumination, wind speed, humidity, visibility, and precipitation. In addition to precision and recall, performance measures include continuity, availability, robustness, resiliency, and coverage volume. We release both our dataset and the code we used for analysis to the public as open source.

摘要

Biased Attention: Do Vision Transformers Amplify Gender Bias More than Convolutional Neural Networks?

paper_url: http://arxiv.org/abs/2309.08760
repo_url: https://github.com/aibhishek/Biased-Attention
paper_authors: Abhishek Mandal, Susan Leavy, Suzanne Little
for: This paper aims to evaluate the potential for bias amplification in vision transformers (ViTs) and compare them to convolutional neural networks (CNNs) in the context of large multimodal models.
methods: The authors introduce a novel metric called Accuracy Difference to measure bias in architectures and use it to evaluate the performance of CNNs and ViTs in image classification tasks.
results: The results show that ViTs amplify gender bias to a greater extent than CNNs, highlighting the importance of considering the potential for bias amplification in the design and deployment of multimodal models.Here’s the same information in Simplified Chinese text:
for: 这篇论文旨在评估视transformer（ViTs）和卷积神经网络（CNNs）在大型多模态模型中的可能的偏见增强。
methods: 作者引入了一个新的度量方法called Accuracy Difference来衡量architecture中的偏见，并用其来评估CNNs和ViTs在图像分类任务中的性能。
results: 结果显示，ViTs在性别偏见方面比CNNs更加把握，强调考虑在设计和部署多模态模型时的偏见增强。

Abstract
Deep neural networks used in computer vision have been shown to exhibit many social biases such as gender bias. Vision Transformers (ViTs) have become increasingly popular in computer vision applications, outperforming Convolutional Neural Networks (CNNs) in many tasks such as image classification. However, given that research on mitigating bias in computer vision has primarily focused on CNNs, it is important to evaluate the effect of a different network architecture on the potential for bias amplification. In this paper we therefore introduce a novel metric to measure bias in architectures, Accuracy Difference. We examine bias amplification when models belonging to these two architectures are used as a part of large multimodal models, evaluating the different image encoders of Contrastive Language Image Pretraining which is an important model used in many generative models such as DALL-E and Stable Diffusion. Our experiments demonstrate that architecture can play a role in amplifying social biases due to the different techniques employed by the models for feature extraction and embedding as well as their different learning properties. This research found that ViTs amplified gender bias to a greater extent than CNNs

摘要
深度神经网络在计算机视觉中已经展现出许多社会偏见，如性别偏见。视觉转换器（ViTs）在计算机视觉应用中变得越来越受欢迎，在许多任务中如图像分类任务中超越了卷积神经网络（CNNs）。然而，由于研究减少计算机视觉中偏见的研究主要集中在CNNs上，因此需要评估不同网络架构对偏见的可能性。因此，我们引入了一种新的准则来衡量偏见，即准确性差异指标（Accuracy Difference）。我们在使用这两种架构的模型作为大型多Modal模型的一部分时进行了评估，并评估了不同图像编码器的对比语言图像预训练模型，这是一个广泛使用的生成模型，如DALL-E和Stable Diffusion。我们的实验表明，网络架构可以影响社会偏见的加剧，因为不同的模型在特征提取和嵌入以及学习特性上employs不同的技术。我们的研究发现，ViTs对gender bias的加剧比CNNs更大。

paper_url: http://arxiv.org/abs/2309.08747
repo_url: None
paper_authors: Reuben Dorent, Nazim Haouchine, Fryderyk Kögl, Samuel Joutard, Parikshit Juvekar, Erickson Torio, Alexandra Golby, Sebastien Ourselin, Sarah Frisken, Tom Vercauteren, Tina Kapur, William M. Wells
for: 这个论文是为了Synthesize missing images from various modalities, using a deep hierarchical variational autoencoder (VAE) with a probabilistic formulation for fusing multi-modal images in a common latent representation.
methods: 这个论文使用的方法包括 extending multi-modal VAEs with a hierarchical latent structure, adversarial learning, and a principled probabilistic fusion operation.
results: 经验表明，这个模型可以比multi-modal VAEs, conditional GANs, and the current state-of-the-art unified method (ResViT)更好地Synthesize missing images, demonstrating the advantage of using a hierarchical latent representation and a principled probabilistic fusion operation.

Abstract
We introduce MHVAE, a deep hierarchical variational auto-encoder (VAE) that synthesizes missing images from various modalities. Extending multi-modal VAEs with a hierarchical latent structure, we introduce a probabilistic formulation for fusing multi-modal images in a common latent representation while having the flexibility to handle incomplete image sets as input. Moreover, adversarial learning is employed to generate sharper images. Extensive experiments are performed on the challenging problem of joint intra-operative ultrasound (iUS) and Magnetic Resonance (MR) synthesis. Our model outperformed multi-modal VAEs, conditional GANs, and the current state-of-the-art unified method (ResViT) for synthesizing missing images, demonstrating the advantage of using a hierarchical latent representation and a principled probabilistic fusion operation. Our code is publicly available \url{https://github.com/ReubenDo/MHVAE}.

摘要
我们介绍MHVAE，一种深度嵌入式多modal VAE，可以将多modal图像中缺失的图像合成为完整的图像。我们通过在共同层次结构中嵌入多modal图像的概率表示方式，实现了对多modal图像的不同模式进行共同表示，并且可以处理部分图像集为输入。此外，我们采用了对抗学习来生成更加锐化的图像。我们在挑战性的 JOINT INTRA-OPERATIVE ULTRASOUND（iUS）和磁共振（MR）合成问题进行了广泛的实验，并证明了MHVAE对于缺失图像的合成表现出色，超过了多modal VAE、conditioned GAN和当前状态的架构（ResViT）。我们的代码可以在 GitHub 上找到，链接如下：https://github.com/ReubenDo/MHVAE。

Improved Breast Cancer Diagnosis through Transfer Learning on Hematoxylin and Eosin Stained Histology Images

paper_url: http://arxiv.org/abs/2309.08745
repo_url: None
paper_authors: Fahad Ahmed, Reem Abdel-Salam, Leon Hamnett, Mary Adewunmi, Temitope Ayano
For: The paper aims to classify breast cancer tumors into seven subtypes using deep learning models on histological images.* Methods: The authors use pre-trained deep learning models such as Xception, EfficientNet, ResNet50, and InceptionResNet, and pre-process the BRACS ROI images with image augmentation, upsampling, and dataset split strategies.* Results: The best results were obtained by ResNet50 achieving 66% f1-score for the default dataset split, and 96.2% f1-score for the custom dataset split, with a significant reduction in false positive and false negative classifications.Here are the three points in Simplified Chinese text:* For: 这个研究旨在使用深度学习模型来分类乳腺癌into seven种亚型。* Methods: 作者使用了预训练的深度学习模型，如Xception、EfficientNet、ResNet50和InceptionResNet，并对BRACS ROI图像进行了预处理，包括图像增强、放大和数据分割策略。* Results: 最佳结果由ResNet50实现，达到了66%的f1分数，而自定义数据分割策略可以达到96.2%的f1分数，同时减少了false positive和false negative分类的数量。

Abstract
Breast cancer is one of the leading causes of death for women worldwide. Early screening is essential for early identification, but the chance of survival declines as the cancer progresses into advanced stages. For this study, the most recent BRACS dataset of histological (H\&E) stained images was used to classify breast cancer tumours, which contains both the whole-slide images (WSI) and region-of-interest (ROI) images, however, for our study we have considered ROI images. We have experimented using different pre-trained deep learning models, such as Xception, EfficientNet, ResNet50, and InceptionResNet, pre-trained on the ImageNet weights. We pre-processed the BRACS ROI along with image augmentation, upsampling, and dataset split strategies. For the default dataset split, the best results were obtained by ResNet50 achieving 66\% f1-score. For the custom dataset split, the best results were obtained by performing upsampling and image augmentation which results in 96.2\% f1-score. Our second approach also reduced the number of false positive and false negative classifications to less than 3\% for each class. We believe that our study significantly impacts the early diagnosis and identification of breast cancer tumors and their subtypes, especially atypical and malignant tumors, thus improving patient outcomes and reducing patient mortality rates. Overall, this study has primarily focused on identifying seven (7) breast cancer tumor subtypes, and we believe that the experimental models can be fine-tuned further to generalize over previous breast cancer histology datasets as well.

摘要
乳癌是女性全球死亡率的主要原因之一。早期检测是锐意义的，但癌细胞阶段的检测难度增加。为了实现这一目标，我们使用了最新的BRACS数据集，包括整幅图像（WSI）和区域 интерес（ROI）图像，但我们只使用了ROI图像。我们使用了不同的预训练深度学习模型，如Xception、EfficientNet、ResNet50和InceptionResNet，预训练在ImageNet权重上。我们对BRACS ROI进行预处理，并使用图像增强、下采样和数据集分割策略。对于默认数据集分割，最佳结果由ResNet50实现66%的f1分数。对于自定义数据集分割，通过实施下采样和图像增强，我们获得了96.2%的f1分数。我们的第二种方法还将false正和false负分类数降低到每个类型下的less than 3%。我们认为这一研究对于早期诊断和识别乳癌癌症和其亚型具有重要意义，从而提高病人结果和减少病人死亡率。总之，这一研究主要关注了identifying seven（7）种乳癌癌症，并我们认为实验性模型可以进一步细化以泛化到之前的乳癌历史学数据集。

Personalized Food Image Classification: Benchmark Datasets and New Baseline

paper_url: http://arxiv.org/abs/2309.08744
repo_url: None
paper_authors: Xinyue Pan, Jiangpeng He, Fengqing Zhu
for: 本研究的目的是提出一种个性化食品分类方法，以便自动分析食品图像中的营养成分。
methods: 本研究使用了自适应神经网络和自我超vised学习技术，以利用食品图像的时间序列特征来提高个性化食品分类的精度。
results: 在两个个性化食品分类 benchmark dataset上进行测试，本研究的方法显示出与现有方法相比的改进性。 dataset可以在以下链接下下载：https://skynet.ecn.purdue.edu/~pan161/dataset_personal.html

Abstract
Food image classification is a fundamental step of image-based dietary assessment, enabling automated nutrient analysis from food images. Many current methods employ deep neural networks to train on generic food image datasets that do not reflect the dynamism of real-life food consumption patterns, in which food images appear sequentially over time, reflecting the progression of what an individual consumes. Personalized food classification aims to address this problem by training a deep neural network using food images that reflect the consumption pattern of each individual. However, this problem is under-explored and there is a lack of benchmark datasets with individualized food consumption patterns due to the difficulty in data collection. In this work, we first introduce two benchmark personalized datasets including the Food101-Personal, which is created based on surveys of daily dietary patterns from participants in the real world, and the VFNPersonal, which is developed based on a dietary study. In addition, we propose a new framework for personalized food image classification by leveraging self-supervised learning and temporal image feature information. Our method is evaluated on both benchmark datasets and shows improved performance compared to existing works. The dataset has been made available at: https://skynet.ecn.purdue.edu/~pan161/dataset_personal.html

摘要
食物图像分类是图像基于营养评估的基本步骤，允许自动分析食物图像中的营养成分。许多当前方法使用深度神经网络训练在通用食物图像集合上，这些集合不会反映现实生活中食物消耗的动态特点，食物图像会随着时间的推移而变化，具有个人特定的食物消耗模式。个性化食物分类目标是解决这个问题，通过使用每个个体的食物消耗模式来训练深度神经网络。然而，这个问题还没有得到充分的研究，并且缺乏个性化食物消耗模式的标准benchmark数据集。在这个工作中，我们首先介绍了两个benchmark个性化数据集，包括Food101-Personal，该数据集基于参与者的实际生活情况进行了调查，以及VFNPersonal，该数据集基于一项营养研究。此外，我们还提出了一种新的个性化食物图像分类框架，利用自动学习和时间图像特征信息。我们的方法在两个benchmark数据集上进行了评估，并与现有的方法进行了比较。数据集可以在以下网址获取：https://skynet.ecn.purdue.edu/~pan161/dataset_personal.html。

Active Learning for Fine-Grained Sketch-Based Image Retrieval

paper_url: http://arxiv.org/abs/2309.08743
repo_url: None
paper_authors: Himanshu Thakur, Soumitri Chattopadhyay
for: 提高 Fine-grained sketch-based image retrieval (FG-SBIR) 的实际应用和扩展性，即使不需要大量的 faithful sketches。
methods: 提出了一种新的活动学样本选择技术，通过利用现有的 photo-sketch 对应关系和其中的中间表示来减少绘制照片笔记的努力。
results: 通过实验在 ChairV2 和 ShoeV2 两个公开的细化的 SBIR 数据集上，证明了我们的方法在 adapted baselines 上表现出优异性。

Abstract
The ability to retrieve a photo by mere free-hand sketching highlights the immense potential of Fine-grained sketch-based image retrieval (FG-SBIR). However, its rapid practical adoption, as well as scalability, is limited by the expense of acquiring faithful sketches for easily available photo counterparts. A solution to this problem is Active Learning, which could minimise the need for labeled sketches while maximising performance. Despite extensive studies in the field, there exists no work that utilises it for reducing sketching effort in FG-SBIR tasks. To this end, we propose a novel active learning sampling technique that drastically minimises the need for drawing photo sketches. Our proposed approach tackles the trade-off between uncertainty and diversity by utilising the relationship between the existing photo-sketch pair to a photo that does not have its sketch and augmenting this relation with its intermediate representations. Since our approach relies only on the underlying data distribution, it is agnostic of the modelling approach and hence is applicable to other cross-modal instance-level retrieval tasks as well. With experimentation over two publicly available fine-grained SBIR datasets ChairV2 and ShoeV2, we validate our approach and reveal its superiority over adapted baselines.

摘要
可以通过免费手写绘制图像检索照片的能力，高举出 Fine-grained sketch-based image retrieval（FG-SBIR）的巨大潜力。然而，它的实用化和扩展受到获得准确的绘制图像的成本的限制。一种解决方案是活动学习，可以最小化绘制图像的需求，同时保证性能的最大化。尽管在这个领域进行了广泛的研究，但是没有任何一个使用活动学习来减少 FG-SBIR 任务中的绘制努力。为了解决这个问题，我们提出了一种新的活动学习采样技术，可以减少绘制图像的需求。我们的提出的方法利用现有照片-绘制图像对的关系，以及这些图像的中间表示，来解决不确定性和多样性之间的负担。由于我们的方法只依据数据分布，因此是无关于模型采样的，因此可以应用于其他跨模态实例级检索任务中。通过对 ChairV2 和 ShoeV2 两个公开的细化 SBIR 数据集进行实验，我们证明了我们的方法的有利性，并比适应基elines进行比较。

Concept explainability for plant diseases classification

paper_url: http://arxiv.org/abs/2309.08739
repo_url: None
paper_authors: Jihen Amara, Birgitta König-Ries, Sheeba Samuel
for: 这个论文旨在提高植物疾病识别的准确率和可读性，以便更好地保护农业生产和食品安全。
methods: 这个研究使用了深度学习的 convolutional neural networks (CNN) 进行植物疾病类型的分类。在这种方法中，研究人员使用了一种名为 Testing with Concept Activation Vectors (TCAV) 的新方法，它可以帮助理解 deep learning 模型的决策过程。
results: 研究人员发现，使用 TCAV 方法可以提高植物疾病识别的准确率和可读性。这种方法可以帮助人们更好地理解 deep learning 模型的决策过程，从而提高植物疾病识别的可靠性和可读性。

Abstract
Plant diseases remain a considerable threat to food security and agricultural sustainability. Rapid and early identification of these diseases has become a significant concern motivating several studies to rely on the increasing global digitalization and the recent advances in computer vision based on deep learning. In fact, plant disease classification based on deep convolutional neural networks has shown impressive performance. However, these methods have yet to be adopted globally due to concerns regarding their robustness, transparency, and the lack of explainability compared with their human experts counterparts. Methods such as saliency-based approaches associating the network output to perturbations of the input pixels have been proposed to give insights into these algorithms. Still, they are not easily comprehensible and not intuitive for human users and are threatened by bias. In this work, we deploy a method called Testing with Concept Activation Vectors (TCAV) that shifts the focus from pixels to user-defined concepts. To the best of our knowledge, our paper is the first to employ this method in the field of plant disease classification. Important concepts such as color, texture and disease related concepts were analyzed. The results suggest that concept-based explanation methods can significantly benefit automated plant disease identification.

摘要
To address these limitations, we propose a method called Testing with Concept Activation Vectors (TCAV), which shifts the focus from pixels to user-defined concepts. Our method provides a more intuitive and comprehensible explanation of the algorithm's decisions, reducing the bias and uncertainty of traditional pixel-based approaches.We applied TCAV to plant disease classification and analyzed important concepts such as color, texture, and disease-related concepts. Our results suggest that concept-based explanation methods can significantly benefit automated plant disease identification. This is the first study to employ TCAV in the field of plant disease classification, and our findings demonstrate the potential of this method for improving the accuracy and reliability of plant disease detection.

AV-MaskEnhancer: Enhancing Video Representations through Audio-Visual Masked Autoencoder

paper_url: http://arxiv.org/abs/2309.08738
repo_url: None
paper_authors: Xingjian Diao, Ming Cheng, Shitong Cheng
for: 学习高质量视频表示，在计算机视觉中有广泛的应用并且 remains 挑战。previous work基于mask autoencoder，如ImageMAE和VideoMAE，已经证明通过重建策略在视觉modalities中学习表示有效。但这些模型具有内在的局限性，特别是在只能从视觉modalities中提取特征时，如low-resolution和模糊原始视频时。
methods: 我们提出了AV-MaskEnhancer，一种 combining visual和audio信息来学习高质量视频表示。我们的方法利用视觉和听频信息的衔接性，以提高视频表示质量。
results: 我们在UCf101 dataset上进行了视频分类任务，与existin work比较，我们的result达到了state-of-the-art，top-1准确率为98.8%，top-5准确率为99.9%。

Abstract
Learning high-quality video representation has shown significant applications in computer vision and remains challenging. Previous work based on mask autoencoders such as ImageMAE and VideoMAE has proven the effectiveness of learning representations in images and videos through reconstruction strategy in the visual modality. However, these models exhibit inherent limitations, particularly in scenarios where extracting features solely from the visual modality proves challenging, such as when dealing with low-resolution and blurry original videos. Based on this, we propose AV-MaskEnhancer for learning high-quality video representation by combining visual and audio information. Our approach addresses the challenge by demonstrating the complementary nature of audio and video features in cross-modality content. Moreover, our result of the video classification task on the UCF101 dataset outperforms the existing work and reaches the state-of-the-art, with a top-1 accuracy of 98.8% and a top-5 accuracy of 99.9%.

摘要
学习高质量视频表示方法显示有 significante 应用于计算机视觉领域，但是仍然是挑战。先前的工作基于面卷 autoencoder 如 ImageMAE 和 VideoMAE 已经证明了通过重建策略在视觉模式中学习表示方法的效iveness。然而，这些模型具有内在的限制，特别是在仅从视觉模式提取特征时遇到低分辨率和模糊的原始视频时。基于这，我们提出了 AV-MaskEnhancer，一种结合视觉和音频信息来学习高质量视频表示方法的方法。我们的方法解决了这个挑战，并证明了音频和视觉内容之间的衔接性。此外，我们在 UCF101 数据集上进行了视频分类任务，与现有的工作相比，我们的结果达到了领先水平，top-1 准确率为 98.8%，top-5 准确率为 99.9%。

Segmentation of Tubular Structures Using Iterative Training with Tailored Samples

paper_url: http://arxiv.org/abs/2309.08727
repo_url: None
paper_authors: Wei Liao
for: 该文章是为了同时计算管状结构的分割面和中心线的最小路径方法。
methods: 该方法使用了基于CNN的特征EXTRACTING，并且引入了一种新的迭代训练方案，以生成更加适合的训练样本，从而解决了训练和推理时样本之间的差异。
results: 与七种之前的方法进行比较，该方法在三个公共数据集上（包括卫星图像和医疗图像） achievements state-of-the-art 的结果 both for segmentation masks and centerlines。

Abstract
We propose a minimal path method to simultaneously compute segmentation masks and extract centerlines of tubular structures with line-topology. Minimal path methods are commonly used for the segmentation of tubular structures in a wide variety of applications. Recent methods use features extracted by CNNs, and often outperform methods using hand-tuned features. However, for CNN-based methods, the samples used for training may be generated inappropriately, so that they can be very different from samples encountered during inference. We approach this discrepancy by introducing a novel iterative training scheme, which enables generating better training samples specifically tailored for the minimal path methods without changing existing annotations. In our method, segmentation masks and centerlines are not determined after one another by post-processing, but obtained using the same steps. Our method requires only very few annotated training images. Comparison with seven previous approaches on three public datasets, including satellite images and medical images, shows that our method achieves state-of-the-art results both for segmentation masks and centerlines.

摘要
我们提出了一种最小路径方法，同时计算 tubular 结构的分割掩模和中心线。最小路径方法广泛应用于 tubular 结构的分割多种应用场景。现有方法通常使用 CNN 提取特征，并经常超越手动调整特征的方法。然而，用于 CNN 基于方法的样本可能不当生成，因此在推断中遇到的样本可能很不同。我们解决这个差异的问题，通过引入一种新的迭代训练方案，可以无需更改现有的注释，生成更适合最小路径方法的训练样本。在我们的方法中，分割掩模和中心线不再是分别由 post-processing 确定的，而是通过同一步的步骤获得。我们的方法只需要很少的注释训练图像。与前七种方法进行比较，我们的方法在三个公共数据集上（卫星图像和医疗图像）上达到了状态的最佳结果。

Performance Metrics for Probabilistic Ordinal Classifiers

paper_url: http://arxiv.org/abs/2309.08701
repo_url: None
paper_authors: Adrian Galdran
for: 这篇论文是关于如何评估排序类别器的概率性预测性能。
methods: 本文使用了一种名为Ranked Probability Score（RPS）的评估方法，这个方法在预测场景中很受欢迎，但在图像分析领域中却没有Received much attention。
results: 经过了四个大规模的生医图像分类任务和三个不同的数据集，结果显示RPS是一个更适合的表现度量指标 для排序类别器的概率性预测。

Abstract
Ordinal classification models assign higher penalties to predictions further away from the true class. As a result, they are appropriate for relevant diagnostic tasks like disease progression prediction or medical image grading. The consensus for assessing their categorical predictions dictates the use of distance-sensitive metrics like the Quadratic-Weighted Kappa score or the Expected Cost. However, there has been little discussion regarding how to measure performance of probabilistic predictions for ordinal classifiers. In conventional classification, common measures for probabilistic predictions are Proper Scoring Rules (PSR) like the Brier score, or Calibration Errors like the ECE, yet these are not optimal choices for ordinal classification. A PSR named Ranked Probability Score (RPS), widely popular in the forecasting field, is more suitable for this task, but it has received no attention in the image analysis community. This paper advocates the use of the RPS for image grading tasks. In addition, we demonstrate a counter-intuitive and questionable behavior of this score, and propose a simple fix for it. Comprehensive experiments on four large-scale biomedical image grading problems over three different datasets show that the RPS is a more suitable performance metric for probabilistic ordinal predictions. Code to reproduce our experiments can be found at https://github.com/agaldran/prob_ord_metrics .

摘要
Ordinal 分类模型会对预测结果进行评分，以确定预测结果与实际类别之间的相似度。因此，它们适用于有关疾病进程预测或医疗图像等级分类等有关的诊断任务。然而，对于概率性预测的性能评价还没有得到过多的讨论。在普通的分类任务中，通用的评价指标包括 próper Scoring Rules（PSR）如布里分数，或者 calibration Errors 如 ECE，但这些指标并不适用于 ordinal 分类。一种 PSR 名为 Ranked Probability Score（RPS），在预测领域中广泛使用，更适合这种任务。然而，这个指标在图像分析社区中受到了少量的关注。本文提倡使用 RPS 来进行图像等级分类任务。此外，我们还发现了这个指标的一种Counter-intuitive和问题的行为，并提出了一个简单的修复方案。我们在四个大规模的生物医学图像等级分类问题上进行了四个不同的数据集的实验，并证明了 RPS 是适用于概率性 ordinal 预测的性能指标。可以在 GitHub 上找到我们的实验代码。

BANSAC: A dynamic BAyesian Network for adaptive SAmple Consensus

paper_url: http://arxiv.org/abs/2309.08690
repo_url: None
paper_authors: Valter Piedade, Pedro Miraldo
for: 提高 robust estimation 算法的效率，使其能够更好地应用于计算机视觉领域。
methods: 使用随机抽样、计算假设、计算异常点数来实现 robust estimation。
results: 在多个实际 datasets 中，方法可以减少计算时间而不会降低准确性，并且在准确性和计算时间之间取得了平衡。

Abstract
RANSAC-based algorithms are the standard techniques for robust estimation in computer vision. These algorithms are iterative and computationally expensive; they alternate between random sampling of data, computing hypotheses, and running inlier counting. Many authors tried different approaches to improve efficiency. One of the major improvements is having a guided sampling, letting the RANSAC cycle stop sooner. This paper presents a new adaptive sampling process for RANSAC. Previous methods either assume no prior information about the inlier/outlier classification of data points or use some previously computed scores in the sampling. In this paper, we derive a dynamic Bayesian network that updates individual data points' inlier scores while iterating RANSAC. At each iteration, we apply weighted sampling using the updated scores. Our method works with or without prior data point scorings. In addition, we use the updated inlier/outlier scoring for deriving a new stopping criterion for the RANSAC loop. We test our method in multiple real-world datasets for several applications and obtain state-of-the-art results. Our method outperforms the baselines in accuracy while needing less computational time.

摘要

Robust e-NeRF: NeRF from Sparse & Noisy Events under Non-Uniform Motion

paper_url: http://arxiv.org/abs/2309.08596
repo_url: https://github.com/wengflow/robust-e-nerf
paper_authors: Weng Fei Low, Gim Hee Lee
for: 这个论文的目的是探讨如何从移动事件相机中直接和可靠地重construct NeRF。
methods: 该方法使用了一种新的事件生成模型，该模型考虑了各种内在参数（如时间不变、非对称阈值和延迟时间）和寸努力（如像素到像素的阈值差异），以及一对归一的正常化损失函数，使得可以有效地泛化到任意速度配置和内在参数值。
results: 实验结果表明，该方法在真实的视频序列和新的真实 simulate的序列上具有高效性和可靠性，能够捕捉到较为复杂的场景特征。

Abstract
Event cameras offer many advantages over standard cameras due to their distinctive principle of operation: low power, low latency, high temporal resolution and high dynamic range. Nonetheless, the success of many downstream visual applications also hinges on an efficient and effective scene representation, where Neural Radiance Field (NeRF) is seen as the leading candidate. Such promise and potential of event cameras and NeRF inspired recent works to investigate on the reconstruction of NeRF from moving event cameras. However, these works are mainly limited in terms of the dependence on dense and low-noise event streams, as well as generalization to arbitrary contrast threshold values and camera speed profiles. In this work, we propose Robust e-NeRF, a novel method to directly and robustly reconstruct NeRFs from moving event cameras under various real-world conditions, especially from sparse and noisy events generated under non-uniform motion. It consists of two key components: a realistic event generation model that accounts for various intrinsic parameters (e.g. time-independent, asymmetric threshold and refractory period) and non-idealities (e.g. pixel-to-pixel threshold variation), as well as a complementary pair of normalized reconstruction losses that can effectively generalize to arbitrary speed profiles and intrinsic parameter values without such prior knowledge. Experiments on real and novel realistically simulated sequences verify our effectiveness. Our code, synthetic dataset and improved event simulator are public.

摘要
In this work, we propose Robust e-NeRF, a novel method to directly and robustly reconstruct NeRFs from moving event cameras under various real-world conditions, including sparse and noisy events generated under non-uniform motion. Our method consists of two key components: a realistic event generation model that accounts for various intrinsic parameters (such as time-independent, asymmetric threshold, and refractory period) and non-idealities (such as pixel-to-pixel threshold variation), and a complementary pair of normalized reconstruction losses that can effectively generalize to arbitrary speed profiles and intrinsic parameter values without prior knowledge.Experiments on real and novel realistically simulated sequences have verified our effectiveness. Our code, synthetic dataset, and improved event simulator are publicly available.

Robust Frame-to-Frame Camera Rotation Estimation in Crowded Scenes

paper_url: http://arxiv.org/abs/2309.08588
repo_url: https://github.com/Neabfi/robust-rotation-estimation
paper_authors: Fabien Delattre, David Dirnfeld, Phat Nguyen, Stephen Scarano, Michael J. Jones, Pedro Miraldo, Erik Learned-Miller
for: 估计摄像头旋转在实际场景中，尤其是手持式单目视频中，是一个尚未得到充分研究的问题。
methods: 我们提出了一种使用HOUGH transform在SO(3)上进行高效和稳定的摄像头旋转估计方法。
results: 与其他比较快速的方法相比，我们的方法可以减少错误率大约50%，并且在任何情况下都比其他方法更加准确。这表示我们的方法在实际场景中表现出了强大的新表现点。

Abstract
We present an approach to estimating camera rotation in crowded, real-world scenes from handheld monocular video. While camera rotation estimation is a well-studied problem, no previous methods exhibit both high accuracy and acceptable speed in this setting. Because the setting is not addressed well by other datasets, we provide a new dataset and benchmark, with high-accuracy, rigorously verified ground truth, on 17 video sequences. Methods developed for wide baseline stereo (e.g., 5-point methods) perform poorly on monocular video. On the other hand, methods used in autonomous driving (e.g., SLAM) leverage specific sensor setups, specific motion models, or local optimization strategies (lagging batch processing) and do not generalize well to handheld video. Finally, for dynamic scenes, commonly used robustification techniques like RANSAC require large numbers of iterations, and become prohibitively slow. We introduce a novel generalization of the Hough transform on SO(3) to efficiently and robustly find the camera rotation most compatible with optical flow. Among comparably fast methods, ours reduces error by almost 50\% over the next best, and is more accurate than any method, irrespective of speed. This represents a strong new performance point for crowded scenes, an important setting for computer vision. The code and the dataset are available at https://fabiendelattre.com/robust-rotation-estimation.

摘要
我们提出了一种方法来估计拍摄机器人的旋转角度在实际场景中，基于单目视频。而Camera rotation estimation是一个非常研究过的问题，但没有任何方法能同时具备高精度和可接受的速度。由于这种场景没有其他数据集可供参考，我们提供了一个新的数据集和benchmark，其中的高精度、严格验证的参考数据来自17个视频序列。广泛使用的五点方法（例如，用于宽基线探测）在单目视频中表现糟糕，而自动驾驶领域中使用的SLAM方法则具有特定的传感器设置、特定的运动模型或本地优化策略（延迟批处理），这些方法无法通过到手持视频。此外，对于动态场景，通常使用的Robustification技术，如RANSAC，需要大量的迭代，成为无法进行的慢卡。我们介绍了一种新的SO(3)上的投影变换的普适扩展，以高效和可靠地找到拍摄机器人的旋转角度。与其他快速方法相比，我们的方法可以减少错误率大约50%，并且在任何速度下都高于任何方法。这代表了一个新的性能点，在实际场景中具有重要意义。我们的代码和数据集可以在https://fabiendelattre.com/robust-rotation-estimation上获取。

Replacing softmax with ReLU in Vision Transformers

paper_url: http://arxiv.org/abs/2309.08586
repo_url: None
paper_authors: Mitchell Wortsman, Jaehoon Lee, Justin Gilmer, Simon Kornblith
for: 这研究探讨了在替换注意力软 макс函数时对准确性的影响，并发现在视transformer上使用ReLU活动可以 Mitigate this degradation.
methods: 这研究使用了ImageNet-21k数据集训练小到大的视transformer，并 comparesthe performance of ReLU-attention和softmax-attention。
results: 结果表明，ReLU-attention可以与softmax-attention相比，在计算规模上Scaling behavior的表现相似或相当。

Abstract
Previous research observed accuracy degradation when replacing the attention softmax with a point-wise activation such as ReLU. In the context of vision transformers, we find that this degradation is mitigated when dividing by sequence length. Our experiments training small to large vision transformers on ImageNet-21k indicate that ReLU-attention can approach or match the performance of softmax-attention in terms of scaling behavior as a function of compute.

摘要
(Simplified Chinese)前期研究发现，当将注意力软 макс replaced with point-wise 活化函数如ReLU时，会导致准确性下降。在视觉转换器上，我们发现这种下降可以通过序列长度除法缓解。我们在ImageNet-21k上训练小到大的视觉转换器，发现ReLU-attention可以与softmax-attention相当，以计算行为为函数来规定缩放性能。

Viewpoint Integration and Registration with Vision Language Foundation Model for Image Change Understanding

paper_url: http://arxiv.org/abs/2309.08585
repo_url: None
paper_authors: Xiaonan Lu, Jianlong Yuan, Ruigang Niu, Yuan Hu, Fan Wang
for: 本研究旨在解决现有的图像变换理解（ICU）任务中，已有的视觉语言基础模型（VLFM）的缺陷，即它们只能单独处理单个图像，而不能理解多个图像之间的变化。
methods: 我们提出了一种视角集成和准确处理方法，包括在预训练编码器中插入设计的可学习适配器和综合适配器，以有效地捕捉图像对的变化。此外，我们还设计了视角协调流和semantic强调模块，以降低视角变化对视觉和semantic空间的影响。
results: 我们在CLEVR-Change和Spot-the-Diff等测试集上进行了实验，结果显示，我们的方法可以在所有纬度上达到状态的排名，包括对应的语义描述和图像对比等。

Abstract
Recently, the development of pre-trained vision language foundation models (VLFMs) has led to remarkable performance in many tasks. However, these models tend to have strong single-image understanding capability but lack the ability to understand multiple images. Therefore, they cannot be directly applied to cope with image change understanding (ICU), which requires models to capture actual changes between multiple images and describe them in language. In this paper, we discover that existing VLFMs perform poorly when applied directly to ICU because of the following problems: (1) VLFMs generally learn the global representation of a single image, while ICU requires capturing nuances between multiple images. (2) The ICU performance of VLFMs is significantly affected by viewpoint variations, which is caused by the altered relationships between objects when viewpoint changes. To address these problems, we propose a Viewpoint Integration and Registration method. Concretely, we introduce a fused adapter image encoder that fine-tunes pre-trained encoders by inserting designed trainable adapters and fused adapters, to effectively capture nuances between image pairs. Additionally, a viewpoint registration flow and a semantic emphasizing module are designed to reduce the performance degradation caused by viewpoint variations in the visual and semantic space, respectively. Experimental results on CLEVR-Change and Spot-the-Diff demonstrate that our method achieves state-of-the-art performance in all metrics.

摘要
近些年来，预训 vision language foundation models（VLFMs）的发展已经带来了多个任务的出色表现。然而，这些模型具有强大的单个图像理解能力，但缺乏多图像理解能力。因此，它们无法直接应用于图像变化理解（ICU）任务，ICU任务需要模型捕捉多个图像之间的实际变化并用语言描述。在这篇论文中，我们发现现有的 VLFMs 当直接应用于 ICU 时表现不佳，主要因为以下两个问题：（1） VLFMs 通常学习单个图像的全局表示，而 ICU 需要多图像之间的细节捕捉。（2） ICU 性能的变化会受到视角变化的影响，这是因为对象之间的关系变化导致的。为解决这些问题，我们提出了一种 Viewpoint Integration and Registration 方法。具体来说，我们引入了一个混合适配器图像编码器，通过插入设计的可学习适配器和混合适配器，有效地捕捉图像对的细节变化。此外，我们还设计了一种视角协调流和一种semantic Emphasizing模块，以降低由视角变化引起的视觉和semantic空间中的性能下降。实验结果表明，我们的方法在 CLEVR-Change 和 Spot-the-Diff 上达到了所有纪录的性能。

The Impact of Different Backbone Architecture on Autonomous Vehicle Dataset

paper_url: http://arxiv.org/abs/2309.08564
repo_url: None
paper_authors: Ning Ding, Azim Eskandarian
for: 这研究旨在评估不同背部架构在自动驾驶环境中的对象检测性能。
methods: 研究使用了三个常见的自动驾驶数据集：KITTI、NuScenes和BDD，对不同背部架构进行对比，以评估它们在对象检测任务中的性能。
results: 研究发现，不同背部架构在不同数据集上的性能有很大差异，而且一些背部架构在某些数据集上表现更好。

Abstract
Object detection is a crucial component of autonomous driving, and many detection applications have been developed to address this task. These applications often rely on backbone architectures, which extract representation features from inputs to perform the object detection task. The quality of the features extracted by the backbone architecture can have a significant impact on the overall detection performance. Many researchers have focused on developing new and improved backbone architectures to enhance the efficiency and accuracy of object detection applications. While these backbone architectures have shown state-of-the-art performance on generic object detection datasets like MS-COCO and PASCAL-VOC, evaluating their performance under an autonomous driving environment has not been previously explored. To address this, our study evaluates three well-known autonomous vehicle datasets, namely KITTI, NuScenes, and BDD, to compare the performance of different backbone architectures on object detection tasks.

摘要
<>translate(Object detection is a crucial component of autonomous driving, and many detection applications have been developed to address this task. These applications often rely on backbone architectures, which extract representation features from inputs to perform the object detection task. The quality of the features extracted by the backbone architecture can have a significant impact on the overall detection performance. Many researchers have focused on developing new and improved backbone architectures to enhance the efficiency and accuracy of object detection applications. While these backbone architectures have shown state-of-the-art performance on generic object detection datasets like MS-COCO and PASCAL-VOC, evaluating their performance under an autonomous driving environment has not been previously explored. To address this, our study evaluates three well-known autonomous vehicle datasets, namely KITTI, NuScenes, and BDD, to compare the performance of different backbone architectures on object detection tasks.)中文翻译：对于自动驾驶中的对象检测，有许多检测应用程序已经被开发出来解决这个问题。这些应用程序通常基于后凹结构，从输入中提取表示特征来进行对象检测任务。后凹结构的特征提取质量可以直接影响对象检测总性能。许多研究人员已经专注于开发新的和改进的后凹结构，以提高对象检测应用程序的效率和准确性。而这些后凹结构在通用对象检测数据集上表现出了state-of-the-art的性能，但是在自动驾驶环境中评估它们的性能尚未得到过探讨。为了解决这个问题，我们的研究对三个著名的自动驾驶数据集，namely KITTI, NuScenes, and BDD,进行对比不同后凹结构在对象检测任务中的性能。)

Automated dermatoscopic pattern discovery by clustering neural network output for human-computer interaction

paper_url: http://arxiv.org/abs/2309.08533
repo_url: None
paper_authors: Lidia Talavera-Martinez, Philipp Tschandl
for: 这项研究的目的是自动找到大量图像数据集中的可读性高的视觉模式，以便从中提取知识。methods: 该研究使用了k-means算法和神经网络提取的图像特征来自动对图像分类。results: 研究发现，通过使用优化的弯曲点方法或凝结度量来选择最佳分裂点，可以生成高度可读性的视觉模式，并且大多数分群可以与先前描述的皮肤病诊断模式相匹配。

Abstract
Background: As available medical image datasets increase in size, it becomes infeasible for clinicians to review content manually for knowledge extraction. The objective of this study was to create an automated clustering resulting in human-interpretable pattern discovery. Methods: Images from the public HAM10000 dataset, including 7 common pigmented skin lesion diagnoses, were tiled into 29420 tiles and clustered via k-means using neural network-extracted image features. The final number of clusters per diagnosis was chosen by either the elbow method or a compactness metric balancing intra-lesion variance and cluster numbers. The amount of resulting non-informative clusters, defined as those containing less than six image tiles, was compared between the two methods. Results: Applying k-means, the optimal elbow cutoff resulted in a mean of 24.7 (95%-CI: 16.4-33) clusters for every included diagnosis, including 14.9% (95% CI: 0.8-29.0) non-informative clusters. The optimal cutoff, as estimated by the compactness metric, resulted in significantly fewer clusters (13.4; 95%-CI 11.8-15.1; p=0.03) and less non-informative ones (7.5%; 95% CI: 0-19.5; p=0.017). The majority of clusters (93.6%) from the compactness metric could be manually mapped to previously described dermatoscopic diagnostic patterns. Conclusions: Automatically constraining unsupervised clustering can produce an automated extraction of diagnostically relevant and human-interpretable clusters of visual patterns from a large image dataset.

摘要
背景：随着医疗影像数据集的增加，为了EXTRACT知识，临床医生无法 manually review content。这项研究的目标是通过自动 clustering 实现人类可读的模式发现。方法：来自公共HAM10000数据集的皮肤悬液瘤病诊断图像，经过分割成29420个图像块，然后使用神经网络提取的图像特征进行k-means clustering。最终选择每个诊断的最佳割分数量是通过膝盖方法或尺度矩阵来决定。对于每个诊断，计算了非参考分布（即包含 menos than six 图像块的分布）的比例。结果：通过k-means clustering，使用最佳膝盖割分数量时，每个诊断的平均分布数量为24.7（95% CI：16.4-33），包含14.9%（95% CI：0.8-29.0）的非参考分布。使用尺度矩阵来选择割分数量时，得到的分布数量较少（13.4；95% CI：11.8-15.1），同时非参考分布的比例较低（7.5%; 95% CI：0-19.5）。大多数分布（93.6%）可以 manually mapping 到已知的DERMATOSCOPIC diagnostic pattern。结论：通过自动 constraining 不supervised clustering，可以生成一个自动EXTRACT的Visual pattern的diagnostically relevant和人类可读的分布。

Breathing New Life into 3D Assets with Generative Repainting

paper_url: http://arxiv.org/abs/2309.08523
repo_url: https://github.com/toshas/remesh_isotropic_planar
paper_authors: Tianfu Wang, Menelaos Kanakis, Konrad Schindler, Luc Van Gool, Anton Obukhov
for: 这篇论文主要旨在提出一种基于扩散模型的3D资产重新绘制方法，以提高3D资产的生成质量。
methods: 该方法利用了2D扩散模型和3D神经场的共振，并通过独立使用这两种工具来实现模块化的重新绘制。
results: 通过大规模的实验， authors 发现了该方法的优势，包括对多种物体和类别的生成质量和速度。In English, that would be:
for: The paper primarily aims to propose a method for re-rendering 3D assets based on diffusion models, to improve the quality of 3D asset generation.
methods: The method utilizes the entanglement of 2D diffusion models and 3D neural fields, and achieves modularization by independently using these two tools.
results: Through large-scale experiments, the authors found advantages of their method, including improved generation quality and speed for multiple objects and categories.

Abstract
Diffusion-based text-to-image models ignited immense attention from the vision community, artists, and content creators. Broad adoption of these models is due to significant improvement in the quality of generations and efficient conditioning on various modalities, not just text. However, lifting the rich generative priors of these 2D models into 3D is challenging. Recent works have proposed various pipelines powered by the entanglement of diffusion models and neural fields. We explore the power of pretrained 2D diffusion models and standard 3D neural radiance fields as independent, standalone tools and demonstrate their ability to work together in a non-learned fashion. Such modularity has the intrinsic advantage of eased partial upgrades, which became an important property in such a fast-paced domain. Our pipeline accepts any legacy renderable geometry, such as textured or untextured meshes, orchestrates the interaction between 2D generative refinement and 3D consistency enforcement tools, and outputs a painted input geometry in several formats. We conduct a large-scale study on a wide range of objects and categories from the ShapeNetSem dataset and demonstrate the advantages of our approach, both qualitatively and quantitatively. Project page: https://www.obukhov.ai/repainting_3d_assets

摘要
填充基于模型在图像创建领域内引起了广泛的关注，艺术家、内容创作者以及视觉社区。这些模型的广泛采用是因为它们在质量和多Modal Conditioning方面提供了显著改进。然而，将2D模型中的富有的生成假设提升到3D是一项挑战。最近的工作提出了基于填充模型和神经场的共融pipeline。我们研究了预训练的2D填充模型和标准3D神经辐射场的独立使用，以及它们之间的非学习协作。这种模块化的设计具有内置的升级优点，可以方便地更新和改进不同模块。我们的管道可以接受任何可 renderable 的geometry，如纹理或无纹理的三角形，并将2D生成精度和3D一致性检查工具相互协作，输出一个涂抹的输入geometry。我们对ShapeNetSem数据集上的各种物体和类型进行了大规模的研究，并证明了我们的方法的优势， both qualitatively和quantitatively。项目页面：https://www.obukhov.ai/repainting_3d_assets

Generalised Probabilistic Diffusion Scale-Spaces

paper_url: http://arxiv.org/abs/2309.08511
repo_url: None
paper_authors: Pascal Peter
for: 这篇论文是关于 probabilistic diffusion models 的研究，用于抽象新图像。
methods: 该方法基于物理中的漫游扩散概念，对图像进行噪声和模糊等perturbations，并使用概率分布来描述图像。
results: 该方法可以实现图像抽象，并且可以根据侧信息进行条件生成图像。目前大多数研究都是关注实践应用，而论文背景还有很多未解之处，尤其是与经典图像滤波器之间的关系。

Abstract
Probabilistic diffusion models excel at sampling new images from learned distributions. Originally motivated by drift-diffusion concepts from physics, they apply image perturbations such as noise and blur in a forward process that results in a tractable probability distribution. A corresponding learned reverse process generates images and can be conditioned on side information, which leads to a wide variety of practical applications. Most of the research focus currently lies on practice-oriented extensions. In contrast, the theoretical background remains largely unexplored, in particular the relations to drift-diffusion. In order to shed light on these connections to classical image filtering, we propose a generalised scale-space theory for probabilistic diffusion models. Moreover, we show conceptual and empirical connections to diffusion and osmosis filters.

摘要
probabilistic diffusion models excel at sampling new images from learned distributions. originally motivated by drift-diffusion concepts from physics, they apply image perturbations such as noise and blur in a forward process that results in a tractable probability distribution. a corresponding learned reverse process generates images and can be conditioned on side information, which leads to a wide variety of practical applications. most of the research focus currently lies on practice-oriented extensions. in contrast, the theoretical background remains largely unexplored, in particular the relations to drift-diffusion. in order to shed light on these connections to classical image filtering, we propose a generalised scale-space theory for probabilistic diffusion models. moreover, we show conceptual and empirical connections to diffusion and osmosis filters.Here's the text in Traditional Chinese:probabilistic diffusion models excel at sampling new images from learned distributions. originally motivated by drift-diffusion concepts from physics, they apply image perturbations such as noise and blur in a forward process that results in a tractable probability distribution. a corresponding learned reverse process generates images and can be conditioned on side information, which leads to a wide variety of practical applications. most of the research focus currently lies on practice-oriented extensions. in contrast, the theoretical background remains largely unexplored, in particular the relations to drift-diffusion. in order to shed light on these connections to classical image filtering, we propose a generalised scale-space theory for probabilistic diffusion models. moreover, we show conceptual and empirical connections to diffusion and osmosis filters.

OccupancyDETR: Making Semantic Scene Completion as Straightforward as Object Detection

paper_url: http://arxiv.org/abs/2309.08504
repo_url: https://github.com/jypjypjypjyp/occupancydetr
paper_authors: Yupeng Jia, Jie He, Runze Chen, Fang Zhao, Haiyong Luo
for: The paper is written for the purpose of proposing a novel 3D semantic occupancy perception method for robotic applications like autonomous driving, which can improve the ability of robots to understand their surroundings while reducing computational demand.
methods: The proposed method, OccupancyDETR, consists of a DETR-like object detection module and a 3D occupancy decoder module, which integrate object detection and 3D occupancy grid prediction to simplify the method and improve performance on small objects.
results: The proposed method is demonstrated on the SemanticKITTI dataset and achieves an mIoU of 23 and a processing speed of 6 frames per second, showcasing its effectiveness and potential for real-time 3D semantic scene completion.

Abstract
Visual-based 3D semantic occupancy perception (also known as 3D semantic scene completion) is a new perception paradigm for robotic applications like autonomous driving. Compared with Bird's Eye View (BEV) perception, it extends the vertical dimension, significantly enhancing the ability of robots to understand their surroundings. However, due to this very reason, the computational demand for current 3D semantic occupancy perception methods generally surpasses that of BEV perception methods and 2D perception methods. We propose a novel 3D semantic occupancy perception method, OccupancyDETR, which consists of a DETR-like object detection module and a 3D occupancy decoder module. The integration of object detection simplifies our method structurally - instead of predicting the semantics of each voxels, it identifies objects in the scene and their respective 3D occupancy grids. This speeds up our method, reduces required resources, and leverages object detection algorithm, giving our approach notable performance on small objects. We demonstrate the effectiveness of our proposed method on the SemanticKITTI dataset, showcasing an mIoU of 23 and a processing speed of 6 frames per second, thereby presenting a promising solution for real-time 3D semantic scene completion.

摘要
visuomotive 3D semantic occupancy perception (也称为3D semantic scene completion) 是一种新的感知 paradigm for robotic applications like autonomous driving. 相比 Bird's Eye View (BEV) perception, 它扩展了垂直维度，明显提高了机器人的周围环境理解能力。然而，由于这个very reason, 当前3D semantic occupancy perception方法的计算需求通常超过 BEV perception方法和2D perception方法。我们提出了一种新的3D semantic occupancy perception方法，OccupancyDETR，该方法包括一个 DETR-like object detection模块和一个 3D occupancy decoder模块。该模块的结合使得我们的方法在结构上更加简单 - 而不是预测每个 voxel 的 semantics, 它将场景中的对象和它们的相对应的 3D occupancy 网格标识出来。这使得我们的方法更快速、需要 fewer resources，并且可以利用对象检测算法，从而实现了对小对象的出色表现。我们在 SemanticKITTI 数据集上展示了我们的提议方法的效果，其中 mIoU 为 23 和处理速度为 6 帧每秒，因此表现出了可靠的解决方案 для实时3D semantic scene completion。

YCB-Ev: Event-vision dataset for 6DoF object pose estimation

paper_url: http://arxiv.org/abs/2309.08482
repo_url: https://github.com/paroj/ycbev
paper_authors: Pavel Rojtberg, Thomas Pöllabauer
For: Introduces the YCB-Ev dataset, a synchronized RGB-D and event data dataset for evaluating 6DoF object pose estimation algorithms.* Methods: Provides ground truth 6DoF object poses for the same 21 YCB objects as the YCB-Video dataset, enabling evaluation of algorithm performance when transferred across datasets.* Results: Evaluates the generalization capabilities of two state-of-the-art algorithms using the novel YCB-V sequences in the dataset.Here’s the full summary in Simplified Chinese:* For: 本研究 introduce YCB-Ev dataset，这是一个同步RGB-D和事件数据集，用于评估6DoF对象 pose 估计算法。* Methods: YCB-Ev dataset提供了同YCB-Video dataset中的21个物体的6DoF对象pose的ground truth数据，使得算法在不同dataset之间进行评估。* Results: 研究使用YCB-Ev dataset中的新的YCB-V序列评估了两种现状顶尖算法的泛化能力。

Abstract
Our work introduces the YCB-Ev dataset, which contains synchronized RGB-D frames and event data that enables evaluating 6DoF object pose estimation algorithms using these modalities. This dataset provides ground truth 6DoF object poses for the same 21 YCB objects \cite{calli2017yale} that were used in the YCB-Video (YCB-V) dataset, enabling the evaluation of algorithm performance when transferred across datasets. The dataset consists of 21 synchronized event and RGB-D sequences, amounting to a total of 7:43 minutes of video. Notably, 12 of these sequences feature the same object arrangement as the YCB-V subset used in the BOP challenge. Our dataset is the first to provide ground truth 6DoF pose data for event streams. Furthermore, we evaluate the generalization capabilities of two state-of-the-art algorithms, which were pre-trained for the BOP challenge, using our novel YCB-V sequences. The proposed dataset is available at https://github.com/paroj/ycbev.

摘要
我们的工作介绍了YCB-Ev数据集，该数据集包含同步RGB-D帧和事件数据，可以用这些模式评估6DoF物体姿态估计算法。这个数据集提供了21个YCB物体的真实6DoF姿态标准，这些物体与YCB-Video（YCB-V）数据集中使用的同一个物体设置相同。因此，可以评估算法在不同数据集之间的性能转移。该数据集包含21个同步RGB-D序列，总计7分43秒的视频。值得注意的是，12个序列包含YCB-V数据集中使用的同一个物体设置。我们的数据集是首个提供事件流真实6DoF姿态数据的 dataset。此外，我们使用我们的YCB-V新序列评估了两个现状最佳算法的泛化能力。我们的数据集可以在https://github.com/paroj/ycbev上下载。

3D Arterial Segmentation via Single 2D Projections and Depth Supervision in Contrast-Enhanced CT Images

paper_url: http://arxiv.org/abs/2309.08481
repo_url: https://github.com/alinafdima/3dseg-mip-depth
paper_authors: Alina F. Dima, Veronika A. Zimmer, Martin J. Menten, Hongwei Bran Li, Markus Graf, Tristan Lemke, Philipp Raffler, Robert Graf, Jan S. Kirschke, Rickmer Braren, Daniel Rueckert
for: 这个论文的目的是提出一种新的3D血管分割方法，以便更好地诊断和治疗许多血管疾病。
methods: 该方法使用深度学习技术，并且只需要一个已经标注过的2D图像来进行训练。
results: 该方法可以准确地分割3D血管，并且可以减少标注工作的努力。

Abstract
Automated segmentation of the blood vessels in 3D volumes is an essential step for the quantitative diagnosis and treatment of many vascular diseases. 3D vessel segmentation is being actively investigated in existing works, mostly in deep learning approaches. However, training 3D deep networks requires large amounts of manual 3D annotations from experts, which are laborious to obtain. This is especially the case for 3D vessel segmentation, as vessels are sparse yet spread out over many slices and disconnected when visualized in 2D slices. In this work, we propose a novel method to segment the 3D peripancreatic arteries solely from one annotated 2D projection per training image with depth supervision. We perform extensive experiments on the segmentation of peripancreatic arteries on 3D contrast-enhanced CT images and demonstrate how well we capture the rich depth information from 2D projections. We demonstrate that by annotating a single, randomly chosen projection for each training sample, we obtain comparable performance to annotating multiple 2D projections, thereby reducing the annotation effort. Furthermore, by mapping the 2D labels to the 3D space using depth information and incorporating this into training, we almost close the performance gap between 3D supervision and 2D supervision. Our code is available at: https://github.com/alinafdima/3Dseg-mip-depth.

摘要
自动化分割血管在3DVolume中是诊断和治疗许多血管疾病的关键步骤。现有许多研究在深度学习方法中进行3D血管分割，但是训练3D深度网络需要大量的手动3D注解从专家手上，这很困难。特别是在3D血管分割方面，血管稀疏，分散在多个片段和2D片段中断掉，从而使得手动注解变得更加困难。在这种情况下，我们提出了一种新的方法，可以从单个注解的2D投影中分割3D血管，并且只需要每个训练样本一个随机选择的2D投影。我们对3D对照CT图像进行了广泛的实验，并证明我们可以从2D投影中获得丰富的深度信息，并将其纳入训练中。此外，我们将2D标签映射到3D空间中使用深度信息，并将其包含在训练中，从而减少了注解努力。此外，我们的代码可以在以下链接中找到：https://github.com/alinafdima/3Dseg-mip-depth。

PoseFix: Correcting 3D Human Poses with Natural Language

paper_url: http://arxiv.org/abs/2309.08480
repo_url: None
paper_authors: Ginger Delmas, Philippe Weinzaepfel, Francesc Moreno-Noguer, Grégory Rogez
for: corrections 3D human poses with natural language
methods: introduce the PoseFix dataset, text-based pose editing, correctional text generation
results: potential for assisted 3D character animation, robot teaching

Abstract
Automatically producing instructions to modify one's posture could open the door to endless applications, such as personalized coaching and in-home physical therapy. Tackling the reverse problem (i.e., refining a 3D pose based on some natural language feedback) could help for assisted 3D character animation or robot teaching, for instance. Although a few recent works explore the connections between natural language and 3D human pose, none focus on describing 3D body pose differences. In this paper, we tackle the problem of correcting 3D human poses with natural language. To this end, we introduce the PoseFix dataset, which consists of several thousand paired 3D poses and their corresponding text feedback, that describe how the source pose needs to be modified to obtain the target pose. We demonstrate the potential of this dataset on two tasks: (1) text-based pose editing, that aims at generating corrected 3D body poses given a query pose and a text modifier; and (2) correctional text generation, where instructions are generated based on the differences between two body poses.

摘要
自动生成修改姿势的指令可以开启无数应用程序的 possibilties，如个性化指导和家庭物理治疗。解决反向问题（即基于自然语言反馈修改3D人体姿势）可以帮助助手动画或机器人教学等领域。虽然一些最近的研究探讨了自然语言与3D人体姿势之间的关系，但none of them focus on描述3D人体姿势的差异。在这篇论文中，我们面临着修正3D人体姿势的问题，并提出了PoseFix数据集，该数据集包括数千个对应的3D姿势和其相应的自然语言反馈，用于描述源姿势需要如何修改以获得目标姿势。我们在两个任务上示cases：（1）文本基于姿势编辑，即根据查询姿势和文本修改器生成修 corrected 3D人体姿势;和（2）修正文本生成，即根据两个人体姿势之间的差异生成指令。

TreeLearn: A Comprehensive Deep Learning Method for Segmenting Individual Trees from Forest Point Clouds

paper_url: http://arxiv.org/abs/2309.08471
repo_url: https://github.com/ecker-lab/treelearn
paper_authors: Jonathan Henrich, Jan van Delden, Dominik Seidel, Thomas Kneib, Alexander Ecker
for: 该研究旨在提出一种基于深度学习的森林点云Semantic和实例分割方法，以提高森林管理中的信息提取。
methods: 该方法基于已经分割的点云数据，通过深度学习来学习Semantic和实例特征，而不需要手动设计特征和算法。
results: 对比Lidar360软件标注的点云数据，该方法在 benchmark 数据集上表现相当或更好，而且可以通过细化训练来大幅提高性能。

Abstract
Laser-scanned point clouds of forests make it possible to extract valuable information for forest management. To consider single trees, a forest point cloud needs to be segmented into individual tree point clouds. Existing segmentation methods are usually based on hand-crafted algorithms, such as identifying trunks and growing trees from them, and face difficulties in dense forests with overlapping tree crowns. In this study, we propose \mbox{TreeLearn}, a deep learning-based approach for semantic and instance segmentation of forest point clouds. Unlike previous methods, TreeLearn is trained on already segmented point clouds in a data-driven manner, making it less reliant on predefined features and algorithms. Additionally, we introduce a new manually segmented benchmark forest dataset containing 156 full trees, and 79 partial trees, that have been cleanly segmented by hand. This enables the evaluation of instance segmentation performance going beyond just evaluating the detection of individual trees. We trained TreeLearn on forest point clouds of 6665 trees, labeled using the Lidar360 software. An evaluation on the benchmark dataset shows that TreeLearn performs equally well or better than the algorithm used to generate its training data. Furthermore, the method's performance can be vastly improved by fine-tuning on the cleanly labeled benchmark dataset. The TreeLearn code is availabe from https://github.com/ecker-lab/TreeLearn. The data as well as trained models can be found at https://doi.org/10.25625/VPMPID.

摘要
lazier-扫描的林地点云可以提供有价值的信息 для森林管理。为了考虑单个树木，林地点云需要被分割成 individuak树木点云。现有的分割方法通常基于手工编写的算法，如从树干上识别树木并在其上生长，并在稠密的林地中遇到树叶重叠时遇到困难。在这项研究中，我们提出了 \mbox{TreeLearn}，一种基于深度学习的 semantic和实例分割方法 для森林点云。与先前的方法不同，TreeLearn 通过数据驱动的方式进行训练，使其不依赖于预先定义的特征和算法。此外，我们还提供了一个新的手动分割的森林数据集，包含 156 棵完整的树木和 79 棵部分树木，这些树木已经被手动清洁分割。这使得我们可以评估实例分割性能，而不仅仅是评估检测 individuak 树木。我们在 6665 棵树木的点云上训练 TreeLearn，使用 Lidar360 软件标注。我们对 benchmark 数据集进行评估，发现 TreeLearn 与生成其训练数据的算法性能相似或更好。此外，通过精细调整 cleanly 标注的 benchmark 数据集，我们可以大幅提高方法的性能。TreeLearn 代码可以从 https://github.com/ecker-lab/TreeLearn 获取，数据和训练模型可以从 https://doi.org/10.25625/VPMPID 获取。

Segment Anything Model for Brain Tumor Segmentation

paper_url: http://arxiv.org/abs/2309.08434
repo_url: None
paper_authors: Peng Zhang, Yaping Wang
for: 诊断和治疗 Glioma 脑肿的准确分 segmentation 是非常重要的。
methods: 本研究使用 Meta AI 发布的 Segment Anything Model(SAM) 进行图像分 segmentation，无需模型微调。
results: 研究发现，无需模型微调，SAM 在脑肿分 segmentation 中还有一定的差距与当前状态流行(SOTA) 模型。

Abstract
Glioma is a prevalent brain tumor that poses a significant health risk to individuals. Accurate segmentation of brain tumor is essential for clinical diagnosis and treatment. The Segment Anything Model(SAM), released by Meta AI, is a fundamental model in image segmentation and has excellent zero-sample generalization capabilities. Thus, it is interesting to apply SAM to the task of brain tumor segmentation. In this study, we evaluated the performance of SAM on brain tumor segmentation and found that without any model fine-tuning, there is still a gap between SAM and the current state-of-the-art(SOTA) model.

摘要
glioma 是一种常见的脑肿瘤，对个人健康pose 一定的风险。确定脑肿瘤的准确分 segmentation 是诊断和治疗的关键。Meta AI 发布的 Segment Anything Model（SAM）是一种基本的图像分 segmentation 模型，具有出色的零样本泛化能力。因此，我们想用 SAM 进行脑肿瘤分 segmentation 的任务。在这个研究中，我们评估了 SAM 在脑肿瘤分 segmentation task 的性能，发现没有任何模型细化，仍有与当前领先模型（SOTA）之间的差距。Note: "脑肿瘤" (glioma) is a type of brain tumor, and "风险" (pose) means "risk" in Chinese. "诊断" (diagnosis) and "治疗" (treatment) are also translated as expected. "零样本泛化能力" (zero-sample generalization capabilities) is a bit more difficult to translate, but I think "出色的零样本泛化能力" (excellent zero-sample generalization capabilities) conveys the meaning well.

X-PDNet: Accurate Joint Plane Instance Segmentation and Monocular Depth Estimation with Cross-Task Distillation and Boundary Correction

paper_url: http://arxiv.org/abs/2309.08424
repo_url: https://github.com/caodinhduc/x-pdnet-official
paper_authors: Duc Cao Dinh, J Lim
For: 本文目的是提出一种多任务学习框架，用于同时进行平面实例分割和深度估计。* Methods: 本文使用特征融合机制和几何约束损失来利用图像的视觉和几何特征。此外，本文还提出了跨任务特征储存设计，以便在早期共享双任务中提高特定任务的表现。* Results: 本文通过实验证明了其提出的方法的效果，在ScanNet和Stanford 2D-3D-S dataset上达到了大幅提高的量化结果，证明了其效果。

Abstract
Segmentation of planar regions from a single RGB image is a particularly important task in the perception of complex scenes. To utilize both visual and geometric properties in images, recent approaches often formulate the problem as a joint estimation of planar instances and dense depth through feature fusion mechanisms and geometric constraint losses. Despite promising results, these methods do not consider cross-task feature distillation and perform poorly in boundary regions. To overcome these limitations, we propose X-PDNet, a framework for the multitask learning of plane instance segmentation and depth estimation with improvements in the following two aspects. Firstly, we construct the cross-task distillation design which promotes early information sharing between dual-tasks for specific task improvements. Secondly, we highlight the current limitations of using the ground truth boundary to develop boundary regression loss, and propose a novel method that exploits depth information to support precise boundary region segmentation. Finally, we manually annotate more than 3000 images from Stanford 2D-3D-Semantics dataset and make available for evaluation of plane instance segmentation. Through the experiments, our proposed methods prove the advantages, outperforming the baseline with large improvement margins in the quantitative results on the ScanNet and the Stanford 2D-3D-S dataset, demonstrating the effectiveness of our proposals.

摘要
Segmentation of planar regions from a single RGB image is a particularly important task in the perception of complex scenes. To utilize both visual and geometric properties in images, recent approaches often formulate the problem as a joint estimation of planar instances and dense depth through feature fusion mechanisms and geometric constraint losses. Despite promising results, these methods do not consider cross-task feature distillation and perform poorly in boundary regions. To overcome these limitations, we propose X-PDNet, a framework for the multitask learning of plane instance segmentation and depth estimation with improvements in the following two aspects:Firstly, we construct the cross-task distillation design, which promotes early information sharing between dual-tasks for specific task improvements.Secondly, we highlight the current limitations of using the ground truth boundary to develop boundary regression loss, and propose a novel method that exploits depth information to support precise boundary region segmentation.Finally, we manually annotate more than 3000 images from the Stanford 2D-3D-Semantics dataset and make them available for evaluation of plane instance segmentation. Through the experiments, our proposed methods prove the advantages, outperforming the baseline with large improvement margins in the quantitative results on the ScanNet and the Stanford 2D-3D-S dataset, demonstrating the effectiveness of our proposals.

MIML: Multiplex Image Machine Learning for High Precision Cell Classification via Mechanical Traits within Microfluidic Systems

paper_url: http://arxiv.org/abs/2309.08421
repo_url: None
paper_authors: Khayrul Islam, Ratul Paul, Shen Wang, Yaling Liu
for: This paper aims to develop a novel machine learning framework for label-free cell classification, addressing the limitations of existing techniques in terms of specificity and speed.
methods: The proposed framework, called Multiplex Image Machine Learning (MIML), combines label-free cell images with biomechanical property data to offer a more holistic understanding of cellular properties.
results: The MIML approach achieves a remarkable 98.3% accuracy in cell classification, outperforming models that only consider a single data type. It has been proven effective in classifying white blood cells and tumor cells, with potential for broader application due to its flexibility and transfer learning capability.

Abstract
Label-free cell classification is advantageous for supplying pristine cells for further use or examination, yet existing techniques frequently fall short in terms of specificity and speed. In this study, we address these limitations through the development of a novel machine learning framework, Multiplex Image Machine Learning (MIML). This architecture uniquely combines label-free cell images with biomechanical property data, harnessing the vast, often underutilized morphological information intrinsic to each cell. By integrating both types of data, our model offers a more holistic understanding of the cellular properties, utilizing morphological information typically discarded in traditional machine learning models. This approach has led to a remarkable 98.3\% accuracy in cell classification, a substantial improvement over models that only consider a single data type. MIML has been proven effective in classifying white blood cells and tumor cells, with potential for broader application due to its inherent flexibility and transfer learning capability. It's particularly effective for cells with similar morphology but distinct biomechanical properties. This innovative approach has significant implications across various fields, from advancing disease diagnostics to understanding cellular behavior.

摘要
标签无Cell分类具有优势，可以提供不受损害的细胞用于进一步使用或检测，然而现有技术 frequently fall short in terms of specificity and speed. 在这种研究中，我们通过开发一种新的机器学习框架，Multiplex Image Machine Learning (MIML)，来解决这些局限性。这种架构独特地结合标签无Cell image和生物力学性数据，利用每个细胞内部的较为忽略的形态信息。通过将两种数据类型集成，我们的模型可以更全面地理解细胞的质量特征，使得模型具有98.3%的准确率。MIML已经成功地分类白血细胞和肿瘤细胞，并且具有更广泛的应用前景，因为它具有内置的灵活性和转移学习能力。它尤其有效于具有相似形态 yet distinct biomechanical properties的细胞。这种创新的方法对于不同领域的应用都具有深远的意义，从提高疾病诊断到理解细胞行为。

Deformable Neural Radiance Fields using RGB and Event Cameras

paper_url: http://arxiv.org/abs/2309.08416
repo_url: None
paper_authors: Qi Ma, Danda Pani Paudel, Ajad Chhatkuli, Luc Van Gool
for: 用于模型快速变形的物体从可见数据中的快速变化。
methods: 使用RGB和事件摄像头，并将事件流处理为ynchronously数据，并对于不确定的摄像头位置进行估计。
results: 与现有方法比较，提供了更好的性能，并在实际的游戏和实际数据上进行了试验。Here’s the full text in Simplified Chinese:
for: 本研究使用RGB和事件摄像头，从可见数据中模型快速变形的物体。
methods: 我们提出了一种新的方法，使用RGB和事件摄像头，并将事件流处理为ynchronously数据，并对于不确定的摄像头位置进行估计。
results: 实际实验结果显示，与现有方法比较，我们的方法提供了更好的性能，并在实际的游戏和实际数据上进行了试验。

Abstract
Modeling Neural Radiance Fields for fast-moving deformable objects from visual data alone is a challenging problem. A major issue arises due to the high deformation and low acquisition rates. To address this problem, we propose to use event cameras that offer very fast acquisition of visual change in an asynchronous manner. In this work, we develop a novel method to model the deformable neural radiance fields using RGB and event cameras. The proposed method uses the asynchronous stream of events and calibrated sparse RGB frames. In our setup, the camera pose at the individual events required to integrate them into the radiance fields remains unknown. Our method jointly optimizes these poses and the radiance field. This happens efficiently by leveraging the collection of events at once and actively sampling the events during learning. Experiments conducted on both realistically rendered graphics and real-world datasets demonstrate a significant benefit of the proposed method over the state-of-the-art and the compared baseline. This shows a promising direction for modeling deformable neural radiance fields in real-world dynamic scenes.

摘要
模型神经辐射场为快速变形物体从视觉数据中快速学习是一个挑战。主要问题在于高度变形和低收集率。为解决这个问题，我们提议使用事件摄像机，它们可以快速获取视觉变化的异步方式。在这个工作中，我们开发了一种新的方法，用于模型弹性神经辐射场，使用RGB和事件摄像机。我们的方法使用异步流动的事件，并在学习过程中活动地选择事件。在我们的设置中，摄像机的具体位置在个别事件中需要被集成到辐射场中。我们的方法同时优化这些位置和辐射场。这里我们可以高效地利用事件的批处理，并在学习过程中活动地选择事件。实验结果表明，我们的方法在实际render的图形和实际场景中具有显著的优势，比对比例的基eline。这表明了我们的方法在实际世界动态场景中具有扎实的批处理能力。

3D SA-UNet: 3D Spatial Attention UNet with 3D ASPP for White Matter Hyperintensities Segmentation

paper_url: http://arxiv.org/abs/2309.08402
repo_url: https://github.com/hjkuijf/wmhchallenge
paper_authors: Changlu Guo
for:* 这项研究旨在提高FLAIR图像中白 matter 高 INTENSITY 的自动分割精度，以便早期诊断多种疾病。methods:* 我们提出了一种名为3D Spatial Attention U-Net（3D SA-UNet）的深度学习模型，该模型使用仅FLAIR扫描图像进行自动WMH分割。* 3D SA-UNet引入了3D空间注意力模块，该模块可以高亮重要的疾病特征，如白 matter 高 INTENSITY，而且可以抑制无关的区域。* 我们还延展了Atrous Spatial Pyramid Pooling（ASPP）模块到3D版本，以捕捉不同级别的特征，从而提高网络的 segmentation 性能。results:* 我们对公共数据集进行了评估，并证明了3D空间注意力模块和3D ASPP在WMH分割中的效iveness。* 对比其他当前领域的3D卷积神经网络，我们的提出的3D SA-UNet模型在精度方面获得了更高的性能。

Abstract
White Matter Hyperintensity (WMH) is an imaging feature related to various diseases such as dementia and stroke. Accurately segmenting WMH using computer technology is crucial for early disease diagnosis. However, this task remains challenging due to the small lesions with low contrast and high discontinuity in the images, which contain limited contextual and spatial information. To address this challenge, we propose a deep learning model called 3D Spatial Attention U-Net (3D SA-UNet) for automatic WMH segmentation using only Fluid Attenuation Inversion Recovery (FLAIR) scans. The 3D SA-UNet introduces a 3D Spatial Attention Module that highlights important lesion features, such as WMH, while suppressing unimportant regions. Additionally, to capture features at different scales, we extend the Atrous Spatial Pyramid Pooling (ASPP) module to a 3D version, enhancing the segmentation performance of the network. We evaluate our method on publicly available dataset and demonstrate the effectiveness of 3D spatial attention module and 3D ASPP in WMH segmentation. Through experimental results, it has been demonstrated that our proposed 3D SA-UNet model achieves higher accuracy compared to other state-of-the-art 3D convolutional neural networks.

摘要
白 matter 高度突出 (WMH) 是各种疾病的成像特征，如 деменcia 和 apoplexy。正确地使用计算机技术将 WMH 自动分割是早期疾病诊断的关键。然而，这项任务仍然具有挑战性，因为病理学图像中的小 lesions 具有低对比度和高终端不连续，含有有限的情况和空间信息。为解决这个挑战，我们提议一种深度学习模型called 3D Spatial Attention U-Net (3D SA-UNet)，用于自动 WMH 分割，只使用 Fluid Attenuation Inversion Recovery (FLAIR) 扫描图像。3D SA-UNet 中的 3D Spatial Attention Module 可以高亮突出重要的 lesion 特征，如 WMH，并且压制不重要的区域。此外，为了捕捉不同尺度的特征，我们扩展了 Atrous Spatial Pyramid Pooling (ASPP) 模块到 3D 版本，从而提高了网络的分割性能。我们对公共可用的数据集进行了评估，并证明了 3D spatial attention module 和 3D ASPP 在 WMH 分割中的效果。通过实验结果，我们的提出的 3D SA-UNet 模型在比较其他状态的情况下 achieve 更高的准确率。

An inspection technology of inner surface of the fine hole based on machine vision

paper_url: http://arxiv.org/abs/2309.08649
repo_url: None
paper_authors: Rongfang He, Weibin Zhang, Guofang Gao
for: detect the quality of the inner surface of fine holes in industrial components
methods: uses a special optical measurement system with a sight pipe and flexible light array to guide external illumination light into the fine hole and output relevant images
results: can measure the inner surface quality of fine holes with a diameter range of 4mm to 47mm and a depth of up to 47mm, with a maximum measurement error standard deviation of about 10um

Abstract
Fine holes are an important structural component of industrial components, and their inner surface quality is closely related to their function.In order to detect the quality of the inner surface of the fine hole,a special optical measurement system was investigated in this paper. A sight pipe is employed to guide the external illumination light into the fine hole and output the relevant images simultaneously. A flexible light array is introduced to suit the narrow space, and the effective field of view is analyzed. Besides, the arc surface projection error and manufacturing assembly error of the device are analyzed, then compensated or ignored if small enough. In the test of prefabricated circular defects with the diameter {\phi}0.1mm, {\phi}0.2mm, 0.4mm distance distribution and the fissure defects with the width 0.3mm, the maximum measurement error standard deviation are all about 10{\mu}m. The minimum diameter of the measured fine hole is 4mm and the depth can reach 47mm.

摘要
细洞是工业Component的重要结构部件，其内部表面质量直接关系到它的功能。为了检测细洞内部质量，这篇论文提出了一种特殊的光学测量系统。使用视窗引导外部照明光入射细洞，并同时输出相关图像。还引入了 flexible 光 Array适应窄空间，分析了有效范围。此外，Device的弯曲面投影错误和生产组装错误也被分析了，并且可以根据小于一定的标准差忽略或补做。在尝试预制圆残渠defects with диаметр（φ）0.1mm、0.2mm、0.4mm距离分布和尖极残渠with width 0.3mm时，最大测量错误标准差都在10μm之间。测量细洞的最小径度为4mm，深度可达47mm。

Double Domain Guided Real-Time Low-Light Image Enhancement for Ultra-High-Definition Transportation Surveillance

paper_url: http://arxiv.org/abs/2309.08382
repo_url: https://github.com/qujx/ddnet
paper_authors: Jingxiang Qu, Ryan Wen Liu, Yuan Gao, Yu Guo, Fenghua Zhu, Fei-yue Wang
for: 这 paper 的目的是提出一种高效可靠的低光照图像增强网络 (DDNet)，用于实时交通监测系统 (ITS) 中的低光照图像增强。
methods: 这 paper 使用了一种 Encoder-Decoder 结构为主体网络 architecture，并将增强处理分解为两个子任务 (i.e., 色彩增强和梯度增强)，通过提出的 Course Enhancement Module (CEM) 和 LoG-based Gradient Enhancement Module (GEM)，以同时进行颜色和梯度特征的增强。
results: 评估实验表明，相比之前的方法，DDNet 可提供更高质量和更高效的低光照图像增强，并且在交通相关的数据集上进行物体检测和Scene Segmentation 实验也表明了DDNet 的实际应用价值。

Abstract
Real-time transportation surveillance is an essential part of the intelligent transportation system (ITS). However, images captured under low-light conditions often suffer the poor visibility with types of degradation, such as noise interference and vague edge features, etc. With the development of imaging devices, the quality of the visual surveillance data is continually increasing, like 2K and 4K, which has more strict requirements on the efficiency of image processing. To satisfy the requirements on both enhancement quality and computational speed, this paper proposes a double domain guided real-time low-light image enhancement network (DDNet) for ultra-high-definition (UHD) transportation surveillance. Specifically, we design an encoder-decoder structure as the main architecture of the learning network. In particular, the enhancement processing is divided into two subtasks (i.e., color enhancement and gradient enhancement) via the proposed coarse enhancement module (CEM) and LoG-based gradient enhancement module (GEM), which are embedded in the encoder-decoder structure. It enables the network to enhance the color and edge features simultaneously. Through the decomposition and reconstruction on both color and gradient domains, our DDNet can restore the detailed feature information concealed by the darkness with better visual quality and efficiency. The evaluation experiments on standard and transportation-related datasets demonstrate that our DDNet provides superior enhancement quality and efficiency compared with the state-of-the-art methods. Besides, the object detection and scene segmentation experiments indicate the practical benefits for higher-level image analysis under low-light environments in ITS.

摘要
现实时交通监测是智能交通系统（ITS）的重要组成部分。然而，在低光照条件下捕捉的图像经常受到质量下降，如噪声干扰和模糊边缘等问题。随着捕捉设备的发展，视觉监测数据的质量不断提高，如2K和4K，这对图像处理效率的要求越来越高。为满足质量提高和计算速度的双重要求，本文提出了双Domain指导实时低光照图像增强网络（DDNet），用于ultra-high-definition（UHD）交通监测。具体来说，我们设计了编码器-解码器结构为主体网络学习架构。特别是，增强处理被分为两个子任务（即色彩增强和梯度增强）via提出的粗略增强模块（CEM）和LoG基于的梯度增强模块（GEM），这些模块在编码器-解码器结构中嵌入。这使得网络可以同时增强色彩和梯度特征。通过对颜色和梯度频谱进行分解和重建，我们的DDNet可以更好地恢复在黑暗中隐藏的细节特征，提供更高质量和效率的增强结果。实验表明，与现有方法相比，我们的DDNet在标准和交通相关的数据集上提供了更高质量和效率的增强结果。此外，对象检测和Scene分割实验表明，DDNet在低光照环境中的高级图像分析也具有实际应用价值。

Reconsidering evaluation practices in modular systems: On the propagation of errors in MRI prostate cancer detection

paper_url: http://arxiv.org/abs/2309.08381
repo_url: None
paper_authors: Erlend Sortland Rolfsnes, Philip Thangngat, Trygve Eftestøl, Tobias Nordström, Fredrik Jäderling, Martin Eklund, Alvaro Fernandez-Quilez
for: 这篇论文主要是为了检测前列腺癌（PCa）的检测方法。
methods: 这篇论文使用人工智能（AI）系统来支持医学评估，包括范例分类和检测癌肿。
results: 这篇论文发现，使用不同的分类网络（s1和s2）会导致不同的检测结果，并且与理想的设定相比（s1：89.90+-2.23 vs 88.97+-3.06 ncsPCa，P<.001，89.30+-4.07和88.12+-2.71 csPCa，P<.001）。这些结果表明了评估整个系统的重要性，而不仅仅是单一模组。

Abstract
Magnetic resonance imaging has evolved as a key component for prostate cancer (PCa) detection, substantially increasing the radiologist workload. Artificial intelligence (AI) systems can support radiological assessment by segmenting and classifying lesions in clinically significant (csPCa) and non-clinically significant (ncsPCa). Commonly, AI systems for PCa detection involve an automatic prostate segmentation followed by the lesion detection using the extracted prostate. However, evaluation reports are typically presented in terms of detection under the assumption of the availability of a highly accurate segmentation and an idealistic scenario, omitting the propagation of errors between modules. For that purpose, we evaluate the effect of two different segmentation networks (s1 and s2) with heterogeneous performances in the detection stage and compare it with an idealistic setting (s1:89.90+-2.23 vs 88.97+-3.06 ncsPCa, P<.001, 89.30+-4.07 and 88.12+-2.71 csPCa, P<.001). Our results depict the relevance of a holistic evaluation, accounting for all the sub-modules involved in the system.

摘要

Beyond Domain Gap: Exploiting Subjectivity in Sketch-Based Person Retrieval

paper_url: http://arxiv.org/abs/2309.08372
repo_url: https://github.com/lin-kayla/subjectivity-sketch-reid
paper_authors: Kejun Lin, Zhixiang Wang, Zheng Wang, Yinqiang Zheng, Shin’ichi Satoh
for: 本研究targets person re-identification (re-ID) problem using sketches as the only available information.
methods: 本研究提出了两个新的设计方法来处理主观性的挑战：1) non-local (NL) fusion module，集成不同证人的绘画来减少主观性的影响；2) AttrAlign module，通过特征属性作为隐藏mask来对域隔的特征进行对齐。
results: 研究在三个benchmark中表现出色，包括大规模、多样化和跨风格 benchmark。

Abstract
Person re-identification (re-ID) requires densely distributed cameras. In practice, the person of interest may not be captured by cameras and, therefore, needs to be retrieved using subjective information (e.g., sketches from witnesses). Previous research defines this case using the sketch as sketch re-identification (Sketch re-ID) and focuses on eliminating the domain gap. Actually, subjectivity is another significant challenge. We model and investigate it by posing a new dataset with multi-witness descriptions. It features two aspects. 1) Large-scale. It contains over 4,763 sketches and 32,668 photos, making it the largest Sketch re-ID dataset. 2) Multi-perspective and multi-style. Our dataset offers multiple sketches for each identity. Witnesses' subjective cognition provides multiple perspectives on the same individual, while different artists' drawing styles provide variation in sketch styles. We further have two novel designs to alleviate the challenge of subjectivity. 1) Fusing subjectivity. We propose a non-local (NL) fusion module that gathers sketches from different witnesses for the same identity. 2) Introducing objectivity. An AttrAlign module utilizes attributes as an implicit mask to align cross-domain features. To push forward the advance of Sketch re-ID, we set three benchmarks (large-scale, multi-style, cross-style). Extensive experiments demonstrate our leading performance in these benchmarks. Dataset and Codes are publicly available at: https://github.com/Lin-Kayla/subjectivity-sketch-reid

摘要
人识别（re-ID）需要密集分布的摄像头。在实践中，人物对摄像头不可见，因此需要使用主观信息（例如，见证人的素描）来检索。先前的研究定义这种情况为素描重新识别（Sketch re-ID），并将着眼点在消除领域差异。然而，主观性是另一个重要挑战。我们模型和调查这个问题，通过提供一个新的数据集，该数据集包含多个见证人的素描。它具有以下两个特点：1）大规模。该数据集包含超过4,763个素描和32,668个照片，是目前最大的Sketch re-ID数据集。2）多元 perspective和多样化。我们的数据集具有每个人物的多个素描，见证人的主观认知提供多个视角，而不同艺术家的绘制风格提供了多样化的素描风格。我们还有两项新的设计来缓解主观性挑战。1）素描 fusions。我们提议一种非本地（NL）融合模块，可以从不同见证人的素描中集成素描。2）引入对象性。我们引入一种特征对齐模块，使用特征作为隐藏的掩码，将不同领域的特征对齐。为推动Sketch re-ID的进步，我们设置了三个标准（大规模、多样化、跨领域）。广泛的实验表明我们在这些标准上显示出领先的性能。数据集和代码在 GitHub 上公开：https://github.com/Lin-Kayla/subjectivity-sketch-reid。

An Efficient Wide-Range Pseudo-3D Vehicle Detection Using A Single Camera

paper_url: http://arxiv.org/abs/2309.08369
repo_url: None
paper_authors: Zhupeng Ye, Yinqi Li, Zejian Yuan
for: 这篇论文旨在提供一种新的宽范围 Pseudo-3D 车辆检测方法，以实现智能驾驶系统中的活跃安全功能。
methods: 本文使用单一摄取器的图像，并将它拼接成两个子窗口图像，以最大化图像分辨率的利用。然后，运用特别设计的检测头来检测宽范围车辆物件。这些检测头同时发出延展的 BBox 和 Side Projection Line（SPL）表示，以捕捉车辆的形状和位置。
results: 实验结果显示，本文的模型在多种评估指标上均 achieve 良好的表现，包括检测精度、稳定性和预测精度。详细的实验结果和评估Metrics可以参考我们的自建dataset和评估报告。

Abstract
Wide-range and fine-grained vehicle detection plays a critical role in enabling active safety features in intelligent driving systems. However, existing vehicle detection methods based on rectangular bounding boxes (BBox) often struggle with perceiving wide-range objects, especially small objects at long distances. And BBox expression cannot provide detailed geometric shape and pose information of vehicles. This paper proposes a novel wide-range Pseudo-3D Vehicle Detection method based on images from a single camera and incorporates efficient learning methods. This model takes a spliced image as input, which is obtained by combining two sub-window images from a high-resolution image. This image format maximizes the utilization of limited image resolution to retain essential information about wide-range vehicle objects. To detect pseudo-3D objects, our model adopts specifically designed detection heads. These heads simultaneously output extended BBox and Side Projection Line (SPL) representations, which capture vehicle shapes and poses, enabling high-precision detection. To further enhance the performance of detection, a joint constraint loss combining both the object box and SPL is designed during model training, improving the efficiency, stability, and prediction accuracy of the model. Experimental results on our self-built dataset demonstrate that our model achieves favorable performance in wide-range pseudo-3D vehicle detection across multiple evaluation metrics. Our demo video has been placed at https://www.youtube.com/watch?v=1gk1PmsQ5Q8.

摘要
宽范围和细化的车辆检测在智能驾驶系统中扮演了关键的角色。然而，现有的基于矩形 bounding box（BBox）的车辆检测方法 frequently struggle with perceiving wide-range objects, especially small objects at long distances. BBox表达不能提供车辆的详细几何形状和姿态信息。这篇论文提出了一种基于单个摄像头的新型宽范围 Pseudo-3D 车辆检测方法。该模型使用组合两个子窗口图像的高分辨率图像来作为输入。这种图像格式可以最大化图像分辨率的利用，以保留车辆 объек 的重要信息。为检测 Pseudo-3D 对象，我们的模型采用特定的检测头。这些头同时输出扩展 BBox 和 Side Projection Line（SPL）表示，以捕捉车辆的形状和姿态，实现高精度检测。为了进一步提高检测性能，我们在模型训练中设计了联合约束损失，既包括对象框和 SPL 的损失。实验结果表明，我们的模型在多种评价指标上表现出色。我们的 Demo 视频已经上传到 YouTube 上，请参考 https://www.youtube.com/watch?v=1gk1PmsQ5Q8。

Robust Burned Area Delineation through Multitask Learning

paper_url: http://arxiv.org/abs/2309.08368
repo_url: https://github.com/links-ads/burned-area-seg
paper_authors: Edoardo Arnaudo, Luca Barco, Matteo Merlo, Claudio Rossi
for: 这篇论文是用于精确地界定野火烧伤区域，以便环境监控和火灾后评估。
methods: 我们使用了一个多任务学习框架，其中包括土地覆盖分类作为助手任务，以增强火烧伤区域分类模型的稳定性和性能。我们还使用了Sentinel-2输入和Copernicus动作等数据来构建一个特殊的数据集，并提供了多个任务的标签，包括火烧伤区域分类和土地覆盖分类。
results: 我们与标准二分法进行比较，结果显示我们的方法在精确地界定火烧伤区域方面表现更好，并且可以增强模型的稳定性和性能。

Abstract
In recent years, wildfires have posed a significant challenge due to their increasing frequency and severity. For this reason, accurate delineation of burned areas is crucial for environmental monitoring and post-fire assessment. However, traditional approaches relying on binary segmentation models often struggle to achieve robust and accurate results, especially when trained from scratch, due to limited resources and the inherent imbalance of this segmentation task. We propose to address these limitations in two ways: first, we construct an ad-hoc dataset to cope with the limited resources, combining information from Sentinel-2 feeds with Copernicus activations and other data sources. In this dataset, we provide annotations for multiple tasks, including burned area delineation and land cover segmentation. Second, we propose a multitask learning framework that incorporates land cover classification as an auxiliary task to enhance the robustness and performance of the burned area segmentation models. We compare the performance of different models, including UPerNet and SegFormer, demonstrating the effectiveness of our approach in comparison to standard binary segmentation.

摘要
Recently, wildfires have presented a significant challenge due to their increasing frequency and severity. Accurate delineation of burned areas is crucial for environmental monitoring and post-fire assessment. However, traditional approaches relying on binary segmentation models often struggle to achieve robust and accurate results, especially when trained from scratch, due to limited resources and the inherent imbalance of this segmentation task. We propose to address these limitations in two ways:First, we construct an ad-hoc dataset to cope with the limited resources, combining information from Sentinel-2 feeds with Copernicus activations and other data sources. In this dataset, we provide annotations for multiple tasks, including burned area delineation and land cover segmentation.Second, we propose a multitask learning framework that incorporates land cover classification as an auxiliary task to enhance the robustness and performance of the burned area segmentation models. We compare the performance of different models, including UPerNet and SegFormer, demonstrating the effectiveness of our approach in comparison to standard binary segmentation.

Continual Learning with Deep Streaming Regularized Discriminant Analysis

paper_url: http://arxiv.org/abs/2309.08353
repo_url: https://github.com/sonycslparis/deep_srda
paper_authors: Joe Khawand, Peter Hanappe, David Colliaux
for: 提高实际机器学习应用中的持续学习能力，以实现更人类化的学习方式。
methods: 提出了一种流处理学习方法， combinestraditional continual learning methods with a convolutional neural network to improve performance on real-world datasets.
results: 在ImageNet ILSVRC-2012数据集上，与批量学习和现有流处理学习算法相比，该方法表现出色，得到了更高的性能。

Abstract
Continual learning is increasingly sought after in real world machine learning applications, as it enables learning in a more human-like manner. Conventional machine learning approaches fail to achieve this, as incrementally updating the model with non-identically distributed data leads to catastrophic forgetting, where existing representations are overwritten. Although traditional continual learning methods have mostly focused on batch learning, which involves learning from large collections of labeled data sequentially, this approach is not well-suited for real-world applications where we would like new data to be integrated directly. This necessitates a paradigm shift towards streaming learning. In this paper, we propose a streaming version of regularized discriminant analysis as a solution to this challenge. We combine our algorithm with a convolutional neural network and demonstrate that it outperforms both batch learning and existing streaming learning algorithms on the ImageNet ILSVRC-2012 dataset.

摘要
<>continuous learning在现实世界机器学习应用中日益受到欢迎，因为它使得机器学习更像人类学习的方式。传统机器学习方法无法实现这一点，因为逐渐更新模型的非一致分布数据会导致扩散遗忘，已有表示被覆盖。 although traditional continual learning methods have mostly focused on batch learning, which involves learning from large collections of labeled data sequentially, this approach is not well-suited for real-world applications where we would like new data to be integrated directly. This necessitates a paradigm shift towards streaming learning. In this paper, we propose a streaming version of regularized discriminant analysis as a solution to this challenge. We combine our algorithm with a convolutional neural network and demonstrate that it outperforms both batch learning and existing streaming learning algorithms on the ImageNet ILSVRC-2012 dataset.<>Here's the translation in Traditional Chinese:<>不断学习在现实世界机器学习应用中日益受到欢迎，因为它使得机器学习更像人类学习的方式。传统机器学习方法无法实现这一点，因为逐渐更新模型的非一致分布数据会导致扩散遗忘，已有表示被覆盖。 although traditional continual learning methods have mostly focused on batch learning, which involves learning from large collections of labeled data sequentially, this approach is not well-suited for real-world applications where we would like new data to be integrated directly. This necessitates a paradigm shift towards streaming learning. In this paper, we propose a streaming version of regularized discriminant analysis as a solution to this challenge. We combine our algorithm with a convolutional neural network and demonstrate that it outperforms both batch learning and existing streaming learning algorithms on the ImageNet ILSVRC-2012 dataset.<>

T-UDA: Temporal Unsupervised Domain Adaptation in Sequential Point Clouds

paper_url: http://arxiv.org/abs/2309.08302
repo_url: https://github.com/ctu-vras/t-uda
paper_authors: Awet Haileslassie Gebrehiwot, David Hurych, Karel Zimmermann, Patrick Pérez, Tomáš Svoboda
for: 本研究旨在提高驾驶场景3D semantic segmentation模型的可靠性，使其在不同地区、感知器件、安装位置等不同域的开放世界设置下能够有效地适应。
methods: 本研究提出了一种新的频率适应方法，即Temporal UDA（T-UDA）方法，它结合输入数据的时间和交叉感知器件的准确性，并与mean teacher方法结合。
results: 经过实验表明，T-UDA方法在Waymo Open Dataset、nuScenes和SemanticKITTI等 datasets上对3D semantic segmentation任务具有显著的性能提升，并且对两种流行的3D点云架构（Cylinder3D和MinkowskiNet）都有优秀的效果。

Abstract
Deep perception models have to reliably cope with an open-world setting of domain shifts induced by different geographic regions, sensor properties, mounting positions, and several other reasons. Since covering all domains with annotated data is technically intractable due to the endless possible variations, researchers focus on unsupervised domain adaptation (UDA) methods that adapt models trained on one (source) domain with annotations available to another (target) domain for which only unannotated data are available. Current predominant methods either leverage semi-supervised approaches, e.g., teacher-student setup, or exploit privileged data, such as other sensor modalities or temporal data consistency. We introduce a novel domain adaptation method that leverages the best of both trends. Our approach combines input data's temporal and cross-sensor geometric consistency with the mean teacher method. Dubbed T-UDA for "temporal UDA", such a combination yields massive performance gains for the task of 3D semantic segmentation of driving scenes. Experiments are conducted on Waymo Open Dataset, nuScenes and SemanticKITTI, for two popular 3D point cloud architectures, Cylinder3D and MinkowskiNet. Our codes are publicly available at https://github.com/ctu-vras/T-UDA.

摘要
深度感知模型需要可靠地处理开放世界的领域变化，包括不同的地理区域、感知器件属性、安装位置等多种原因。由于可获取的标注数据的可得性是技术上不可能的，因此研究人员将焦点放在无监督领域适应（UDA）方法上，以适应已经训练的一个（源）领域的模型，并使其适应另一个（目标）领域的无标注数据。目前主流的方法包括使用 semi-supervised 方法，如教师生Setup，或者利用特权数据，如其他感知模式或时间数据一致性。我们介绍了一种新的领域适应方法，它将输入数据的时间和跨感器几何一致性与“教师”方法结合。我们将这种方法称为“时间领域适应”（T-UDA）。实验在 Waymo 开放数据集、nuscenes 和 SemanticKITTI 上进行，使用两种流行的3D点云架构：Cylinder3D 和 MinkowskiNet。我们的代码公开在 GitHub 上，请参考。

A Real-Time Active Speaker Detection System Integrating an Audio-Visual Signal with a Spatial Querying Mechanism

paper_url: http://arxiv.org/abs/2309.08295
repo_url: None
paper_authors: Ilya Gurvich, Ido Leichter, Dharmendar Reddy Palle, Yossi Asher, Alon Vinnikov, Igor Abramovski, Vishak Gopal, Ross Cutler, Eyal Krupka
for: 这篇论文是为了研究一种实时、 causal、基于神经网络的活动说话人检测系统，适用于低功耗边缘计算。
methods: 该系统使用了一个微phone阵列和360度摄像头提供的数据，并使用了一种具有127 MFLOPs每参与者的神经网络。与前一些工作不同，这里的网络会在计算预算 exhausted 情况下表现出“美味的衰弱”，使系统可以在这种情况下运行良好。
results: 作者在一个真实的会议数据集上训练和评估了自己的算法，并证明了其在多达14名参与者、叠加演讲、其他挑战性enario下的性能。

Abstract
We introduce a distinctive real-time, causal, neural network-based active speaker detection system optimized for low-power edge computing. This system drives a virtual cinematography module and is deployed on a commercial device. The system uses data originating from a microphone array and a 360-degree camera. Our network requires only 127 MFLOPs per participant, for a meeting with 14 participants. Unlike previous work, we examine the error rate of our network when the computational budget is exhausted, and find that it exhibits graceful degradation, allowing the system to operate reasonably well even in this case. Departing from conventional DOA estimation approaches, our network learns to query the available acoustic data, considering the detected head locations. We train and evaluate our algorithm on a realistic meetings dataset featuring up to 14 participants in the same meeting, overlapped speech, and other challenging scenarios.

摘要
我们介绍了一种特有的实时、因果的神经网络基于活动 speaker检测系统，适用于低功耗边缘计算。这个系统驱动了虚拟 cinematography 模块，并在商业设备上部署。系统使用来自麦克风阵列和360度摄像头的数据。我们的网络只需127 MFLOPs每参与者，对于参与者14人的会议。与前一项不同，我们研究了我们网络在计算预算尽用时的错误率，发现它具有很好的宽恒化特性，使系统可以在这种情况下运行良好。与传统的 DOA 估计方法不同，我们的网络学习了查询可用的声音数据，考虑检测到的头部位置。我们在实际会议 dataset 上训练和评估了我们的算法，该 dataset 包括最多14名参与者、重叠的说话、其他挑战性enario。

Unsupervised Disentangling of Facial Representations with 3D-aware Latent Diffusion Models

paper_url: http://arxiv.org/abs/2309.08273
repo_url: None
paper_authors: Ruian He, Zhen Xing, Weimin Tan, Bo Yan
for: 提出了一种无监督的人脸表示学习方法，以提高无需大量标注数据的face理解能力。
methods: 提出了一种基于3D射频扩散模型的无监督分解方法，通过在幂空间进行分解，实现了人脸表示的分解和表示混合。
results: 通过对多个数据集进行测试，实现了无监督人脸表示学习模型的state-of-the-art性能，并且在人脸认证和表情识别等下游任务中达到了优秀的结果。

Abstract
Unsupervised learning of facial representations has gained increasing attention for face understanding ability without heavily relying on large-scale annotated datasets. However, it remains unsolved due to the coupling of facial identities, expressions, and external factors like pose and light. Prior methods primarily focus on 2D factors and pixel-level consistency, leading to incomplete disentangling and suboptimal performance in downstream tasks. In this paper, we propose LatentFace, a novel unsupervised disentangling framework for facial expression and identity representation. We suggest the disentangling problem should be performed in latent space and propose the solution using a 3D-ware latent diffusion model. First, we introduce a 3D-aware autoencoder to encode face images into 3D latent embeddings. Second, we propose a novel representation diffusion model (RDM) to disentangle 3D latent into facial identity and expression. Consequently, our method achieves state-of-the-art performance in facial expression recognition and face verification among unsupervised facial representation learning models.

摘要
<>Translate the given text into Simplified Chinese.<>无监督学习的脸部表示学习在脸部理解能力方面受到了越来越多的关注，但它还未得到解决，因为脸部身份、表情和外部因素如pose和光照的关系。先前的方法主要关注于2D因素和像素级匹配，导致不完全分离和下游任务的下降性能。在本文中，我们提议在幂空间进行不监督分解，并提出一种3D-aware latent diffusion模型来解决这个问题。首先，我们引入3D-aware autoencoder来编码脸部图像到3D幂 embeddings。其次，我们提出一种新的表达分析模型（RDM），以分解3D幂到脸部身份和表情。因此，我们的方法在无监督脸部表示学习模型中实现了状态机器的表情识别和脸部验证性能。

Edge Based Oriented Object Detection

paper_url: http://arxiv.org/abs/2309.08265
repo_url: https://github.com/pratishtha-agarwal/Automation-of-Attendance-montoring-system
paper_authors: Jianghu Shen, Xiaojun Wu
for: 提高遥感对象检测精度
methods: 使用旋转 bounding box (OBB) 约束对象，并基于边导向量的类似度measurement函数设计新的损失函数，同时实现边缘自注意模块以增强对象边缘的识别。
results: 对比基eline algorithm 的 Smooth L1 损失函数，提出的损失函数实现了0.6%的mAP提升，并且通过边缘自注意模块实现了1.3%的mAP提升在 DOTA 数据集上。

Abstract
In the field of remote sensing, we often utilize oriented bounding boxes (OBB) to bound the objects. This approach significantly reduces the overlap among dense detection boxes and minimizes the inclusion of background content within the bounding boxes. To enhance the detection accuracy of oriented objects, we propose a unique loss function based on edge gradients, inspired by the similarity measurement function used in template matching task. During this process, we address the issues of non-differentiability of the function and the semantic alignment between gradient vectors in ground truth (GT) boxes and predicted boxes (PB). Experimental results show that our proposed loss function achieves $0.6\%$ mAP improvement compared to the commonly used Smooth L1 loss in the baseline algorithm. Additionally, we design an edge-based self-attention module to encourage the detection network to focus more on the object edges. Leveraging these two innovations, we achieve a mAP increase of 1.3% on the DOTA dataset.

摘要
在遥感领域中，我们经常使用方向 bounding box (OBB) 来约束 объек。这种方法可以减少密集检测框的重叠和背景内容的包含在 bounding box 中。为了提高方向 объек 的检测精度，我们提议一种基于边导向量的特有损失函数， inspirited by 模板匹配任务中的相似度测量函数。在这个过程中，我们解决了非 differentiability 问题和 GT 框和 PB 框中的semantic alignment问题。实验结果表明，我们的提议的损失函数可以与基eline algorithm 中的 Smooth L1 损失函数相比，提高了 $0.6\%$ mAP。此外，我们设计了一个基于边的自注意模块，以便检测网络更加注重对象边缘。通过这两个创新，我们在 DOTA 数据集上实现了 mAP 提高 $1.3\%$。

Leveraging the Power of Data Augmentation for Transformer-based Tracking

paper_url: http://arxiv.org/abs/2309.08264
repo_url: None
paper_authors: Jie Zhao, Johan Edstedt, Michael Felsberg, Dong Wang, Huchuan Lu
for: 这篇 paper 的目的是探讨对于视觉物件追踪的表现进行改进，并且检查数据增强的影响。
methods: 这篇 paper 使用了 transformer 架构，并且提出了两种数据增强方法，包括一个静态搜寻半径 Mechanism 和一个对应� Feature Mixing augmentation strategy。
results: 实验结果显示，这两种数据增强方法可以提高 transformer 架构的追踪性能，特别是在一些挑战性的情况下，如一击追踪和小图像分辨率。

Abstract
Due to long-distance correlation and powerful pretrained models, transformer-based methods have initiated a breakthrough in visual object tracking performance. Previous works focus on designing effective architectures suited for tracking, but ignore that data augmentation is equally crucial for training a well-performing model. In this paper, we first explore the impact of general data augmentations on transformer-based trackers via systematic experiments, and reveal the limited effectiveness of these common strategies. Motivated by experimental observations, we then propose two data augmentation methods customized for tracking. First, we optimize existing random cropping via a dynamic search radius mechanism and simulation for boundary samples. Second, we propose a token-level feature mixing augmentation strategy, which enables the model against challenges like background interference. Extensive experiments on two transformer-based trackers and six benchmarks demonstrate the effectiveness and data efficiency of our methods, especially under challenging settings, like one-shot tracking and small image resolutions.

摘要
Translation in Simplified Chinese:因为长距离相关和强大预训模型，转换器基本方法在视觉对象跟踪性能中引入了一次突破。先前的工作主要关注设计适合跟踪的有效架构，而忽略了数据增强的重要性。在这篇论文中，我们首先通过系统实验研究 transformer 基本方法中的数据增强对应性的影响，并发现这些常见策略的有限效iveness。在实验结果的激发下，我们提出了两种特定于跟踪的数据增强方法。首先，我们通过动态搜索半径机制和模拟边缘样本来优化现有的随机裁剪。其次，我们提出了一种含义级别的特征混合增强策略，使模型在背景干扰等挑战下能够更有效。我们在两个 transformer 基本方法和六个标准测试集上进行了广泛的实验，特别是在一键跟踪和小图分辨率的情况下， demonstrating the effectiveness and data efficiency of our methods。

BROW: Better featuRes fOr Whole slide image based on self-distillation

paper_url: http://arxiv.org/abs/2309.08259
repo_url: None
paper_authors: Yuanfeng Wu, Shaojie Li, Zhiqiang Du, Wentao Zhu
for: This paper aims to propose a foundation model for extracting better feature representations for whole slide images (WSIs) in clinical diagnosis.
methods: The proposed model, called BROW, uses a transformer architecture and is pretrained using a self-distillation framework. It also employs techniques such as patch shuffling to improve the model’s robustness.
results: The proposed model achieves high performance on various downstream tasks, including slide-level subtyping, patch-level classification, and nuclei instance segmentation. The results confirm the efficacy, robustness, and good generalization ability of the model, making it a promising foundation model for WSI feature extraction.Here are the three points in Simplified Chinese text:
for: 这个论文的目的是提出一个基础模型，用于抽取全像片（WSIs）的更好的特征表示。
methods: 该提案的模型，名为BROW，使用转换器架构，通过自我精炼框架进行预训练。它还使用了质patch混淆来提高模型的稳定性。
results: 该模型在各种下游任务上达到了高性能，包括滑块分类、谱片分类和核体实例分割。结果证明了模型的可靠性、稳定性和通用性，表明其可以作为WSIs特征EXTRACTING的基础模型。

Abstract
Whole slide image (WSI) processing is becoming part of the key components of standard clinical diagnosis for various diseases. However, the direct application of conventional image processing algorithms to WSI faces certain obstacles because of WSIs' distinct property: the super-high resolution. The performance of most WSI-related tasks relies on the efficacy of the backbone which extracts WSI patch feature representations. Hence, we proposed BROW, a foundation model for extracting better feature representations for WSIs, which can be conveniently adapted to downstream tasks without or with slight fine-tuning. The model takes transformer architecture, pretrained using self-distillation framework. To improve model's robustness, techniques such as patch shuffling have been employed. Additionally, the model leverages the unique properties of WSIs, utilizing WSI's multi-scale pyramid to incorporate an additional global view, thereby further enhancing its performance. We used both private and public data to make up a large pretraining dataset, containing more than 11000 slides, over 180M extracted patches, encompassing WSIs related to various organs and tissues. To assess the effectiveness of \ourmodel, we run a wide range of downstream tasks, including slide-level subtyping, patch-level classification and nuclei instance segmentation. The results confirmed the efficacy, robustness and good generalization ability of the proposed model. This substantiates its potential as foundation model for WSI feature extraction and highlights promising prospects for its application in WSI processing.

摘要
整幕图像（WSI）处理已成为许多疾病的标准临床诊断中的关键组件。然而，直接将传统图像处理算法应用于WSI遇到了一些障碍，即WSI的超高分辨率。大多数WSI相关任务的性能取决于提取WSI补丁特征表示的后备模型的效果。因此，我们提出了BROW，一个用于提取WSI补丁特征表示的基本模型，可以无需或只需轻微微调整地应用于下游任务。该模型采用转换器架构，预训练使用自我混合框架。为了提高模型的Robustness，我们采用了补丁混淆技术。此外，模型利用WSI的多尺度 pyramid 特性，进一步增强其性能。我们使用了私人和公共数据构成了一个大型预训练数据集，包含超过11000幅整幕图像，18000万个提取的补丁，覆盖各种器官和组织的WSI。为评估 \ourmodel 的效果，我们运行了多种下游任务，包括板块分类、补丁级别分类和核体实例分割。结果证明了我们提出的模型的有效性、Robustness 和好的泛化能力，这证明了其作为WSI特征提取基本模型的潜在能力，并且表明了其在WSI处理领域的广阔前景。

Cartoondiff: Training-free Cartoon Image Generation with Diffusion Transformer Models

paper_url: http://arxiv.org/abs/2309.08251
repo_url: None
paper_authors: Feihong He, Gang Li, Lingyu Si, Leilei Yan, Shimeng Hou, Hongwei Dong, Fanzhang Li
for: 这个论文旨在提出一种无需训练的图像漫画化方法，使用扩散变换模型进行图像漫画化。
methods: 该方法基于扩散变换模型的反向过程，分为semantic生成阶段和细节生成阶段。图像漫画化过程中，通过特定的干扰步骤来减少高频信号的干扰。
results: EXTENSIVE experiment results show that CartoonDiff has powerful ability in image cartoonization, without requiring additional reference images, complex model designs, or tedious adjustment of multiple parameters. 详细的实验结果表明，CartoonDiff可以具有强大的图像漫画化能力，不需要额外的参考图像、复杂的模型设计或多参数的繁琐调整。

Abstract
Image cartoonization has attracted significant interest in the field of image generation. However, most of the existing image cartoonization techniques require re-training models using images of cartoon style. In this paper, we present CartoonDiff, a novel training-free sampling approach which generates image cartoonization using diffusion transformer models. Specifically, we decompose the reverse process of diffusion models into the semantic generation phase and the detail generation phase. Furthermore, we implement the image cartoonization process by normalizing high-frequency signal of the noisy image in specific denoising steps. CartoonDiff doesn't require any additional reference images, complex model designs, or the tedious adjustment of multiple parameters. Extensive experimental results show the powerful ability of our CartoonDiff. The project page is available at: https://cartoondiff.github.io/

摘要
图像漫画化已经在图像生成领域引起了广泛的关注。然而，大多数现有的图像漫画化技术需要重新训练模型使用漫画风格图像。在这篇论文中，我们提出了CartoonDiff，一种新的无需训练的抽象采样方法，可以使用噪声变换模型来生成图像漫画化。我们将推理过程中的噪声变换模型分解成semantic生成阶段和细节生成阶段。此外，我们在特定的净化步骤中normalize高频信号，以实现图像漫画化过程。CartoonDiff不需要任何参考图像、复杂的模型设计或多个参数的繁琐调整。我们的实验结果表明CartoonDiff具有强大的能力。相关项目页面可以在以下地址找到：https://cartoondiff.github.io/

Optimization of Rank Losses for Image Retrieval

paper_url: http://arxiv.org/abs/2309.08250
repo_url: https://github.com/cvdfoundation/google-landmark
paper_authors: Elias Ramzi, Nicolas Audebert, Clément Rambour, André Araujo, Xavier Bitot, Nicolas Thome
for: This paper focuses on improving the training of deep neural networks for image retrieval tasks using a new framework for robust and decomposable rank losses optimization.
methods: The proposed framework includes a general surrogate for ranking operators called SupRank, which provides an upperbound for rank losses and ensures robust training, as well as a simple yet effective loss function to reduce the decomposability gap between the averaged batch approximation of ranking losses and their values on the whole training set.
results: The authors apply their framework to two standard metrics for image retrieval (AP and R@k) and introduce an extension of AP called hierarchical average precision $\mathcal{H}$-AP, which is optimized as well. Additionally, they create the first hierarchical landmarks retrieval dataset using a semi-automatic pipeline to create hierarchical labels, and release the code at https://github.com/elias-ramzi/SupRank.

Abstract
In image retrieval, standard evaluation metrics rely on score ranking, \eg average precision (AP), recall at k (R@k), normalized discounted cumulative gain (NDCG). In this work we introduce a general framework for robust and decomposable rank losses optimization. It addresses two major challenges for end-to-end training of deep neural networks with rank losses: non-differentiability and non-decomposability. Firstly we propose a general surrogate for ranking operator, SupRank, that is amenable to stochastic gradient descent. It provides an upperbound for rank losses and ensures robust training. Secondly, we use a simple yet effective loss function to reduce the decomposability gap between the averaged batch approximation of ranking losses and their values on the whole training set. We apply our framework to two standard metrics for image retrieval: AP and R@k. Additionally we apply our framework to hierarchical image retrieval. We introduce an extension of AP, the hierarchical average precision $\mathcal{H}$-AP, and optimize it as well as the NDCG. Finally we create the first hierarchical landmarks retrieval dataset. We use a semi-automatic pipeline to create hierarchical labels, extending the large scale Google Landmarks v2 dataset. The hierarchical dataset is publicly available at https://github.com/cvdfoundation/google-landmark. Code will be released at https://github.com/elias-ramzi/SupRank.

摘要
在图像检索中，标准评估指标通常是基于分数排名的，例如平均精度（AP）和在k个结果中的恢复率（R@k）。在这项工作中，我们介绍了一种泛化框架，用于robust和可分解排名损失优化。它解决了深度神经网络在排名损失训练中的两个主要挑战：非导数性和非可分解性。我们首先提出了一种通用排名运算符的代理，即SupRank，该运算符是可以用于批量梯度下降的。它提供了排名损失的Upperbound，并确保了Robust训练。其次，我们使用一种简单 yet有效的损失函数，以减少排名损失的可分解差距。我们将我们的框架应用于AP和R@k两个标准指标，以及层次图像检索。我们引入了一个层次average precision（$\mathcal{H}$-AP），并且优化了NDCG。最后，我们创建了图像检索领域的首个层次标记集。我们使用一种 semi-自动化的管道来创建层次标记，扩展了Google Landmarks v2数据集。层次标记集公开可用于https://github.com/cvdfoundation/google-landmark。代码将在https://github.com/elias-ramzi/SupRank中发布。

A Real-time Faint Space Debris Detector With Learning-based LCM

paper_url: http://arxiv.org/abs/2309.08244
repo_url: None
paper_authors: Zherui Lu, Gangyi Wang, Xinguo Wei, Jian Li
for: 这篇论文旨在提高空间 situational awareness (SSA) 系统的敏感性和效率，以应对增长的空间垃圾问题。
methods: 本论文提出了一种基于本地对比和最大可能性估计 (MLE) 的低信号干扰estreak抽取方法，可以快速和高精度地检测低信号目标。
results: 本论文透过实验和实际应用证明了本方法的高速和高精度，并且与现有的ODCC方法相比，本方法具有较高的效率和较低的中心误差。

Abstract
With the development of aerospace technology, the increasing population of space debris has posed a great threat to the safety of spacecraft. However, the low intensity of reflected light and high angular velocity of space debris impede the extraction. Besides, due to the limitations of the ground observation methods, small space debris can hardly be detected, making it necessary to enhance the spacecraft's capacity for space situational awareness (SSA). Considering that traditional methods have some defects in low-SNR target detection, such as low effectiveness and large time consumption, this paper proposes a method for low-SNR streak extraction based on local contrast and maximum likelihood estimation (MLE), which can detect space objects with SNR 2.0 efficiently. In the proposed algorithm, local contrast will be applied for crude classifications, which will return connected components as preliminary results, and then MLE will be performed to reconstruct the connected components of targets via orientated growth, further improving the precision. The algorithm has been verified with both simulated streaks and real star tracker images, and the average centroid error of the proposed algorithm is close to the state-of-the-art method like ODCC. At the same time, the algorithm in this paper has significant advantages in efficiency compared with ODCC. In conclusion, the algorithm in this paper is of high speed and precision, which guarantees its promising applications in the extraction of high dynamic targets.

摘要
随着航天技术的发展，随空间垃圾的增加已经对航天器的安全提出了极大的威胁。然而，零intsity的反射光和高angular velocity的随空间垃圾使得extraction困难。此外，由于地面观测方法的限制，小型随空间垃圾几乎无法检测，因此需要提高航天器的空间 situational awareness（SSA）能力。由于传统方法在低Signal-to-Noise Ratio（SNR）目标检测中存在一些缺陷，这篇论文提出了一种基于本地对比和最大可能性估计（MLE）的低SNR扫描方法，可以高效地检测SNR 2.0的空间目标。在提案的算法中，本地对比将被应用于初步分类，返回连接组件作为先期结果，然后MLE将被执行以重建连接组件目标via oriented growth，进一步提高精度。这种算法已经在simulated streaks和实际星rack images中验证，并且算法的中心误差与 OdCC 类似。同时，这种算法在效率方面具有显著的优势。因此，这种算法在检测高动态目标方面具有承诺的应用前景。

Human-Inspired Topological Representations for Visual Object Recognition in Unseen Environments

paper_url: http://arxiv.org/abs/2309.08239
repo_url: None
paper_authors: Ekta U. Samani, Ashis G. Banerjee
for: 提高移动机器人在未看到和受阻的indoor环境中对物体认知的精度
methods: 使用TOPS2描述符和基于人类思维机制的THOR2认知框架，将Mapper算法获得的颜色嵌入与形态基于TOPS描述符进行混合
results: THOR2在两个真实世界数据集上（OCID和UW-IS Occluded）实现了比shape-based THOR框架和RGB-D ViT更高的认知精度，因此THOR2是实现低成本机器人中 robust认知的可能性的一步

Abstract
Visual object recognition in unseen and cluttered indoor environments is a challenging problem for mobile robots. Toward this goal, we extend our previous work to propose the TOPS2 descriptor, and an accompanying recognition framework, THOR2, inspired by a human reasoning mechanism known as object unity. We interleave color embeddings obtained using the Mapper algorithm for topological soft clustering with the shape-based TOPS descriptor to obtain the TOPS2 descriptor. THOR2, trained using synthetic data, achieves substantially higher recognition accuracy than the shape-based THOR framework and outperforms RGB-D ViT on two real-world datasets: the benchmark OCID dataset and the UW-IS Occluded dataset. Therefore, THOR2 is a promising step toward achieving robust recognition in low-cost robots.

摘要
<>transliteration: zhèng zhì wén tiě de rén shì yǐ jīn yì yì zhòng zhèng shì, dài zhèng zhì wén tiě de jīn yì yì zhòng zhèng shì, zhèng zhì wén tiě de rén shì yǐ jīn yì yì zhòng zhèng shì, yǐ jīn yì yì zhòng zhèng shì, dài zhèng zhì wén tiě de jīn yì yì zhòng zhèng shì, yǐ jīn yì yì zhòng zhèng shì.Translation:现代移动机器人可能需要在未看过和受损的indoor环境中进行视觉对象识别。为此，我们从我们之前的工作中扩展了TOPS描述器，并采用了人类思维机制known as object unity的灵感，提出了TOPS2描述器。我们将Mapper算法得到的颜色嵌入与TOPS描述器结合，以获得TOPS2描述器。THOR2，使用 sintetic数据进行训练，在两个实际 datasets上表现出了substantially higher的识别精度，比shape-based THOR框架和RGB-D ViT更高。因此，THOR2是一个有前途的步骤，可以帮助实现低成本机器人中的稳定识别。

Efficient Polyp Segmentation Via Integrity Learning

paper_url: http://arxiv.org/abs/2309.08234
repo_url: None
paper_authors: Ziqiang Chen, Kang Wang, Yun Liu
for: 该研究旨在提高护肤诊断、指导 intervención 和 treatment 的准确性，通过解决质量不足问题，提高护肤诊断的准确性。
methods: 该研究提出了一种名为 Integrity Capturing Polyp Segmentation (IC-PolypSeg) 网络，该网络使用轻量级的骨干和三种关键组件来改善质量不足问题：1）像素级别特征重新分配模块（PFR），2）跨stage像素级别特征重新分配模块（CPFR），3）粗细调整模块。
results: 对于5个公共数据集，该研究表明，提出的 IC-PolypSeg 方法可以与8种现有方法进行比较，在准确率和计算效率两个方面具有显著优势。IC-PolypSeg-EF0 使用300次少于 PraNet 的参数，实现了实时处理速度235 FPS，并且可以减少 false negative 率。

Abstract
Accurate polyp delineation in colonoscopy is crucial for assisting in diagnosis, guiding interventions, and treatments. However, current deep-learning approaches fall short due to integrity deficiency, which often manifests as missing lesion parts. This paper introduces the integrity concept in polyp segmentation at both macro and micro levels, aiming to alleviate integrity deficiency. Specifically, the model should distinguish entire polyps at the macro level and identify all components within polyps at the micro level. Our Integrity Capturing Polyp Segmentation (IC-PolypSeg) network utilizes lightweight backbones and 3 key components for integrity ameliorating: 1) Pixel-wise feature redistribution (PFR) module captures global spatial correlations across channels in the final semantic-rich encoder features. 2) Cross-stage pixel-wise feature redistribution (CPFR) module dynamically fuses high-level semantics and low-level spatial features to capture contextual information. 3) Coarse-to-fine calibration module combines PFR and CPFR modules to achieve precise boundary detection. Extensive experiments on 5 public datasets demonstrate that the proposed IC-PolypSeg outperforms 8 state-of-the-art methods in terms of higher precision and significantly improved computational efficiency with lower computational consumption. IC-PolypSeg-EF0 employs 300 times fewer parameters than PraNet while achieving a real-time processing speed of 235 FPS. Importantly, IC-PolypSeg reduces the false negative ratio on five datasets, meeting clinical requirements.

摘要
准确的肠癌腺分 segmentation在colonoscopy中非常重要，可以帮助诊断、引导 intervención和治疗。然而，现有的深度学习方法 often lack integrity, which can lead to missing lesion parts. This paper introduces the integrity concept in polyp segmentation at both macro and micro levels, aiming to alleviate integrity deficiency. Specifically, the model should distinguish entire polyps at the macro level and identify all components within polyps at the micro level. Our Integrity Capturing Polyp Segmentation (IC-PolypSeg) network uses lightweight backbones and 3 key components to improve integrity:1. 像素级别特征重定向（PFR）模块 capture global spatial correlations across channels in the final semantic-rich encoder features.2. 跨阶段像素级别特征重定向（CPFR）模块 dynamically fuses high-level semantics and low-level spatial features to capture contextual information.3. 粗细调整模块 combines PFR and CPFR modules to achieve precise boundary detection.我们在5个公共数据集上进行了广泛的实验，结果显示，我们的提posed IC-PolypSeg方法在精度和计算效率方面都有所提高，与8种现有方法进行比较。 IC-PolypSeg-EF0使用300次少于PraNet的参数，同时实现了235帧/秒的实时处理速度。进一步地，IC-PolypSeg可以减少在5个数据集上的假阳性比率，满足临床要求。

UniST: Towards Unifying Saliency Transformer for Video Saliency Prediction and Detection

paper_url: http://arxiv.org/abs/2309.08220
repo_url: None
paper_authors: Junwen Xiong, Peng Zhang, Chuanyue Li, Wei Huang, Yufei Zha, Tao You
for: 本研究旨在构建一个通用的视觉注意力模型框架，以实现视觉注意力预测和视觉关键物预测的融合。
methods: 本研究提出了一个具有视觉注意力属性的 transformer 架构，并将其应用于逐步增加分辨率的对照处理中，以获得更加积极的视觉注意力表示。此外，本研究还提出了一个任务特定的解码器，以进行最终的预测。
results: 实验结果显示，提出的 UniST 模型在七个挑战性的benchmark上表现出色，与其他现有的方法相比，具有更高的性能。

Abstract
Video saliency prediction and detection are thriving research domains that enable computers to simulate the distribution of visual attention akin to how humans perceiving dynamic scenes. While many approaches have crafted task-specific training paradigms for either video saliency prediction or video salient object detection tasks, few attention has been devoted to devising a generalized saliency modeling framework that seamlessly bridges both these distinct tasks. In this study, we introduce the Unified Saliency Transformer (UniST) framework, which comprehensively utilizes the essential attributes of video saliency prediction and video salient object detection. In addition to extracting representations of frame sequences, a saliency-aware transformer is designed to learn the spatio-temporal representations at progressively increased resolutions, while incorporating effective cross-scale saliency information to produce a robust representation. Furthermore, a task-specific decoder is proposed to perform the final prediction for each task. To the best of our knowledge, this is the first work that explores designing a transformer structure for both saliency modeling tasks. Convincible experiments demonstrate that the proposed UniST achieves superior performance across seven challenging benchmarks for two tasks, and significantly outperforms the other state-of-the-art methods.

摘要
视频注意力预测和检测是计算机视觉领域的兴旺研究领域，它们帮助计算机模拟人类看到动态场景中的注意力分布，类似于人类的视觉过程。虽然有很多方法已经为这两个独立的任务进行了特定任务训练 paradigma，但是很少人关注了开发一个涵盖这两个任务的通用注意力模型框架。在这种研究中，我们引入了统一注意力变换（UniST）框架，它利用视频注意力预测和视频突出对象检测中的重要特征进行了全面利用。此外，我们还设计了一个可靠的词法解码器，以便对每个任务进行最终预测。根据我们所知，这是第一个采用变换结构来解决两个注意力模型任务的研究。我们的实验表明，提案的 UniST 在七个挑战性的benchmark上表现出色，与其他现有方法相比，具有显著的优势。

Salient Object Detection in Optical Remote Sensing Images Driven by Transformer

paper_url: http://arxiv.org/abs/2309.08206
repo_url: https://github.com/mathlee/gelenet
paper_authors: Gongyang Li, Zhen Bai, Zhi Liu, Xinpeng Zhang, Haibin Ling
For: 本研究提出了一种全新的全球抽象本地探索网络（GeleNet），用于 óptical remote sensing 图像中的突出对象检测（ORSI-SOD）。* Methods: GeleNet 使用 transformer 背景进行四级特征嵌入，并使用方向感知杂化杂化精度探索模块（D-SWSAM）和简化版 SWSAM，以及知识传递模块（KTM）进行增强。* Results: 对三个公共数据集进行了广泛的实验，结果表明，提出的 GeleNet 方法在相关的州态艺术方法之上表现出色，并且可以更好地检测 óptical remote sensing 图像中的突出对象。

Abstract
Existing methods for Salient Object Detection in Optical Remote Sensing Images (ORSI-SOD) mainly adopt Convolutional Neural Networks (CNNs) as the backbone, such as VGG and ResNet. Since CNNs can only extract features within certain receptive fields, most ORSI-SOD methods generally follow the local-to-contextual paradigm. In this paper, we propose a novel Global Extraction Local Exploration Network (GeleNet) for ORSI-SOD following the global-to-local paradigm. Specifically, GeleNet first adopts a transformer backbone to generate four-level feature embeddings with global long-range dependencies. Then, GeleNet employs a Direction-aware Shuffle Weighted Spatial Attention Module (D-SWSAM) and its simplified version (SWSAM) to enhance local interactions, and a Knowledge Transfer Module (KTM) to further enhance cross-level contextual interactions. D-SWSAM comprehensively perceives the orientation information in the lowest-level features through directional convolutions to adapt to various orientations of salient objects in ORSIs, and effectively enhances the details of salient objects with an improved attention mechanism. SWSAM discards the direction-aware part of D-SWSAM to focus on localizing salient objects in the highest-level features. KTM models the contextual correlation knowledge of two middle-level features of different scales based on the self-attention mechanism, and transfers the knowledge to the raw features to generate more discriminative features. Finally, a saliency predictor is used to generate the saliency map based on the outputs of the above three modules. Extensive experiments on three public datasets demonstrate that the proposed GeleNet outperforms relevant state-of-the-art methods. The code and results of our method are available at https://github.com/MathLee/GeleNet.

摘要
现有的 optical remote sensing images （ORSIs）中的醒目对象检测（SOD）方法主要采用卷积神经网络（CNN）作为背景，如 VGG 和 ResNet。由于 CNN 只能提取特定感受场的特征， większość ORSIs-SOD 方法通常采用本地到Contextual 方法。在这篇文章中，我们提出了一种全新的全球抽取本地探索网络（GeleNet） для ORSIs-SOD，采用全球到本地方法。具体来说，GeleNet 首先采用 transformer 背景来生成四级特征嵌入，并且使用方向感知杂合排序键控模块（D-SWSAM）和其简化版本（SWSAM）来增强本地互动。此外，我们还采用知识传递模块（KTM）来进一步增强跨级Contextual 互动。D-SWSAM 通过方向性杂合来完全感知最低级特征中的方向信息，以适应 ORSIs 中不同方向的醒目对象，并有效地提高醒目对象的细节。SWSAM 将 D-SWSAM 中的方向感知部分排除，以专注于 ORSIs 中醒目对象的本地化。KTM 基于自我注意机制，模型了两个中级特征的Contextual 相互关系知识，并将其传递给原始特征，以生成更有特征的特征。最后，我们使用 saliency predictor 来生成醒目度映射，基于上述三个模块的输出。我们在三个公共数据集上进行了广泛的实验，结果表明，提出的 GeleNet 方法在相关的状态 искусственный智能方法中具有优势。我们的代码和结果可以在 GitHub 上找到：https://github.com/MathLee/GeleNet。

One-stage Modality Distillation for Incomplete Multimodal Learning

paper_url: http://arxiv.org/abs/2309.08204
repo_url: None
paper_authors: Shicai Wei, Yang Luo, Chunbo Luo
for: Addresses the challenge of inferring with incomplete modality in multimodal learning.
methods: Proposes a one-stage modality distillation framework that combines privileged knowledge transfer and modality information fusion via multi-task learning.
results: Achieves state-of-the-art performance on RGB-D classification and segmentation tasks despite incomplete modality input in various scenes.

Abstract
Learning based on multimodal data has attracted increasing interest recently. While a variety of sensory modalities can be collected for training, not all of them are always available in development scenarios, which raises the challenge to infer with incomplete modality. To address this issue, this paper presents a one-stage modality distillation framework that unifies the privileged knowledge transfer and modality information fusion into a single optimization procedure via multi-task learning. Compared with the conventional modality distillation that performs them independently, this helps to capture the valuable representation that can assist the final model inference directly. Specifically, we propose the joint adaptation network for the modality transfer task to preserve the privileged information. This addresses the representation heterogeneity caused by input discrepancy via the joint distribution adaptation. Then, we introduce the cross translation network for the modality fusion task to aggregate the restored and available modality features. It leverages the parameters-sharing strategy to capture the cross-modal cues explicitly. Extensive experiments on RGB-D classification and segmentation tasks demonstrate the proposed multimodal inheritance framework can overcome the problem of incomplete modality input in various scenes and achieve state-of-the-art performance.

摘要
Translation notes:* "Learning based on multimodal data" becomes "学习基于多modal数据" (学习基于多modal数据) in Simplified Chinese.* "while a variety of sensory modalities can be collected for training" becomes "而可收集多种感知模式进行训练" (而可收集多种感知模式进行训练) in Simplified Chinese.* "not all of them are always available in development scenarios" becomes "不一定可用于开发场景" (不一定可用于开发场景) in Simplified Chinese.* "which raises the challenge to infer with incomplete modality" becomes "带来 incomplete modality 的挑战" (带来 incomplete modality 的挑战) in Simplified Chinese.* "To address this issue, this paper presents a one-stage modality distillation framework" becomes "为解决这个问题，本文提出了一个一stage modality distillation框架" (为解决这个问题，本文提出了一个一stage modality distillation 框架) in Simplified Chinese.* "Compared with the conventional modality distillation that performs them independently" becomes "与传统的 modality distillation 相比" (与传统的 modality distillation 相比) in Simplified Chinese.* "This helps to capture the valuable representation that can assist the final model's inference directly" becomes "可以直接帮助最终模型的推理" (可以直接帮助最终模型的推理) in Simplified Chinese.* "Specifically, we propose the joint adaptation network for the modality transfer task to preserve the privileged information" becomes "我们专门提出了一个 joint adaptation network 来保持特权信息" (我们专门提出了一个 joint adaptation network 来保持特权信息) in Simplified Chinese.* "Then, we introduce the cross translation network for the modality fusion task to aggregate the restored and available modality features" becomes "然后，我们引入了一个 cross translation network 来聚合恢复和可用的模式特征" (然后，我们引入了一个 cross translation network 来聚合恢复和可用的模式特征) in Simplified Chinese.* "Extensive experiments on RGB-D classification and segmentation tasks demonstrate the proposed multimodal inheritance framework can overcome the problem of incomplete modality input in various scenes and achieve state-of-the-art performance" becomes "广泛的RGB-D分类和 segmentation任务实验表明，我们提出的多modal继承框架可以在多种场景中superior performance" (广泛的RGB-D分类和 segmentation 任务实验表明，我们提出的多modal 继承框架可以在多种场景中superior performance) in Simplified Chinese.

Hyperspectral Image Denoising via Self-Modulating Convolutional Neural Networks

paper_url: http://arxiv.org/abs/2309.08197
repo_url: https://github.com/orhan-t/sm-cnn
paper_authors: Orhan Torun, Seniha Esen Yuksel, Erkut Erdem, Nevrez Imamoglu, Aykut Erdem
for: 本研究旨在提出一种基于自适应 spectral-spatial 信息的高spectral像像预处理方法，以提高现有高spectral像像预处理方法对实际复杂噪谱的性能。
methods: 该方法基于一种新的 spectral self-modulating residual block (SSMRB)，该块可以根据邻近 spectral 数据来自适应地变换输入高spectral像像的特征，从而提高网络对实际复杂噪谱的适应能力。
results: 实验表明，提出的 SM-CNN 方法在公共 benchmark 数据集上比其他当前领先的高spectral像像预处理方法表现更好 both quantitatively and qualitatively。

Abstract
Compared to natural images, hyperspectral images (HSIs) consist of a large number of bands, with each band capturing different spectral information from a certain wavelength, even some beyond the visible spectrum. These characteristics of HSIs make them highly effective for remote sensing applications. That said, the existing hyperspectral imaging devices introduce severe degradation in HSIs. Hence, hyperspectral image denoising has attracted lots of attention by the community lately. While recent deep HSI denoising methods have provided effective solutions, their performance under real-life complex noise remains suboptimal, as they lack adaptability to new data. To overcome these limitations, in our work, we introduce a self-modulating convolutional neural network which we refer to as SM-CNN, which utilizes correlated spectral and spatial information. At the core of the model lies a novel block, which we call spectral self-modulating residual block (SSMRB), that allows the network to transform the features in an adaptive manner based on the adjacent spectral data, enhancing the network's ability to handle complex noise. In particular, the introduction of SSMRB transforms our denoising network into a dynamic network that adapts its predicted features while denoising every input HSI with respect to its spatio-spectral characteristics. Experimental analysis on both synthetic and real data shows that the proposed SM-CNN outperforms other state-of-the-art HSI denoising methods both quantitatively and qualitatively on public benchmark datasets.

摘要
Translated into Simplified Chinese:与自然图像不同，光谱图像（HSIs）具有较多的频谱信息，每个频谱信息都是在某些波长上的特定信息，甚至是可见光spectrum之外的信息。这些光谱图像的特点使其在远程感知应用中非常有效。然而，现有的光谱成像设备会对HSIs进行严重的降解。因此，光谱图像去噪引起了社区的广泛关注。虽然最近的深度HSIs去噪方法提供了有效的解决方案，但它们在实际生成的复杂噪声下表现不佳，因为它们缺乏对新数据的适应性。为了解决这些限制，我们在这里引入一种自适应 convolutional neural network，即SM-CNN，该网络利用相关的 spectral和空间信息。SM-CNN的核心块是一种新的 spectral self-modulating residual block（SSMRB），该块使得网络可以根据邻近的spectral数据来适应性地变换特征，提高网络对复杂噪声的能力。具体来说，SSMRB将我们的去噪网络转化成一种动态网络，该网络在处理每个输入HSIs时对其进行适应性的预测。实验分析表明，我们提出的SM-CNN在 Synthetic和实际数据上都能够superior于其他状态的艺术HSIs去噪方法， both quantitatively and qualitatively。

ECEA: Extensible Co-Existing Attention for Few-Shot Object Detection

paper_url: http://arxiv.org/abs/2309.08196
repo_url: None
paper_authors: Zhimeng Xin, Tianxu Wu, Shiming Chen, Yixiong Zou, Ling Shao, Xinge You
for: 提高ew-shot对象检测精度，使用两个阶段学习策略，但忽略了对象的局部到全局的转化。
methods: 提出了一种可扩展的合作注意力模块（ECEA），通过在基础阶段积累充足数据后，在新阶段进行可扩展的学习，使模型快速适应扩展局部区域到共存区域。
results: 在PASCAL VOC和COCO dataset上进行了广泛的实验，显示了ECEA模块可以帮助ew-shot检测器完全预测对象，即使一些地方在训练样本中不出现。 Comparing with现有的ew-shot对象检测方法，ECEA模块实现了新的最佳性。

Abstract
Few-shot object detection (FSOD) identifies objects from extremely few annotated samples. Most existing FSOD methods, recently, apply the two-stage learning paradigm, which transfers the knowledge learned from abundant base classes to assist the few-shot detectors by learning the global features. However, such existing FSOD approaches seldom consider the localization of objects from local to global. Limited by the scarce training data in FSOD, the training samples of novel classes typically capture part of objects, resulting in such FSOD methods cannot detect the completely unseen object during testing. To tackle this problem, we propose an Extensible Co-Existing Attention (ECEA) module to enable the model to infer the global object according to the local parts. Essentially, the proposed module continuously learns the extensible ability on the base stage with abundant samples and transfers it to the novel stage, which can assist the few-shot model to quickly adapt in extending local regions to co-existing regions. Specifically, we first devise an extensible attention mechanism that starts with a local region and extends attention to co-existing regions that are similar and adjacent to the given local region. We then implement the extensible attention mechanism in different feature scales to progressively discover the full object in various receptive fields. Extensive experiments on the PASCAL VOC and COCO datasets show that our ECEA module can assist the few-shot detector to completely predict the object despite some regions failing to appear in the training samples and achieve the new state of the art compared with existing FSOD methods.

摘要
几个样本对象检测（FSOD）可以从极少量的标注样本中识别 объек。现有的大多数FSOD方法在最近使用两stage学习 парадиг，通过将丰富基类知识传递到帮助几个样本检测器学习全局特征。然而，现有的FSOD方法通常不考虑对象从本地到全局的地方化。由于FSOD的训练样本通常只包含部分 объек，因此这些方法在测试中无法完全识别未看过的对象。为解决这个问题，我们提议一个可扩展的共存注意力（ECEA）模块，使得模型可以根据本地部分来推断全局对象。具体来说，我们首先设计了一个可扩展注意力机制，从本地区域开始，扩展注意力到相似和邻近的共存区域。然后，我们在不同的特征缩放中实现了可扩展注意力机制，逐渐发现全对象在不同的感知场。广泛的实验表明，我们的ECEA模块可以帮助几个样本检测器快速适应延伸本地区域到共存区域，并在PASCAL VOC和COCO数据集上达到新的状态的艺术水平，比现有的FSOD方法更高。

Towards Robust and Smooth 3D Multi-Person Pose Estimation from Monocular Videos in the Wild

paper_url: http://arxiv.org/abs/2309.08644
repo_url: None
paper_authors: Sungchan Park, Eunyi You, Inhoe Lee, Joonseok Lee
for: 3DMPPE (3D pose estimation for multi-person from a monocular video)
methods: sequence-to-sequence 2D-to-3D lifting model with novel geometry-aware data augmentation strategy
results: robust generalization to diverse unseen views, robust recovery against heavy occlusions, and more natural and smoother outputsHere’s the full summary in Simplified Chinese:
for: 3DMPPE 是计算机视觉中一项非常有价值的任务，尤其是在多人3Dpose estimation方面，现有的方法仍然处于不稳定的阶段，尚未应用于实际场景。我们提出了三个未解决的问题：在训练过程中不能处理未见视图，容易受到遮挡，并且输出存在剧烈晃动。
methods: 我们提出了一种基于序列到序列的2D-to-3D升级模型，利用了一种新的geometry-aware数据增强策略，可以生成无限多的视图数据，同时注意到地面和遮挡。
results: 我们的模型和数据增强方法可以稳定地处理多个未见视图，在压权遮挡下强健地恢复pose，并生成更自然和平滑的输出。我们的方法不仅在公共benchmark上实现了状态的最佳性能，还在更加挑战性的室内视频上取得了优秀的result。I hope that helps!

Abstract
3D pose estimation is an invaluable task in computer vision with various practical applications. Especially, 3D pose estimation for multi-person from a monocular video (3DMPPE) is particularly challenging and is still largely uncharted, far from applying to in-the-wild scenarios yet. We pose three unresolved issues with the existing methods: lack of robustness on unseen views during training, vulnerability to occlusion, and severe jittering in the output. As a remedy, we propose POTR-3D, the first realization of a sequence-to-sequence 2D-to-3D lifting model for 3DMPPE, powered by a novel geometry-aware data augmentation strategy, capable of generating unbounded data with a variety of views while caring about the ground plane and occlusions. Through extensive experiments, we verify that the proposed model and data augmentation robustly generalizes to diverse unseen views, robustly recovers the poses against heavy occlusions, and reliably generates more natural and smoother outputs. The effectiveness of our approach is verified not only by achieving the state-of-the-art performance on public benchmarks, but also by qualitative results on more challenging in-the-wild videos. Demo videos are available at https://www.youtube.com/@potr3d.

摘要
“3Dpose estimation是计算机视觉中的一项无可取代任务，具有各种实际应用。特别是3DMPPE（多人3Dpose estimation from monocular video）在训练时没有看到视图的稳定性和 occlusion 问题，还尚未在实际场景中应用。我们提出了现有方法的三个未解问题：训练时没有看到视图的稳定性、 occlusion 的感受性和输出中的抖动。为了解决这些问题，我们提出了 POTR-3D，首个实现了sequence-to-sequence 2D-to-3D lifting模型，通过一种新的几何化数据增强策略，能够生成无限多个视图，同时注重地面和 occlusions。经过广泛的实验，我们证明了我们的模型和数据增强方法可以robustly泛化到多种未见过视图，对重重 occlusions 进行恢复，并且可以生成更自然和稳定的输出。我们的方法不仅在公共标准 bencmarks 上实现了state-of-the-art表现，还通过在更加挑战的in-the-wild视频中的质量结果证明了我们的方法的有效性。 Demo 视频可以在https://www.youtube.com/@potr3d 找到。”

STDG: Semi-Teacher-Student Training Paradigram for Depth-guided One-stage Scene Graph Generation

paper_url: http://arxiv.org/abs/2309.08179
repo_url: None
paper_authors: Xukun Zhou, Zhenbo Song, Jun He, Hongyan Liu, Zhaoxin Fan
for: 提高自动化 роботи系统的环境理解能力，尤其是在背景复杂度下
methods: 使用一stageScene Graph生成方法，包括自定义的HHA表示生成模块、 semi-teaching网络学习模块和Scene Graph生成模块
results: 对比基eline，本方法在一stageScene Graph生成任务上显著提高性能

Abstract
Scene Graph Generation is a critical enabler of environmental comprehension for autonomous robotic systems. Most of existing methods, however, are often thwarted by the intricate dynamics of background complexity, which limits their ability to fully decode the inherent topological information of the environment. Additionally, the wealth of contextual information encapsulated within depth cues is often left untapped, rendering existing approaches less effective. To address these shortcomings, we present STDG, an avant-garde Depth-Guided One-Stage Scene Graph Generation methodology. The innovative architecture of STDG is a triad of custom-built modules: The Depth Guided HHA Representation Generation Module, the Depth Guided Semi-Teaching Network Learning Module, and the Depth Guided Scene Graph Generation Module. This trifecta of modules synergistically harnesses depth information, covering all aspects from depth signal generation and depth feature utilization, to the final scene graph prediction. Importantly, this is achieved without imposing additional computational burden during the inference phase. Experimental results confirm that our method significantly enhances the performance of one-stage scene graph generation baselines.

摘要
场景图生成是自主 роботи系统的关键能力之一，但大多数现有方法受到背景复杂性的限制，导致它们无法完全解码环境的内在拓扑信息。此外，现有方法通常会忽略depth缺失信息，使其效果相对较差。为解决这些缺陷，我们提出了STDG方法，这是一种革新的深度导航一stage场景图生成方法。STDG方法的核心是一个自定义的三部分模块：深度导航HHA表示生成模块、深度导航 semi-教学网络学习模块和深度导航场景图生成模块。这三个模块共同利用深度信息，从depth信号生成到depth特征利用，最后预测场景图。与传统方法相比，STDG方法不需要在推理阶段添加计算负担。实验结果表明，我们的方法可以明显提高一stage场景图生成基线性能。

Differentiable Resolution Compression and Alignment for Efficient Video Classification and Retrieval

paper_url: http://arxiv.org/abs/2309.08167
repo_url: https://github.com/dun-research/drca
paper_authors: Rui Deng, Qian Wu, Yuke Li, Haoran Fu
for: 提高视频推理效率，应对各种场景中的快速变化和细化要求。
methods: 提出一种高效的视频表示网络，通过不同分辨率层次进行压缩和对齐，从早期网络阶段减少计算成本，保持时间相关性。
results: 实验结果显示，我们的方法在靠近重复视频检索和动态视频分类中实现了最佳的效率和性能协议，与当前状态艺术方法相比。代码：https://github.com/dun-research/DRCA。

Abstract
Optimizing video inference efficiency has become increasingly important with the growing demand for video analysis in various fields. Some existing methods achieve high efficiency by explicit discard of spatial or temporal information, which poses challenges in fast-changing and fine-grained scenarios. To address these issues, we propose an efficient video representation network with Differentiable Resolution Compression and Alignment mechanism, which compresses non-essential information in the early stage of the network to reduce computational costs while maintaining consistent temporal correlations. Specifically, we leverage a Differentiable Context-aware Compression Module to encode the saliency and non-saliency frame features, refining and updating the features into a high-low resolution video sequence. To process the new sequence, we introduce a new Resolution-Align Transformer Layer to capture global temporal correlations among frame features with different resolutions, while reducing spatial computation costs quadratically by utilizing fewer spatial tokens in low-resolution non-saliency frames. The entire network can be end-to-end optimized via the integration of the differentiable compression module. Experimental results show that our method achieves the best trade-off between efficiency and performance on near-duplicate video retrieval and competitive results on dynamic video classification compared to state-of-the-art methods. Code:https://github.com/dun-research/DRCA

摘要
优化视频推理效率已成为不断增长的需求，尤其是在各个领域中进行视频分析。现有的方法可以通过显式抛弃空间或时间信息来实现高效性，但是这会在快速变化和细腻场景中带来挑战。为解决这些问题，我们提出了一种高效的视频表示网络，具有差分分辨率压缩和对齐机制。具体来说，我们利用了差分上下文感知压缩模块来编码不同分辨率的帧特征，并将其更新和修正为一个高低分辨率视频序列。为处理新的序列，我们引入了一个新的分辨率对齐转换层，以捕捉不同分辨率帧特征之间的全局时间相关性，同时减少空间计算成本平方根。整个网络可以通过差分压缩模块的结合来进行端到端优化。实验结果表明，我们的方法可以在靠近重复视频检索和动态视频分类中实现最佳的效率和性能，并且与当前状态OFTHE-ART方法相比，实现了竞争的结果。代码：https://github.com/dun-research/DRCA

A Ground Segmentation Method Based on Point Cloud Map for Unstructured Roads

paper_url: http://arxiv.org/abs/2309.08164
repo_url: None
paper_authors: Zixuan Li, Haiying Lin, Zhangyu Wang, Huazhi Li, Miao Yu, Jie Wang
for: 提高不结构化道路Scene中的地面分割精度
methods: 基于点云地图的方法，包括区域兴趣EXTRACTION、点云地图与实时点云的位域关联、基于 Gaussian Distribution 的背景模型 subtract
results: 实验结果显示，正确地面点分割率为 99.95%，运行时间为 26ms，与 state of the art 方法 Patchwork++ 相比，平均地面点分割精度提高 7.43%，运行时间增加 17ms。

Abstract
Ground segmentation, as the basic task of unmanned intelligent perception, provides an important support for the target detection task. Unstructured road scenes represented by open-pit mines have irregular boundary lines and uneven road surfaces, which lead to segmentation errors in current ground segmentation methods. To solve this problem, a ground segmentation method based on point cloud map is proposed, which involves three parts: region of interest extraction, point cloud registration and background subtraction. Firstly, establishing boundary semantic associations to obtain regions of interest in unstructured roads. Secondly, establishing the location association between point cloud map and the real-time point cloud of region of interest by semantics information. Thirdly, establishing a background model based on Gaussian distribution according to location association, and segments the ground in real-time point cloud by the background substraction method. Experimental results show that the correct segmentation rate of ground points is 99.95%, and the running time is 26ms. Compared with state of the art ground segmentation algorithm Patchwork++, the average accuracy of ground point segmentation is increased by 7.43%, and the running time is increased by 17ms. Furthermore, the proposed method is practically applied to unstructured road scenarios represented by open pit mines.

摘要
地面 segmentation，作为无人智能感知的基本任务，对目标检测任务提供了重要支持。不结构的公路场景，如开采场，具有不规则边界线和不平的路面，这会导致当前地面 segmentation 方法中的segmentation error。为解决这个问题，一种基于点云地图的地面 segmentation 方法被提出，该方法包括三部分：region of interest 提取、点云注册和背景 subtract。首先，通过边界semantic association establishment obtain regions of interest in unstructured roads。其次，通过semantics information establishment的point cloud map和实时点云的区域兴趣点cloud association。最后，根据该association establishment Gaussian distributionbased background model，并在实时点云中对ground进行background subtract。实验结果显示，正确地面点的分割率为99.95%，运行时间为26ms。相比之前的state-of-the-art ground segmentation algorithm Patchwork++, average accuracy of ground point segmentation提高了7.43%，运行时间提高了17ms。此外，提出的方法实际应用于无结构公路场景中，如开采场。

paper_url: http://arxiv.org/abs/2309.08160
repo_url: None
paper_authors: Yuda Bi, Anees Abrol, Jing Sui, Vince Calhoun
for: 本研究旨在透过条件Vision Transformer生成数学模型（cViT-GAN）Synthesize functional network connectivity（FNC）数据，并分析其与 estructural magnetic resonance imaging（sMRI）之间的交互关系。
methods: 本研究使用cViT-GAN模型，将sMRI资料作为输入，生成各个subject的FNC数据，并形成每个subject的FNC矩阵。最后，我们得到了一个群体差FNC矩阵，与实际FNC矩阵之间的相関系数为0.73。
results: 我们的FNC可视化结果显示了特定的下来部分脑区域之间的相关性，显示了模型的能力将结构功能关系纳入考虑。此表现与 conditional CNN-based GAN alternatives如Pix2Pix不同，实现了模型的优化。

Abstract
The cross-modal synthesis between structural magnetic resonance imaging (sMRI) and functional network connectivity (FNC) is a relatively unexplored area in medical imaging, especially with respect to schizophrenia. This study employs conditional Vision Transformer Generative Adversarial Networks (cViT-GANs) to generate FNC data based on sMRI inputs. After training on a comprehensive dataset that included both individuals with schizophrenia and healthy control subjects, our cViT-GAN model effectively synthesized the FNC matrix for each subject, and then formed a group difference FNC matrix, obtaining a Pearson correlation of 0.73 with the actual FNC matrix. In addition, our FNC visualization results demonstrate significant correlations in particular subcortical brain regions, highlighting the model's capability of capturing detailed structural-functional associations. This performance distinguishes our model from conditional CNN-based GAN alternatives such as Pix2Pix. Our research is one of the first attempts to link sMRI and FNC synthesis, setting it apart from other cross-modal studies that concentrate on T1- and T2-weighted MR images or the fusion of MRI and CT scans.

摘要
cross-modal合成 между结构磁共振成像(sMRI)和功能网络连接(FNC)是医学成像领域中相对未曾开探的领域，特别是在偏头痛方面。本研究使用条件视觉变换生成敌对网络(cViT-GANs)来生成基于sMRI输入的FNC数据。经过训练一个包括偏头痛和健康控制者的全面数据集后，我们的cViT-GAN模型成功生成了每个素个的FNC矩阵，然后组成了群差FNC矩阵，并 obtient了与实际FNC矩阵的Spearson相关系数0.73。此外，我们的FNC视觉结果显示了特定的下侧大脑区域之间的相关性，ILLUMINATING模型的能力捕捉细致的结构-功能关系。这种性能与 conditional CNN-based GAN的Pix2Pix相比， distinguishes our model。本研究是医学成像领域中首次将sMRI和FNC合成联系起来，与其他交叠Modal Studies的T1-和T2-束磁共振图像或MRI和CT扫描的融合相比，更加独特。

AdSEE: Investigating the Impact of Image Style Editing on Advertisement Attractiveness

paper_url: http://arxiv.org/abs/2309.08159
repo_url: https://github.com/liyaojiang1998/adsee
paper_authors: Liyao Jiang, Chenglin Li, Haolan Chen, Xiaodong Gao, Xinwang Zhong, Yang Qiu, Shani Ye, Di Niu
for: 这个论文研究了在线广告的 clicked 率是否受到图像的语义编辑的影响。
methods: 该论文使用了 StyleGAN 技术进行语义编辑和反编辑，并使用了传统的视觉和文本特征来预测 clicked 率。
results: 该论文通过一个大量的 collected 数据集（QQ-AD）进行了广泛的 offline 测试，发现不同的语义方向和编辑强度可以影响 clicked 率。此外，该论文还设计了一个基于进化算法的广告编辑器，可以高效地搜索最佳的编辑方向和强度。在在线 A/B 测试中，对于 AdSEE 编辑的样本，与控制组的原始广告相比， click-through 率得到了提高。

Abstract
Online advertisements are important elements in e-commerce sites, social media platforms, and search engines. With the increasing popularity of mobile browsing, many online ads are displayed with visual information in the form of a cover image in addition to text descriptions to grab the attention of users. Various recent studies have focused on predicting the click rates of online advertisements aware of visual features or composing optimal advertisement elements to enhance visibility. In this paper, we propose Advertisement Style Editing and Attractiveness Enhancement (AdSEE), which explores whether semantic editing to ads images can affect or alter the popularity of online advertisements. We introduce StyleGAN-based facial semantic editing and inversion to ads images and train a click rate predictor attributing GAN-based face latent representations in addition to traditional visual and textual features to click rates. Through a large collected dataset named QQ-AD, containing 20,527 online ads, we perform extensive offline tests to study how different semantic directions and their edit coefficients may impact click rates. We further design a Genetic Advertisement Editor to efficiently search for the optimal edit directions and intensity given an input ad cover image to enhance its projected click rates. Online A/B tests performed over a period of 5 days have verified the increased click-through rates of AdSEE-edited samples as compared to a control group of original ads, verifying the relation between image styles and ad popularity. We open source the code for AdSEE research at https://github.com/LiyaoJiang1998/adsee.

摘要
在电子商务平台、社交媒体平台和搜索引擎上，在线广告是重要的元素。随着移动浏览的增加 popularity，许多在线广告将视觉信息显示为封面图片的形式，以吸引用户的注意。 Various recent studies have focused on predicting the click rates of online advertisements based on visual features or composing optimal advertisement elements to enhance visibility. In this paper, we propose Advertisement Style Editing and Attractiveness Enhancement (AdSEE), which explores whether semantic editing to ads images can affect or alter the popularity of online advertisements. We introduce StyleGAN-based facial semantic editing and inversion to ads images and train a click rate predictor attributing GAN-based face latent representations in addition to traditional visual and textual features to click rates. Through a large collected dataset named QQ-AD, containing 20,527 online ads, we perform extensive offline tests to study how different semantic directions and their edit coefficients may impact click rates. We further design a Genetic Advertisement Editor to efficiently search for the optimal edit directions and intensity given an input ad cover image to enhance its projected click rates. Online A/B tests performed over a period of 5 days have verified the increased click-through rates of AdSEE-edited samples as compared to a control group of original ads, verifying the relation between image styles and ad popularity. We open source the code for AdSEE research at .

Uncertainty-Aware Multi-View Visual Semantic Embedding

paper_url: http://arxiv.org/abs/2309.08154
repo_url: None
paper_authors: Wenzhang Wei, Zhipeng Gui, Changguang Wu, Anqi Zhao, Xingguang Wang, Huayi Wu
for: This paper aims to improve image-text retrieval by leveraging semantic information and modeling uncertainty in multi-modal understanding.
methods: The proposed Uncertainty-Aware Multi-View Visual Semantic Embedding (UAMVSE) framework decomposes the overall image-text matching into multiple view-text matchings and uses an uncertainty-aware loss function (UALoss) to adaptively model the uncertainty in each view-text correspondence.
results: Experimental results on the Flicker30k and MS-COCO datasets demonstrate that UAMVSE outperforms state-of-the-art models.

Abstract
The key challenge in image-text retrieval is effectively leveraging semantic information to measure the similarity between vision and language data. However, using instance-level binary labels, where each image is paired with a single text, fails to capture multiple correspondences between different semantic units, leading to uncertainty in multi-modal semantic understanding. Although recent research has captured fine-grained information through more complex model structures or pre-training techniques, few studies have directly modeled uncertainty of correspondence to fully exploit binary labels. To address this issue, we propose an Uncertainty-Aware Multi-View Visual Semantic Embedding (UAMVSE)} framework that decomposes the overall image-text matching into multiple view-text matchings. Our framework introduce an uncertainty-aware loss function (UALoss) to compute the weighting of each view-text loss by adaptively modeling the uncertainty in each view-text correspondence. Different weightings guide the model to focus on different semantic information, enhancing the model's ability to comprehend the correspondence of images and texts. We also design an optimized image-text matching strategy by normalizing the similarity matrix to improve model performance. Experimental results on the Flicker30k and MS-COCO datasets demonstrate that UAMVSE outperforms state-of-the-art models.

摘要
“图像文本检索的关键挑战是有效地利用semantic信息来衡量视觉和语言数据之间的相似性。然而，使用实例级binary标签，其中每个图像与一个文本相对应，无法捕捉不同semantic单元之间的多重匹配关系，从而导致多modal semantic理解的uncertainty。虽然latest research捕捉了细腻信息通过更复杂的模型结构或预训练技术，但很少研究直接模型匹配不确定性，以全面利用binary标签。为解决这个问题，我们提出了一种Uncertainty-Aware Multi-View Visual Semantic Embedding（UAMVSE） frameworks，它将整个图像文本匹配分解成多个视图文本匹配。我们的 frameworks introduce一种uncertainty-aware损失函数（UALoss）来计算每个视图文本损失的权重，通过自适应地模型不确定性来computing each view-text correspondence的uncertainty。不同的权重导引模型更好地理解图像和文本之间的匹配关系，提高模型对多modal semantic理解的能力。我们还设计了一种优化图像文本匹配策略，通过normalizing similarity Matrix来提高模型性能。实验结果表明，UAMVSE在Flicker30k和MS-COCO数据集上超过了当前状态的模型性能。”

DA-RAW: Domain Adaptive Object Detection for Real-World Adverse Weather Conditions

paper_url: http://arxiv.org/abs/2309.08152
repo_url: None
paper_authors: Minsik Jeon, Junwon Seo, Jihong Min
for: 提高 object detection 方法在不良天气 condition 中的可靠性
methods: 使用 unsupervised domain adaptation 方法，分别处理 style 和 weather 两个领域的差异，以提高 object detection 的可靠性
results: 对比其他方法，本方法在不良天气 condition 下的 object detection 性能更高

Abstract
Despite the success of deep learning-based object detection methods in recent years, it is still challenging to make the object detector reliable in adverse weather conditions such as rain and snow. For the robust performance of object detectors, unsupervised domain adaptation has been utilized to adapt the detection network trained on clear weather images to adverse weather images. While previous methods do not explicitly address weather corruption during adaptation, the domain gap between clear and adverse weather can be decomposed into two factors with distinct characteristics: a style gap and a weather gap. In this paper, we present an unsupervised domain adaptation framework for object detection that can more effectively adapt to real-world environments with adverse weather conditions by addressing these two gaps separately. Our method resolves the style gap by concentrating on style-related information of high-level features using an attention module. Using self-supervised contrastive learning, our framework then reduces the weather gap and acquires instance features that are robust to weather corruption. Extensive experiments demonstrate that our method outperforms other methods for object detection in adverse weather conditions.

摘要
尽管深度学习基于对象检测方法在最近几年内取得了成功，但是在雨雪等不利天气条件下，对象检测器的可靠性仍然是一个挑战。为了增强对象检测器的可靠性，不经过监督的领域适应已经被应用于适应天气图像。而前一些方法并没有直接面对气候腐败的影响，但是领域差异 между晴天和不利天气可以分解为两个因素，它们具有不同的特征。在这篇论文中，我们提出了一种不经过监督的领域适应框架，可以更有效地适应实际环境中的不利天气条件。我们解决了风格差异的问题，通过集中于高级特征的风格相关信息使用注意力模块。然后，我们使用自我监督对比学习，减少气候差异，获得抵抗气候腐败的实例特征。我们的方法在不利天气下进行对象检测的实验表明，与其他方法相比，我们的方法表现更加出色。

Syn-Att: Synthetic Speech Attribution via Semi-Supervised Unknown Multi-Class Ensemble of CNNs

paper_url: http://arxiv.org/abs/2309.08146
repo_url: https://github.com/awsaf49/synatt
paper_authors: Md Awsafur Rahman, Bishmoy Paul, Najibul Haque Sarker, Zaber Ibn Abdul Hakim, Shaikh Anowarul Fattah, Mohammad Saquib
for: 本研究旨在对伪造的语音识别和推断伪造工具，以提高伪造语音识别和追溯的能力。
methods: 本研究提出了一种新的推理策略，将伪造语音轨变换为峰低 spectrogram，使用 CNN 提取特征，并使用 semi-supervision 和 ensemble 进行强化和普适化。
results: 本研究在 IEEE SP Cup 挑战中，以12-13%的精度优势在强 perturbed 数据集上（Eval 2），以及1-2%的精度优势在弱 perturbed 数据集上（Eval 1），与其他顶尖队伍相比表现出色。

Abstract
With the huge technological advances introduced by deep learning in audio & speech processing, many novel synthetic speech techniques achieved incredible realistic results. As these methods generate realistic fake human voices, they can be used in malicious acts such as people imitation, fake news, spreading, spoofing, media manipulations, etc. Hence, the ability to detect synthetic or natural speech has become an urgent necessity. Moreover, being able to tell which algorithm has been used to generate a synthetic speech track can be of preeminent importance to track down the culprit. In this paper, a novel strategy is proposed to attribute a synthetic speech track to the generator that is used to synthesize it. The proposed detector transforms the audio into log-mel spectrogram, extracts features using CNN, and classifies it between five known and unknown algorithms, utilizing semi-supervision and ensemble to improve its robustness and generalizability significantly. The proposed detector is validated on two evaluation datasets consisting of a total of 18,000 weakly perturbed (Eval 1) & 10,000 strongly perturbed (Eval 2) synthetic speeches. The proposed method outperforms other top teams in accuracy by 12-13% on Eval 2 and 1-2% on Eval 1, in the IEEE SP Cup challenge at ICASSP 2022.

摘要
随着深度学习在音频和语音处理领域的大量技术进步，许多新的合成语音技术实现了不可思议的真实效果。这些方法可以生成真实的假人声音，因此可以用于有害的行为，如人类模仿、假新闻、散播、骗取、媒体操纵等。因此，可以准确地检测合成或自然语音的能力变得非常重要。此外，能够判断哪种算法生成了合成语音轨迹也是非常重要的。在本文中，一种新的策略是提出的，用于归因合成语音轨迹到生成它的算法。提议的检测器将音频转换为径向mel spectrogram，提取特征使用CNN，并将其分类为五个已知和未知算法，使用semi-supervision和ensemble以提高其可靠性和泛化性。提议的检测器在两个评估数据集上进行验证，包括18,000个弱地扰动（Eval 1）和10,000个强地扰动（Eval 2）合成语音。提议的方法在IEEE SP杯比赛中于ICASSP 2022年的评估中，与其他顶尖队伍相比，准确率高出12-13%（Eval 2）和1-2%（Eval 1）。

Multi-Scale Estimation for Omni-Directional Saliency Maps Using Learnable Equator Bias

paper_url: http://arxiv.org/abs/2309.08139
repo_url: https://github.com/islab-sophia/odisal
paper_authors: Takao Yamanaka, Tatsuya Suzuki, Taiki Nobutsune, Chenjunlin Wu
for: 这 paper 用于 estimating 镜像中的焦点点 cloud，以便在头戴式显示器上探测重要的区域。
methods: 该 paper 提出了一种基于抽象2D图像的新的焦点映射估计模型，通过在不同的方向和视角下提取 overlap 2D 图像来实现。
results: 该 paper 的实验结果表明，使用该模型可以提高镜像中焦点点云的准确性。

Abstract
Omni-directional images have been used in wide range of applications. For the applications, it would be useful to estimate saliency maps representing probability distributions of gazing points with a head-mounted display, to detect important regions in the omni-directional images. This paper proposes a novel saliency-map estimation model for the omni-directional images by extracting overlapping 2-dimensional (2D) plane images from omni-directional images at various directions and angles of view. While 2D saliency maps tend to have high probability at the center of images (center bias), the high-probability region appears at horizontal directions in omni-directional saliency maps when a head-mounted display is used (equator bias). Therefore, the 2D saliency model with a center-bias layer was fine-tuned with an omni-directional dataset by replacing the center-bias layer to an equator-bias layer conditioned on the elevation angle for the extraction of the 2D plane image. The limited availability of omni-directional images in saliency datasets can be compensated by using the well-established 2D saliency model pretrained by a large number of training images with the ground truth of 2D saliency maps. In addition, this paper proposes a multi-scale estimation method by extracting 2D images in multiple angles of view to detect objects of various sizes with variable receptive fields. The saliency maps estimated from the multiple angles of view were integrated by using pixel-wise attention weights calculated in an integration layer for weighting the optimal scale to each object. The proposed method was evaluated using a publicly available dataset with evaluation metrics for omni-directional saliency maps. It was confirmed that the accuracy of the saliency maps was improved by the proposed method.

摘要
全irectional图像在各种应用中使用，为这些应用，可以使用头戴式显示器来估算 gazing point 的抽象图像，以检测图像中重要的区域。本文提出了一种基于全irectional图像的新的抽象图像估算模型，通过提取不同方向和视角的2D平面图像来生成 omni-directional 图像的抽象图像。在使用头戴式显示器时，2D抽象图像的高概率区域往往出现在水平方向上（equator bias），而不是在图像中心（center bias）。因此，该文件提出了一种基于2D抽象图像的中心偏好层的修正方法，通过将中心偏好层替换为基于扬角的 equator-bias 层来提高 omni-directional 图像的抽象图像估算精度。由于 omni-directional 图像的数据匮乏，本文还提出了一种使用已有的2D抽象图像模型，通过大量的训练图像和相关的地平面抽象图像来练习该模型，从而补做数据的不足。此外，本文还提出了一种多比例估算方法，通过提取多个视角的2D图像来检测不同尺度的对象，并使用每个对象的变量感知权重来权重每个对象的估算结果。该方法被评估使用公共可用的数据集，并证实了其精度的提高。

Let’s Roll: Synthetic Dataset Analysis for Pedestrian Detection Across Different Shutter Types

paper_url: http://arxiv.org/abs/2309.08136
repo_url: None
paper_authors: Yue Hu, Gourav Datta, Kira Beerel, Peter Beerel
for: 本研究旨在检验不同闭着机制对机器学习对象检测模型的影响，以确定是否存在严重差异在检测精度方面。
methods: 本研究使用Unreal Engine 5进行了 sintetic数据生成，并对主流检测模型进行了训练和评估。
results: 研究结果表明，闭着机制对检测精度的影响存在差异，特别是在检测低速对象（如行人）时。在粗粒度检测精度方面（mAP=0.5），闭着机制对检测模型的影响几乎无效，但在细粒度检测精度方面（mAP=0.5:0.95）存在显著差异。这表明，ML管道可能不需要显式修正RS，但为了提高RS对象检测的精度，可能需要进行进一步的研究。

Abstract
Computer vision (CV) pipelines are typically evaluated on datasets processed by image signal processing (ISP) pipelines even though, for resource-constrained applications, an important research goal is to avoid as many ISP steps as possible. In particular, most CV datasets consist of global shutter (GS) images even though most cameras today use a rolling shutter (RS). This paper studies the impact of different shutter mechanisms on machine learning (ML) object detection models on a synthetic dataset that we generate using the advanced simulation capabilities of Unreal Engine 5 (UE5). In particular, we train and evaluate mainstream detection models with our synthetically-generated paired GS and RS datasets to ascertain whether there exists a significant difference in detection accuracy between these two shutter modalities, especially when capturing low-speed objects (e.g., pedestrians). The results of this emulation framework indicate the performance between them are remarkably congruent for coarse-grained detection (mean average precision (mAP) for IOU=0.5), but have significant differences for fine-grained measures of detection accuracy (mAP for IOU=0.5:0.95). This implies that ML pipelines might not need explicit correction for RS for many object detection applications, but mitigating RS effects in ISP-less ML pipelines that target fine-grained location of the objects may need additional research.

摘要
计算机视觉（CV）流水线通常会被评估使用图像信号处理（ISP）流水线处理的数据集，尽管在资源有限的应用中，一个重要的研究目标是避免最多的ISP步骤。尤其是，大多数CV数据集都是全球闭缝（GS）图像，而今天的大多数相机都使用滚动闭缝（RS）。这篇论文研究了不同闭缝机制对机器学习（ML）对象检测模型的影响，使用我们自己生成的synthetic数据集，并在Unreal Engine 5（UE5）的高级模拟功能下进行了模拟。具体来说，我们使用生成的paired GS和RS数据集来训练和评估主流检测模型，以确定这两种闭缝模式之间的检测精度差异，特别是检测低速对象（如人行道）。结果表明，这两种闭缝模式之间的性能差异不大（ Mean Average Precision（mAP）为IOU=0.5），但是在细化检测精度指标（mAP为IOU=0.5：0.95）时有显著差异。这表明，ML管道可能不需要显式修正RS，但是为了实现细化对象检测应用，可能需要进行进一步的研究以避免RS的影响。

AnyOKP: One-Shot and Instance-Aware Object Keypoint Extraction with Pretrained ViT

paper_url: http://arxiv.org/abs/2309.08134
repo_url: None
paper_authors: Fangbo Qin, Taogang Hou, Shan Lin, Kaiyuan Wang, Michael C. Yip, Shan Yu
for: 这个研究旨在提出一种具有多物件类别和实例意识的一击式物点键点抽象方法（AnyOKP），用于实现灵活的物象视觉感知。
methods: 这个方法利用预训transformer（ViT）的强大表示能力，并可以从支持图片学习而获得键点。首先使用训练自动找到最佳原型对（BPPs），然后根据图片的出现相似性搜寻候选键点。最后，分解整个图形为不同物体实例的子图形。
results: 这个方法在实验中显示了跨类别和实例意识的优异性，并且具有传播和类别转换的优秀适应能力。

Abstract
Towards flexible object-centric visual perception, we propose a one-shot instance-aware object keypoint (OKP) extraction approach, AnyOKP, which leverages the powerful representation ability of pretrained vision transformer (ViT), and can obtain keypoints on multiple object instances of arbitrary category after learning from a support image. An off-the-shelf petrained ViT is directly deployed for generalizable and transferable feature extraction, which is followed by training-free feature enhancement. The best-prototype pairs (BPPs) are searched for in support and query images based on appearance similarity, to yield instance-unaware candidate keypoints.Then, the entire graph with all candidate keypoints as vertices are divided to sub-graphs according to the feature distributions on the graph edges. Finally, each sub-graph represents an object instance. AnyOKP is evaluated on real object images collected with the cameras of a robot arm, a mobile robot, and a surgical robot, which not only demonstrates the cross-category flexibility and instance awareness, but also show remarkable robustness to domain shift and viewpoint change.

摘要
向flexible对象中心视觉进化，我们提出了一种一shot实例感知对象关键点（OKP）提取方法，AnyOKP，它利用预训练的视transformer（ViT）的强大表示能力，可以在学习支持图片后，从多个对象实例中提取关键点。 directly使用已经训练的ViT进行通用和转移的特征提取，然后通过无需训练的特征增强。在支持图片和查询图片中基于外观相似性搜索最佳原型对（BPPs），然后将整个图像的所有候选关键点作为顶点组织成图像，最终每个子图像表示一个对象实例。 AnyOKP在实际对象图像中进行了测试，不仅表现出了跨类弹性和实例意识，而且还具有了remarkable的领域变化和视点变化的Robustness。

Increasing diversity of omni-directional images generated from single image using cGAN based on MLPMixer

paper_url: http://arxiv.org/abs/2309.08129
repo_url: https://github.com/islab-sophia/odigen-mlpmixer
paper_authors: Atsuya Nakata, Ryuto Miyazaki, Takao Yamanaka
for: 这个论文提出了一种生成全向图像的新方法，该方法可以将单张图像转换成全向图像。
methods: 该方法使用了多层感知器（MLPMixer）来取代传统的生成对抗网络（CNN），MLPMixer可以更好地传递信息，从而提高生成的全向图像的多样性和质量。
results: 该方法可以减少内存占用和计算成本，同时保持与传统方法相当的性能水平，并且可以生成更多样化的全向图像。

Abstract
This paper proposes a novel approach to generating omni-directional images from a single snapshot picture. The previous method has relied on the generative adversarial networks based on convolutional neural networks (CNN). Although this method has successfully generated omni-directional images, CNN has two drawbacks for this task. First, since a convolutional layer only processes a local area, it is difficult to propagate the information of an input snapshot picture embedded in the center of the omni-directional image to the edges of the image. Thus, the omni-directional images created by the CNN-based generator tend to have less diversity at the edges of the generated images, creating similar scene images. Second, the CNN-based model requires large video memory in graphics processing units due to the nature of the deep structure in CNN since shallow-layer networks only receives signals from a limited range of the receptive field. To solve these problems, MLPMixer-based method was proposed in this paper. The MLPMixer has been proposed as an alternative to the self-attention in the transformer, which captures long-range dependencies and contextual information. This enables to propagate information efficiently in the omni-directional image generation task. As a result, competitive performance has been achieved with reduced memory consumption and computational cost, in addition to increasing diversity of the generated omni-directional images.

摘要
Here's the text in Simplified Chinese:这篇论文提出了一种新的方法，用于从单一的Snapshot图片中生成全irectional图像。过去的方法都是基于生成对抗网络（GAN），但是这些方法有两个缺点。首先，Convolutional Neural Networks（CNN）在生成全irectional图像时，很难将中心位置的Snapshot图片信息传递到图像的边缘。因此，由CNN生成的全irectional图像中的边缘图像会更为相似。其次，CNN模型需要大量的图像处理器内存，这是因为CNN的深度结构会导致模型具有较大的计算量和内存占用。为解决这些问题，这篇论文提出了基于多层杂化（MLPMixer）的方法。MLPMixer可以 capture长距离依赖关系和Contextual信息，因此可以高效地传递信息在全irectional图像生成任务中。因此，该方法可以实现与传统方法相同的性能，同时减少内存占用量和计算量。此外，该方法还可以增加生成的全irectional图像的多样性。

paper_url: http://arxiv.org/abs/2309.08113
repo_url: https://github.com/yinzhicun/metaf2n
paper_authors: Zhicun Yin, Ming Liu, Xiaoming Li, Hui Yang, Longan Xiao, Wangmeng Zuo
for: 提高低品质图像的超解析表现
methods: 利用 faces 内容为模型进行 fine-tuning，避免了低品质图像生成和优化不确定性
results: 在实验中，MetaF2N 可以从一次 fine-tuning 中获得良好的表现，并且可以适应不同的自然图像环境

Abstract
Due to their highly structured characteristics, faces are easier to recover than natural scenes for blind image super-resolution. Therefore, we can extract the degradation representation of an image from the low-quality and recovered face pairs. Using the degradation representation, realistic low-quality images can then be synthesized to fine-tune the super-resolution model for the real-world low-quality image. However, such a procedure is time-consuming and laborious, and the gaps between recovered faces and the ground-truths further increase the optimization uncertainty. To facilitate efficient model adaptation towards image-specific degradations, we propose a method dubbed MetaF2N, which leverages the contained Faces to fine-tune model parameters for adapting to the whole Natural image in a Meta-learning framework. The degradation extraction and low-quality image synthesis steps are thus circumvented in our MetaF2N, and it requires only one fine-tuning step to get decent performance. Considering the gaps between the recovered faces and ground-truths, we further deploy a MaskNet for adaptively predicting loss weights at different positions to reduce the impact of low-confidence areas. To evaluate our proposed MetaF2N, we have collected a real-world low-quality dataset with one or multiple faces in each image, and our MetaF2N achieves superior performance on both synthetic and real-world datasets. Source code, pre-trained models, and collected datasets are available at https://github.com/yinzhicun/MetaF2N.

摘要
由于 faces 的高度结构化特征，因此可以更容易地进行匿名图像超解像。我们可以从低质量和恢复的人脸对 extracted degradation representation。使用这个表示，我们可以生成高质量的匿名图像，以便为超解像模型进行 fine-tuning。但是，这种过程时间consuming 和劳动密集，并且差距 между recovered faces 和真实值更大，这会增加优化的uncertainty。为了快速和高效地适应图像特定的降低，我们提议一种方法 MetaF2N，它利用 faces 来微调模型参数，以适应整个自然图像。在这种Meta-学框架中，我们可以 circumvent 低质量图像生成和降低表示步骤，只需要一次微调。考虑到差距 между recovered faces 和真实值，我们进一步部署了一个 MaskNet 来适应性地预测不同位置的损失权重，以降低低信任区域的影响。为了评估我们提议的 MetaF2N，我们收集了一个真实世界的低质量图像集，并且我们的 MetaF2N 在这些 synthetic 和真实世界的集合上都达到了更高的性能。代码、预训练模型和收集到的数据都可以在上获取。

Detail Reinforcement Diffusion Model: Augmentation Fine-Grained Visual Categorization in Few-Shot Conditions

paper_url: http://arxiv.org/abs/2309.08097
repo_url: None
paper_authors: Tianxu Wu, Shuo Ye, Shuhuang Chen, Qinmu Peng, Xinge You
for: Addressing the challenge of fine-grained visual categorization with limited data.
methods: Proposes a novel approach called the detail reinforcement diffusion model (DRDM), which leverages rich knowledge from large models for data augmentation and includes two key components: discriminative semantic recombination (DSR) and spatial knowledge reference (SKR).
results: Demonstrates improved performance for fine-grained visual recognition tasks through effective utilization of knowledge from large models, despite limited data availability.

Abstract
The challenge in fine-grained visual categorization lies in how to explore the subtle differences between different subclasses and achieve accurate discrimination. Previous research has relied on large-scale annotated data and pre-trained deep models to achieve the objective. However, when only a limited amount of samples is available, similar methods may become less effective. Diffusion models have been widely adopted in data augmentation due to their outstanding diversity in data generation. However, the high level of detail required for fine-grained images makes it challenging for existing methods to be directly employed. To address this issue, we propose a novel approach termed the detail reinforcement diffusion model~(DRDM), which leverages the rich knowledge of large models for fine-grained data augmentation and comprises two key components including discriminative semantic recombination (DSR) and spatial knowledge reference~(SKR). Specifically, DSR is designed to extract implicit similarity relationships from the labels and reconstruct the semantic mapping between labels and instances, which enables better discrimination of subtle differences between different subclasses. Furthermore, we introduce the SKR module, which incorporates the distributions of different datasets as references in the feature space. This allows the SKR to aggregate the high-dimensional distribution of subclass features in few-shot FGVC tasks, thus expanding the decision boundary. Through these two critical components, we effectively utilize the knowledge from large models to address the issue of data scarcity, resulting in improved performance for fine-grained visual recognition tasks. Extensive experiments demonstrate the consistent performance gain offered by our DRDM.

摘要
Simplified Chinese translation:细化图像分类的挑战在于如何探索不同 subclass 之间的细微差异并实现精准分类。先前的研究通过大量标注数据和预训练深度模型来实现目标。但当只有有限的样本可用时，类似的方法可能变得 menos有效。增殖模型在数据生成方面具有出色的多样性，但是高级别的细化图像需要大量的样本来进行训练，这使得现有的方法难以直接采用。为解决这个问题，我们提出了一种新的方法，称为细化强化增殖模型~(DRDM)。DRDM 利用大模型中的丰富知识来进行细化数据增殖，并包括两个关键组件：推理 semantic recombination~(DSR) 和空间知识引用~(SKR)。特别是，DSR 设计用于从标签中提取隐式相似关系，并重建类标签和实例之间的semantic mapping，从而提高细化差异的探索。此外，我们引入 SKR 模块，该模块在特征空间中引入不同数据集的分布作为参考。这使得 SKR 能够在少数shot FGVC 任务中聚合高维度的 subclass 特征分布，从而扩大决策边界。通过这两个关键组件，我们有效地利用大模型中的知识来解决数据稀缺性的问题，从而提高细化图像认知任务的性能。广泛的实验表明我们的 DRDM 具有稳定的性能提升。

hear-your-action: human action recognition by ultrasound active sensing

paper_url: http://arxiv.org/abs/2309.08087
repo_url: None
paper_authors: Risako Tanigawa, Yasunori Ishii
for: 本研究旨在提出一种隐私保护的动作识别方法，使得通过视觉信息获取动作识别结果时，不会泄露用户的隐私信息。
methods: 本研究使用ultrasound active sensing技术进行动作识别，并创建了一个新的数据集来支持这种方法。研究人员计算了时间变化的声波反射波的强度特征值，并使用支持向量机和VGG进行分类。
results: 研究人员在同一个人和同一个环境下训练和测试时，达到了97.9%的准确率。此外，研究人员还对不同人进行训练和测试，并达到了89.5%的准确率。研究人员还进行了不同条件下的准确率分析和限制。

Abstract
Action recognition is a key technology for many industrial applications. Methods using visual information such as images are very popular. However, privacy issues prevent widespread usage due to the inclusion of private information, such as visible faces and scene backgrounds, which are not necessary for recognizing user action. In this paper, we propose a privacy-preserving action recognition by ultrasound active sensing. As action recognition from ultrasound active sensing in a non-invasive manner is not well investigated, we create a new dataset for action recognition and conduct a comparison of features for classification. We calculated feature values by focusing on the temporal variation of the amplitude of ultrasound reflected waves and performed classification using a support vector machine and VGG for eight fundamental action classes. We confirmed that our method achieved an accuracy of 97.9% when trained and evaluated on the same person and in the same environment. Additionally, our method achieved an accuracy of 89.5% even when trained and evaluated on different people. We also report the analyses of accuracies in various conditions and limitations.

摘要
《动作识别是许多工业应用中的关键技术。图像信息的方法非常受欢迎，但是隐私问题限制了广泛的使用，因为图像中包含私人信息，如可见的脸和场景背景，这些信息不是必需的 для识别用户动作。在本文中，我们提出了一种隐私保护的动作识别方法，基于ultrasound活动感测。由于从ultrasound活动感测中得到动作识别的非侵入性方法未得到广泛的研究，我们创建了一个新的数据集，并进行了不同的特征比较。我们计算了时间变化的ultrasound反射波的幅值，并使用支持向量机和VGG进行分类。我们确认了我们的方法在同一个人和同一个环境中获得了97.9%的准确率，而且我们的方法在不同的人和环境中也获得了89.5%的准确率。此外，我们还报告了不同情况下的准确率分析和限制。》Note: Please note that the translation is in Simplified Chinese, which is used in mainland China and Singapore. If you need Traditional Chinese, please let me know.

2023-09-15

EgoObjects: A Large-Scale Egocentric Dataset for Fine-Grained Object Understanding

The Use of Multi-Scale Fiducial Markers To Aid Takeoff and Landing Navigation by Rotorcraft

Biased Attention: Do Vision Transformers Amplify Gender Bias More than Convolutional Neural Networks?

Unified Brain MR-Ultrasound Synthesis using Multi-Modal Hierarchical Representations

Improved Breast Cancer Diagnosis through Transfer Learning on Hematoxylin and Eosin Stained Histology Images

Personalized Food Image Classification: Benchmark Datasets and New Baseline

Active Learning for Fine-Grained Sketch-Based Image Retrieval

Concept explainability for plant diseases classification

AV-MaskEnhancer: Enhancing Video Representations through Audio-Visual Masked Autoencoder

Segmentation of Tubular Structures Using Iterative Training with Tailored Samples

Performance Metrics for Probabilistic Ordinal Classifiers

BANSAC: A dynamic BAyesian Network for adaptive SAmple Consensus

Robust e-NeRF: NeRF from Sparse & Noisy Events under Non-Uniform Motion

Robust Frame-to-Frame Camera Rotation Estimation in Crowded Scenes

Replacing softmax with ReLU in Vision Transformers

Viewpoint Integration and Registration with Vision Language Foundation Model for Image Change Understanding

The Impact of Different Backbone Architecture on Autonomous Vehicle Dataset

Automated dermatoscopic pattern discovery by clustering neural network output for human-computer interaction

Breathing New Life into 3D Assets with Generative Repainting

Generalised Probabilistic Diffusion Scale-Spaces

OccupancyDETR: Making Semantic Scene Completion as Straightforward as Object Detection

YCB-Ev: Event-vision dataset for 6DoF object pose estimation

3D Arterial Segmentation via Single 2D Projections and Depth Supervision in Contrast-Enhanced CT Images

PoseFix: Correcting 3D Human Poses with Natural Language

TreeLearn: A Comprehensive Deep Learning Method for Segmenting Individual Trees from Forest Point Clouds

Segment Anything Model for Brain Tumor Segmentation

X-PDNet: Accurate Joint Plane Instance Segmentation and Monocular Depth Estimation with Cross-Task Distillation and Boundary Correction

MIML: Multiplex Image Machine Learning for High Precision Cell Classification via Mechanical Traits within Microfluidic Systems

Deformable Neural Radiance Fields using RGB and Event Cameras

3D SA-UNet: 3D Spatial Attention UNet with 3D ASPP for White Matter Hyperintensities Segmentation

An inspection technology of inner surface of the fine hole based on machine vision

Double Domain Guided Real-Time Low-Light Image Enhancement for Ultra-High-Definition Transportation Surveillance

Reconsidering evaluation practices in modular systems: On the propagation of errors in MRI prostate cancer detection

Beyond Domain Gap: Exploiting Subjectivity in Sketch-Based Person Retrieval

An Efficient Wide-Range Pseudo-3D Vehicle Detection Using A Single Camera

Robust Burned Area Delineation through Multitask Learning

Continual Learning with Deep Streaming Regularized Discriminant Analysis

T-UDA: Temporal Unsupervised Domain Adaptation in Sequential Point Clouds

A Real-Time Active Speaker Detection System Integrating an Audio-Visual Signal with a Spatial Querying Mechanism

Unsupervised Disentangling of Facial Representations with 3D-aware Latent Diffusion Models

Edge Based Oriented Object Detection

Leveraging the Power of Data Augmentation for Transformer-based Tracking

BROW: Better featuRes fOr Whole slide image based on self-distillation

Cartoondiff: Training-free Cartoon Image Generation with Diffusion Transformer Models

Optimization of Rank Losses for Image Retrieval

A Real-time Faint Space Debris Detector With Learning-based LCM

Human-Inspired Topological Representations for Visual Object Recognition in Unseen Environments

Efficient Polyp Segmentation Via Integrity Learning

UniST: Towards Unifying Saliency Transformer for Video Saliency Prediction and Detection

Salient Object Detection in Optical Remote Sensing Images Driven by Transformer

One-stage Modality Distillation for Incomplete Multimodal Learning

Hyperspectral Image Denoising via Self-Modulating Convolutional Neural Networks

ECEA: Extensible Co-Existing Attention for Few-Shot Object Detection

Towards Robust and Smooth 3D Multi-Person Pose Estimation from Monocular Videos in the Wild

STDG: Semi-Teacher-Student Training Paradigram for Depth-guided One-stage Scene Graph Generation

Differentiable Resolution Compression and Alignment for Efficient Video Classification and Retrieval

A Ground Segmentation Method Based on Point Cloud Map for Unstructured Roads

Cross-Modal Synthesis of Structural MRI and Functional Connectivity Networks via Conditional ViT-GANs

AdSEE: Investigating the Impact of Image Style Editing on Advertisement Attractiveness

Uncertainty-Aware Multi-View Visual Semantic Embedding

DA-RAW: Domain Adaptive Object Detection for Real-World Adverse Weather Conditions

Syn-Att: Synthetic Speech Attribution via Semi-Supervised Unknown Multi-Class Ensemble of CNNs

Multi-Scale Estimation for Omni-Directional Saliency Maps Using Learnable Equator Bias

Let’s Roll: Synthetic Dataset Analysis for Pedestrian Detection Across Different Shutter Types

AnyOKP: One-Shot and Instance-Aware Object Keypoint Extraction with Pretrained ViT

Increasing diversity of omni-directional images generated from single image using cGAN based on MLPMixer

MetaF2N: Blind Image Super-Resolution by Learning Efficient Model Adaptation from Faces

Detail Reinforcement Diffusion Model: Augmentation Fine-Grained Visual Categorization in Few-Shot Conditions

hear-your-action: human action recognition by ultrasound active sensing