cs.CV - 2023-10-28

Deep Learning-based Compressed Domain Multimedia for Man and Machine: A Taxonomy and Application to Point Cloud Classification

paper_url: http://arxiv.org/abs/2310.18849
repo_url: None
paper_authors: Abdelrahman Seleem, André F. R. Guarda, Nuno M. M. Rodrigues, Fernando Pereira
for: 本研究旨在提出一种基于深度学习的图像和视频处理技术，以提高计算机视觉任务的性能和减少计算复杂度。
methods: 该研究使用了深度学习来提取图像和视频数据中的特征，并使用了一种新的稳定性分析方法来评估不同的图像和视频处理算法。
results: 研究结果显示，使用了基于深度学习的图像和视频处理算法可以大幅提高计算机视觉任务的性能，同时减少计算复杂度。此外，研究还发现了一些新的图像和视频处理算法，可以在不同的应用场景中得到优秀的效果。

Abstract
In the current golden age of multimedia, human visualization is no longer the single main target, with the final consumer often being a machine which performs some processing or computer vision tasks. In both cases, deep learning plays a undamental role in extracting features from the multimedia representation data, usually producing a compressed representation referred to as latent representation. The increasing development and adoption of deep learning-based solutions in a wide area of multimedia applications have opened an exciting new vision where a common compressed multimedia representation is used for both man and machine. The main benefits of this vision are two-fold: i) improved performance for the computer vision tasks, since the effects of coding artifacts are mitigated; and ii) reduced computational complexity, since prior decoding is not required. This paper proposes the first taxonomy for designing compressed domain computer vision solutions driven by the architecture and weights compatibility with an available spatio-temporal computer vision processor. The potential of the proposed taxonomy is demonstrated for the specific case of point cloud classification by designing novel compressed domain processors using the JPEG Pleno Point Cloud Coding standard under development and adaptations of the PointGrid classifier. Experimental results show that the designed compressed domain point cloud classification solutions can significantly outperform the spatial-temporal domain classification benchmarks when applied to the decompressed data, containing coding artifacts, and even surpass their performance when applied to the original uncompressed data.

摘要
在当今的金色 Multimedia 时代，人类视觉不再是唯一的主要目标，最终consumer часто是一个机器，执行一些处理或计算机视觉任务。在这两种情况下，深度学习在抽取 Multimedia 表示数据中的特征方面发挥了关键作用，通常生成一个压缩表示 referred to as 封顶表示。随着深度学习基于解决方案在各种 Multimedia 应用领域的开发和采用，开启了一个新的视野，在这个视野中，一个通用的压缩 Multimedia 表示被用于人类和机器。这个视野的主要优点有两个方面：一是提高计算机视觉任务的性能，因为编码artifacts的影响被减少; 二是降低计算复杂性，因为不需要先 decode。这篇文章提出了首个设计压缩领域计算机视觉解决方案的taxonomy，该taxonomy基于可用的空间temporal计算机视觉处理器的建立和重量相容性。特点的实验结果表明，通过设计 novel 压缩领域处理器，使用 JPEG Pleno Point Cloud Coding 标准在开发中和PointGrid分类器的改进，可以在压缩领域实现显著的点云分类性能提升，并在应用到解码后的数据中 even surpass 其性能。

INCODE: Implicit Neural Conditioning with Prior Knowledge Embeddings

paper_url: http://arxiv.org/abs/2310.18846
repo_url: https://github.com/xmindflow/INCODE
paper_authors: Amirhossein Kazerouni, Reza Azad, Alireza Hosseini, Dorit Merhof, Ulas Bagci
for: 提高信号表示的精度和灵活性，解决现有INR的细节捕捉和鲁棒性问题
methods: 利用深度先验知识调整抽象函数的参数，并通过任务特定预训练模型进行任务特定参数调整，以优化表示过程
results: 在多种信号表示任务上具有更高的精度、质量、灵活性和速度，并能够解决复杂的音频、图像、3D形状重建、NeRFs、反问题等任务，并且在各种难题上具有优于现有INR的表现

Abstract
Implicit Neural Representations (INRs) have revolutionized signal representation by leveraging neural networks to provide continuous and smooth representations of complex data. However, existing INRs face limitations in capturing fine-grained details, handling noise, and adapting to diverse signal types. To address these challenges, we introduce INCODE, a novel approach that enhances the control of the sinusoidal-based activation function in INRs using deep prior knowledge. INCODE comprises a harmonizer network and a composer network, where the harmonizer network dynamically adjusts key parameters of the activation function. Through a task-specific pre-trained model, INCODE adapts the task-specific parameters to optimize the representation process. Our approach not only excels in representation, but also extends its prowess to tackle complex tasks such as audio, image, and 3D shape reconstructions, as well as intricate challenges such as neural radiance fields (NeRFs), and inverse problems, including denoising, super-resolution, inpainting, and CT reconstruction. Through comprehensive experiments, INCODE demonstrates its superiority in terms of robustness, accuracy, quality, and convergence rate, broadening the scope of signal representation. Please visit the project's website for details on the proposed method and access to the code.

摘要
归一神经表示（INR）已经革命化了信号表示方法，通过利用神经网络提供连续和平滑的数据表示方式。然而，现有INR受到细节capturing、鲁棒性和多样化信号类型的限制。为解决这些挑战，我们介绍了INCODE，一种新的方法，它使用深度先验知识来强化神经征函数的控制。INCODE包括一个和 composer网络，其中和网络动态调整 activation 函数的关键参数。通过一个任务特定的预训练模型，INCODE可以将任务特定的参数适应优化表示过程。我们的方法不仅在表示方面卓越，还能够扩展到复杂的任务，如音频、图像、三维形状重建、神经辐射场（NeRF）、反问题（包括噪声、超分解、缺失、CT重建）等。通过全面的实验，INCODE在稳定性、准确性、质量和收敛率方面表现出优异，扩大信号表示的范围。请参考项目网站了解提出的方法和获取代码。

Customizing 360-Degree Panoramas through Text-to-Image Diffusion Models

paper_url: http://arxiv.org/abs/2310.18840
repo_url: https://github.com/littlewhitesea/stitchdiffusion
paper_authors: Hai Wang, Xiaoyu Xiang, Yuchen Fan, Jing-Hao Xue
for: 本研究旨在提出一种基于diffusion模型的个性化文本到图像（T2I）合成方法，用于自适应地生成360度全景图像。
methods: 我们首先为这项任务提前抽象了一个预训练的T2I扩散模型，然后使用LoRA进行精度调整。然而，这些调整并不能保证左右两侧图像的连续性，这是360度全景图像的重要特征。因此，我们提出了StitchDiffusion方法，包括在拼接块中进行预除噪音处理，以及应用全局剪辑来生成无缝360度全景图像。
results: 我们的自定义模型，加上我们提出的StitchDiffusion方法，可以生成高质量的360度全景图像。此外，我们的自定义模型在生成未在训练数据集中看到的场景时表现出了异常的泛化能力。

Abstract
Personalized text-to-image (T2I) synthesis based on diffusion models has attracted significant attention in recent research. However, existing methods primarily concentrate on customizing subjects or styles, neglecting the exploration of global geometry. In this study, we propose an approach that focuses on the customization of 360-degree panoramas, which inherently possess global geometric properties, using a T2I diffusion model. To achieve this, we curate a paired image-text dataset specifically designed for the task and subsequently employ it to fine-tune a pre-trained T2I diffusion model with LoRA. Nevertheless, the fine-tuned model alone does not ensure the continuity between the leftmost and rightmost sides of the synthesized images, a crucial characteristic of 360-degree panoramas. To address this issue, we propose a method called StitchDiffusion. Specifically, we perform pre-denoising operations twice at each time step of the denoising process on the stitch block consisting of the leftmost and rightmost image regions. Furthermore, a global cropping is adopted to synthesize seamless 360-degree panoramas. Experimental results demonstrate the effectiveness of our customized model combined with the proposed StitchDiffusion in generating high-quality 360-degree panoramic images. Moreover, our customized model exhibits exceptional generalization ability in producing scenes unseen in the fine-tuning dataset. Code is available at https://github.com/littlewhitesea/StitchDiffusion.

摘要
<>转换给定文本到简化中文。<>研究中的个性化文本到图像（T2I）合成技术受到了非常重视。然而，现有的方法主要集中在主题或风格方面，忽略了全球几何特性的探索。在这种研究中，我们提出了一种方法，该方法是通过T2I扩散模型进行个性化360度全景图像的合成。为此，我们制作了特定于任务的图像文本对集，然后使用它来练化一个预训练的T2I扩散模型。然而，练化后的模型本身不能保证左侧和右侧图像的连续性，这是360度全景图像的关键特征。为解决这个问题，我们提出了一种方法called StitchDiffusion。具体来说，我们在每个时间步中对固定块进行预处理操作两次，并采用全球裁剪来合成无缝360度全景图像。实验结果表明，我们的定制模型与StitchDiffusion结合可以生成高质量的360度全景图像。此外，我们的定制模型还表现出了非常好的泛化能力，可以生成未在练化数据集中出现的场景。代码可以在https://github.com/littlewhitesea/StitchDiffusion上找到。

UniCat: Crafting a Stronger Fusion Baseline for Multimodal Re-Identification

paper_url: http://arxiv.org/abs/2310.18812
repo_url: None
paper_authors: Jennifer Crawford, Haoli Yin, Luke McDermott, Daniel Cummings
for: 这个论文目标是为了解决多模态重识别任务中的批量化问题，提高对多种数据流的重识别能力。
methods: 该论文使用了多种方法，包括单模态和多模态方法，以及各种拼接和融合策略。
results: 研究发现，使用单模态方法可以获得更好的表示，而不是使用多模态方法。此外，使用不同的拼接和融合策略也可以提高表示的质量。

Abstract
Multimodal Re-Identification (ReID) is a popular retrieval task that aims to re-identify objects across diverse data streams, prompting many researchers to integrate multiple modalities into a unified representation. While such fusion promises a holistic view, our investigations shed light on potential pitfalls. We uncover that prevailing late-fusion techniques often produce suboptimal latent representations when compared to methods that train modalities in isolation. We argue that this effect is largely due to the inadvertent relaxation of the training objectives on individual modalities when using fusion, what others have termed modality laziness. We present a nuanced point-of-view that this relaxation can lead to certain modalities failing to fully harness available task-relevant information, and yet, offers a protective veil to noisy modalities, preventing them from overfitting to task-irrelevant data. Our findings also show that unimodal concatenation (UniCat) and other late-fusion ensembling of unimodal backbones, when paired with best-known training techniques, exceed the current state-of-the-art performance across several multimodal ReID benchmarks. By unveiling the double-edged sword of "modality laziness", we motivate future research in balancing local modality strengths with global representations.

摘要
多模态重识别（ReID）是一个广泛应用的检索任务，旨在透过多种数据流处理对象的重识别，引起了许多研究人员将多种模式融合到一个统一表示中。然而，我们的调查发现，使用合并技术时常会产生优化后的下降，相比于单独训练模式时的表示。我们认为这是由于将多个模式融合时，不小心放弃各个模式的训练目标，导致模式懒散（modality laziness）的问题。我们提出一种复杂的观点，认为这种放弃可以使某些模式在任务相关的信息上充分发挥作用，同时防止不相关的数据泛染。我们的发现还表明，将单模式 concatenation（UniCat）和其他融合技术与最佳训练技术结合，可以在多模态ReIDbenchmark上超越当前状态的表现。我们的研究揭示了“模式懒散”的双重剑，激励未来的研究人员在地方模式强大性和全球表示之间寻求平衡。

A Review on the Applications of Machine Learning for Tinnitus Diagnosis Using EEG Signals

paper_url: http://arxiv.org/abs/2310.18795
repo_url: None
paper_authors: Farzaneh Ramezani, Hamidreza Bolhasani
for: 这个研究的目的是使用机器学习技术来识别或预测听力障碍患者，以便早期诊断和治疗。
methods: 这些研究使用了多种数据模式和机器学习技术来识别和分类听力障碍患者。
results: 这些研究的结果表明，使用EEG信号作为输入数据，可以准确地识别和预测听力障碍患者。但是，研究结果存在差异和矛盾，需要进一步的研究以更好地理解听力障碍的特征和预测方法。

Abstract
Tinnitus is a prevalent hearing disorder that can be caused by various factors such as age, hearing loss, exposure to loud noises, ear infections or tumors, certain medications, head or neck injuries, and psychological conditions like anxiety and depression. While not every patient requires medical attention, about 20% of sufferers seek clinical intervention. Early diagnosis is crucial for effective treatment. New developments have been made in tinnitus detection to aid in early detection of this illness. Over the past few years, there has been a notable growth in the usage of electroencephalography (EEG) to study variations in oscillatory brain activity related to tinnitus. However, the results obtained from numerous studies vary greatly, leading to conflicting conclusions. Currently, clinicians rely solely on their expertise to identify individuals with tinnitus. Researchers in this field have incorporated various data modalities and machine-learning techniques to aid clinicians in identifying tinnitus characteristics and classifying people with tinnitus. The purpose of writing this article is to review articles that focus on using machine learning (ML) to identify or predict tinnitus patients using EEG signals as input data. We have evaluated 11 articles published between 2016 and 2023 using a systematic literature review (SLR) method. This article arranges perfect summaries of all the research reviewed and compares the significant aspects of each. Additionally, we performed statistical analyses to gain a deeper comprehension of the most recent research in this area. Almost all of the reviewed articles followed a five-step procedure to achieve the goal of tinnitus. Disclosure. Finally, we discuss the open affairs and challenges in this method of tinnitus recognition or prediction and suggest future directions for research.

摘要
听力障碍（tinnitus）是一种非常普遍的听力疾病，可以由年龄、听力损伤、高音响应、耳感染或肿瘤、某些药物、头或Neck伤等多种因素引起。虽然不是所有患者需要医疗干预，但约20%的患者会寻求临床 intervención。早期诊断非常重要，以便有效的治疗。在过去几年中，对听力障碍检测方法的新发展带来了一定的进步。通过使用电enzephalography（EEG）研究听力障碍相关的脑动力学特征，已经有了一定的进步。然而，这些研究的结果很多样化，导致了不一致的结论。目前，临床医生仅仅靠自己的专业知识来诊断听力障碍。研究人员在这一领域已经结合了不同的数据模式和机器学习技术，以帮助临床医生识别听力障碍特征并将患者分类。本文的目的是对使用机器学习（ML）识别或预测听力障碍患者的研究进行系统性的文献综述。我们对2016年至2023年间发表的11篇文章进行了系统性的文献综述，并对每篇文章进行了精确的摘要。此外，我们还进行了统计分析，以更深入地了解最近的研究发展。大多数复习的文章遵循了五步程序来实现听力障碍识别或预测的目标。最后，我们讨论了这一方法的开放问题和挑战，并建议未来的研究方向。

PrObeD: Proactive Object Detection Wrapper

paper_url: http://arxiv.org/abs/2310.18788
repo_url: None
paper_authors: Vishal Asnani, Abhinav Kumar, Suya You, Xiaoming Liu
For: 提高$2D$物体检测的性能，使其能够更好地检测普通和掩蔽的图像中的物体。* Methods: 基于 wrapper 的扩展方法 PrObeD，包括一个编码器-解码器架构，通过生成图像依赖的信号（模板）来加密输入图像，并通过解码器从Encrypted images中提取这个模板。* Results: 对 MS-COCO、CAMO、COD$10$K 和 NC$4$K 数据集进行了实验，并在不同的检测器上显示了提高的检测性能。

Abstract
Previous research in $2D$ object detection focuses on various tasks, including detecting objects in generic and camouflaged images. These works are regarded as passive works for object detection as they take the input image as is. However, convergence to global minima is not guaranteed to be optimal in neural networks; therefore, we argue that the trained weights in the object detector are not optimal. To rectify this problem, we propose a wrapper based on proactive schemes, PrObeD, which enhances the performance of these object detectors by learning a signal. PrObeD consists of an encoder-decoder architecture, where the encoder network generates an image-dependent signal termed templates to encrypt the input images, and the decoder recovers this template from the encrypted images. We propose that learning the optimum template results in an object detector with an improved detection performance. The template acts as a mask to the input images to highlight semantics useful for the object detector. Finetuning the object detector with these encrypted images enhances the detection performance for both generic and camouflaged. Our experiments on MS-COCO, CAMO, COD$10$K, and NC$4$K datasets show improvement over different detectors after applying PrObeD. Our models/codes are available at https://github.com/vishal3477/Proactive-Object-Detection.

摘要
PrObeD consists of an encoder-decoder architecture, where the encoder network generates an image-dependent signal called templates to encrypt the input images, and the decoder recovers this template from the encrypted images. We believe that learning the optimum template results in an object detector with improved detection performance. The template acts as a mask to the input images, highlighting semantics that are useful for the object detector. Finetuning the object detector with these encrypted images improves the detection performance for both generic and camouflaged objects.Our experiments on the MS-COCO, CAMO, COD$10$K, and NC$4$K datasets show that PrObeD improves the detection performance of different object detectors. Our models and codes are available at https://github.com/vishal3477/Proactive-Object-Detection.Simplified Chinese translation:前一些研究主要关注在二维对象检测中的不同任务，包括检测通用和涂抹图像中的对象。这些工作被视为通过对输入图像进行修改来实现对象检测的被动方法。然而，神经网络中的学习结果可能并不是最优的，因此我们认为这些学习结果可能并不是最优的。为了解决这个问题，我们提出了一种基于主动方法的包装器，称为PrObeD，它可以提高对象检测器的性能。PrObeD包括一个编码器-解码器架构，其中编码器网络生成一个图像具有依赖关系的信号，称为模板，并将这个模板用于对输入图像进行加密。解码器则可以从加密后的图像中提取出这个模板。我们认为，学习最优的模板可以提高对象检测器的检测性能。模板可以视为对输入图像进行修饰，使对象检测器更容易察见用于检测的 semantics。通过在这些加密图像上进行训练，可以提高对象检测器的检测性能，包括通用和涂抹图像中的对象。我们在 MS-COCO、CAMO、COD$10$K 和 NC$4$K 数据集上进行了实验，结果显示 PrObeD 可以提高不同的对象检测器的检测性能。我们的模型和代码可以在 https://github.com/vishal3477/Proactive-Object-Detection 上获取。

CityRefer: Geography-aware 3D Visual Grounding Dataset on City-scale Point Cloud Data

paper_url: http://arxiv.org/abs/2310.18773
repo_url: https://github.com/atr-dbi/cityrefer
paper_authors: Taiki Miyanishi, Fumiya Kitamori, Shuhei Kurita, Jungdae Lee, Motoaki Kawanabe, Nakamasa Inoue
for: 城市级3D点云数据是用于表示细节和复杂的户外结构的有前途的方式，可以用于吸引人的应用，如自适应导航和无人机。
methods: 我们引入了CityRefer数据集，其包含35k个自然语言描述和5k个地标标签，以及与OpenStreetMap的同步。我们还开发了基线系统，可以学习编码语言描述、3D物体实例和城市的地标信息，以实现视Grounding。
results: 据我们知道，CityRefer数据集是当前最大的城市级视Grounding数据集，用于本地化特定3D对象。

Abstract
City-scale 3D point cloud is a promising way to express detailed and complicated outdoor structures. It encompasses both the appearance and geometry features of segmented city components, including cars, streets, and buildings, that can be utilized for attractive applications such as user-interactive navigation of autonomous vehicles and drones. However, compared to the extensive text annotations available for images and indoor scenes, the scarcity of text annotations for outdoor scenes poses a significant challenge for achieving these applications. To tackle this problem, we introduce the CityRefer dataset for city-level visual grounding. The dataset consists of 35k natural language descriptions of 3D objects appearing in SensatUrban city scenes and 5k landmarks labels synchronizing with OpenStreetMap. To ensure the quality and accuracy of the dataset, all descriptions and labels in the CityRefer dataset are manually verified. We also have developed a baseline system that can learn encoded language descriptions, 3D object instances, and geographical information about the city's landmarks to perform visual grounding on the CityRefer dataset. To the best of our knowledge, the CityRefer dataset is the largest city-level visual grounding dataset for localizing specific 3D objects.

摘要
城市级3D点云是一种有前途的方式表达细节和复杂的户外结构。它包括城市组成部分的外观和几何特征，包括汽车、街道和建筑物，可以用于有吸引力的应用，如用户交互导航自动汽车和无人机。然而，相比于图像和室内场景的广泛文本注释，户外场景的文本注释的缺乏对市场具有显著的挑战，以实现这些应用。为解决这个问题，我们介绍了城市参照数据集（CityRefer），该数据集包含35000个自然语言描述3D объек在敏捷城市场景中出现的场景和5000个地标标签，与OpenStreetMap相匹配。为保证数据集的质量和准确性，所有的描述和标签在CityRefer数据集中都是手动验证的。我们还开发了一个基eline系统，可以学习编码的自然语言描述、3D объек实例和城市的地标信息，以在CityRefer数据集上进行视觉定位。据我们所知，CityRefer数据集是当前最大的城市级视觉定位数据集，用于特定3D对象的本地化。

Online Multi-view Anomaly Detection with Disentangled Product-of-Experts Modeling

paper_url: http://arxiv.org/abs/2310.18728
repo_url: https://github.com/cshaowang/dPoE
paper_authors: Hao Wang, Zhi-Qi Cheng, Jingdong Sun, Xin Yang, Xiao Wu, Hongyang Chen, Yan Yang
for: 本研究的目的是提出一种能够处理多视图数据的异常检测方法，以解决现有方法中的一些缺陷，如只适用于两个视图或特定类型异常等。
methods: 本研究使用了多视图学习、分解表示学习和生成模型等方法，其中包括一个Product-of-Experts（PoE）层、一个Total Correction（TC）推定器和一个联合损失函数等。
results: 经过广泛的实验测试，提出的dPoE模型在六个真实世界数据集上表现出色，舒过基elines明显。

Abstract
Multi-view or even multi-modal data is appealing yet challenging for real-world applications. Detecting anomalies in multi-view data is a prominent recent research topic. However, most of the existing methods 1) are only suitable for two views or type-specific anomalies, 2) suffer from the issue of fusion disentanglement, and 3) do not support online detection after model deployment. To address these challenges, our main ideas in this paper are three-fold: multi-view learning, disentangled representation learning, and generative model. To this end, we propose dPoE, a novel multi-view variational autoencoder model that involves (1) a Product-of-Experts (PoE) layer in tackling multi-view data, (2) a Total Correction (TC) discriminator in disentangling view-common and view-specific representations, and (3) a joint loss function in wrapping up all components. In addition, we devise theoretical information bounds to control both view-common and view-specific representations. Extensive experiments on six real-world datasets demonstrate that the proposed dPoE outperforms baselines markedly.

摘要
多视图或多模式数据吸引了现实应用的研究者，但是检测多视图数据中异常现象是一个挑战。现有的大多数方法1）只适用于两个视图或类型特定异常，2）受混合解决问题的影响，3）无法在模型部署后进行在线检测。为解决这些挑战，我们的主要想法是三重：多视图学习、分解表示学习和生成模型。为此，我们提出了dPoE，一种新的多视图变量自适应器模型，它包括（1）一个Product-of-Experts（PoE）层来处理多视图数据，（2）一个总正确（TC）推分器来分解视图共同和视图特定表示，以及（3）一个联合损失函数来包装所有组件。此外，我们设计了理论信息约束来控制视图共同和视图特定表示。广泛的实验证明了我们提出的dPoE明显超过基eline。

Audio-Visual Instance Segmentation

paper_url: http://arxiv.org/abs/2310.18709
repo_url: None
paper_authors: Ruohao Guo, Yaru Chen, Yanyu Qi, Wenzhen Yue, Dantong Niu, Xianghua Ying
for: 这个论文目标是提出一种新的多模态任务，即音频视频实例分割（AVIS），目的是同时在可见视频中识别、分割和跟踪各种声音实例。
methods: 该论文使用了一种简单的基础模型，其中添加了一个音频分支和一个跨模态融合模块，以使用Mask2Former来找到所有声音实例。
results: 该论文使用两种脊梁进行评估，并得到了在AVISeg上的较好的性能。作者认为，AVIS将激励社区尝试更加全面的多模态理解。

Abstract
In this paper, we propose a new multi-modal task, namely audio-visual instance segmentation (AVIS), in which the goal is to identify, segment, and track individual sounding object instances in audible videos, simultaneously. To our knowledge, it is the first time that instance segmentation has been extended into the audio-visual domain. To better facilitate this research, we construct the first audio-visual instance segmentation benchmark (AVISeg). Specifically, AVISeg consists of 1,258 videos with an average duration of 62.6 seconds from YouTube and public audio-visual datasets, where 117 videos have been annotated by using an interactive semi-automatic labeling tool based on the Segment Anything Model (SAM). In addition, we present a simple baseline model for the AVIS task. Our new model introduces an audio branch and a cross-modal fusion module to Mask2Former to locate all sounding objects. Finally, we evaluate the proposed method using two backbones on AVISeg. We believe that AVIS will inspire the community towards a more comprehensive multi-modal understanding.

摘要
在这篇论文中，我们提出了一个新的多模态任务，即听视频实例分割（AVIS），目的是同时在听sible的视频中识别、分割和跟踪具有声音的对象实例。我们认为这是多模态理解的一个新的突破口。为了更好地推进这项研究，我们建立了首个听视频实例分割benchmark（AVISeg）。具体来说，AVISeg包括YouTube和公共听视频数据集的1,258个视频，视频的平均时长为62.6秒，其中117个视频通过使用基于Segment Anything Model（SAM）的交互式半自动标注工具进行了标注。此外，我们提出了一个简单的基线模型 дляAVIS任务。我们的新模型在Mask2Former模型中添加了一个声音支持和一个跨模态融合模块，以便在听sible的视频中找到所有的声音对象。最后，我们使用两个背景网络测试了我们的提议方法。我们认为AVIS将鼓励社区更加全面地理解多模态。

Triplet Attention Transformer for Spatiotemporal Predictive Learning

paper_url: http://arxiv.org/abs/2310.18698
repo_url: None
paper_authors: Xuesong Nie, Xi Chen, Haoyuan Jin, Zhihang Zhu, Yunfeng Yan, Donglian Qi
for: 预测未来序列 based on 历史序列，提高预测质量 while maintaining 计算效率
methods: 使用 triplet attention transformer，包括 Triplet Attention Module (TAM)，替代传统的 recurrent units， capture both inter-frame dynamics 和 intra-frame static features
results: 在多种场景下，包括移动对象轨迹预测、交通流预测、驾驶场景预测和人体动作捕捉，实现了 state-of-the-art 性能，超过了现有的 recurrent-based 和 recurrent-free 方法

Abstract
Spatiotemporal predictive learning offers a self-supervised learning paradigm that enables models to learn both spatial and temporal patterns by predicting future sequences based on historical sequences. Mainstream methods are dominated by recurrent units, yet they are limited by their lack of parallelization and often underperform in real-world scenarios. To improve prediction quality while maintaining computational efficiency, we propose an innovative triplet attention transformer designed to capture both inter-frame dynamics and intra-frame static features. Specifically, the model incorporates the Triplet Attention Module (TAM), which replaces traditional recurrent units by exploring self-attention mechanisms in temporal, spatial, and channel dimensions. In this configuration: (i) temporal tokens contain abstract representations of inter-frame, facilitating the capture of inherent temporal dependencies; (ii) spatial and channel attention combine to refine the intra-frame representation by performing fine-grained interactions across spatial and channel dimensions. Alternating temporal, spatial, and channel-level attention allows our approach to learn more complex short- and long-range spatiotemporal dependencies. Extensive experiments demonstrate performance surpassing existing recurrent-based and recurrent-free methods, achieving state-of-the-art under multi-scenario examination including moving object trajectory prediction, traffic flow prediction, driving scene prediction, and human motion capture.

摘要
《空时空间预测学习》提供了一种自主学习 paradigma，允许模型通过预测未来序列基于历史序列来学习空间和时间模式。主流方法受限于缺乏并行化和实际场景下的表现不佳，我们提议一种创新的 triplet attention transformer，用于捕捉空间、时间和通道维度的自我注意力机制。在这种配置下：（i）时间ток包含了抽象的间隔frame的表示，以便捕捉自然的时间依赖关系；（ii）空间和通道注意力结合以进一步细化内帧表示，通过在空间和通道维度进行细化的交互来增强模型对于短距离和长距离空间时间关系的学习能力。 alternate temporal、空间和通道级别的注意力允许我们的方法学习更复杂的短距离和长距离空间时间关系。广泛的实验证明了我们的方法在多种场景下，包括人体动作跟踪、交通流量预测、驾驶场景预测和人体动作捕捉等，性能超过了现有的循环单元和循环自由方法，实现了状态当前的水平。

Foundational Models in Medical Imaging: A Comprehensive Survey and Future Vision

paper_url: http://arxiv.org/abs/2310.18689
repo_url: None
paper_authors: Bobby Azad, Reza Azad, Sania Eskandari, Afshin Bozorgpour, Amirhossein Kazerouni, Islem Rekik, Dorit Merhof
For: This paper provides a comprehensive overview of foundation models in the domain of medical imaging, with a focus on their applications, opportunities, and future directions.* Methods: The paper classifies foundation models within the medical domain based on training strategies, imaging modalities, specific organs of interest, and algorithms integral to these models.* Results: The paper discusses the practical use cases of some selected approaches and addresses the challenges and research pathways associated with foundational models in medical imaging, including interpretability, data management, computational requirements, and contextual comprehension.

Abstract
Foundation models, large-scale, pre-trained deep-learning models adapted to a wide range of downstream tasks have gained significant interest lately in various deep-learning problems undergoing a paradigm shift with the rise of these models. Trained on large-scale dataset to bridge the gap between different modalities, foundation models facilitate contextual reasoning, generalization, and prompt capabilities at test time. The predictions of these models can be adjusted for new tasks by augmenting the model input with task-specific hints called prompts without requiring extensive labeled data and retraining. Capitalizing on the advances in computer vision, medical imaging has also marked a growing interest in these models. To assist researchers in navigating this direction, this survey intends to provide a comprehensive overview of foundation models in the domain of medical imaging. Specifically, we initiate our exploration by providing an exposition of the fundamental concepts forming the basis of foundation models. Subsequently, we offer a methodical taxonomy of foundation models within the medical domain, proposing a classification system primarily structured around training strategies, while also incorporating additional facets such as application domains, imaging modalities, specific organs of interest, and the algorithms integral to these models. Furthermore, we emphasize the practical use case of some selected approaches and then discuss the opportunities, applications, and future directions of these large-scale pre-trained models, for analyzing medical images. In the same vein, we address the prevailing challenges and research pathways associated with foundational models in medical imaging. These encompass the areas of interpretability, data management, computational requirements, and the nuanced issue of contextual comprehension.

摘要
大量训练的深度学习模型（foundation models）在不同领域的深度学习问题中受到了非常大的关注。这些模型可以在各种模式之间进行Contextual reasoning，通过加入任务特定的提示（prompts）来调整模型的预测，无需大量的标注数据和重新训练。随着计算机视觉领域的进步，医学影像领域也开始关注这些模型。本文旨在为医学影像领域的研究人员提供一份全面的评论，以帮助他们在这个方向中探索。 Specifically, we begin by providing an overview of the fundamental concepts that underlie foundation models. We then offer a systematic taxonomy of foundation models within the medical domain, classifying them based on their training strategies, application domains, imaging modalities, specific organs of interest, and the algorithms used in these models. We also highlight the practical use cases of some selected approaches and discuss the opportunities, applications, and future directions of these large-scale pre-trained models for analyzing medical images. Furthermore, we address the challenges and research pathways associated with foundational models in medical imaging, including interpretability, data management, computational requirements, and the nuanced issue of contextual comprehension.

Efficient Object Detection in Optical Remote Sensing Imagery via Attention-based Feature Distillation

paper_url: http://arxiv.org/abs/2310.18676
repo_url: None
paper_authors: Pourya Shamsolmoali, Jocelyn Chanussot, Huiyu Zhou, Yue Lu
for: 这篇论文主要针对的是实时观测中的有效物体检测方法，并且使用知识传播（KD）技术来实现轻量级模型，同时保持精度。
methods: 本文提出了一个新的知识传播方法，即注意力基本Distillation（AFD），这个方法可以将教师模型中的本地和全球资讯都传播到学生模型中，以提高学生模型的检测精度。此外，本文还引入了一个多例对劲机制，以分辨背景和前景元素，并将其传播到学生模型中。
results: 本文的实验结果显示，这个AFD方法可以在两个公共的航空图像benchmark上实现和其他状态顶对称模型相同的检测性能，同时具有轻量级的特点。

Abstract
Efficient object detection methods have recently received great attention in remote sensing. Although deep convolutional networks often have excellent detection accuracy, their deployment on resource-limited edge devices is difficult. Knowledge distillation (KD) is a strategy for addressing this issue since it makes models lightweight while maintaining accuracy. However, existing KD methods for object detection have encountered two constraints. First, they discard potentially important background information and only distill nearby foreground regions. Second, they only rely on the global context, which limits the student detector's ability to acquire local information from the teacher detector. To address the aforementioned challenges, we propose Attention-based Feature Distillation (AFD), a new KD approach that distills both local and global information from the teacher detector. To enhance local distillation, we introduce a multi-instance attention mechanism that effectively distinguishes between background and foreground elements. This approach prompts the student detector to focus on the pertinent channels and pixels, as identified by the teacher detector. Local distillation lacks global information, thus attention global distillation is proposed to reconstruct the relationship between various pixels and pass it from teacher to student detector. The performance of AFD is evaluated on two public aerial image benchmarks, and the evaluation results demonstrate that AFD in object detection can attain the performance of other state-of-the-art models while being efficient.

摘要
Recently, efficient object detection methods have received significant attention in remote sensing. Although deep convolutional networks often have excellent detection accuracy, deploying them on resource-limited edge devices is challenging. Knowledge distillation (KD) is a strategy that can address this issue by making models lightweight while maintaining accuracy. However, existing KD methods for object detection have two limitations. First, they discard potentially important background information and only distill nearby foreground regions. Second, they only rely on global context, which limits the student detector's ability to acquire local information from the teacher detector.To address these challenges, we propose Attention-based Feature Distillation (AFD), a new KD approach that distills both local and global information from the teacher detector. To enhance local distillation, we introduce a multi-instance attention mechanism that effectively distinguishes between background and foreground elements. This approach prompts the student detector to focus on the pertinent channels and pixels, as identified by the teacher detector. Local distillation lacks global information, so we propose attention global distillation to reconstruct the relationship between various pixels and pass it from teacher to student detector.We evaluate the performance of AFD on two public aerial image benchmarks, and the results show that AFD can achieve the performance of other state-of-the-art models while being efficient.

Foundation Models for Generalist Geospatial Artificial Intelligence

paper_url: http://arxiv.org/abs/2310.18660
repo_url: None
paper_authors: Johannes Jakubik, Sujit Roy, C. E. Phillips, Paolo Fraccaro, Denys Godwin, Bianca Zadrozny, Daniela Szwarcman, Carlos Gomes, Gabby Nyirjesy, Blair Edwards, Daiki Kimura, Naomi Simumba, Linsong Chu, S. Karthik Mukkavilli, Devyani Lambhate, Kamal Das, Ranjini Bangalore, Dario Oliveira, Michal Muszynski, Kumar Ankur, Muthukumaran Ramasubramanian, Iksha Gurung, Sam Khallaghi, Hanxi, Li, Michael Cecil, Maryam Ahmadi, Fatemeh Kordi, Hamed Alemohammad, Manil Maskey, Raghu Ganti, Kommy Weldemariam, Rahul Ramachandran
for:* 这篇论文的目的是为了提出一个高度可调整且可重用的人工智能（AI）模型的开发，以便在地球科学和遥测中具有重要影响。methods:* 这篇论文使用了自我指导的方法来预训foundational models，然后使用小量标签数据进行精确化。results:* 这篇论文的研究显示了一个首次的框架可以有效地将foundational models预训和精确化，并在多个地球观测任务上表现出色，例如多 Spectral satellite imagery 的测试。

Abstract
Significant progress in the development of highly adaptable and reusable Artificial Intelligence (AI) models is expected to have a significant impact on Earth science and remote sensing. Foundation models are pre-trained on large unlabeled datasets through self-supervision, and then fine-tuned for various downstream tasks with small labeled datasets. This paper introduces a first-of-a-kind framework for the efficient pre-training and fine-tuning of foundational models on extensive geospatial data. We have utilized this framework to create Prithvi, a transformer-based geospatial foundational model pre-trained on more than 1TB of multispectral satellite imagery from the Harmonized Landsat-Sentinel 2 (HLS) dataset. Our study demonstrates the efficacy of our framework in successfully fine-tuning Prithvi to a range of Earth observation tasks that have not been tackled by previous work on foundation models involving multi-temporal cloud gap imputation, flood mapping, wildfire scar segmentation, and multi-temporal crop segmentation. Our experiments show that the pre-trained model accelerates the fine-tuning process compared to leveraging randomly initialized weights. In addition, pre-trained Prithvi compares well against the state-of-the-art, e.g., outperforming a conditional GAN model in multi-temporal cloud imputation by up to 5pp (or 5.7%) in the structural similarity index. Finally, due to the limited availability of labeled data in the field of Earth observation, we gradually reduce the quantity of available labeled data for refining the model to evaluate data efficiency and demonstrate that data can be decreased significantly without affecting the model's accuracy. The pre-trained 100 million parameter model and corresponding fine-tuning workflows have been released publicly as open source contributions to the global Earth sciences community through Hugging Face.

摘要
“预计在人工智能（AI）模型的开发中，有 significante进步，这将对地球科学和远程感知产生重要影响。基础模型通过自我超vision，在大量无标签数据上自我预训练，然后使用小量标签数据进行精度调整。这篇论文介绍了一种新的框架，用于高效地预训练和精度调整基础模型，并在extensive geospatial数据上进行了实践。我们使用了这个框架，创造了一个基于转换器的地ospatial基础模型，名为Prithvi，并在 более чем 1TB的多spectral卫星图像上进行了预训练。我们的研究表明，我们的框架可以成功地将Prithvi fine-tune到多种地观测任务中，包括多temporal云阴掩模型、洪水地图、野火痕分割和多temporal作物分割。我们的实验显示，预训练后的模型可以加速 fine-tuning 过程，并且与随机初始化的权重相比，具有更高的准确率。此外，我们的Prithvi模型在多temporal云阴掩模型中与状态艺术模型进行比较，在结构相似指数中提高了5pp（或5.7%）。最后，由于地球观测领域内标注数据的有限性，我们逐渐减少可用标注数据的量来评估数据效率，并证明可以大幅减少数据量而无需影响模型的准确率。预训练10000万参数模型和相应的精度调整工作流已经公开发布在Hugging Face上，作为对全球地球科学社区的开源贡献。”

Med-DANet V2: A Flexible Dynamic Architecture for Efficient Medical Volumetric Segmentation

paper_url: http://arxiv.org/abs/2310.18656
repo_url: None
paper_authors: Haoran Shen, Yifu Zhang, Wenxuan Wang, Chen Chen, Jing Liu, Shanshan Song, Jiangyun Li
for: 这个论文的目的是提高医疗影像三维分类的计算效率。
methods: 这个方法使用了动态推论基于层次复杂度，并 dynamically选择适合不同层次的2D候选模型。
results: 该方法在BraTS 2019和2020的实验中实现了与前一代方法相似或更好的性能，并且具有许多少的模型复杂度。相比Med-DANet和TransBTS，我们的框架可以提高模型效率，并且具有相似的分类结果。

Abstract
Recent works have shown that the computational efficiency of 3D medical image (e.g. CT and MRI) segmentation can be impressively improved by dynamic inference based on slice-wise complexity. As a pioneering work, a dynamic architecture network for medical volumetric segmentation (i.e. Med-DANet) has achieved a favorable accuracy and efficiency trade-off by dynamically selecting a suitable 2D candidate model from the pre-defined model bank for different slices. However, the issues of incomplete data analysis, high training costs, and the two-stage pipeline in Med-DANet require further improvement. To this end, this paper further explores a unified formulation of the dynamic inference framework from the perspective of both the data itself and the model structure. For each slice of the input volume, our proposed method dynamically selects an important foreground region for segmentation based on the policy generated by our Decision Network and Crop Position Network. Besides, we propose to insert a stage-wise quantization selector to the employed segmentation model (e.g. U-Net) for dynamic architecture adapting. Extensive experiments on BraTS 2019 and 2020 show that our method achieves comparable or better performance than previous state-of-the-art methods with much less model complexity. Compared with previous methods Med-DANet and TransBTS with dynamic and static architecture respectively, our framework improves the model efficiency by up to nearly 4.1 and 17.3 times with comparable segmentation results on BraTS 2019.

摘要
To address these issues, this paper proposes a unified formulation of the dynamic inference framework from both the data and model perspectives. For each slice of the input volume, our method dynamically selects an important foreground region for segmentation based on the policy generated by our Decision Network and Crop Position Network. Additionally, we propose inserting a stage-wise quantization selector into the employed segmentation model (such as U-Net) for dynamic architecture adapting.Experiments on the BraTS 2019 and 2020 datasets show that our method achieves performance comparable to or better than previous state-of-the-art methods with much less model complexity. Compared with previous methods Med-DANet and TransBTS with dynamic and static architecture, respectively, our framework improves model efficiency by up to nearly 4.1 and 17.3 times with comparable segmentation results on BraTS 2019.

Feature Guided Masked Autoencoder for Self-supervised Learning in Remote Sensing

paper_url: http://arxiv.org/abs/2310.18653
repo_url: https://github.com/zhu-xlab/fgmae
paper_authors: Yi Wang, Hugo Hernández Hernández, Conrad M Albrecht, Xiao Xiang Zhu
for: 这篇论文旨在探讨自我监督学习帮助vised transformer在远程感知中进行预训。
methods: 本论文使用Masked AutoEncoder（MAE）作为预训模型，并将spectral和spatial remote sensing图像特征作为改进MAE重建目标。
results: 实验结果显示Feature Guided Masked Autoencoder（FG-MAE）可以提高多spectral图像和SAR图像的semantic理解，并且具有很好的扩展性。

Abstract
Self-supervised learning guided by masked image modelling, such as Masked AutoEncoder (MAE), has attracted wide attention for pretraining vision transformers in remote sensing. However, MAE tends to excessively focus on pixel details, thereby limiting the model's capacity for semantic understanding, in particular for noisy SAR images. In this paper, we explore spectral and spatial remote sensing image features as improved MAE-reconstruction targets. We first conduct a study on reconstructing various image features, all performing comparably well or better than raw pixels. Based on such observations, we propose Feature Guided Masked Autoencoder (FG-MAE): reconstructing a combination of Histograms of Oriented Graidents (HOG) and Normalized Difference Indices (NDI) for multispectral images, and reconstructing HOG for SAR images. Experimental results on three downstream tasks illustrate the effectiveness of FG-MAE with a particular boost for SAR imagery. Furthermore, we demonstrate the well-inherited scalability of FG-MAE and release a first series of pretrained vision transformers for medium resolution SAR and multispectral images.

摘要
自领导学习，如遮盲自动编码（MAE），在远程感知领域内吸引了广泛的关注，用于预训练视Transformer。然而，MAE往往过分关注像素细节，因此限制模型对Semantic理解的能力，特别是对噪音SAR图像。在这篇论文中，我们探索谱spectral和空间Remote sensing图像特征作为改进MAE重建目标。我们首先进行了不同图像特征的重建研究，并发现所有特征都可以相对或更好地than raw pixels。基于这些观察，我们提出了特征引导遮盲自动编码（FG-MAE）：对多spectral图像重建 histogram of oriented gradients（HOG）和normalized difference indices（NDI）的组合，对SAR图像重建HOG。我们的实验结果表明FG-MAE在三个下游任务上表现出了效果，特别是对SAR图像。此外，我们还证明了FG-MAE具有良好的扩展性，并发布了首个媒体分辨率SAR和多spectral图像预训练的视Transformer。

Local-Global Self-Supervised Visual Representation Learning

paper_url: http://arxiv.org/abs/2310.18651
repo_url: https://github.com/alijavidani/local_global_representation_learning
paper_authors: Ali Javidani, Mohammad Amin Sadeghi, Babak Nadjar Araabi
for: 本研究旨在探讨将patch-level特征学习纳入现有自动学习批处理方法中，以提高学习得到的视觉表示的质量。
methods: 我们提出了一种简单 yet effective的patch-matching算法，可以在扩展视图下找到图像中匹配的补丁。然后，我们使用基于Vision Transformer（ViT）的自动学习框架，将扩展视图和补丁进行自我超vised学习。这种方法可以同时生成图像级别和补丁级别的表示。
results: 我们在小、中、大规模数据集上预训练了我们的方法，并证明了我们的方法可以在图像分类和下游任务中超越当前状态艺技。

Abstract
Self-supervised representation learning methods mainly focus on image-level instance discrimination. This study explores the potential benefits of incorporating patch-level discrimination into existing methods to enhance the quality of learned representations by simultaneously looking at local and global visual features. Towards this idea, we present a straightforward yet effective patch-matching algorithm that can find the corresponding patches across the augmented views of an image. The augmented views are subsequently fed into a self-supervised learning framework employing Vision Transformer (ViT) as its backbone. The result is the generation of both image-level and patch-level representations. Leveraging the proposed patch-matching algorithm, the model minimizes the representation distance between not only the CLS tokens but also the corresponding patches. As a result, the model gains a more comprehensive understanding of both the entirety of the image as well as its finer details. We pretrain the proposed method on small, medium, and large-scale datasets. It is shown that our approach could outperform state-of-the-art image-level representation learning methods on both image classification and downstream tasks. Keywords: Self-Supervised Learning; Visual Representations; Local-Global Representation Learning; Patch-Wise Representation Learning; Vision Transformer (ViT)

摘要
自我监督学习方法主要关注图像级别的实例识别。本研究探讨可以将图像级别的识别与patch级别的识别结合到现有方法中，以提高学习的表示质量。为了实现这一目标，我们提出了一种简单又有效的补丁匹配算法，可以在扩展视图中找到图像中的相对应补丁。这些扩展视图然后被 feed into一个自我监督学习框架，使用 Vision Transformer（ViT）作为 backing。通过这种方法，我们可以同时生成图像级别的表示和patch级别的表示。通过补丁匹配算法，模型可以将 CLS Token 的表示距离和相应的补丁之间的表示距离减小。因此，模型可以更好地理解图像的整体特征和细节特征。我们在小、中、大样本大小上进行预训练，并显示了我们的方法可以在图像级别表示学习中超越状态艺术的图像级别表示学习方法。关键词：自我监督学习;视觉表示;本地-全局表示学习;补丁级别表示学习; Vision Transformer（ViT）

Switching Temporary Teachers for Semi-Supervised Semantic Segmentation

paper_url: http://arxiv.org/abs/2310.18640
repo_url: https://github.com/naver-ai/dual-teacher
paper_authors: Jaemin Na, Jung-Woo Ha, Hyung Jin Chang, Dongyoon Han, Wonjun Hwang
for: 这篇研究旨在提高 semi-supervised semantic segmentation 的效果，并且解决 teacher 和学生模型之间的问题。
methods: 这篇研究使用了 dual temporary teacher 方法，将 teacher 和学生模型分为两个短期教师，以降低学生模型与教师模型之间的 Coupling 问题。
results: 这篇研究在 PASCAL VOC、Cityscapes 和 ADE20K 测试 benchmark 上达到了竞争性的表现，并且训练时间比 state-of-the-art 方法短得多。此外，这篇研究还证明了其方法是模型不敏感的，可以与 CNN 和 Transformer 等不同类型的模型搭配使用。

Abstract
The teacher-student framework, prevalent in semi-supervised semantic segmentation, mainly employs the exponential moving average (EMA) to update a single teacher's weights based on the student's. However, EMA updates raise a problem in that the weights of the teacher and student are getting coupled, causing a potential performance bottleneck. Furthermore, this problem may become more severe when training with more complicated labels such as segmentation masks but with few annotated data. This paper introduces Dual Teacher, a simple yet effective approach that employs dual temporary teachers aiming to alleviate the coupling problem for the student. The temporary teachers work in shifts and are progressively improved, so consistently prevent the teacher and student from becoming excessively close. Specifically, the temporary teachers periodically take turns generating pseudo-labels to train a student model and maintain the distinct characteristics of the student model for each epoch. Consequently, Dual Teacher achieves competitive performance on the PASCAL VOC, Cityscapes, and ADE20K benchmarks with remarkably shorter training times than state-of-the-art methods. Moreover, we demonstrate that our approach is model-agnostic and compatible with both CNN- and Transformer-based models. Code is available at \url{https://github.com/naver-ai/dual-teacher}.

摘要
教师-学生框架，广泛存在 semi-supervised 语义分割中，主要采用积分移动平均（EMA）来更新单个教师的参数基于学生的。然而，EMA 更新可能会导致教师和学生的参数相互关联，从而引起性能瓶颈。此外，这种问题可能会在具有更复杂的标签，如分割mask，但具有少量标注数据的情况下变得更加严重。本文介绍了 Dual Teacher，一种简单 yet effective 的方法，它采用了双临时教师来解决学生模型与教师模型之间的关联问题。这两个临时教师会在交替的时间间隔内为学生模型生成 pseudo-标签，以便在每个轮次中维护学生模型的独特特征。因此， Dual Teacher 在 PASCAL VOC、Cityscapes 和 ADE20K 测试集上 achieve 竞争性性能，并且训练时间较短于当前state-of-the-art 方法。此外，我们还证明了我们的方法是模型无关的，可以与 CNN 和 Transformer 等模型结合使用。代码可以在 \url{https://github.com/naver-ai/dual-teacher} 上找到。

Towards Plastic and Stable Exemplar-Free Incremental Learning: A Dual-Learner Framework with Cumulative Parameter Averaging

paper_url: http://arxiv.org/abs/2310.18639
repo_url: None
paper_authors: Wenju Sun, Qingyong Li, Wen Wang, Yangli-ao Geng
for: 这个研究是为了解决增量学习中的困难，特别是在例项自由情况下，当学习新任务时不能访问旧任务的样本。
methods: 这个方法使用了单任务学习（STL）和平均参数积存（CPA）技术，具有单任务学习和综合学习两种模式。
results: 实验结果显示，这个方法在CIFAR-100和Tiny-ImageNet上比过几个状态顶对的增量学习基eline表现出色，尤其在任务增量学习和类别增量学习情况下。

Abstract
The dilemma between plasticity and stability presents a significant challenge in Incremental Learning (IL), especially in the exemplar-free scenario where accessing old-task samples is strictly prohibited during the learning of a new task. A straightforward solution to this issue is learning and storing an independent model for each task, known as Single Task Learning (STL). Despite the linear growth in model storage with the number of tasks in STL, we empirically discover that averaging these model parameters can potentially preserve knowledge across all tasks. Inspired by this observation, we propose a Dual-Learner framework with Cumulative Parameter Averaging (DLCPA). DLCPA employs a dual-learner design: a plastic learner focused on acquiring new-task knowledge and a stable learner responsible for accumulating all learned knowledge. The knowledge from the plastic learner is transferred to the stable learner via cumulative parameter averaging. Additionally, several task-specific classifiers work in cooperation with the stable learner to yield the final prediction. Specifically, when learning a new task, these modules are updated in a cyclic manner: i) the plastic learner is initially optimized using a self-supervised loss besides the supervised loss to enhance the feature extraction robustness; ii) the stable learner is then updated with respect to the plastic learner in a cumulative parameter averaging manner to maintain its task-wise generalization; iii) the task-specific classifier is accordingly optimized to align with the stable learner. Experimental results on CIFAR-100 and Tiny-ImageNet show that DLCPA outperforms several state-of-the-art exemplar-free baselines in both Task-IL and Class-IL settings.

摘要
increments 学习（IL）中的困境在选择between plasticity and stability 上是一个 significannot challenge, especially in the exemplar-free scenario where accessing old-task samples is strictly prohibited during the learning of a new task. A straightforward solution to this issue is learning and storing an independent model for each task, known as Single Task Learning (STL). Despite the linear growth in model storage with the number of tasks in STL, we empirically discover that averaging these model parameters can potentially preserve knowledge across all tasks. Inspired by this observation, we propose a Dual-Learner framework with Cumulative Parameter Averaging (DLCPA). DLCPA employs a dual-learner design: a plastic learner focused on acquiring new-task knowledge and a stable learner responsible for accumulating all learned knowledge. The knowledge from the plastic learner is transferred to the stable learner via cumulative parameter averaging. Additionally, several task-specific classifiers work in cooperation with the stable learner to yield the final prediction. Specifically, when learning a new task, these modules are updated in a cyclic manner: i) the plastic learner is initially optimized using a self-supervised loss besides the supervised loss to enhance the feature extraction robustness; ii) the stable learner is then updated with respect to the plastic learner in a cumulative parameter averaging manner to maintain its task-wise generalization; iii) the task-specific classifier is accordingly optimized to align with the stable learner. Experimental results on CIFAR-100 and Tiny-ImageNet show that DLCPA outperforms several state-of-the-art exemplar-free baselines in both Task-IL and Class-IL settings.

ODM3D: Alleviating Foreground Sparsity for Enhanced Semi-Supervised Monocular 3D Object Detection

paper_url: http://arxiv.org/abs/2310.18620
repo_url: None
paper_authors: Weijia Zhang, Dongnan Liu, Chao Ma, Weidong Cai
for: 提高单光图像3D物体检测（M3OD）的性能，使其能够更好地检测自动驾驶中的3D物体。
methods: 使用semi-supervised learning，将LiDAR频谱知识注入到单光图像检测器中，并通过提取前景稀缺性来进行更加有效的知识传递。
results: 在KITTI验证和测试环境中，其方法 ranked 1st，在BEV和3D检测纪录中都有显著的提升，舒过所有现有的单光方法，包括直接监督和半监督方法。

Abstract
Monocular 3D object detection (M3OD) is a significant yet inherently challenging task in autonomous driving due to absence of implicit depth cues in a single RGB image. In this paper, we strive to boost currently underperforming monocular 3D object detectors by leveraging an abundance of unlabelled data via semi-supervised learning. Our proposed ODM3D framework entails cross-modal knowledge distillation at various levels to inject LiDAR-domain knowledge into a monocular detector during training. By identifying foreground sparsity as the main culprit behind existing methods' suboptimal training, we exploit the precise localisation information embedded in LiDAR points to enable more foreground-attentive and efficient distillation via the proposed BEV occupancy guidance mask, leading to notably improved knowledge transfer and M3OD performance. Besides, motivated by insights into why existing cross-modal GT-sampling techniques fail on our task at hand, we further design a novel cross-modal object-wise data augmentation strategy for effective RGB-LiDAR joint learning. Our method ranks 1st in both KITTI validation and test benchmarks, significantly surpassing all existing monocular methods, supervised or semi-supervised, on both BEV and 3D detection metrics.

摘要
《单目三维物体检测（M3OD）是自主驾驶中的一项重要 yet inherently 挑战性任务，因为单个 RGB 图像中缺乏隐式深度提示。在这篇文章中，我们努力提高目前的单目三维物体检测器，通过利用大量未标注数据进行 semi-supervised 学习。我们提出的 ODM3D 框架在不同层次进行交叉模态知识填充，以在训练中注入 LiDAR 频谱知识到单目检测器。通过发现前景稀畴是现有方法训练不佳的主要原因，我们利用 LiDAR 点的精确地址信息来实现更加前景注意和高效的填充，从而提高知识传递和 M3OD 性能。此外，鉴于现有交叉模态 GT 采样技术在我们任务上失效的原因，我们还设计了一种新的交叉模态对象增强数据采样策略，以便有效地在 RGB 和 LiDAR JOINT 学习中进行对象增强。我们的方法在 KITTI 验证和测试benchmark上rank 1st，明显超过了所有现有的单目方法，包括直接监督和 semi-supervised 方法，在 BEV 和 3D 检测 метриках上。

Domain Generalisation via Risk Distribution Matching

paper_url: http://arxiv.org/abs/2310.18598
repo_url: https://github.com/nktoan/risk-distribution-matching
paper_authors: Toan Nguyen, Kien Do, Bao Duong, Thin Nguyen
for: 这篇论文旨在解决域对应（Domain Generalization，DG）中的问题，提出一个新的方法，利用风险分布来描述域，以 достиieving 域之对称。
methods: 这篇论文使用的方法是基于风险分布的，即使用最大mean距离（MMD）距离来测量风险分布之间的差异，并将其用于域之对称。
results: 实验结果显示，这篇论文提出的方法（Risk Distribution Matching，RDM）在标准的benchmark数据集上具有较高的域对称能力，并且比其他state-of-the-art DG方法更有效率。

Abstract
We propose a novel approach for domain generalisation (DG) leveraging risk distributions to characterise domains, thereby achieving domain invariance. In our findings, risk distributions effectively highlight differences between training domains and reveal their inherent complexities. In testing, we may observe similar, or potentially intensifying in magnitude, divergences between risk distributions. Hence, we propose a compelling proposition: Minimising the divergences between risk distributions across training domains leads to robust invariance for DG. The key rationale behind this concept is that a model, trained on domain-invariant or stable features, may consistently produce similar risk distributions across various domains. Building upon this idea, we propose Risk Distribution Matching (RDM). Using the maximum mean discrepancy (MMD) distance, RDM aims to minimise the variance of risk distributions across training domains. However, when the number of domains increases, the direct optimisation of variance leads to linear growth in MMD computations, resulting in inefficiency. Instead, we propose an approximation that requires only one MMD computation, by aligning just two distributions: that of the worst-case domain and the aggregated distribution from all domains. Notably, this method empirically outperforms optimising distributional variance while being computationally more efficient. Unlike conventional DG matching algorithms, RDM stands out for its enhanced efficacy by concentrating on scalar risk distributions, sidestepping the pitfalls of high-dimensional challenges seen in feature or gradient matching. Our extensive experiments on standard benchmark datasets demonstrate that RDM shows superior generalisation capability over state-of-the-art DG methods.

摘要
我们提出了一种新的领域通用化（DG）方法，利用风险分布来特征化领域，从而实现领域不变性。我们发现，风险分布能够有效地披露训练领域之间的差异和内在复杂性。在测试中，我们可能会观察到类似或者可能加剧的差异between风险分布。因此，我们提出了一个有力的提议：将领域之间风险分布的差异降到最小化，以实现Robust Invariance for DG。这个概念的关键思想是，通过训练领域不变或稳定的特征，我们可以在不同领域上通过风险分布的匹配来实现模型的稳定性。基于这个想法，我们提出了风险分布匹配（RDM）。使用最大平均差（MMD）距离，RDM aimsto minimize the variance of risk distributions across training domains。然而，当领域数量增加时，直接优化差异会导致线性增长的MMD计算，从而变得不效率。因此，我们提出了一种简化方法，只需要一次MMD计算，通过对最坏领域的分布和所有领域的分布进行对应。与传统的DG匹配算法不同，RDM更加有效地做到了通过scalar风险分布来快速匹配，而不需要高维度的特征或梯度匹配。我们对标准 benchmark 数据集进行了广泛的实验，发现RDM在state-of-the-art DG方法中显示出了更好的总体化能力。

This Looks Like Those: Illuminating Prototypical Concepts Using Multiple Visualizations

paper_url: http://arxiv.org/abs/2310.18589
repo_url: https://github.com/henrymachiyu/this-looks-like-those_protoconcepts
paper_authors: Chiyu Ma, Brandon Zhao, Chaofan Chen, Cynthia Rudin
for: 这个论文的目的是提出一种可解释的图像分类方法，结合深度学习和案例基础理解。
methods: 这个方法使用多个图像 patches 来学习概念，并使用这些概念进行可解释的图像分类。
results: 实验结果表明，这种方法可以应用于各种现有的prototype-based图像分类网络中，并在标准数据集上实现相同的准确率。

Abstract
We present ProtoConcepts, a method for interpretable image classification combining deep learning and case-based reasoning using prototypical parts. Existing work in prototype-based image classification uses a ``this looks like that'' reasoning process, which dissects a test image by finding prototypical parts and combining evidence from these prototypes to make a final classification. However, all of the existing prototypical part-based image classifiers provide only one-to-one comparisons, where a single training image patch serves as a prototype to compare with a part of our test image. With these single-image comparisons, it can often be difficult to identify the underlying concept being compared (e.g., ``is it comparing the color or the shape?''). Our proposed method modifies the architecture of prototype-based networks to instead learn prototypical concepts which are visualized using multiple image patches. Having multiple visualizations of the same prototype allows us to more easily identify the concept captured by that prototype (e.g., ``the test image and the related training patches are all the same shade of blue''), and allows our model to create richer, more interpretable visual explanations. Our experiments show that our ``this looks like those'' reasoning process can be applied as a modification to a wide range of existing prototypical image classification networks while achieving comparable accuracy on benchmark datasets.

摘要
我们提出了ProtoConcepts，一种可读性高的图像分类方法，结合深度学习和倡议式推理，使用 protoypical parts。现有的图像分类方法中，使用“这看起来像那”的思维过程，将试验图像分解为 protoypical parts，并从这些 protoypical parts 中获取证据，以进行最终的分类。然而，所有的单一图像比较方法都仅提供一对一的比较，即一个训练图像区块作为一个 prototype，与试验图像中的一部分进行比较。这种单一图像比较可能很难理解到被比较的基本概念（例如，“是对颜色或形状的比较？”）。我们的提案方法改变 prototype-based 网络的架构，以学习 protoypical concepts，这些概念可以通过多个图像区块进行可读性更高的visual化。有多个可读性更高的 visual explanation，我们的模型可以更好地识别基本概念，并创建更加可读性更高的图像解释。我们的实验显示，我们的“这看起来像那”的思维过程可以与现有的 prototype-based 图像分类网络结合，在benchmark dataset上实现相似的准确性。

Self-Supervised Multi-Modality Learning for Multi-Label Skin Lesion Classification

paper_url: http://arxiv.org/abs/2310.18583
repo_url: https://github.com/dylan-h-wang/skin-sm3
paper_authors: Hao Wang, Euijoon Ahn, Lei Bi, Jinman Kim
for: 该研究旨在提高多Modal Skin Lesion 诊断精度，使用自助学习算法和多modal 特征。
methods: 该算法使用了对匹配的dermoscopic和临床图像进行最大化的相似性 Maximization，以及基于归一化的Clustering分析生成Surrogate pseudo-multi-labels。
results: 研究结果表明，该算法在 Seven-point Skin Lesion 数据集上表现更好于其他状态对照算法，并且能够准确地识别多种皮肤病变。

Abstract
The clinical diagnosis of skin lesion involves the analysis of dermoscopic and clinical modalities. Dermoscopic images provide a detailed view of the surface structures whereas clinical images offer a complementary macroscopic information. The visual diagnosis of melanoma is also based on seven-point checklist which involves identifying different visual attributes. Recently, supervised learning approaches such as convolutional neural networks (CNNs) have shown great performances using both dermoscopic and clinical modalities (Multi-modality). The seven different visual attributes in the checklist are also used to further improve the the diagnosis. The performances of these approaches, however, are still reliant on the availability of large-scaled labeled data. The acquisition of annotated dataset is an expensive and time-consuming task, more so with annotating multi-attributes. To overcome this limitation, we propose a self-supervised learning (SSL) algorithm for multi-modality skin lesion classification. Our algorithm enables the multi-modality learning by maximizing the similarities between paired dermoscopic and clinical images from different views. In addition, we generate surrogate pseudo-multi-labels that represent seven attributes via clustering analysis. We also propose a label-relation-aware module to refine each pseudo-label embedding and capture the interrelationships between pseudo-multi-labels. We validated the effectiveness of our algorithm using well-benchmarked seven-point skin lesion dataset. Our results show that our algorithm achieved better performances than other state-of-the-art SSL counterparts.

摘要
诊断皮肤损伤的临床 диагностика involves 分析 dermoscopic 和临床特征。 dermoscopic 图像提供表面结构的详细视图，而临床图像提供较大规模的信息。诊断 меланомой 还是基于七点检查表，包括不同的视觉特征。最近，supervised learning 方法 such as convolutional neural networks (CNNs) 在多Modalities 上表现出色，使用dermoscopic 和临床特征。七个不同的视觉特征也用于进一步改进诊断。然而，这些方法的表现仍然受到大规模标注数据的可用性的限制。获取标注数据集是一项expensive 和时间consuming的任务，更是与标注多属性。为了突破这些限制，我们提出了一种自助学习（SSL）算法 для多Modalities 皮肤损伤分类。我们的算法使得多Modalities 学习，通过最大化不同视角dermoscopic 和临床图像之间的相似性。此外，我们生成了surrogate pseudo-multi-labels，通过分类分析代表七个属性。我们还提出了一种标签关系意识模块，用于修复每个 pseudo-label embedding 并捕捉 pseudo-multi-labels 之间的关系。我们 validated 我们的算法使用 well-benchmarked 七点皮肤损伤数据集。我们的结果显示，我们的算法在与其他状态时的SSL 对手中表现出色。

MultiScale Spectral-Spatial Convolutional Transformer for Hyperspectral Image Classification

paper_url: http://arxiv.org/abs/2310.18550
repo_url: None
paper_authors: Zhiqiang Gong, Xian Zhou, Wen Yao
For: The paper is written for hyperspectral image classification, and it proposes a new architecture called MultiscaleFormer that captures both spectral and spatial information.* Methods: The proposed method uses multiscale spatial patches as tokens to formulate the spatial Transformer, and generates multiscale spectral-spatial representation of each pixel. It also uses a modified spectral-spatial CAF module to fuse cross-layer spectral and spatial information.* Results: The proposed method outperforms other architectures for hyperspectral image classification on commonly used real-world datasets.Here’s the simplified Chinese text for the three key points:* For: 这篇论文是为了干涉谱图像分类而写的，并提出了一种新的架构方案 called MultiscaleFormer，它能够捕捉谱图像的 spectral 和 spatial 信息。* Methods: 该方法使用多个级别的空间块作为 токен，以形成空间 transformer，并生成每个像素的多级 spectral-spatial 表示。它还使用一种修改后的 spectral-spatial CAF 模块来融合层次 spectral 和 spatial 信息。* Results: 该方法在常用的实际 dataset 上进行了实验，并与其他架构进行了比较，结果显示了该方法的优越性。

Abstract
Due to the powerful ability in capturing the global information, Transformer has become an alternative architecture of CNNs for hyperspectral image classification. However, general Transformer mainly considers the global spectral information while ignores the multiscale spatial information of the hyperspectral image. In this paper, we propose a multiscale spectral-spatial convolutional Transformer (MultiscaleFormer) for hyperspectral image classification. First, the developed method utilizes multiscale spatial patches as tokens to formulate the spatial Transformer and generates multiscale spatial representation of each band in each pixel. Second, the spatial representation of all the bands in a given pixel are utilized as tokens to formulate the spectral Transformer and generate the multiscale spectral-spatial representation of each pixel. Besides, a modified spectral-spatial CAF module is constructed in the MultiFormer to fuse cross-layer spectral and spatial information. Therefore, the proposed MultiFormer can capture the multiscale spectral-spatial information and provide better performance than most of other architectures for hyperspectral image classification. Experiments are conducted over commonly used real-world datasets and the comparison results show the superiority of the proposed method.

摘要
由于Transformer的强大能力 capture global information，因此成为了干扰器的替代架构for hyperspectral image classification。然而，通常的Transformer主要考虑全球 spectral information，而忽略了多尺度空间信息的干扰器图像。在本文中，我们提出了一种多尺度 spectral-spatial convolutional Transformer（MultiscaleFormer）for hyperspectral image classification。首先，我们开发的方法使用多尺度空间块作为токен，并生成每个像素的多尺度空间表示。其次，所有帧在每个像素中的 spectral representation被作为 tokens，并生成每个像素的多尺度 spectral-spatial表示。此外，我们修改了 spectral-spatial CAF模块，以融合层次 spectral和空间信息。因此，我们提出的 MultiFormer 可以捕捉多尺度 spectral-spatial信息，并提供更好的性能 than most other architectures for hyperspectral image classification。我们对常用的实际 dataset进行了实验，并 compare 结果表明了我们的方法的优越性。

MEDAVET: Traffic Vehicle Anomaly Detection Mechanism based on spatial and temporal structures in vehicle traffic

paper_url: http://arxiv.org/abs/2310.18548
repo_url: None
paper_authors: Ana Rosalía Huamán Reyna, Alex Josué Flórez Farfán, Geraldo Pereira Rocha Filho, Sandra Sampaio, Robson de Grande, Luis Hideo, Vasconcelos Nakamura, Rodolfo Ipolito Meneguette
for: 这篇论文是为了模型交通异常检测而写的。
methods: 该论文使用计算机视觉技术进行车辆跟踪，并使用бипаolar图和Convex Hull算法定义运动区域。异常检测使用QuadTree和靠近 occluded 的数据结构。
results: 实验结果显示，该方法在 Track4 测试集上得到了85.7% 的 F1 分数和25.432 的平方差。

Abstract
Currently, there are computer vision systems that help us with tasks that would be dull for humans, such as surveillance and vehicle tracking. An important part of this analysis is to identify traffic anomalies. An anomaly tells us that something unusual has happened, in this case on the highway. This paper aims to model vehicle tracking using computer vision to detect traffic anomalies on a highway. We develop the steps of detection, tracking, and analysis of traffic: the detection of vehicles from video of urban traffic, the tracking of vehicles using a bipartite graph and the Convex Hull algorithm to delimit moving areas. Finally for anomaly detection we use two data structures to detect the beginning and end of the anomaly. The first is the QuadTree that groups vehicles that are stopped for a long time on the road and the second that approaches vehicles that are occluded. Experimental results show that our method is acceptable on the Track4 test set, with an F1 score of 85.7% and a mean squared error of 25.432.

摘要
现在，计算机视觉系统可以帮助我们完成一些人类厌热的任务，如Surveillance和车辆跟踪。这个分析的一个重要组成部分是检测交通异常。异常告诉我们 чтоomething不寻常发生在公路上。这篇论文旨在通过计算机视觉来模型车辆跟踪，检测公路上的交通异常。我们开发了检测、跟踪和分析交通的步骤：从城市交通视频中检测车辆，使用二分图和Convex Hull算法来定义移动区域，并用两种数据结构来检测异常的开始和结束。实验结果表明，我们的方法在Track4测试集上得到了可接受的结果，F1分数为85.7%，平均方差为25.432。