2023-09-03

cs.CV

cs.CV - 2023-09-03

MAP: Domain Generalization via Meta-Learning on Anatomy-Consistent Pseudo-Modalities

paper_url: http://arxiv.org/abs/2309.01286
repo_url: None
paper_authors: Dewei Hu, Hao Li, Han Liu, Xing Yao, Jiacheng Wang, Ipek Oguz
for: 提高深度模型的 клиниче应用性，即增强模型对未经见的领域的泛化能力。
methods: 我们提出了一种名为Meta learning on Anatomy-consistent Pseudo-modalities（MAP）的方法，该方法通过学习结构特征来提高模型的泛化能力。我们首先使用特征提取网络生成了三种不同的 Pseudo-modalities，然后使用 episodic learning 模式，选择一个 Pseudo-modalities 作为元训练集，并在一个通过 Dirichlet mixup 生成的连续变换图像空间中进行元测试。此外，我们还引入了两种捕捉形态信息的损失函数，以便模型更好地关注形态特征。
results: 我们在七个公共 datasets 上进行了测试，并证明了 MAP 在不同的Retinal imaging modalities上有substantially better的泛化能力。

Abstract
Deep models suffer from limited generalization capability to unseen domains, which has severely hindered their clinical applicability. Specifically for the retinal vessel segmentation task, although the model is supposed to learn the anatomy of the target, it can be distracted by confounding factors like intensity and contrast. We propose Meta learning on Anatomy-consistent Pseudo-modalities (MAP), a method that improves model generalizability by learning structural features. We first leverage a feature extraction network to generate three distinct pseudo-modalities that share the vessel structure of the original image. Next, we use the episodic learning paradigm by selecting one of the pseudo-modalities as the meta-train dataset, and perform meta-testing on a continuous augmented image space generated through Dirichlet mixup of the remaining pseudo-modalities. Further, we introduce two loss functions that facilitate the model's focus on shape information by clustering the latent vectors obtained from images featuring identical vasculature. We evaluate our model on seven public datasets of various retinal imaging modalities and we conclude that MAP has substantially better generalizability. Our code is publically available at https://github.com/DeweiHu/MAP.

摘要

FOR-instance: a UAV laser scanning benchmark dataset for semantic and instance segmentation of individual trees

paper_url: http://arxiv.org/abs/2309.01279
repo_url: None
paper_authors: Stefano Puliti, Grant Pearse, Peter Surový, Luke Wallace, Markus Hollaus, Maciej Wielgosz, Rasmus Astrup
for: 这个论文是为了提高 dense airborne laser scanning 数据的Instance和Semantic segmentation技术而写的。
methods: 该论文使用了 UAV 上的激光扫描数据，并对其进行手动标注，以获得不同类别的树实体和 semantic classes。
results: 该论文提供了一个标准化的 benchmarking 数据集，用于提高Instance和Semantic segmentation技术的发展，并且可以适应不同的深度学习框架和 segmentation 策略。

Abstract
The FOR-instance dataset (available at https://doi.org/10.5281/zenodo.8287792) addresses the challenge of accurate individual tree segmentation from laser scanning data, crucial for understanding forest ecosystems and sustainable management. Despite the growing need for detailed tree data, automating segmentation and tracking scientific progress remains difficult. Existing methodologies often overfit small datasets and lack comparability, limiting their applicability. Amid the progress triggered by the emergence of deep learning methodologies, standardized benchmarking assumes paramount importance in these research domains. This data paper introduces a benchmarking dataset for dense airborne laser scanning data, aimed at advancing instance and semantic segmentation techniques and promoting progress in 3D forest scene segmentation. The FOR-instance dataset comprises five curated and ML-ready UAV-based laser scanning data collections from diverse global locations, representing various forest types. The laser scanning data were manually annotated into individual trees (instances) and different semantic classes (e.g. stem, woody branches, live branches, terrain, low vegetation). The dataset is divided into development and test subsets, enabling method advancement and evaluation, with specific guidelines for utilization. It supports instance and semantic segmentation, offering adaptability to deep learning frameworks and diverse segmentation strategies, while the inclusion of diameter at breast height data expands its utility to the measurement of a classic tree variable. In conclusion, the FOR-instance dataset contributes to filling a gap in the 3D forest research, enhancing the development and benchmarking of segmentation algorithms for dense airborne laser scanning data.

摘要
《FOR-instance数据集》（可在https://doi.org/10.5281/zenodo.8287792中获取）是一个关键性的三维森林景象分割数据集，用于提高受众树 segmentation 技术的精度。despite the growing need for detailed tree data, automating segmentation and tracking scientific progress remains difficult. Existing methodologies often overfit small datasets and lack comparability, limiting their applicability. With the emergence of deep learning methodologies, standardized benchmarking assumes paramount importance in these research domains. This data paper introduces a benchmarking dataset for dense airborne laser scanning data, aimed at advancing instance and semantic segmentation techniques and promoting progress in 3D forest scene segmentation.The FOR-instance dataset includes five curated and ML-ready UAV-based laser scanning data collections from diverse global locations, representing various forest types. The laser scanning data were manually annotated into individual trees (instances) and different semantic classes (e.g. stem, woody branches, live branches, terrain, low vegetation). The dataset is divided into development and test subsets, enabling method advancement and evaluation, with specific guidelines for utilization. It supports instance and semantic segmentation, offering adaptability to deep learning frameworks and diverse segmentation strategies, while the inclusion of diameter at breast height data expands its utility to the measurement of a classic tree variable. In conclusion, the FOR-instance dataset contributes to filling a gap in the 3D forest research, enhancing the development and benchmarking of segmentation algorithms for dense airborne laser scanning data.

Diffusion Models with Deterministic Normalizing Flow Priors

paper_url: http://arxiv.org/abs/2309.01274
repo_url: https://github.com/mohsenzand/dinof
paper_authors: Mohsen Zand, Ali Etemad, Michael Greenspan
for: 提高样本质量和采样速度
methods: 使用normalizing flows和diffusion模型
results: 在标准图像生成数据集上达到了比较出色的表现，包括FID和inception分数

Abstract
For faster sampling and higher sample quality, we propose DiNof ($\textbf{Di}$ffusion with $\textbf{No}$rmalizing $\textbf{f}$low priors), a technique that makes use of normalizing flows and diffusion models. We use normalizing flows to parameterize the noisy data at any arbitrary step of the diffusion process and utilize it as the prior in the reverse diffusion process. More specifically, the forward noising process turns a data distribution into partially noisy data, which are subsequently transformed into a Gaussian distribution by a nonlinear process. The backward denoising procedure begins with a prior created by sampling from the Gaussian distribution and applying the invertible normalizing flow transformations deterministically. To generate the data distribution, the prior then undergoes the remaining diffusion stochastic denoising procedure. Through the reduction of the number of total diffusion steps, we are able to speed up both the forward and backward processes. More importantly, we improve the expressive power of diffusion models by employing both deterministic and stochastic mappings. Experiments on standard image generation datasets demonstrate the advantage of the proposed method over existing approaches. On the unconditional CIFAR10 dataset, for example, we achieve an FID of 2.01 and an Inception score of 9.96. Our method also demonstrates competitive performance on CelebA-HQ-256 dataset as it obtains an FID score of 7.11. Code is available at https://github.com/MohsenZand/DiNof.

摘要
For faster sampling and higher sample quality, we propose DiNof (diffusion with normalizing flow priors), a technique that combines normalizing flows and diffusion models. We use normalizing flows to parameterize the noisy data at any arbitrary step of the diffusion process and use it as the prior in the reverse diffusion process. Specifically, the forward noising process converts a data distribution into partially noisy data, which are then transformed into a Gaussian distribution through a nonlinear process. The backward denoising process starts with a prior created by sampling from the Gaussian distribution and applying invertible normalizing flow transformations deterministically. The prior then undergoes the remaining diffusion stochastic denoising procedure to generate the data distribution. By reducing the number of total diffusion steps, we can speed up both the forward and backward processes. Moreover, we improve the expressive power of diffusion models by using both deterministic and stochastic mappings. Experimental results on standard image generation datasets show the advantage of our proposed method over existing approaches. On the unconditional CIFAR10 dataset, for example, we achieve an FID of 2.01 and an Inception score of 9.96. Our method also demonstrates competitive performance on the CelebA-HQ-256 dataset, with an FID score of 7.11. Code is available at https://github.com/MohsenZand/DiNof.Here's the translation in Traditional Chinese:为了更快速的抽样和提高抽样质量，我们提出了DiNof（diffusion with normalizing flow priors）技术，该技术结合了normalizing flows和diffusion models。我们使用normalizing flows来对任意步骤的diffusion проце程中的噪声数据进行参数化，并将其用作反diffusion проце程中的假设。具体来说，前向噪声过程将数据分布转换成部分噪声的数据，然后通过非线性过程将其转换为Gaussian分布。反对噪声过程从Gaussian分布中随机抽样获得一个假设，并通过实现可逆的normalizing flow对应映射来确定性地将其转换为数据分布。通过缩减总diffusion步骤数量，我们可以快速化前向和反对噪声过程。更重要的是，我们通过使用deterministic和stochastic mapping来提高diffusion模型的表达力。实验结果显示，我们在标准的图像生成 dataset上比较其他方法表现出色，例如在CIFAR10 dataset上，我们获得了FID值为2.01和inception值为9.96。我们的方法也在CelebA-HQ-256 dataset上表现出色，FID值为7.11。相关的代码可以在https://github.com/MohsenZand/DiNof上获取。

SOAR: Scene-debiasing Open-set Action Recognition

paper_url: http://arxiv.org/abs/2309.01265
repo_url: https://github.com/yhZhai/SOAR
paper_authors: Yuanhao Zhai, Ziyi Liu, Zhenyu Wu, Yi Wu, Chunluan Zhou, David Doermann, Junsong Yuan, Gang Hua
for: mitigating the risk of utilizing spurious clues in open-set action recognition
methods: adversarial scene reconstruction module, adaptive adversarial scene classification module
results: better mitigation of scene bias, outperformance of state-of-the-art methodsHere’s the simplified Chinese text:
for: 开普设置动作识别中减少背景信息
methods: adversarial scene reconstruction module, adaptive adversarial scene classification module
results: 更好地减少场景偏见, 超过当前最佳方法表现I hope that helps! Let me know if you have any other questions.

Abstract
Deep learning models have a risk of utilizing spurious clues to make predictions, such as recognizing actions based on the background scene. This issue can severely degrade the open-set action recognition performance when the testing samples have different scene distributions from the training samples. To mitigate this problem, we propose a novel method, called Scene-debiasing Open-set Action Recognition (SOAR), which features an adversarial scene reconstruction module and an adaptive adversarial scene classification module. The former prevents the decoder from reconstructing the video background given video features, and thus helps reduce the background information in feature learning. The latter aims to confuse scene type classification given video features, with a specific emphasis on the action foreground, and helps to learn scene-invariant information. In addition, we design an experiment to quantify the scene bias. The results indicate that the current open-set action recognizers are biased toward the scene, and our proposed SOAR method better mitigates such bias. Furthermore, our extensive experiments demonstrate that our method outperforms state-of-the-art methods, and the ablation studies confirm the effectiveness of our proposed modules.

摘要
深度学习模型可能会利用干扰信号来做预测，如recognize动作基于背景场景。这个问题可能会严重降低开集动作认识性能，因为测试样本的场景分布与训练样本不同。为了解决这个问题，我们提出了一种新方法，叫做Scene-debiasing Open-set Action Recognition（SOAR），它包括一个对抗场景重建模块和一个适应对抗场景分类模块。前者防止解码器基于视频特征重建视频背景，从而减少视频背景的影响。后者强调动作前景，尝试使场景类型分类不分化，以学习场景不变的信息。此外，我们设计了一个测量场景偏见的实验。结果表明，当前的开集动作认识器偏向场景，而我们提出的SOAR方法更好地 mitigates such bias。此外，我们的广泛实验表明，我们的方法高效地超过了当前的状态实验，而ablation studies也证明了我们的提出的模块的效果。

Multimodal Contrastive Learning with Hard Negative Sampling for Human Activity Recognition

paper_url: http://arxiv.org/abs/2309.01262
repo_url: None
paper_authors: Hyeongju Choi, Apoorva Beedu, Irfan Essa
for: 这种研究旨在提高人员活动识别（HAR）系统的性能，特别是在日常生活中使用自助学习方法，以减少 annotated data 的成本和困难。
methods: 我们提出了一种基于强负样本选择的自助学习方法，使用硬负样本损失函数，并在 Camera 和 IMU 感知器数据集上进行实验。
results: 我们的方法在两个标准 benchmark 数据集上（UTD-MHAD 和 MMAct）表现出色，在有限数据设置下学习强的特征表示，并在下游活动识别任务中超过了所有现有的状态艺术方法。

Abstract
Human Activity Recognition (HAR) systems have been extensively studied by the vision and ubiquitous computing communities due to their practical applications in daily life, such as smart homes, surveillance, and health monitoring. Typically, this process is supervised in nature and the development of such systems requires access to large quantities of annotated data. However, the higher costs and challenges associated with obtaining good quality annotations have rendered the application of self-supervised methods an attractive option and contrastive learning comprises one such method. However, a major component of successful contrastive learning is the selection of good positive and negative samples. Although positive samples are directly obtainable, sampling good negative samples remain a challenge. As human activities can be recorded by several modalities like camera and IMU sensors, we propose a hard negative sampling method for multimodal HAR with a hard negative sampling loss for skeleton and IMU data pairs. We exploit hard negatives that have different labels from the anchor but are projected nearby in the latent space using an adjustable concentration parameter. Through extensive experiments on two benchmark datasets: UTD-MHAD and MMAct, we demonstrate the robustness of our approach forlearning strong feature representation for HAR tasks, and on the limited data setting. We further show that our model outperforms all other state-of-the-art methods for UTD-MHAD dataset, and self-supervised methods for MMAct: Cross session, even when uni-modal data are used during downstream activity recognition.

摘要
人工活动识别（HAR）系统在视觉和无限计算领域得到了广泛的研究，因为它在日常生活中有很多实际应用，如智能家居、监测和健康监测。 Typically, this process is supervised in nature, and the development of such systems requires access to large amounts of annotated data. However, the higher costs and challenges associated with obtaining good quality annotations have made self-supervised methods an attractive option. Contrastive learning is one such method, but selecting good positive and negative samples is a major challenge. Although positive samples are directly obtainable, sampling good negative samples remains a challenge.为解决这个问题，我们提出了一种困难的负样本选择方法 для多modal HAR，并使用一个可调的集中参数来选择硬负样本。 We exploit hard negatives that have different labels from the anchor but are projected nearby in the latent space. Through extensive experiments on two benchmark datasets: UTD-MHAD and MMAct, we demonstrate the robustness of our approach for learning strong feature representations for HAR tasks, and on limited data settings. We further show that our model outperforms all other state-of-the-art methods for UTD-MHAD dataset, and self-supervised methods for MMAct: Cross session, even when uni-modal data are used during downstream activity recognition.

S2RF: Semantically Stylized Radiance Fields

paper_url: http://arxiv.org/abs/2309.01252
repo_url: None
paper_authors: Dishani Lahiri, Neeraj Panse, Moneish Kumar
for: 提供一种将任意图像中的风格传递到3D场景中的对象上的方法。
methods: 提议一种 nearest neighborhood-based loss 的新方法，以实现更好的3D场景重建和灵活的风格定制，同时保证多视角准确性。
results: 方法可以实现自由的3D场景重建和灵活的风格定制，并保证多视角准确性。

Abstract
We present our method for transferring style from any arbitrary image(s) to object(s) within a 3D scene. Our primary objective is to offer more control in 3D scene stylization, facilitating the creation of customizable and stylized scene images from arbitrary viewpoints. To achieve this, we propose a novel approach that incorporates nearest neighborhood-based loss, allowing for flexible 3D scene reconstruction while effectively capturing intricate style details and ensuring multi-view consistency.

摘要
我们提出了一种方法，可以将任意图像中的风格传递到3D场景中的对象上。我们的主要目标是为3D场景增加个性化风格控制，以便从任意视角创建个性化和风格化的场景图像。为 достичь这个目标，我们提议一种新的方法，该方法包括最近邻域基于的损失函数，可以在3D场景重建中 flexible 地捕捉细节，同时保证多视角一致性。

Towards Generic Image Manipulation Detection with Weakly-Supervised Self-Consistency Learning

paper_url: http://arxiv.org/abs/2309.01246
repo_url: https://github.com/yhZhai/WSCL
paper_authors: Yuanhao Zhai, Tianyu Luan, David Doermann, Junsong Yuan
for: 本文主要针对于强制性较低的图像修饰检测问题，以便利用更多的训练图像和快速适应新的修饰技术。methods: 本文提出了一种弱类supervised自适应学习方法，仅需要图像水平的二分类标签（真实或修饰）进行训练。该方法利用了多种内容无关的信息，通过在线pseudo标签生成和优化过程实现了跨源学习。此外，本文还提出了一种inter-patch consistency（IPC）方法，可以找到整个修饰区域。results: 实验表明，even though our方法是弱类supervised的，它在受检测图像是否修饰的情况下，与完全supervised方法相比，具有竞争性的性能，并且可以准确地找到修饰区域。

Abstract
As advanced image manipulation techniques emerge, detecting the manipulation becomes increasingly important. Despite the success of recent learning-based approaches for image manipulation detection, they typically require expensive pixel-level annotations to train, while exhibiting degraded performance when testing on images that are differently manipulated compared with training images. To address these limitations, we propose weakly-supervised image manipulation detection, such that only binary image-level labels (authentic or tampered with) are required for training purpose. Such a weakly-supervised setting can leverage more training images and has the potential to adapt quickly to new manipulation techniques. To improve the generalization ability, we propose weakly-supervised self-consistency learning (WSCL) to leverage the weakly annotated images. Specifically, two consistency properties are learned: multi-source consistency (MSC) and inter-patch consistency (IPC). MSC exploits different content-agnostic information and enables cross-source learning via an online pseudo label generation and refinement process. IPC performs global pair-wise patch-patch relationship reasoning to discover a complete region of manipulation. Extensive experiments validate that our WSCL, even though is weakly supervised, exhibits competitive performance compared with fully-supervised counterpart under both in-distribution and out-of-distribution evaluations, as well as reasonable manipulation localization ability.

摘要
As advanced image manipulation techniques emerge, detecting the manipulation becomes increasingly important. Despite the success of recent learning-based approaches for image manipulation detection, they typically require expensive pixel-level annotations to train, while exhibiting degraded performance when testing on images that are differently manipulated compared with training images. To address these limitations, we propose weakly-supervised image manipulation detection, such that only binary image-level labels (authentic or tampered with) are required for training purpose. Such a weakly-supervised setting can leverage more training images and has the potential to adapt quickly to new manipulation techniques. To improve the generalization ability, we propose weakly-supervised self-consistency learning (WSCL) to leverage the weakly annotated images. Specifically, two consistency properties are learned: multi-source consistency (MSC) and inter-patch consistency (IPC). MSC exploits different content-agnostic information and enables cross-source learning via an online pseudo label generation and refinement process. IPC performs global pair-wise patch-patch relationship reasoning to discover a complete region of manipulation. Extensive experiments validate that our WSCL, even though is weakly supervised, exhibits competitive performance compared with fully-supervised counterpart under both in-distribution and out-of-distribution evaluations, as well as reasonable manipulation localization ability.Here's the translation in Traditional Chinese:为了应对进阶图像修饰技术的出现，检测修饰成本日益重要。 despite recent learning-based approaches for image manipulation detection的成功，它们通常需要高昂的像素级标注来训练，而且在训练和测试图像不同的修饰方法时，表现会下降。为了解决这些限制，我们提出了弱型图像修饰检测，仅需要图像水平标注（真实或伪造）来训练。这样的弱型设定可以对更多的训练图像进行学习，并且具有适应新修饰技术的潜力。为了提高普遍性，我们提出了弱型自适应学习（WSCL），以利用弱型标注图像。具体来说，我们学习了两种一致性属性：多源一致性（MSC）和间接图像一致性（IPC）。MSC 利用不同内容不相关的信息，并允许跨源学习 via 线上 pseudo 标签生成和修正过程。IPC 执行全域对 patch-patch 关系的全球推理，以发现修饰区域。实验显示，我们的 WSCL ，即使是弱型学习，在两个分布中的评估中表现竞争性好，以及修饰地域的实际能力。

BodySLAM++: Fast and Tightly-Coupled Visual-Inertial Camera and Human Motion Tracking

paper_url: http://arxiv.org/abs/2309.01236
repo_url: None
paper_authors: Dorian F. Henning, Christopher Choi, Simon Schaefer, Stefan Leutenegger
for: 这篇论文是为了解决人体状态估计问题，尤其是在实际应用中需要实时估计人体状态的情况下。
methods: 该论文使用了视觉感知和自适应器来实现人体和摄像头状态估计，并将现有的视觉感知状态估计框架OKVIS2扩展到同时解决人体和摄像头状态估计的双重任务。
results: 相比基准方法，该方法可以提高人体状态估计的准确性和摄像头状态估计的准确性，并在Intel i7-模型CPU上实现15+帧每秒的实时性。

Abstract
Robust, fast, and accurate human state - 6D pose and posture - estimation remains a challenging problem. For real-world applications, the ability to estimate the human state in real-time is highly desirable. In this paper, we present BodySLAM++, a fast, efficient, and accurate human and camera state estimation framework relying on visual-inertial data. BodySLAM++ extends an existing visual-inertial state estimation framework, OKVIS2, to solve the dual task of estimating camera and human states simultaneously. Our system improves the accuracy of both human and camera state estimation with respect to baseline methods by 26% and 12%, respectively, and achieves real-time performance at 15+ frames per second on an Intel i7-model CPU. Experiments were conducted on a custom dataset containing both ground truth human and camera poses collected with an indoor motion tracking system.

摘要
Robust、快速、精确的人体状态估算——6D姿态和姿态——仍然是一个挑战性的问题。在实际应用中，可以在实时中估算人体状态是非常感兴趣的。在这篇论文中，我们提出了BodySLAM++，一种基于视觉-陀螺数据的人体和摄像头状态估算框架。BodySLAM++在OKVIS2视觉-陀螺状态估算框架的基础上进行了扩展，同时解决了同时估算摄像头和人体状态的两个任务。我们的系统相比基准方法提高了人体和摄像头状态估算的准确性，增加了26%和12%，并在Intel i7-型CPU上实现了15+帧每秒的实时性。我们在一个自定义的人体和摄像头pose的数据集上进行了实验。

Generalizability and Application of the Skin Reflectance Estimate Based on Dichromatic Separation (SREDS)

paper_url: http://arxiv.org/abs/2309.01235
repo_url: https://github.com/josephdrahos/sreds
paper_authors: Joseph Drahos, Richard Plesh, Keivan Bahmani, Mahesh Banavar, Stephanie Schuckers
for: 本研究旨在提供一种可靠的皮肤颜色度量，以便在面 recognition系统中减少因皮肤颜色而导致的性能差异。
methods: 本研究使用了基于 dichromatic separation 的皮肤颜色度量（SREDS），并对其进行了进一步的分析和评估。
results: 研究发现，SREDS 能够创造一个具有较低差异的皮肤颜色度量，并且可以作为自我报告的种族标签的替代方案。此外，研究还提供了一个开源的 SREDS 实现，以帮助研究人员。

Abstract
Face recognition (FR) systems have become widely used and readily available in recent history. However, differential performance between certain demographics has been identified within popular FR models. Skin tone differences between demographics can be one of the factors contributing to the differential performance observed in face recognition models. Skin tone metrics provide an alternative to self-reported race labels when such labels are lacking or completely not available e.g. large-scale face recognition datasets. In this work, we provide a further analysis of the generalizability of the Skin Reflectance Estimate based on Dichromatic Separation (SREDS) against other skin tone metrics and provide a use case for substituting race labels for SREDS scores in a privacy-preserving learning solution. Our findings suggest that SREDS consistently creates a skin tone metric with lower variability within each subject and SREDS values can be utilized as an alternative to the self-reported race labels at minimal drop in performance. Finally, we provide a publicly available and open-source implementation of SREDS to help the research community. Available at https://github.com/JosephDrahos/SREDS

摘要
人脸识别（FR）系统在近代历史中广泛使用和可用。然而， differential performance between certain demographics 在流行的 FR 模型中被识别出来。skin tone differences between demographics can be one of the factors contributing to the differential performance observed in face recognition models。skin tone metrics provide an alternative to self-reported race labels when such labels are lacking or completely not available, for example, large-scale face recognition datasets.在这项工作中，我们进一步分析了Skin Reflectance Estimate based on Dichromatic Separation（SREDS）与其他皮肤颜色指标的一致性，并提供了使用 SREDS scores substitute for race labels in a privacy-preserving learning solution的用例。我们发现，SREDS consistently creates a skin tone metric with lower variability within each subject, and SREDS values can be used as an alternative to self-reported race labels at minimal drop in performance。最后，我们提供了一个公共可用的和开源的 SREDS 实现，以帮助研究社区。可以在查看。

Spectral Adversarial MixUp for Few-Shot Unsupervised Domain Adaptation

paper_url: http://arxiv.org/abs/2309.01207
repo_url: https://github.com/RPIDIAL/SAMix
paper_authors: Jiajin Zhang, Hanqing Chao, Amit Dhurandhar, Pin-Yu Chen, Ali Tajer, Yangyang Xu, Pingkun Yan
for: 本研究旨在 Addressing the challenging problem of few-shot unsupervised domain adaptation (FSUDA) in clinical applications, where only a limited number of unlabeled target domain samples are available for training.
methods: 我们提出了一种新的方法，即 spectral sensitivity map 和 sensitivity-guided spectral adversarial mixup (SAMix) 方法，以强化模型在目标频谱中的一致性和模型通用性。
results: 我们在多个任务和数据集上进行了严谨的评估，并证明了我们的方法可以有效地提高模型在目标频谱中的一致性和模型通用性。

Abstract
Domain shift is a common problem in clinical applications, where the training images (source domain) and the test images (target domain) are under different distributions. Unsupervised Domain Adaptation (UDA) techniques have been proposed to adapt models trained in the source domain to the target domain. However, those methods require a large number of images from the target domain for model training. In this paper, we propose a novel method for Few-Shot Unsupervised Domain Adaptation (FSUDA), where only a limited number of unlabeled target domain samples are available for training. To accomplish this challenging task, first, a spectral sensitivity map is introduced to characterize the generalization weaknesses of models in the frequency domain. We then developed a Sensitivity-guided Spectral Adversarial MixUp (SAMix) method to generate target-style images to effectively suppresses the model sensitivity, which leads to improved model generalizability in the target domain. We demonstrated the proposed method and rigorously evaluated its performance on multiple tasks using several public datasets.

摘要
域名转换是在医疗应用中的一个常见问题，source domain 和 target domain 的图像分布不同。不supervised Domain Adaptation（UDA）技术已经被提议，以适应source domain 中训练的模型到 target domain。然而，这些方法需要大量的target domain图像来训练模型。在这篇论文中，我们提出了一种新的方法：Few-Shot Unsupervised Domain Adaptation（FSUDA），只需要有限量的target domain样本进行训练。为了实现这个复杂的任务，我们首先引入了一个spectral sensitivity map来描述模型在频率域的泛化弱点。然后，我们开发了一种Sensitivity-guided Spectral Adversarial MixUp（SAMix）方法，可以生成target-style图像，以有效地减少模型的敏感性，从而提高模型在target domain的泛化性。我们证明了我们的方法的可行性和对多个任务的精心评估。

MAGMA: Music Aligned Generative Motion Autodecoder

paper_url: http://arxiv.org/abs/2309.01202
repo_url: None
paper_authors: Sohan Anisetty, Amit Raj, James Hays
for: 这 paper 的目的是解决将乐曲映射到舞蹈中的问题，以实现空间和时间协调，同时与乐曲的进程保持同步。
methods: 作者使用 Vector Quantized-Variational Autoencoder (VQ-VAE) 和 Transformer 解码器，将动作划分成基本 primitives，并通过对音乐表示的比较来评估音乐表示的重要性。
results: 作者的方法可以实现州际最佳的音乐到动作生成效果，并可以生成较长的动作序列，易于自定义动作序列以满足风格要求。

Abstract
Mapping music to dance is a challenging problem that requires spatial and temporal coherence along with a continual synchronization with the music's progression. Taking inspiration from large language models, we introduce a 2-step approach for generating dance using a Vector Quantized-Variational Autoencoder (VQ-VAE) to distill motion into primitives and train a Transformer decoder to learn the correct sequencing of these primitives. We also evaluate the importance of music representations by comparing naive music feature extraction using Librosa to deep audio representations generated by state-of-the-art audio compression algorithms. Additionally, we train variations of the motion generator using relative and absolute positional encodings to determine the effect on generated motion quality when generating arbitrarily long sequence lengths. Our proposed approach achieve state-of-the-art results in music-to-motion generation benchmarks and enables the real-time generation of considerably longer motion sequences, the ability to chain multiple motion sequences seamlessly, and easy customization of motion sequences to meet style requirements.

摘要
将音乐映射到舞蹈是一个挑战性的问题，需要空间和时间协调以及 continual 同步音乐的进程。引用大语言模型，我们提出了一种 two-step 方法，使用 вектор量化-自适应编码器（VQ-VAE）来压缩动作并训练 transformer 解码器来学习正确的动作顺序。我们还评估音乐表示的重要性，比较 naive 音乐特征提取使用 Librosa 和深度音频表示生成器生成的音频特征。此外，我们在不同的 poz 编码器和绝对 poz 编码器进行训练，以确定在生成长序列时的影响。我们的提出方法在音乐到动作生成标准准则中实现了状态顶峰的结果，可以实时生成较长的动作序列，链接多个动作序列，以及根据风格要求自由定制动作序列。

Holistic Dynamic Frequency Transformer for Image Fusion and Exposure Correction

paper_url: http://arxiv.org/abs/2309.01183
repo_url: None
paper_authors: Xiaoke Shang, Gehui Li, Zhiying Jiang, Shaomin Zhang, Nai Ding, Jinyuan Liu
for: 提高图像质量，解决曝光相关问题
methods: 利用频域回归，替代传统的相关计算，使用全息频域注意力和动态频域前向网络来提取全息信息，并使用拉普拉斯 pyramid decomposes 图像为不同频率带信息，然后使用多个修复器来恢复特定频率带信息
results: 实现了主流数据集上的最佳Result，为曝光 corrections 等computer vision任务提供了更加细致和协调的解决方案

Abstract
The correction of exposure-related issues is a pivotal component in enhancing the quality of images, offering substantial implications for various computer vision tasks. Historically, most methodologies have predominantly utilized spatial domain recovery, offering limited consideration to the potentialities of the frequency domain. Additionally, there has been a lack of a unified perspective towards low-light enhancement, exposure correction, and multi-exposure fusion, complicating and impeding the optimization of image processing. In response to these challenges, this paper proposes a novel methodology that leverages the frequency domain to improve and unify the handling of exposure correction tasks. Our method introduces Holistic Frequency Attention and Dynamic Frequency Feed-Forward Network, which replace conventional correlation computation in the spatial-domain. They form a foundational building block that facilitates a U-shaped Holistic Dynamic Frequency Transformer as a filter to extract global information and dynamically select important frequency bands for image restoration. Complementing this, we employ a Laplacian pyramid to decompose images into distinct frequency bands, followed by multiple restorers, each tuned to recover specific frequency-band information. The pyramid fusion allows a more detailed and nuanced image restoration process. Ultimately, our structure unifies the three tasks of low-light enhancement, exposure correction, and multi-exposure fusion, enabling comprehensive treatment of all classical exposure errors. Benchmarking on mainstream datasets for these tasks, our proposed method achieves state-of-the-art results, paving the way for more sophisticated and unified solutions in exposure correction.

摘要
correction of exposure-related issues 是图像质量进步的关键组件，具有广泛的计算机视觉应用场景。历史上，大多数方法ologies 都是在空间领域进行恢复，忽略了频率领域的潜在优势。此外，对于低光照修复、曝光修复和多曝光融合，缺乏一个统一的视角，使得图像处理优化受到妨碍。为了解决这些挑战，本文提出了一种新的方法，利用频率领域来改善和统一曝光修复任务。我们的方法引入全局频率注意力和动态频率预测网络，取代了传统的空间领域相关计算。它们组成了基本建构块，使得U-形全局动态频率变换器作为筛选器，以EXTRACT全局信息和动态选择重要的频率带width。此外，我们采用拉普拉斯 pyramid decomposed 图像到不同的频率带width，然后使用多个恢复器，每个恢复器都是针对特定频率带width 的信息恢复。pyramid 融合allowing 更加细致和细腻的图像修复过程。最终，我们的结构统一了三个任务：低光照修复、曝光修复和多曝光融合，使得全面地处理所有传统曝光错误。在主流数据集上 benchmarking，我们提出的方法实现了状态之前的成绩，开启了更加复杂和统一的曝光修复解决方案。

Deep Unfolding Convolutional Dictionary Model for Multi-Contrast MRI Super-resolution and Reconstruction

paper_url: http://arxiv.org/abs/2309.01171
repo_url: https://github.com/lpcccc-cv/mc-cdic
paper_authors: Pengcheng Lei, Faming Fang, Guixu Zhang, Ming Xu
for: 这个论文主要是为了提出一个基于深度学习的多测量MRI超解析和重建方法，以探索多测量图像之间的联乘关系。
methods: 本文提出了一个叫做多测量 convolutional dictionary（MC-CDic）模型，利用优化算法和构造资料实际关系来实现多测量图像之间的联乘。MC-CDic模型包括建立观察模型、构造多测量字典和使用 proximal 算法来优化模型。
results: 实验结果显示，MC-CDic模型在多测量MRI超解析和重建任务中具有较高的性能，较以前的State-of-the-Art方法。

Abstract
Magnetic resonance imaging (MRI) tasks often involve multiple contrasts. Recently, numerous deep learning-based multi-contrast MRI super-resolution (SR) and reconstruction methods have been proposed to explore the complementary information from the multi-contrast images. However, these methods either construct parameter-sharing networks or manually design fusion rules, failing to accurately model the correlations between multi-contrast images and lacking certain interpretations. In this paper, we propose a multi-contrast convolutional dictionary (MC-CDic) model under the guidance of the optimization algorithm with a well-designed data fidelity term. Specifically, we bulid an observation model for the multi-contrast MR images to explicitly model the multi-contrast images as common features and unique features. In this way, only the useful information in the reference image can be transferred to the target image, while the inconsistent information will be ignored. We employ the proximal gradient algorithm to optimize the model and unroll the iterative steps into a deep CDic model. Especially, the proximal operators are replaced by learnable ResNet. In addition, multi-scale dictionaries are introduced to further improve the model performance. We test our MC-CDic model on multi-contrast MRI SR and reconstruction tasks. Experimental results demonstrate the superior performance of the proposed MC-CDic model against existing SOTA methods. Code is available at https://github.com/lpcccc-cv/MC-CDic.

摘要
magnetic resonance imaging (MRI) 任务 oftentimes involve multiple contrasts. Recently, numerous deep learning-based multi-contrast MRI super-resolution (SR) and reconstruction methods have been proposed to explore the complementary information from the multi-contrast images. However, these methods either construct parameter-sharing networks or manually design fusion rules, failing to accurately model the correlations between multi-contrast images and lacking certain interpretations.In this paper, we propose a multi-contrast convolutional dictionary (MC-CDic) model under the guidance of the optimization algorithm with a well-designed data fidelity term. Specifically, we bulid an observation model for the multi-contrast MR images to explicitly model the multi-contrast images as common features and unique features. In this way, only the useful information in the reference image can be transferred to the target image, while the inconsistent information will be ignored.We employ the proximal gradient algorithm to optimize the model and unroll the iterative steps into a deep CDic model. Especially, the proximal operators are replaced by learnable ResNet. In addition, multi-scale dictionaries are introduced to further improve the model performance.We test our MC-CDic model on multi-contrast MRI SR and reconstruction tasks. Experimental results demonstrate the superior performance of the proposed MC-CDic model against existing state-of-the-art (SOTA) methods. Code is available at https://github.com/lpcccc-cv/MC-CDic.

An Asynchronous Linear Filter Architecture for Hybrid Event-Frame Cameras

paper_url: http://arxiv.org/abs/2309.01159
repo_url: https://github.com/ziweiwwang/event-asynchronous-filter
paper_authors: Ziwei Wang, Yonhon Ng, Cedric Scheerlinck, Robert Mahony
for: 这篇论文是为了描述一种基于事件和帧camera数据的协同筛选框架，用于重建HDR视频和空间卷积。
methods: 该框架使用了 asynchronous linear filter architecture，将事件和帧camera数据 fusion，以优化HDR视频重建和空间卷积。
results: 对于公共的 datasets，该方法与其他状态艺法比较，在灰度误差率（69.4%减少）和图像相似性指标（均提高35.5%）中表现出色。此外，该框架还可以将图像卷积与线性空间核积合并应用。

Abstract
Event cameras are ideally suited to capture High Dynamic Range (HDR) visual information without blur but provide poor imaging capability for static or slowly varying scenes. Conversely, conventional image sensors measure absolute intensity of slowly changing scenes effectively but do poorly on HDR or quickly changing scenes. In this paper, we present an asynchronous linear filter architecture, fusing event and frame camera data, for HDR video reconstruction and spatial convolution that exploits the advantages of both sensor modalities. The key idea is the introduction of a state that directly encodes the integrated or convolved image information and that is updated asynchronously as each event or each frame arrives from the camera. The state can be read-off as-often-as and whenever required to feed into subsequent vision modules for real-time robotic systems. Our experimental results are evaluated on both publicly available datasets with challenging lighting conditions and fast motions, along with a new dataset with HDR reference that we provide. The proposed AKF pipeline outperforms other state-of-the-art methods in both absolute intensity error (69.4% reduction) and image similarity indexes (average 35.5% improvement). We also demonstrate the integration of image convolution with linear spatial kernels Gaussian, Sobel, and Laplacian as an application of our architecture.

摘要
The key idea is to introduce a state that encodes the integrated or convolved image information and is updated asynchronously as each event or frame arrives from the camera. This state can be read off as often as required to feed into subsequent vision modules for real-time robotic systems. Our experimental results are evaluated on publicly available datasets with challenging lighting conditions and fast motions, as well as a new dataset with HDR reference that we provide.Compared to other state-of-the-art methods, our proposed asynchronous kernel filter (AKF) pipeline achieves a 69.4% reduction in absolute intensity error and an average 35.5% improvement in image similarity indexes. We also demonstrate the integration of image convolution with linear spatial kernels, such as Gaussian, Sobel, and Laplacian, as an application of our architecture.

LoGoPrompt: Synthetic Text Images Can Be Good Visual Prompts for Vision-Language Models

paper_url: http://arxiv.org/abs/2309.01155
repo_url: None
paper_authors: Cheng Shi, Sibei Yang
for: 提高预训练模型在下游任务中的性能，尤其是图像识别领域
methods: 使用生成的文本图像作为视觉提示，并解决了鸡蛋问题
results: 在16个 datasets上，方法 consistently outperforms 州际方法，包括几个shot学习、基础到新的泛化和领域泛化

Abstract
Prompt engineering is a powerful tool used to enhance the performance of pre-trained models on downstream tasks. For example, providing the prompt ``Let's think step by step" improved GPT-3's reasoning accuracy to 63% on MutiArith while prompting ``a photo of" filled with a class name enables CLIP to achieve $80$\% zero-shot accuracy on ImageNet. While previous research has explored prompt learning for the visual modality, analyzing what constitutes a good visual prompt specifically for image recognition is limited. In addition, existing visual prompt tuning methods' generalization ability is worse than text-only prompting tuning. This paper explores our key insight: synthetic text images are good visual prompts for vision-language models! To achieve that, we propose our LoGoPrompt, which reformulates the classification objective to the visual prompt selection and addresses the chicken-and-egg challenge of first adding synthetic text images as class-wise visual prompts or predicting the class first. Without any trainable visual prompt parameters, experimental results on 16 datasets demonstrate that our method consistently outperforms state-of-the-art methods in few-shot learning, base-to-new generalization, and domain generalization.

摘要
Prompt engineering是一种强大的工具，可以提高预训练模型在下游任务中表现。例如，提供“思考步骤”的提示可以提高GPT-3的逻辑准确率到63%在MutiArith上，而提示“一张”filled with类名可以使CLIP achieve ImageNet上零基本精度80%。而前期研究已经探索了文本模式下的提示学习，但是对于图像识别领域的视觉提示特别是有限的研究。此外，现有的视觉提示调整方法的通用能力比文本只提示调整更差。这篇论文探讨了我们的关键发现：Synthetic text images是良好的视觉提示 для视觉语言模型！为实现这一点，我们提出了LoGoPrompt，它将类型化目标重新定义为视觉提示选择，并解决了鸡蛋问题，即首先添加synthetic text images为类别视觉提示或预测类型。无需任何可训练的视觉提示参数，我们的方法在16个数据集上实验表明， consistently outperform了当前状态的方法在少shot学习、基础到新的泛化和频率泛化上。

EdaDet: Open-Vocabulary Object Detection Using Early Dense Alignment

paper_url: http://arxiv.org/abs/2309.01151
repo_url: None
paper_authors: Cheng Shi, Sibei Yang
for: 提高开放词汇物体检测性能，使检测器在基础类别上训练， yet 能够检测新类别。
methods: 使用 CLIP 强大的零上shot认知能力对象级别的嵌入进行对应。
results: 比如，使用 CLIP 对象级别对接 resulted in overfitting to base categories, i.e., novel categories most similar to base categories have particularly poor performance as they are recognized as similar base categories。 In this paper, we propose Early Dense Alignment (EDA) to bridge the gap between generalizable local semantics and object-level prediction.

Abstract
Vision-language models such as CLIP have boosted the performance of open-vocabulary object detection, where the detector is trained on base categories but required to detect novel categories. Existing methods leverage CLIP's strong zero-shot recognition ability to align object-level embeddings with textual embeddings of categories. However, we observe that using CLIP for object-level alignment results in overfitting to base categories, i.e., novel categories most similar to base categories have particularly poor performance as they are recognized as similar base categories. In this paper, we first identify that the loss of critical fine-grained local image semantics hinders existing methods from attaining strong base-to-novel generalization. Then, we propose Early Dense Alignment (EDA) to bridge the gap between generalizable local semantics and object-level prediction. In EDA, we use object-level supervision to learn the dense-level rather than object-level alignment to maintain the local fine-grained semantics. Extensive experiments demonstrate our superior performance to competing approaches under the same strict setting and without using external training resources, i.e., improving the +8.4% novel box AP50 on COCO and +3.9% rare mask AP on LVIS.

摘要
现代视力语言模型，如CLIP，已经提高了开放词汇物体检测的性能，其中检测器在基本类别上训练，但需要检测新类别。现有方法利用CLIP强大的零shot识别能力将对象级别的嵌入与文本类别嵌入相对 alignment。然而，我们发现使用CLIP进行对象级别对 alignment 会导致基本类别最 Similar novel categories 的表现特别差，即基本类别最 Similar novel categories 的表现特别差。在这篇论文中，我们首先发现了现有方法无法具备强大的基础-to-新泛化的能力，因为loss of critical fine-grained local image semantics 阻碍了现有方法的提升。然后，我们提出了 Early Dense Alignment (EDA) 方法，用于补做这个障碍。在 EDA 中，我们使用对象级别的超级vised learning 来学习 dense-level 的对ignment，以保持本地细致的 semantics。我们的实验表明，我们在同样的严格设定下，不使用外部训练资源，可以提高 COCO 上的 +8.4% novel box AP50 和 LVIS 上的 +3.9% rare mask AP。

VGDiffZero: Text-to-image Diffusion Models Can Be Zero-shot Visual Grounders

paper_url: http://arxiv.org/abs/2309.01141
repo_url: None
paper_authors: Xuyang Liu, Siteng Huang, Yachen Kang, Honggang Chen, Donglin Wang
for: 这个研究的目的是寻找一个可以实现零shot visual grounding的方法，不需要任何调整和额外的训练数据。
methods: 这个方法基于文本与图像散射模型，并提出了一个简单又有效的零shot visual grounding框架，另外还设计了一个考虑全局和局部 контекст的区域评分方法。
results: 实验结果显示，这个方法在RefCOCO、RefCOCO+和RefCOCOg上实现了优秀的零shot visual grounding性能。

Abstract
Large-scale text-to-image diffusion models have shown impressive capabilities across various generative tasks, enabled by strong vision-language alignment obtained through pre-training. However, most vision-language discriminative tasks require extensive fine-tuning on carefully-labeled datasets to acquire such alignment, with great cost in time and computing resources. In this work, we explore directly applying a pre-trained generative diffusion model to the challenging discriminative task of visual grounding without any fine-tuning and additional training dataset. Specifically, we propose VGDiffZero, a simple yet effective zero-shot visual grounding framework based on text-to-image diffusion models. We also design a comprehensive region-scoring method considering both global and local contexts of each isolated proposal. Extensive experiments on RefCOCO, RefCOCO+, and RefCOCOg show that VGDiffZero achieves strong performance on zero-shot visual grounding.

摘要
大规模文本到图像扩散模型已经在多种生成任务中展现出了吸引人的能力，得益于在预训练中获得的强视语对应性。然而，大多数视语识别任务需要大量的时间和计算资源进行精心 Labeling 数据集来获得这种对应性。在这种情况下，我们研究直接将预训练的生成扩散模型应用到挑战性的视觉识别任务中，无需任何 fine-tuning 和额外的训练数据集。特别是，我们提出了一种简单 yet effective 的零shot 视觉定位框架，称为 VGDiffZero。我们还设计了一种全面的区域分配方法，考虑了每个隔离提案的全局和地方上下文。广泛的实验表明，VGDiffZero 在 RefCOCO、RefCOCO+ 和 RefCOCOg 上达到了零shot 视觉定位的强性表现。

RSDiff: Remote Sensing Image Generation from Text Using Diffusion Model

paper_url: http://arxiv.org/abs/2309.02455
repo_url: None
paper_authors: Ahmad Sebaq, Mohamed ElHelw
for: This paper is written for remote sensing tasks that require high-quality, detailed satellite images for accurate analysis and decision-making.
methods: The paper proposes an innovative and lightweight approach that employs two-stage diffusion models to generate high-resolution satellite images purely based on text prompts. The approach consists of two interconnected diffusion models: a Low-Resolution Generation Diffusion Model (LR-GDM) and a Super-Resolution Diffusion Model (SRDM).
results: The approach outperforms existing state-of-the-art (SoTA) models in generating satellite images with realistic geographical features, weather conditions, and land structures while achieving remarkable super-resolution results for increased spatial precision.Here’s the Chinese version of the information points:
for: 这篇论文是为Remote Sensing任务而写的，需要高质量、细节准确的卫星图像进行准确分析和决策。
methods: 这篇论文提出了一种创新的、轻量级的方法，使用两个阶段的扩散模型来基于文本提示生成高分辨率卫星图像。该方法包括两个相连的扩散模型：一个低分辨率生成扩散模型（LR-GDM）和一个超分辨率扩散模型（SRDM）。
results: 该方法比现有的State-of-the-Art（SoTA）模型在生成卫星图像方面更高效，能够实现更高的地理特征、天气条件和土地结构的准确描述，同时实现了显著的超分辨率效果以提高空间精度。

Abstract
Satellite imagery generation and super-resolution are pivotal tasks in remote sensing, demanding high-quality, detailed images for accurate analysis and decision-making. In this paper, we propose an innovative and lightweight approach that employs two-stage diffusion models to gradually generate high-resolution Satellite images purely based on text prompts. Our innovative pipeline comprises two interconnected diffusion models: a Low-Resolution Generation Diffusion Model (LR-GDM) that generates low-resolution images from text and a Super-Resolution Diffusion Model (SRDM) conditionally produced. The LR-GDM effectively synthesizes low-resolution by (computing the correlations of the text embedding and the image embedding in a shared latent space), capturing the essential content and layout of the desired scenes. Subsequently, the SRDM takes the generated low-resolution image and its corresponding text prompts and efficiently produces the high-resolution counterparts, infusing fine-grained spatial details and enhancing visual fidelity. Experiments are conducted on the commonly used dataset, Remote Sensing Image Captioning Dataset (RSICD). Our results demonstrate that our approach outperforms existing state-of-the-art (SoTA) models in generating satellite images with realistic geographical features, weather conditions, and land structures while achieving remarkable super-resolution results for increased spatial precision.

摘要
卫星图像生成和超解像是远程感知领域的关键任务，需要高质量、详细的图像 для准确的分析和决策。在这篇论文中，我们提出了一种创新的和轻量级的方法，使用两个链接的扩散模型来逐渐生成高分辨率的卫星图像， purely based on text prompts。我们的创新管道包括两个相连的扩散模型：一个低分辨率生成扩散模型（LR-GDM），通过（计算文本嵌入和图像嵌入在共享尺度空间中的相关性）来有效地生成低分辨率图像，捕捉整个场景的主要内容和布局。然后，SRDM模型会使用生成的低分辨率图像和其相应的文本提示，生成高分辨率对应的图像，注入细致的空间细节，提高视觉准确性。我们在Remote Sensing Image Captioning Dataset（RSICD）上进行了实验，我们的方法比现有的SoTA模型在生成卫星图像的realistic geographical features、天气条件和地形结构方面表现出色，同时实现了很高的超分辨率效果，提高了空间精度。

Hybrid-Supervised Dual-Search: Leveraging Automatic Learning for Loss-free Multi-Exposure Image Fusion

paper_url: http://arxiv.org/abs/2309.01113
repo_url: None
paper_authors: Guanyao Wu, Hongming Fu, Jinyuan Liu, Long Ma, Xin Fan, Risheng Liu
for: 本研究目的是解决多 expose 图像融合中的限制，提高图像的Authentic Representation。
methods: 本文提出了一种 Hybrid-Supervised Dual-Search 方法（HSDS-MEF），它使用自动化设计网络结构和损失函数的双级优化搜索算法。
results: 对比Various competitive schemes，本文实现了state-of-the-art表现，在Visual Information Fidelity（VIF）指标上提高10.61%和4.38%，并提供高对比度、丰富细节和颜色的结果。

Abstract
Multi-exposure image fusion (MEF) has emerged as a prominent solution to address the limitations of digital imaging in representing varied exposure levels. Despite its advancements, the field grapples with challenges, notably the reliance on manual designs for network structures and loss functions, and the constraints of utilizing simulated reference images as ground truths. Consequently, current methodologies often suffer from color distortions and exposure artifacts, further complicating the quest for authentic image representation. In addressing these challenges, this paper presents a Hybrid-Supervised Dual-Search approach for MEF, dubbed HSDS-MEF, which introduces a bi-level optimization search scheme for automatic design of both network structures and loss functions. More specifically, we harnesses a unique dual research mechanism rooted in a novel weighted structure refinement architecture search. Besides, a hybrid supervised contrast constraint seamlessly guides and integrates with searching process, facilitating a more adaptive and comprehensive search for optimal loss functions. We realize the state-of-the-art performance in comparison to various competitive schemes, yielding a 10.61% and 4.38% improvement in Visual Information Fidelity (VIF) for general and no-reference scenarios, respectively, while providing results with high contrast, rich details and colors.

摘要
多曝光图像融合（MEF）已成为现代图像捕捉技术的一个主要解决方案，以 Addressing the limitations of digital imaging in representing varied exposure levels. Despite its advancements, the field is still facing challenges, such as the reliance on manual designs for network structures and loss functions, and the constraints of using simulated reference images as ground truths. As a result, current methodologies often suffer from color distortions and exposure artifacts, further complicating the quest for authentic image representation.To address these challenges, this paper proposes a Hybrid-Supervised Dual-Search approach for MEF, called HSDS-MEF, which introduces a bi-level optimization search scheme for automatic design of both network structures and loss functions. Specifically, we leverage a unique dual research mechanism rooted in a novel weighted structure refinement architecture search. Moreover, a hybrid supervised contrast constraint seamlessly guides and integrates with the searching process, facilitating a more adaptive and comprehensive search for optimal loss functions.We demonstrate the state-of-the-art performance of HSDS-MEF compared to various competitive schemes, with a 10.61% and 4.38% improvement in Visual Information Fidelity (VIF) for general and no-reference scenarios, respectively. The results show high contrast, rich details, and vivid colors.

paper_url: http://arxiv.org/abs/2309.01111
repo_url: https://github.com/duyooho/arsdm
paper_authors: Yuhao Du, Yuncheng Jiang, Shuangyi Tan, Xusheng Wu, Qi Dou, Zhen Li, Guanbin Li, Xiang Wan
For: 协助临床诊断和治疗，提高残渣检测和识别精度。* Methods: 利用扩展的Diffusion模型，根据扩展的数据分布进行数据生成，并在训练过程中使用基于预训练的分割模型进行纠正。* Results: 对残渣检测和识别任务进行了广泛的实验，发现生成的数据可以显著提高基eline方法的性能。

Abstract
Colonoscopy analysis, particularly automatic polyp segmentation and detection, is essential for assisting clinical diagnosis and treatment. However, as medical image annotation is labour- and resource-intensive, the scarcity of annotated data limits the effectiveness and generalization of existing methods. Although recent research has focused on data generation and augmentation to address this issue, the quality of the generated data remains a challenge, which limits the contribution to the performance of subsequent tasks. Inspired by the superiority of diffusion models in fitting data distributions and generating high-quality data, in this paper, we propose an Adaptive Refinement Semantic Diffusion Model (ArSDM) to generate colonoscopy images that benefit the downstream tasks. Specifically, ArSDM utilizes the ground-truth segmentation mask as a prior condition during training and adjusts the diffusion loss for each input according to the polyp/background size ratio. Furthermore, ArSDM incorporates a pre-trained segmentation model to refine the training process by reducing the difference between the ground-truth mask and the prediction mask. Extensive experiments on segmentation and detection tasks demonstrate the generated data by ArSDM could significantly boost the performance of baseline methods.

摘要
colonoscopy分析，特别是自动复合体划分和检测，对诊断和治疗提供了重要支持。然而，医疗图像注释是劳动和资源浪费的，缺乏注释数据限制了现有方法的效iveness和泛化。尽管最近的研究将着眼于数据生成和扩展来解决这一问题，但生成的数据质量仍然是挑战，这限制了后续任务的贡献。 inspirited by diffuse models的优势在适应数据分布和生成高质量数据，在这篇论文中，我们提出了一种适应改进 semantic diffusion model（ArSDM），用于生成帮助下游任务的colonoscopy图像。具体来说，ArSDM在训练过程中使用了真实分 segmentation mask作为假设条件，并根据复合体/背景大小比进行了diffusion损失的调整。此外，ArSDM还 integrate了预训练分 segmentation模型，以减少真实分 mask和预测 mask之间的差异。对于分 segmentation和检测任务进行了广泛的实验， demonstrate that ArSDM生成的数据可以对基eline方法提供显著的提高。

AdvMono3D: Advanced Monocular 3D Object Detection with Depth-Aware Robust Adversarial Training

paper_url: http://arxiv.org/abs/2309.01106
repo_url: None
paper_authors: Xingyuan Li, Jinyuan Liu, Long Ma, Xin Fan, Risheng Liu
for: 增强monocular 3D对象检测模型的鲁棒性，以应对针对这些模型的攻击。
methods: 我们提出了一种深度意识 adversarial training 方法，包括设计了一种基于 IDP 的攻击，以及一种基于uncertainty的征还学习方法。
results: 我们在 KITTI 3D 数据集上进行了广泛的实验，发现 DART3D 在对抗攻击时比直接对抗训练（最流行的方法）提高了 $AP_{R40}$ 的车类分类表现，升准4.415%、4.112% 和 3.195%。

Abstract
Monocular 3D object detection plays a pivotal role in the field of autonomous driving and numerous deep learning-based methods have made significant breakthroughs in this area. Despite the advancements in detection accuracy and efficiency, these models tend to fail when faced with such attacks, rendering them ineffective. Therefore, bolstering the adversarial robustness of 3D detection models has become a crucial issue that demands immediate attention and innovative solutions. To mitigate this issue, we propose a depth-aware robust adversarial training method for monocular 3D object detection, dubbed DART3D. Specifically, we first design an adversarial attack that iteratively degrades the 2D and 3D perception capabilities of 3D object detection models(IDP), serves as the foundation for our subsequent defense mechanism. In response to this attack, we propose an uncertainty-based residual learning method for adversarial training. Our adversarial training approach capitalizes on the inherent uncertainty, enabling the model to significantly improve its robustness against adversarial attacks. We conducted extensive experiments on the KITTI 3D datasets, demonstrating that DART3D surpasses direct adversarial training (the most popular approach) under attacks in 3D object detection $AP_{R40}$ of car category for the Easy, Moderate, and Hard settings, with improvements of 4.415%, 4.112%, and 3.195%, respectively.

摘要
<>三元射顶3D物体探测在自动驾驶领域扮演重要角色，许多深度学习基础方法在这个领域中获得了重要突破。然而，这些模型对于这些攻击时往往会失败，导致它们无效。因此，增强3D物体探测模型的敌意耐袭性成为了一个紧要的问题，需要获得优先顾及和创新解决方案。为了解决这个问题，我们提出了一个深度感知敌意耐袭训练方法，名为DART3D。具体来说，我们首先设计了一个攻击，逐步对3D物体探测模型（IDP）进行损害， serve as the foundation for our subsequent defense mechanism。对于这个攻击，我们提出了一种不确定性基于的剩余学习方法 для adversarial training。我们的对抗训练方法利用模型的不确定性，使模型能够在攻击下提高其 robustness。我们对KITTI 3D数据集进行了广泛的实验，结果显示，DART3D 在3D物体探测 $AP_{R40}$ 的车辆类别下，在Easy、Moderate和Hard设定下，与直接对抗训练（最受欢迎的方法）相比，增加了4.415%, 4.112%, 3.195%。

Turn Fake into Real: Adversarial Head Turn Attacks Against Deepfake Detection

paper_url: http://arxiv.org/abs/2309.01104
repo_url: None
paper_authors: Weijie Wang, Zhengyu Zhao, Nicu Sebe, Bruno Lepri
for: 评估深伪检测器的可靠性，检测到深伪视频中的人脸变化。
methods: 基于单个假图像Synthesize face view的方法，实现3D对抗性评估。
results: 对多种检测器进行了广泛的实验， validate了攻击者可以通过AdvHeat在真实场景下取得高攻击成功率（96.8%），并且可以降低步骤数到50。

Abstract
Malicious use of deepfakes leads to serious public concerns and reduces people's trust in digital media. Although effective deepfake detectors have been proposed, they are substantially vulnerable to adversarial attacks. To evaluate the detector's robustness, recent studies have explored various attacks. However, all existing attacks are limited to 2D image perturbations, which are hard to translate into real-world facial changes. In this paper, we propose adversarial head turn (AdvHeat), the first attempt at 3D adversarial face views against deepfake detectors, based on face view synthesis from a single-view fake image. Extensive experiments validate the vulnerability of various detectors to AdvHeat in realistic, black-box scenarios. For example, AdvHeat based on a simple random search yields a high attack success rate of 96.8% with 360 searching steps. When additional query access is allowed, we can further reduce the step budget to 50. Additional analyses demonstrate that AdvHeat is better than conventional attacks on both the cross-detector transferability and robustness to defenses. The adversarial images generated by AdvHeat are also shown to have natural looks. Our code, including that for generating a multi-view dataset consisting of 360 synthetic views for each of 1000 IDs from FaceForensics++, is available at https://github.com/twowwj/AdvHeaT.

摘要
恶意使用深度模仿导致公众对数字媒体的信任减退。虽然有效的深度模仿检测器已经提出，但它们却容易受到反对攻击。为评估检测器的可靠性，latest studies have explored various attacks。然而，所有的攻击都是基于二维图像干扰，这些干扰难以在真实的人脸变化中翻译。在这篇论文中，我们提出了第一个基于三维面视 synthesis的深度模仿攻击方法——对深度模仿检测器的反抗头部攻击（AdvHeat）。我们进行了广泛的实验，证明了各种检测器对AdvHeat的攻击成功率在真实的黑盒enario中很高，例如，基于随机搜索的AdvHeat可以达到96.8%的攻击成功率，只需360步。当允许更多的查询访问时，我们可以进一步降低步数到50。additional analyses表明，AdvHeat比传统攻击更好地在跨检测器的转移性和防御机制上。生成的反对图像也被证明为具有自然的外观。我们的代码，包括生成360个视图的多视图数据集，可以在https://github.com/twowwj/AdvHeaT上下载。

Dual Adversarial Resilience for Collaborating Robust Underwater Image Enhancement and Perception

paper_url: http://arxiv.org/abs/2309.01102
repo_url: None
paper_authors: Zengxi Zhang, Zhiying Jiang, Zeru Shi, Jinyuan Liu, Risheng Liu
for: 提高水下图像的可见度和色彩稳定性，并且提高后续识别任务的精度。
methods: 提出了一种协同对抗鲁棒网络（CARNet），包括一个可逆网络、一种同步进行攻击训练和攻击检测、以及一个攻击模式识别器，以提高图像增强和识别任务的Robustness。
results: 实验结果表明，提出的方法可以输出高质量的增强图像，并且与前STATE-OF-THE-ART方法相比，其识别精度提高了6.71%。

Abstract
Due to the uneven scattering and absorption of different light wavelengths in aquatic environments, underwater images suffer from low visibility and clear color deviations. With the advancement of autonomous underwater vehicles, extensive research has been conducted on learning-based underwater enhancement algorithms. These works can generate visually pleasing enhanced images and mitigate the adverse effects of degraded images on subsequent perception tasks. However, learning-based methods are susceptible to the inherent fragility of adversarial attacks, causing significant disruption in results. In this work, we introduce a collaborative adversarial resilience network, dubbed CARNet, for underwater image enhancement and subsequent detection tasks. Concretely, we first introduce an invertible network with strong perturbation-perceptual abilities to isolate attacks from underwater images, preventing interference with image enhancement and perceptual tasks. Furthermore, we propose a synchronized attack training strategy with both visual-driven and perception-driven attacks enabling the network to discern and remove various types of attacks. Additionally, we incorporate an attack pattern discriminator to heighten the robustness of the network against different attacks. Extensive experiments demonstrate that the proposed method outputs visually appealing enhancement images and perform averagely 6.71% higher detection mAP than state-of-the-art methods.

摘要
In this work, we propose a collaborative adversarial resilience network (CARNet) for underwater image enhancement and detection tasks. The key idea is to use an invertible network with strong perturbation-perceptual abilities to isolate attacks from underwater images, preventing interference with image enhancement and perceptual tasks. Additionally, we propose a synchronized attack training strategy that incorporates both visual-driven and perception-driven attacks, allowing the network to distinguish and remove various types of attacks. To further enhance the robustness of the network, we also incorporate an attack pattern discriminator.Experimental results show that the proposed method outputs visually appealing enhancement images and achieves an average detection mAP of 6.71% higher than state-of-the-art methods.

Enhancing Infrared Small Target Detection Robustness with Bi-Level Adversarial Framework

paper_url: http://arxiv.org/abs/2309.01099
repo_url: None
paper_authors: Zhu Liu, Zihang Chen, Jinyuan Liu, Long Ma, Xin Fan, Risheng Liu
for: 提高小型红外目标检测 against 模糊和干扰背景的稳定性。
methods: 提出了一种 би低级对抗框架，包括learnable生成干扰并 Maximize losses as lower-level objective，以及提高检测器的Robustness promotion as upper-level objective。还提出了一种层次强化学习策略，以发现最有害的干扰并均衡性能和稳定性。
results: 在各种干扰下，提高了21.96% IOU，并在总benchmark上提高了4.97% IOU。I hope that helps! Let me know if you have any further questions.

Abstract
The detection of small infrared targets against blurred and cluttered backgrounds has remained an enduring challenge. In recent years, learning-based schemes have become the mainstream methodology to establish the mapping directly. However, these methods are susceptible to the inherent complexities of changing backgrounds and real-world disturbances, leading to unreliable and compromised target estimations. In this work, we propose a bi-level adversarial framework to promote the robustness of detection in the presence of distinct corruptions. We first propose a bi-level optimization formulation to introduce dynamic adversarial learning. Specifically, it is composited by the learnable generation of corruptions to maximize the losses as the lower-level objective and the robustness promotion of detectors as the upper-level one. We also provide a hierarchical reinforced learning strategy to discover the most detrimental corruptions and balance the performance between robustness and accuracy. To better disentangle the corruptions from salient features, we also propose a spatial-frequency interaction network for target detection. Extensive experiments demonstrate our scheme remarkably improves 21.96% IOU across a wide array of corruptions and notably promotes 4.97% IOU on the general benchmark. The source codes are available at https://github.com/LiuZhu-CV/BALISTD.

摘要
探测小型红外目标在杂乱背景下是一个长期不断挑战。在最近几年，学习基于的方法成为了主流方法来建立映射。然而，这些方法容易受到变化背景和真实世界干扰的影响，导致目标估计不可靠和妥协。在这种情况下，我们提出了一种bi-level对抗框架，以提高探测中的Robustness。我们首先提出了bi-level优化形式来引入动态对抗学习。具体来说，它由learnable生成损害来最大化损害作为下一级目标，并且通过提高检测器的Robustness来作为上一级目标。我们还提供了层次强化学习策略，以发现最有害的损害和平衡性能和准确性。为了更好地分离损害和突出特征，我们还提出了一种空间频率交互网络 для目标探测。广泛的实验表明，我们的方案可以remarkably提高21.96% IOU在各种损害下，并且明显提高4.97% IOU在总benchmark上。源代码可以在https://github.com/LiuZhu-CV/BALISTD中下载。

CoTDet: Affordance Knowledge Prompting for Task Driven Object Detection

paper_url: http://arxiv.org/abs/2309.01093
repo_url: None
paper_authors: Jiajin Tang, Ge Zheng, Jingyi Yu, Sibei Yang
for: 本研究旨在检测图像中适合完成任务的对象实例，挑战在于对象类型过多，不能仅仅依据传统对象检测中的Category列表。
methods: 我们提出了基于基本可行性（Fundamental Affordances）而不是对象类型的方法，即从大量语言模型中提取可行知识，并使用多层链式思维（MLCoT）提取可行知识。
results: 我们的CoTDet方法在比较性评价中表现出色，与状态前方法相比，提高了15.6个盒子AP和14.8个面积AP，并能够生成对象检测的合理理由。

Abstract
Task driven object detection aims to detect object instances suitable for affording a task in an image. Its challenge lies in object categories available for the task being too diverse to be limited to a closed set of object vocabulary for traditional object detection. Simply mapping categories and visual features of common objects to the task cannot address the challenge. In this paper, we propose to explore fundamental affordances rather than object categories, i.e., common attributes that enable different objects to accomplish the same task. Moreover, we propose a novel multi-level chain-of-thought prompting (MLCoT) to extract the affordance knowledge from large language models, which contains multi-level reasoning steps from task to object examples to essential visual attributes with rationales. Furthermore, to fully exploit knowledge to benefit object recognition and localization, we propose a knowledge-conditional detection framework, namely CoTDet. It conditions the detector from the knowledge to generate object queries and regress boxes. Experimental results demonstrate that our CoTDet outperforms state-of-the-art methods consistently and significantly (+15.6 box AP and +14.8 mask AP) and can generate rationales for why objects are detected to afford the task.

摘要
In this paper, we propose exploring fundamental affordances rather than object categories. Affordances are common attributes that enable different objects to accomplish the same task. We also propose a novel multi-level chain-of-thought prompting (MLCoT) to extract affordance knowledge from large language models. This contains multiple levels of reasoning steps from the task to object examples to essential visual attributes with rationales.Furthermore, to fully utilize knowledge to benefit object recognition and localization, we propose a knowledge-conditional detection framework, named CoTDet. This framework conditions the detector based on the knowledge to generate object queries and regress boxes. Experimental results show that our CoTDet outperforms state-of-the-art methods by a consistent and significant margin (+15.6 box AP and +14.8 mask AP) and can generate rationales for why objects are detected to afford the task.

Face Clustering for Connection Discovery from Event Images

paper_url: http://arxiv.org/abs/2309.01092
repo_url: None
paper_authors: Ming Cheung
for: 这篇论文旨在提出一种基于事件图像的社交连接发现系统，以便无需在线社交图像上获取数据。
methods: 该论文提出了一种基于面 clustering的方法，通过分析事件图像中的人脸印象，找到社交连接。
results: 经过实验， authors 发现可以使用事件图像中的人脸印象来构建社交图，并且可以达到 80% 的 F1 分数。

Abstract
Social graphs are very useful for many applications, such as recommendations and community detections. However, they are only accessible to big social network operators due to both data availability and privacy concerns. Event images also capture the interactions among the participants, from which social connections can be discovered to form a social graph. Unlike online social graphs, social connections carried by event images can be extracted without user inputs, and hence many social graph-based applications become possible, even without access to online social graphs. This paper proposes a system to discover social connections from event images. By utilizing the social information from even images, such as co-occurrence, a face clustering method is proposed and implemented, and connections can be discovered without the identity of the event participants. By collecting over 40000 faces from over 3000 participants, it is shown that the faces can be well clustered with 80% in F1 score, and social graphs can be constructed. Utilizing offline event images may create a long-term impact on social network analytics.

摘要
社交图是非常有用于多种应用程序，如推荐和社群检测。然而，它们只能被大型社交网络运营商访问，因为数据可用性和隐私问题。事件图像也捕捉参与者之间的互动，从而可以构建社交图。与在线社交图不同，基于事件图像的社交连接可以无需用户输入抽象，因此许多基于社交图应用程序变得可能，甚至无需访问在线社交图。这篇论文提议一种基于事件图像的社交连接发现系统。通过使用事件图像中的社交信息，如共处，一种面 clustering 方法被提出并实现，并可以无需参与者身份信息发现社交连接。通过收集超过 40000 张面和超过 3000 名参与者，显示可以很好地将面 clustering 得到 80% 的 F1 分数，并构建社交图。使用离线事件图像可能会对社交网络分析产生长期影响。

MILA: Memory-Based Instance-Level Adaptation for Cross-Domain Object Detection

paper_url: http://arxiv.org/abs/2309.01086
repo_url: None
paper_authors: Onkar Krishna, Hiroki Ohashi, Saptarshi Sinha
for: 这种论文是为了解决跨Domain对象检测中的困难问题，特别是对于不同领域之间的对应关系的建立。
methods: 这种方法使用了对抗学习来对像级和实例级的特征进行对齐。具体来说，它使用了一个内存模块来存储所有标注源实例的卷积特征，并使用一个简单 yet effective的内存检索模块来为目标实例检索最相似的源实例。
results: 这种方法在不同领域之间的对应关系建立方面具有显著的优势，比如在不同领域的对象检测任务中，它的性能都高于非内存基于的方法。

Abstract
Cross-domain object detection is challenging, and it involves aligning labeled source and unlabeled target domains. Previous approaches have used adversarial training to align features at both image-level and instance-level. At the instance level, finding a suitable source sample that aligns with a target sample is crucial. A source sample is considered suitable if it differs from the target sample only in domain, without differences in unimportant characteristics such as orientation and color, which can hinder the model's focus on aligning the domain difference. However, existing instance-level feature alignment methods struggle to find suitable source instances because their search scope is limited to mini-batches. Mini-batches are often so small in size that they do not always contain suitable source instances. The insufficient diversity of mini-batches becomes problematic particularly when the target instances have high intra-class variance. To address this issue, we propose a memory-based instance-level domain adaptation framework. Our method aligns a target instance with the most similar source instance of the same category retrieved from a memory storage. Specifically, we introduce a memory module that dynamically stores the pooled features of all labeled source instances, categorized by their labels. Additionally, we introduce a simple yet effective memory retrieval module that retrieves a set of matching memory slots for target instances. Our experiments on various domain shift scenarios demonstrate that our approach outperforms existing non-memory-based methods significantly.

摘要
域外对象检测是一项挑战性任务，需要对来源和目标域进行对齐。先前的方法通过对抗学习来实现对像水平和实例水平的对齐。在实例水平上，找到一个适合的来源实例是关键。一个来源实例被视为适合的，只要它与目标实例在域之间差异，而不是在不重要的特征如旋转和颜色上差异，这些特征可能会使模型忽略对域差异的对齐。然而，现有的实例级别的特征对齐方法很难找到适合的来源实例，因为它们的搜索范围仅限于 мини-批。 мини-批通常很小，因此不一定包含适合的来源实例。这种缺乏多样性的问题特别是在目标实例具有高内类变异时变得更加突出。为解决这个问题，我们提出了一种带有内存的实例级别域适应框架。我们的方法将目标实例与同类标签的最相似来源实例进行对齐，而不是通过搜索 mini-批中的来源实例。具体来说，我们引入了一个内存模块，该模块在 Label 分类下将所有标注源实例的归一化特征存储在内存中。此外，我们还引入了一个简单 yet effective的内存检索模块，该模块可以将目标实例与内存中的匹配记录进行对比。我们的实验表明，我们的方法在不同的域转移enario中与非带内存的方法相比显著性能更高。

Chinese Text Recognition with A Pre-Trained CLIP-Like Model Through Image-IDS Aligning

paper_url: http://arxiv.org/abs/2309.01083
repo_url: None
paper_authors: Haiyang Yu, Xiaocong Wang, Bin Li, Xiangyang Xue
for: 提高中文文本识别精度和普适性
methods: 使用人工智能模型进行中文字符预训练，并将学习的表示映射到文本识别模型中进行优化
results: 在中文字符识别和文本识别任务中表现出色，超过了之前的方法在大多数场景下Translation:
for: 提高中文文本识别精度和普适性
methods: 使用人工智能模型进行中文字符预训练，并将学习的表示映射到文本识别模型中进行优化
results: 在中文字符识别和文本识别任务中表现出色，超过了之前的方法在大多数场景下

Abstract
Scene text recognition has been studied for decades due to its broad applications. However, despite Chinese characters possessing different characteristics from Latin characters, such as complex inner structures and large categories, few methods have been proposed for Chinese Text Recognition (CTR). Particularly, the characteristic of large categories poses challenges in dealing with zero-shot and few-shot Chinese characters. In this paper, inspired by the way humans recognize Chinese texts, we propose a two-stage framework for CTR. Firstly, we pre-train a CLIP-like model through aligning printed character images and Ideographic Description Sequences (IDS). This pre-training stage simulates humans recognizing Chinese characters and obtains the canonical representation of each character. Subsequently, the learned representations are employed to supervise the CTR model, such that traditional single-character recognition can be improved to text-line recognition through image-IDS matching. To evaluate the effectiveness of the proposed method, we conduct extensive experiments on both Chinese character recognition (CCR) and CTR. The experimental results demonstrate that the proposed method performs best in CCR and outperforms previous methods in most scenarios of the CTR benchmark. It is worth noting that the proposed method can recognize zero-shot Chinese characters in text images without fine-tuning, whereas previous methods require fine-tuning when new classes appear. The code is available at https://github.com/FudanVI/FudanOCR/tree/main/image-ids-CTR.

摘要
scene文本识别已经受到了多年的研究，因为它有广泛的应用场景。然而，尽管中文字体具有不同的特点，如复杂的内部结构和大量的类别，但只有少数方法被提出来用于中文文本识别（CTR）。特别是，大类划分带来了对零Instance和少Instance中文字体的挑战。在这篇论文中，我们提出了一个基于人类识别中文文本的两stage框架。首先，我们通过对印刷字体图像和意图描述序列（IDS）进行对齐，预训练一个类似于CLIP的模型。这个预训练阶段模拟了人类识别中文字体，并从图像和IDS获得了每个字体的征识性表示。然后，学习的表示被用来监督CTR模型，以改进传统的单个字体识别，并将其扩展到文本图像与IDS匹配。为评估提案的效果，我们进行了广泛的实验，包括中文字体识别（CCR）和CTR。实验结果显示，提案的方法在CCR中表现最佳，并在大多数CTR标准各种场景中超过了先前的方法。值得注意的是，提案的方法可以在文本图像中识别零Instance中文字体，而不需要调整。相比之下，先前的方法在新类型出现时需要调整。代码可以在https://github.com/FudanVI/FudanOCR/tree/main/image-ids-CTR中找到。

Orientation-Independent Chinese Text Recognition in Scene Images

paper_url: http://arxiv.org/abs/2309.01081
repo_url: None
paper_authors: Haiyang Yu, Xiaocong Wang, Bin Li, Xiangyang Xue
for: 本文是为了提高自然场景中的中文文本识别精度而提出的。
methods: 本文使用了一种新的Character Image Reconstruction Network（CIRN）来提取文本图像中的 orientation-independent 视觉特征，以便在自然场景中Robustly 识别水平和垂直文本。
results: 实验结果表明，在一个场景集上，提取Content和orientation信息的方法可以提高文本识别性能，而且在特制的Vertical Chinese Text Recognition（VCTR）集上，该方法可以提高45.63%。

Abstract
Scene text recognition (STR) has attracted much attention due to its broad applications. The previous works pay more attention to dealing with the recognition of Latin text images with complex backgrounds by introducing language models or other auxiliary networks. Different from Latin texts, many vertical Chinese texts exist in natural scenes, which brings difficulties to current state-of-the-art STR methods. In this paper, we take the first attempt to extract orientation-independent visual features by disentangling content and orientation information of text images, thus recognizing both horizontal and vertical texts robustly in natural scenes. Specifically, we introduce a Character Image Reconstruction Network (CIRN) to recover corresponding printed character images with disentangled content and orientation information. We conduct experiments on a scene dataset for benchmarking Chinese text recognition, and the results demonstrate that the proposed method can indeed improve performance through disentangling content and orientation information. To further validate the effectiveness of our method, we additionally collect a Vertical Chinese Text Recognition (VCTR) dataset. The experimental results show that the proposed method achieves 45.63% improvement on VCTR when introducing CIRN to the baseline model.

摘要
Scene文本识别（STR）在广泛应用领域中受到了广泛关注。先前的研究更多地关注了处理复杂背景的拉丁文本图像的识别，通过语言模型或其他辅助网络。与拉丁文本不同，自然场景中存在许多垂直的中文文本，这带来了当前状态的STR方法的困难。在这篇论文中，我们首次尝试提取不受方向影响的视觉特征，通过分离内容和方向信息来识别自然场景中的垂直和水平文本。我们引入了Character Image Reconstruction Network（CIRN）来重建相应的打印字符图像，并提取了内容和方向信息的分离。我们对一个场景集进行了测试，并得到了提高性的结果。为了进一步验证我们的方法的有效性，我们还收集了一个垂直中文文本识别（VCTR）集。实验结果表明，当我们将CIRN添加到基eline模型时，提高了45.63%的性能。

Robust Adversarial Defense by Tensor Factorization

paper_url: http://arxiv.org/abs/2309.01077
repo_url: None
paper_authors: Manish Bhattarai, Mehmet Cagri Kaymak, Ryan Barron, Ben Nebgen, Kim Rasmussen, Boian Alexandrov
for: 防御机器学习模型受到敌意攻击的难题，这种攻击可能会导致模型的性能下降或甚至失败。
methods: 我们的方法利用输入数据的矩阵化和神经网络参数的矩阵化，并将其组合成一种强大的防御策略。
results: 我们的方法能够保持高度的鲁棒性，即使面临最强大的自动攻击也能够维持Robust性。对比已有的防御策略，我们的结果不仅与之匹配，而且还超过了它们。这个研究证明了将矩阵化和低级别分解结合使用的可能性。

Abstract
As machine learning techniques become increasingly prevalent in data analysis, the threat of adversarial attacks has surged, necessitating robust defense mechanisms. Among these defenses, methods exploiting low-rank approximations for input data preprocessing and neural network (NN) parameter factorization have shown potential. Our work advances this field further by integrating the tensorization of input data with low-rank decomposition and tensorization of NN parameters to enhance adversarial defense. The proposed approach demonstrates significant defense capabilities, maintaining robust accuracy even when subjected to the strongest known auto-attacks. Evaluations against leading-edge robust performance benchmarks reveal that our results not only hold their ground against the best defensive methods available but also exceed all current defense strategies that rely on tensor factorizations. This study underscores the potential of integrating tensorization and low-rank decomposition as a robust defense against adversarial attacks in machine learning.

摘要
随着机器学习技术在数据分析中的普及，对抗攻击的威胁也在不断增加，需要开发有力的防御机制。在这些防御机制中，利用输入数据的低级别拟合和神经网络（NN）参数的因式分解方法已经显示出了潜在的可能性。我们的工作将这些方法进一步推广，通过将输入数据的维度化和NN参数的维度化结合使用，以增强对抗攻击的防御能力。我们的提案方法在面对最强大的自动攻击时仍能保持坚固的准确率，并在与现有的防御策略相比表现出色。这一研究表明，将维度化和低级别拟合结合使用可以成为机器学习中对抗攻击的有力防御策略。

Muti-Stage Hierarchical Food Classification

paper_url: http://arxiv.org/abs/2309.01075
repo_url: None
paper_authors: Xinyue Pan, Jiangpeng He, Fengqing Zhu
for: 本研究旨在提高食品图像分类的精度，以便从捕捉的食品图像中提取营养成分信息。
methods: 我们提出了一种多stage层次结构的方法，通过Iteratively clustering和合并食品项目 during the training process，使深度模型能够提取图像特征，这些特征在不同的标签之间具有很好的拟合度。
results: 我们在VFN-nutrient数据集上进行了测试，并获得了与现有工作相比的出色的结果，包括食品类别和食品项目分类。

Abstract
Food image classification serves as a fundamental and critical step in image-based dietary assessment, facilitating nutrient intake analysis from captured food images. However, existing works in food classification predominantly focuses on predicting 'food types', which do not contain direct nutritional composition information. This limitation arises from the inherent discrepancies in nutrition databases, which are tasked with associating each 'food item' with its respective information. Therefore, in this work we aim to classify food items to align with nutrition database. To this end, we first introduce VFN-nutrient dataset by annotating each food image in VFN with a food item that includes nutritional composition information. Such annotation of food items, being more discriminative than food types, creates a hierarchical structure within the dataset. However, since the food item annotations are solely based on nutritional composition information, they do not always show visual relations with each other, which poses significant challenges when applying deep learning-based techniques for classification. To address this issue, we then propose a multi-stage hierarchical framework for food item classification by iteratively clustering and merging food items during the training process, which allows the deep model to extract image features that are discriminative across labels. Our method is evaluated on VFN-nutrient dataset and achieve promising results compared with existing work in terms of both food type and food item classification.

摘要
We introduce the VFN-nutrient dataset, annotating each food image in VFN with a food item including nutritional composition information. This hierarchical structure allows for more discriminative annotation of food items. However, the nutritional composition information does not always correspond to visual relations, posing challenges for deep learning-based classification.To address this, we propose a multi-stage hierarchical framework for food item classification. We iteratively cluster and merge food items during training, allowing the deep model to extract image features that are discriminative across labels. Our method is evaluated on the VFN-nutrient dataset and achieves promising results compared to existing works in terms of both food type and food item classification.

Spatial and Visual Perspective-Taking via View Rotation and Relation Reasoning for Embodied Reference Understanding

paper_url: http://arxiv.org/abs/2309.01073
repo_url: https://github.com/ChengShiest/REP-ERU
paper_authors: Cheng Shi, Sibei Yang
for: 本研究的目的是研究语言和姿势的参照理解，即接受者需要根据发送者的语言和姿势在共享物理环境中找到target对象。
methods: 本研究提出了一种基于自己视角的参照理解方法，称为REasoning from your Perspective（REP），该方法通过建立发送者和接受者之间的关系和对象与发送者之间的关系来解决主要挑战。
results: 实验结果表明，REP方法在YouRefIt上的精度达到+5.22%，与所有现有的状态 искусственный智能算法相比，占据了大幅度的优势。

Abstract
Embodied Reference Understanding studies the reference understanding in an embodied fashion, where a receiver is required to locate a target object referred to by both language and gesture of the sender in a shared physical environment. Its main challenge lies in how to make the receiver with the egocentric view access spatial and visual information relative to the sender to judge how objects are oriented around and seen from the sender, i.e., spatial and visual perspective-taking. In this paper, we propose a REasoning from your Perspective (REP) method to tackle the challenge by modeling relations between the receiver and the sender and the sender and the objects via the proposed novel view rotation and relation reasoning. Specifically, view rotation first rotates the receiver to the position of the sender by constructing an embodied 3D coordinate system with the position of the sender as the origin. Then, it changes the orientation of the receiver to the orientation of the sender by encoding the body orientation and gesture of the sender. Relation reasoning models the nonverbal and verbal relations between the sender and the objects by multi-modal cooperative reasoning in gesture, language, visual content, and spatial position. Experiment results demonstrate the effectiveness of REP, which consistently surpasses all existing state-of-the-art algorithms by a large margin, i.e., +5.22% absolute accuracy in terms of Prec0.5 on YouRefIt.

摘要
“人体参照理解”研究者强调在与另一个人共享的物理环境中，语言和手势都指向某个目标物件，并且需要接受者对 sender 的 egocentric 视角进行诠释。这个挑战在于如何让接受者获取 sender 的位置和方向信息，以便对于 sender 所看到的物品进行诠释。在这篇论文中，我们提出了一种基于你的视角（REP）方法，以解决这个挑战。REP 方法包括两个主要步骤：一、使用视角转换来让接受者视角与 sender 的视角进行对接，并且将接受者的视角转换为 sender 的视角。二、使用多modal 协同理解来模型非语言和语言之间的关系，以及物品和接受者之间的关系。实验结果显示，REP 方法能够优于所有现有的州际算法，具体而言，在 YouRefIt 上的 Prec0.5 上提高了 +5.22% 的绝对精度。

Channel Attention Separable Convolution Network for Skin Lesion Segmentation

paper_url: http://arxiv.org/abs/2309.01072
repo_url: None
paper_authors: Changlu Guo, Jiangyan Dai, Marton Szemenyei, Yugen Yi
for: 静观皮肤恶性肿瘤早期诊断，提高检测精度和效率。
methods: 基于U-Net、DenseNet、分割核、通道注意力和离散尺度 pyramid pooling（ASPP）的新网络：通道注意力分割卷积网络（CASCN）。
results: 在PH2数据集上进行了评估，没有过多预/后处理图像，CASCN实现了PH2数据集的最佳性能，Dice相似度0.9461，准确率0.9645。

Abstract
Skin cancer is a frequently occurring cancer in the human population, and it is very important to be able to diagnose malignant tumors in the body early. Lesion segmentation is crucial for monitoring the morphological changes of skin lesions, extracting features to localize and identify diseases to assist doctors in early diagnosis. Manual de-segmentation of dermoscopic images is error-prone and time-consuming, thus there is a pressing demand for precise and automated segmentation algorithms. Inspired by advanced mechanisms such as U-Net, DenseNet, Separable Convolution, Channel Attention, and Atrous Spatial Pyramid Pooling (ASPP), we propose a novel network called Channel Attention Separable Convolution Network (CASCN) for skin lesions segmentation. The proposed CASCN is evaluated on the PH2 dataset with limited images. Without excessive pre-/post-processing of images, CASCN achieves state-of-the-art performance on the PH2 dataset with Dice similarity coefficient of 0.9461 and accuracy of 0.9645.

摘要
皮肤癌是人类常见的癌症，早期诊断非常重要。 lesion segmentation 是监测皮肤损害的重要步骤，提取特征以地址和诊断疾病，帮助医生早期诊断。手动减少 dermoscopic 图像的步骤是时间consuming 和 error-prone，因此需要精准和自动化的分 segmentation 算法。 Drawing inspiration from advanced mechanisms such as U-Net, DenseNet, Separable Convolution, Channel Attention, and Atrous Spatial Pyramid Pooling (ASPP), we propose a novel network called Channel Attention Separable Convolution Network (CASCN) for skin lesions segmentation. The proposed CASCN is evaluated on the PH2 dataset with limited images. Without excessive pre-/post-processing of images, CASCN achieves state-of-the-art performance on the PH2 dataset with Dice similarity coefficient of 0.9461 and accuracy of 0.9645.Here's the breakdown of the translation:* "皮肤癌" (pí shèi gān) - skin cancer* "常见" (cháng jiàn) - frequently occurring* "早期诊断" (zhāo qiér xiǎng dài) - early diagnosis* "lesion segmentation" (lé shion segmenation) - segmentation of lesions* "重要步骤" (zhòng yào bù shè) - crucial step* "提取特征" (tixiāo tè xíng) - extract features* "地址和诊断" (dì yì yè shì) - localize and identify diseases* "帮助医生" (bāng zhù yī shēng) - assist doctors* "早期诊断" (zhāo qiér xiǎng dài) - early diagnosis* "精准和自动化" (jīn chūn yǔ zì zhì) - precise and automated* "分 segmentation" (fēn biao) - segmentation* "算法" (suān fǎ) - algorithm* "Channel Attention Separable Convolution Network" (CHannel Attention Separable Convolution Network) - proposed network* "PH2 dataset" (PH2 dataset) - dataset used for evaluation* "limited images" (liù yǐng) - limited number of images* "without excessive pre-/post-processing" (yīn wèi zhèng yǐn yǐn) - without extensive pre-/post-processing* "achieves state-of-the-art performance" (dào zhèng yì yì) - achieves state-of-the-art performance* "Dice similarity coefficient" (Dice similarity coefficient) - evaluation metric* "accuracy" ( accuracy) - evaluation metric

Semi-supervised 3D Video Information Retrieval with Deep Neural Network and Bi-directional Dynamic-time Warping Algorithm

paper_url: http://arxiv.org/abs/2309.01063
repo_url: None
paper_authors: Yintai Ma, Diego Klabjan
for: 本研究提出了一种基于视觉内容的半监督深度学习算法，用于检索相似的2D和3D视频。
methods: 该方法结合深度卷积神经网络和循环神经网络，并使用动态时间戳对相似度进行评估。
results: 该方法在多个公共数据集上进行测试，并与状态元深度学习模型进行比较，结果显示该方法在视频检索任务中具有良好的性能。

Abstract
This paper presents a novel semi-supervised deep learning algorithm for retrieving similar 2D and 3D videos based on visual content. The proposed approach combines the power of deep convolutional and recurrent neural networks with dynamic time warping as a similarity measure. The proposed algorithm is designed to handle large video datasets and retrieve the most related videos to a given inquiry video clip based on its graphical frames and contents. We split both the candidate and the inquiry videos into a sequence of clips and convert each clip to a representation vector using an autoencoder-backed deep neural network. We then calculate a similarity measure between the sequences of embedding vectors using a bi-directional dynamic time-warping method. This approach is tested on multiple public datasets, including CC\_WEB\_VIDEO, Youtube-8m, S3DIS, and Synthia, and showed good results compared to state-of-the-art. The algorithm effectively solves video retrieval tasks and outperforms the benchmarked state-of-the-art deep learning model.

摘要
To implement the algorithm, we split both the candidate and the inquiry videos into a sequence of clips and convert each clip to a representation vector using an autoencoder-backed deep neural network. We then calculate a similarity measure between the sequences of embedding vectors using a bi-directional dynamic time-warping method.We test the algorithm on multiple public datasets, including CC\_WEB\_VIDEO, Youtube-8m, S3DIS, and Synthia, and show that it outperforms state-of-the-art deep learning models. The algorithm effectively solves video retrieval tasks and demonstrates good performance.

Efficient Curriculum based Continual Learning with Informative Subset Selection for Remote Sensing Scene Classification

paper_url: http://arxiv.org/abs/2309.01050
repo_url: None
paper_authors: S Divakar Bhat, Biplab Banerjee, Subhasis Chaudhuri, Avik Bhattacharya
for: 这个论文探讨了从光学远程测量图像中的地面分类问题，并提出了一个基于现有数据的类别增cremental learning（CIL）框架。
methods: 本文提出了一个独特的CIL方法，据建立了一个专门的curriculum来学习新的类别，并采用了一种对于旧的流程进行选择的sample选择策略来减少错误的影响。
results: 实验结果显示，提出的方法可以提高CIL性能，并且比过去的方法更好地调节稳定性和柔软性的贡献。

Abstract
We tackle the problem of class incremental learning (CIL) in the realm of landcover classification from optical remote sensing (RS) images in this paper. The paradigm of CIL has recently gained much prominence given the fact that data are generally obtained in a sequential manner for real-world phenomenon. However, CIL has not been extensively considered yet in the domain of RS irrespective of the fact that the satellites tend to discover new classes at different geographical locations temporally. With this motivation, we propose a novel CIL framework inspired by the recent success of replay-memory based approaches and tackling two of their shortcomings. In order to reduce the effect of catastrophic forgetting of the old classes when a new stream arrives, we learn a curriculum of the new classes based on their similarity with the old classes. This is found to limit the degree of forgetting substantially. Next while constructing the replay memory, instead of randomly selecting samples from the old streams, we propose a sample selection strategy which ensures the selection of highly confident samples so as to reduce the effects of noise. We observe a sharp improvement in the CIL performance with the proposed components. Experimental results on the benchmark NWPU-RESISC45, PatternNet, and EuroSAT datasets confirm that our method offers improved stability-plasticity trade-off than the literature.

摘要
我们在这篇论文中处理了Remote Sensing（RS）图像中的类增长学习（CIL）问题。CIL的概念在真实世界中的数据采集中变得越来越重要，因为数据通常是在序列化的方式获取的。然而，CIL在RS领域还没有得到广泛的考虑，即使卫星在不同的地理位置上发现新的类型。为了解决这个问题，我们提出了一种基于最近的replay-memory基本方法的新CIL框架，并解决了两个缺点。首先，我们学习新类的curriculum，基于其与旧类的相似性，以限制淘汰旧类的影响。我们发现这可以减少淘汰的影响。然后，在构建replay内存时，而不是随机选择旧流中的样本，我们提议一种样本选择策略，以确保选择高度确定的样本，以降低噪声的影响。我们发现这些组件对CIL性能带来了明显的改善。实验结果表明，我们的方法在NWPU-RESISC45、PatternNet和EuroSAT数据集上提供了与文献中的稳定- пластично性质量Trade-off更好的性能。

Fun Paper

2023-09-03

cs.CV - 2023-09-03

MAP: Domain Generalization via Meta-Learning on Anatomy-Consistent Pseudo-Modalities

FOR-instance: a UAV laser scanning benchmark dataset for semantic and instance segmentation of individual trees

Diffusion Models with Deterministic Normalizing Flow Priors

SOAR: Scene-debiasing Open-set Action Recognition

Multimodal Contrastive Learning with Hard Negative Sampling for Human Activity Recognition

S2RF: Semantically Stylized Radiance Fields

Towards Generic Image Manipulation Detection with Weakly-Supervised Self-Consistency Learning

BodySLAM++: Fast and Tightly-Coupled Visual-Inertial Camera and Human Motion Tracking

Generalizability and Application of the Skin Reflectance Estimate Based on Dichromatic Separation (SREDS)

Spectral Adversarial MixUp for Few-Shot Unsupervised Domain Adaptation

MAGMA: Music Aligned Generative Motion Autodecoder

Holistic Dynamic Frequency Transformer for Image Fusion and Exposure Correction

Deep Unfolding Convolutional Dictionary Model for Multi-Contrast MRI Super-resolution and Reconstruction

An Asynchronous Linear Filter Architecture for Hybrid Event-Frame Cameras

LoGoPrompt: Synthetic Text Images Can Be Good Visual Prompts for Vision-Language Models

EdaDet: Open-Vocabulary Object Detection Using Early Dense Alignment

VGDiffZero: Text-to-image Diffusion Models Can Be Zero-shot Visual Grounders

RSDiff: Remote Sensing Image Generation from Text Using Diffusion Model

Hybrid-Supervised Dual-Search: Leveraging Automatic Learning for Loss-free Multi-Exposure Image Fusion

ArSDM: Colonoscopy Images Synthesis with Adaptive Refinement Semantic Diffusion Models

AdvMono3D: Advanced Monocular 3D Object Detection with Depth-Aware Robust Adversarial Training

Turn Fake into Real: Adversarial Head Turn Attacks Against Deepfake Detection

Dual Adversarial Resilience for Collaborating Robust Underwater Image Enhancement and Perception

Enhancing Infrared Small Target Detection Robustness with Bi-Level Adversarial Framework

CoTDet: Affordance Knowledge Prompting for Task Driven Object Detection

Face Clustering for Connection Discovery from Event Images

MILA: Memory-Based Instance-Level Adaptation for Cross-Domain Object Detection

Chinese Text Recognition with A Pre-Trained CLIP-Like Model Through Image-IDS Aligning

Orientation-Independent Chinese Text Recognition in Scene Images

Robust Adversarial Defense by Tensor Factorization

Muti-Stage Hierarchical Food Classification

Spatial and Visual Perspective-Taking via View Rotation and Relation Reasoning for Embodied Reference Understanding

Channel Attention Separable Convolution Network for Skin Lesion Segmentation

Semi-supervised 3D Video Information Retrieval with Deep Neural Network and Bi-directional Dynamic-time Warping Algorithm

Efficient Curriculum based Continual Learning with Informative Subset Selection for Remote Sensing Scene Classification