2023-08-13

cs.CV

cs.CV - 2023-08-13

Modified Topological Image Preprocessing for Skin Lesion Classifications

paper_url: http://arxiv.org/abs/2308.06796
repo_url: None
paper_authors: Hong Cheng, Rebekah Leamons, Ahmad Al Shami
for: 这个论文是为了提出一种修改了拓扑数据分析模型，用于精确地处理皮肤图像的预处理和优化。
methods: 该模型使用了修改后的拓扑数据分析方法，并在使用深度卷积神经网络和视transformer模型进行训练。
results: 实验结果表明，使用修改后的拓扑数据分析方法可以在皮肤图像预处理中提高性能。

Abstract
This paper proposes a modified Topological Data Analysis model for skin images preprocessing and enhancements. The skin lesion dataset HAM10000 used with the intention of identifying the important objects in relevant regions of the images. In order to evaluate both the original dataset and the preprocessed dataset, Deep Convolutional Neural Network and Vision Transformer models were utilized to train both models. After training, the experimental results demonstrate that the images preprocessed using the Modified Topological Data Analysis consistently perform better.

摘要
这个论文提出了一种修改后的拓扑数据分析模型，用于皮肤图像的预处理和改进。使用了悬峰10000个皮肤病变数据集，以便在相关区域中标识重要对象。为了评估原始数据集和预处理后的数据集，使用了深度卷积神经网络和视 traducción transformer 模型进行训练。经训练后，实验结果表明，使用修改后的拓扑数据分析后的图像预处理 consistently perform better。Note: "拓扑数据分析" (topological data analysis) is a bit of a mouthful in Chinese, so I've shortened it to "拓扑分析" (topological analysis) in the translation.

PV-SSD: A Projection and Voxel-based Double Branch Single-Stage 3D Object Detector

paper_url: http://arxiv.org/abs/2308.06791
repo_url: None
paper_authors: Yongxin Shao, Aihong Tan, Zhetao Sun, Enhui Zheng, Tianhong Yan
for: 提高LIDAR数据的3D对象检测和分类精度，以便自动驾驶。
methods: 提出基于精度抽象和投影叠加的double branch特征提取方法（PV-SSD），以减少投影过程中的信息损失。
results: 与之前的工作相比，本方法实现了良好的性能，并且提出了多个贡献，包括：1）基于精度抽象的粒子特征提取方法；2）基于重要性抽象的特征点抽取方法；3）基于SSFA模块的MSSFA模块。

Abstract
LIDAR-based 3D object detection and classification is crucial for autonomous driving. However, inference in real-time from extremely sparse 3D data poses a formidable challenge. To address this issue, a common approach is to project point clouds onto a bird's-eye or perspective view, effectively converting them into an image-like data format. However, this excessive compression of point cloud data often leads to the loss of information. This paper proposes a 3D object detector based on voxel and projection double branch feature extraction (PV-SSD) to address the problem of information loss. We add voxel features input containing rich local semantic information, which is fully fused with the projected features in the feature extraction stage to reduce the local information loss caused by projection. A good performance is achieved compared to the previous work. In addition, this paper makes the following contributions: 1) a voxel feature extraction method with variable receptive fields is proposed; 2) a feature point sampling method by weight sampling is used to filter out the feature points that are more conducive to the detection task; 3) the MSSFA module is proposed based on the SSFA module. To verify the effectiveness of our method, we designed comparison experiments.

摘要
LIDAR-based 3D对象检测和分类是自动驾驶中关键。然而，在实时推理从极其稀疏3D数据中却成为一大挑战。为解决这个问题，一种常见的方法是将点云 proyect onto a bird's-eye or perspective view，实际上将其转换成图像类数据格式。然而，这种压缩点云数据的方法经常会导致信息损失。这篇论文提出了基于voxel和投影双分支特征提取（PV-SSD）的3D对象检测器，以解决信息损失问题。我们添加了包含丰富本地语义信息的voxel特征输入，并将其完全与投影特征在特征提取阶段进行了完全融合，以降低由投影所导致的本地信息损失。我们实现了与之前的工作相比的良好性能。此外，本文还做出了以下贡献：1）基于voxel特征提取方法中的可变感知场被提出；2）通过重点抽样来筛选更适合检测任务的特征点；3）基于SSFA模块的MSSFA模块被提出。为证明我们的方法的有效性，我们设计了对比实验。

RMP-Loss: Regularizing Membrane Potential Distribution for Spiking Neural Networks

paper_url: http://arxiv.org/abs/2308.06787
repo_url: None
paper_authors: Yufei Guo, Xiaode Liu, Yuanpei Chen, Liwen Zhang, Weihang Peng, Yuhan Zhang, Xuhui Huang, Zhe Ma
for: 这篇论文是为了解决神经网络中的数字化错误问题，并提出一个简单且直观的训练方法来减少这种错误的影响。
methods: 本论文使用的方法是一种叫做Regularizing membrane potential loss (RMP-Loss)的调整项，可以将数字化错误的影响降到最小化。这个方法实现非常简单，并且可以轻松地训练神经网络。
results: 本论文的实验结果显示，使用RMP-Loss训练神经网络可以对数字化错误问题做出有效的降低，并且可以与其他已知的方法相比，在不同的网络架构和数据集上表现更好。

Abstract
Spiking Neural Networks (SNNs) as one of the biology-inspired models have received much attention recently. It can significantly reduce energy consumption since they quantize the real-valued membrane potentials to 0/1 spikes to transmit information thus the multiplications of activations and weights can be replaced by additions when implemented on hardware. However, this quantization mechanism will inevitably introduce quantization error, thus causing catastrophic information loss. To address the quantization error problem, we propose a regularizing membrane potential loss (RMP-Loss) to adjust the distribution which is directly related to quantization error to a range close to the spikes. Our method is extremely simple to implement and straightforward to train an SNN. Furthermore, it is shown to consistently outperform previous state-of-the-art methods over different network architectures and datasets.

摘要
神经网络（SNN）作为生物体系静脉模型，最近受到了非常多的关注。它可以减少能耗，因为它将实际值膜电压转换为0/1气压来传输信息，因此硬件实现中的多Multiplications of activations and weights可以被替换为加法运算。然而，这种归一化机制会不可避免地导致归一化错误，从而导致重大的信息损失。为解决这个问题，我们提议一种调整膜电压损失（RMP-Loss）来调整直接与归一化错误相关的分布，使其落在近距离气压范围内。我们的方法非常简单易于实现，并且可以 straightforwardly 训练一个 SNN。此外，我们的方法在不同的网络架构和数据集上都能够 consistently outperform 前一代的方法。

Shape-guided Conditional Latent Diffusion Models for Synthesising Brain Vasculature

paper_url: http://arxiv.org/abs/2308.06781
repo_url: None
paper_authors: Yash Deo, Haoran Dou, Nishant Ravikumar, Alejandro F. Frangi, Toni Lassila
for: 该研究旨在提高对脑血管疾病的研究和临床 intervención中对脑血管的理解，通过生成真实的3D脑血管分割图像，包括较少见的脑血管变化。
methods: 该研究使用了一种新的生成模型，基于 conditional latent diffusion model，具有形态和解剖指导，以生成真实的3D脑血管分割图像，包括不同的脑血管变化。
results: 研究结果显示，该模型生成的脑血管分割图像比较真实，与其他生成模型，如 conditional GAN和 conditional VAE，具有更高的视觉准确性，FID分数比best-performing GAN-based model高53%。

Abstract
The Circle of Willis (CoW) is the part of cerebral vasculature responsible for delivering blood to the brain. Understanding the diverse anatomical variations and configurations of the CoW is paramount to advance research on cerebrovascular diseases and refine clinical interventions. However, comprehensive investigation of less prevalent CoW variations remains challenging because of the dominance of a few commonly occurring configurations. We propose a novel generative approach utilising a conditional latent diffusion model with shape and anatomical guidance to generate realistic 3D CoW segmentations, including different phenotypical variations. Our conditional latent diffusion model incorporates shape guidance to better preserve vessel continuity and demonstrates superior performance when compared to alternative generative models, including conditional variants of 3D GAN and 3D VAE. We observed that our model generated CoW variants that are more realistic and demonstrate higher visual fidelity than competing approaches with an FID score 53\% better than the best-performing GAN-based model.

摘要
圆形 Wille （CoW）是脑血管系统中听取血液到脑部的部分。了解不同的 anatomical 变化和配置的 CoW 对于进展研究脑血管疾病和细化临床 intervención 至关重要。然而，对于 menos prevalence CoW 变化的全面调查仍然具有挑战，因为一些常见的配置占据了主导地位。我们提出了一种新的生成方法，使用 conditioned 潜在扩散模型，包括形态和解剖指导来生成真实的 3D CoW 分割，包括不同的现象变化。我们的 conditioned 潜在扩散模型包括形态指导，以更好地保持血管连续性，并在与其他生成模型进行比较时，表现出更高的性能。我们观察到，我们的模型生成的 CoW 变化更加真实，与竞争方法相比，FID 分数高达 53% 更高。

Neural Networks at a Fraction with Pruned Quaternions

paper_url: http://arxiv.org/abs/2308.06780
repo_url: https://github.com/smlab-niser/quartLT22
paper_authors: Sahel Mohammad Iqbal, Subhankar Mishra
for: 这个研究目的是实现具有限制的计算能力的装置上进行深度学习模型的部署。
methods: 这个研究使用删减技术来删除不必要的参数，以减少训练和推导的资源需求。另外，使用高维度数据嵌入，如复数或四元数，可以降低模型的参数数量，保持精度。
results: 研究发现，在某些架构和任务上，这些复数模型在高给定率下具有更高的准确性，比如在CIFAR-10上的图像分类任务中，使用Conv-4架构，删减后的复数模型比同架构的实际模型提高了超过10%。实验结果显示，在极限的资源环境下，一个简短的复数网络可能比同架构的实际简短模型更适合进行部署。

Abstract
Contemporary state-of-the-art neural networks have increasingly large numbers of parameters, which prevents their deployment on devices with limited computational power. Pruning is one technique to remove unnecessary weights and reduce resource requirements for training and inference. In addition, for ML tasks where the input data is multi-dimensional, using higher-dimensional data embeddings such as complex numbers or quaternions has been shown to reduce the parameter count while maintaining accuracy. In this work, we conduct pruning on real and quaternion-valued implementations of different architectures on classification tasks. We find that for some architectures, at very high sparsity levels, quaternion models provide higher accuracies than their real counterparts. For example, at the task of image classification on CIFAR-10 using Conv-4, at $3\%$ of the number of parameters as the original model, the pruned quaternion version outperforms the pruned real by more than $10\%$. Experiments on various network architectures and datasets show that for deployment in extremely resource-constrained environments, a sparse quaternion network might be a better candidate than a real sparse model of similar architecture.

摘要
现代神经网络在训练和推理过程中的参数数量逐渐增加，这限制了它们在计算机能力有限的设备上进行部署。折射是一种技术来移除不必要的权重，以降低训练和推理所需的资源。此外，在多维输入数据的机器学习任务中，使用高维数域嵌入，如复数或四元数，可以降低参数数量而保持准确性。在这项工作中，我们对实际值和四元数值实现的不同架构进行了折射。我们发现，在某些架构上，在非常高的稀疏程度下，四元数模型可以提供更高的准确性。例如，在使用Conv-4架构进行图像分类任务时，在原始模型的3%的参数数量下，折射后的四元数模型高于原始实际模型的准确性超过10%。在不同的网络架构和数据集上进行了多种实验，我们发现在极限的资源环境下，一个稀疏的四元数网络可能比同类架构的实际稀疏模型更适合进行部署。

Shrinking Class Space for Enhanced Certainty in Semi-Supervised Learning

paper_url: http://arxiv.org/abs/2308.06777
repo_url: https://github.com/LiheYoung/ShrinkMatch
paper_authors: Lihe Yang, Zhen Zhao, Lei Qi, Yu Qiao, Yinghuan Shi, Hengshuang Zhao
for: 本文针对semi-supervised learning中的问题提出了一个新的方法，即ShrinkMatch，以解决过去的 Pseudo label 可能不准确问题。
methods: 本文使用了一个新的方法，即ShrinkMatch，它可以将uncertain samples转换为certain samples，并且运用了一个consistency regularization来实现更好的表现。
results: 本文的实验结果显示了ShrinkMatch方法在各个 benchmark 上的出色表现，并且与其他state-of-the-art方法相比，它的表现更好。

Abstract
Semi-supervised learning is attracting blooming attention, due to its success in combining unlabeled data. To mitigate potentially incorrect pseudo labels, recent frameworks mostly set a fixed confidence threshold to discard uncertain samples. This practice ensures high-quality pseudo labels, but incurs a relatively low utilization of the whole unlabeled set. In this work, our key insight is that these uncertain samples can be turned into certain ones, as long as the confusion classes for the top-1 class are detected and removed. Invoked by this, we propose a novel method dubbed ShrinkMatch to learn uncertain samples. For each uncertain sample, it adaptively seeks a shrunk class space, which merely contains the original top-1 class, as well as remaining less likely classes. Since the confusion ones are removed in this space, the re-calculated top-1 confidence can satisfy the pre-defined threshold. We then impose a consistency regularization between a pair of strongly and weakly augmented samples in the shrunk space to strive for discriminative representations. Furthermore, considering the varied reliability among uncertain samples and the gradually improved model during training, we correspondingly design two reweighting principles for our uncertain loss. Our method exhibits impressive performance on widely adopted benchmarks. Code is available at https://github.com/LiheYoung/ShrinkMatch.

摘要
semi-supervised learning 已经吸引了大量的注意力，因为它可以将无标签数据与标签数据结合起来。为了避免 potential incorrect pseudo labels，现有的框架通常设置固定的信任度reshold来抛弃不确定的样本。这种做法可以保证高质量的 pseudo labels，但是会导致未利用整个无标签集的资源。在这项工作中，我们的关键发现是，这些不确定的样本可以被转化为确定的样本，只要检测并移除混淆类。驱使这个想法，我们提出了一种名为 ShrinkMatch 的新方法。对于每个不确定的样本，它可以适应地寻找一个缩小的类空间，这个类空间只包含原始的 top-1 类，以及剩下的 less likely 类。由于混淆类被移除在这个空间中，重新计算的 top-1 信任度可以满足预定的阈值。然后，我们对一对强制和弱制的扩展样本之间的一致性regularization进行强制。此外，考虑到不确定样本之间的不同可靠性和在训练过程中逐渐提高的模型，我们采用了两种不同的 uncertain loss 重量原则。我们的方法在广泛采用的 benchmark 上表现出色。代码可以在中找到。

Unsupervised Image Denoising in Real-World Scenarios via Self-Collaboration Parallel Generative Adversarial Branches

paper_url: http://arxiv.org/abs/2308.06776
repo_url: https://github.com/linxin0/scpgabnet
paper_authors: Xin Lin, Chao Ren, Xiao Liu, Jie Huang, Yinjie Lei
for: 提高无标注数据集上的图像去噪性能，不需要大量的训练数据。
methods: 基于生成对抗网络的无supervised方法，通过在filter-guided噪音提取模块中逐次更新denoiser来提高噪音提取性能。
results: 与state-of-the-art无supervised方法相比，本方法实现了更高的噪音提取性能，而且不需要增加训练数据量或计算复杂度。

Abstract
Deep learning methods have shown remarkable performance in image denoising, particularly when trained on large-scale paired datasets. However, acquiring such paired datasets for real-world scenarios poses a significant challenge. Although unsupervised approaches based on generative adversarial networks offer a promising solution for denoising without paired datasets, they are difficult in surpassing the performance limitations of conventional GAN-based unsupervised frameworks without significantly modifying existing structures or increasing the computational complexity of denoisers. To address this problem, we propose a SC strategy for multiple denoisers. This strategy can achieve significant performance improvement without increasing the inference complexity of the GAN-based denoising framework. Its basic idea is to iteratively replace the previous less powerful denoiser in the filter-guided noise extraction module with the current powerful denoiser. This process generates better synthetic clean-noisy image pairs, leading to a more powerful denoiser for the next iteration. This baseline ensures the stability and effectiveness of the training network. The experimental results demonstrate the superiority of our method over state-of-the-art unsupervised methods.

摘要
深度学习方法在图像去噪中表现出了惊人的表现，特别是在大规模对应数据集上训练。然而，在实际场景中获得这些对应数据集是一项巨大的挑战。although generative adversarial networks（GAN）基于的无监督方法可以提供去噪无需对应数据集的解决方案，但是它们在不改变现有结构或提高去噪器的计算复杂度下难以超越传统GAN基于无监督框架的性能限制。为解决这个问题，我们提出了SC策略。这种策略可以在GAN基于的去噪框架中实现显著性能提升，无需提高去噪器的计算复杂度。它的基本思想是在滤波器指导噪音EXTRACTION模块中，逐次替换以前较弱的去噪器，使得当前更强的去噪器生成更好的干涉clean-noisy图像对。这个基准确保了训练网络的稳定性和效果。实验结果表明，我们的方法在无监督去噪方法中表现出了明显的优势。

A Survey on Deep Neural Network Pruning-Taxonomy, Comparison, Analysis, and Recommendations

paper_url: http://arxiv.org/abs/2308.06767
repo_url: https://github.com/hrcheng1066/awesome-pruning
paper_authors: Hongrong Cheng, Miao Zhang, Javen Qinfeng Shi
for: 本研究寻求解决现代深度神经网络的大型模型需要大量计算和存储资源的问题，以便在有限的资源环境下部署和加速推理。
methods: 本文对现有的深度神经网络剪辑技术进行了一个权威的综述，包括1) 通用/特定加速、2) 何时剪辑、3) 如何剪辑、4) 剪辑与其他压缩技术的融合。
results: 本文对7对不同的剪辑设置进行了比较分析，并探讨了一些新的话题，如后处理剪辑、不同水平的监督剪辑和应用于不同领域（如对抗攻击），以便更好地了解现有方法的共同点和不同点，并为未来的研究提供基础。

Abstract
Modern deep neural networks, particularly recent large language models, come with massive model sizes that require significant computational and storage resources. To enable the deployment of modern models on resource-constrained environments and accelerate inference time, researchers have increasingly explored pruning techniques as a popular research direction in neural network compression. However, there is a dearth of up-to-date comprehensive review papers on pruning. To address this issue, in this survey, we provide a comprehensive review of existing research works on deep neural network pruning in a taxonomy of 1) universal/specific speedup, 2) when to prune, 3) how to prune, and 4) fusion of pruning and other compression techniques. We then provide a thorough comparative analysis of seven pairs of contrast settings for pruning (e.g., unstructured/structured) and explore emerging topics, including post-training pruning, different levels of supervision for pruning, and broader applications (e.g., adversarial robustness) to shed light on the commonalities and differences of existing methods and lay the foundation for further method development. To facilitate future research, we build a curated collection of datasets, networks, and evaluations on different applications. Finally, we provide some valuable recommendations on selecting pruning methods and prospect promising research directions. We build a repository at https://github.com/hrcheng1066/awesome-pruning.

摘要
现代深度神经网络，特别是最近的大语言模型，具有庞大的模型大小，需要显著的计算和存储资源。为了在有限资源环境中部署现代模型和加速推理时间，研究人员逐渐探索剪枝技术作为神经网络压缩的流行研究方向。然而，有很多相关研究的报告是不够全面的。为了解决这个问题，在这个调查中，我们提供了一份完整的剪枝技术评论，包括1) 通用/特定速度，2) 何时剪枝，3) 如何剪枝，和4) 剪枝与其他压缩技术的融合。然后，我们进行了7对7的对比分析，探讨不同的设定（例如，无结构/结构），并探索了新的主题，如后期剪枝、不同水平的监督、以及更广泛的应用（例如，对抗攻击），以便更好地了解现有方法的共同点和差异，并为未来的研究提供基础。为便于未来的研究，我们创建了一个汇总的数据集、网络和评估的库，并提供了一些有价值的建议，以及一些前景探索的可能性。

Tissue Segmentation of Thick-Slice Fetal Brain MR Scans with Guidance from High-Quality Isotropic Volumes

paper_url: http://arxiv.org/abs/2308.06762
repo_url: None
paper_authors: Shijie Huang, Xukun Zhang, Zhiming Cui, He Zhang, Geng Chen, Dinggang Shen
for: 这个论文的目的是为了提高胎儿大脑磁共振成像（MR）扫描中的精准组织分割。
methods: 这篇论文使用了域适应技术，将高质量的ISO体磁共振图像（和其相应的注解）作为指导，以提高胎儿大脑磁共振扫描中的精准组织分割。
results: 这篇论文的实验结果表明，使用C2DA-Net可以在胎儿大脑磁共振扫描中提高精准组织分割的性能，并且比前Edge的方法更好。

Abstract
Accurate tissue segmentation of thick-slice fetal brain magnetic resonance (MR) scans is crucial for both reconstruction of isotropic brain MR volumes and the quantification of fetal brain development. However, this task is challenging due to the use of thick-slice scans in clinically-acquired fetal brain data. To address this issue, we propose to leverage high-quality isotropic fetal brain MR volumes (and also their corresponding annotations) as guidance for segmentation of thick-slice scans. Due to existence of significant domain gap between high-quality isotropic volume (i.e., source data) and thick-slice scans (i.e., target data), we employ a domain adaptation technique to achieve the associated knowledge transfer (from high-quality volumes to thick-slice scans). Specifically, we first register the available high-quality isotropic fetal brain MR volumes across different gestational weeks to construct longitudinally-complete source data. To capture domain-invariant information, we then perform Fourier decomposition to extract image content and style codes. Finally, we propose a novel Cycle-Consistent Domain Adaptation Network (C2DA-Net) to efficiently transfer the knowledge learned from high-quality isotropic volumes for accurate tissue segmentation of thick-slice scans. Our C2DA-Net can fully utilize a small set of annotated isotropic volumes to guide tissue segmentation on unannotated thick-slice scans. Extensive experiments on a large-scale dataset of 372 clinically acquired thick-slice MR scans demonstrate that our C2DA-Net achieves much better performance than cutting-edge methods quantitatively and qualitatively.

摘要
准确的脏部分 segmentation thick-slice 胎 Mind Magnetic Resonance（MR）扫描是关键的，以重建是otropic 胎 Mind MR 体积以及胎 Mind 发展评估。然而，这项任务受到thick-slice 扫描的使用带来挑战，因为这些扫描通常具有低分辨率。为了解决这个问题，我们提议利用高质量的 isotropic 胎 Mind MR 体积（以及其相应的注释）作为指导，以提高 thick-slice 扫描的 segmentation 精度。由于源数据和目标数据之间存在显著的领域差异，我们采用领域适应技术来实现相关的知识传递。具体来说，我们首先将可用的高质量 isotropic 胎 Mind MR 体积进行注册，以构建不同 Gestational Week 的 longitudinally-complete 源数据。然后，我们使用 Fourier 分解来提取图像内容和样式代码。最后，我们提议一种新的 Cycle-Consistent Domain Adaptation Network（C2DA-Net），以高效地将高质量 isotropic 体积中学到的知识传递到 thick-slice 扫描中。我们的 C2DA-Net 可以充分利用一小组注释的 isotropic 体积来导引脏部分 segmentation on unannotated thick-slice scans。我们在一个大规模的数据集上进行了广泛的实验，并证明了我们的 C2DA-Net 在量和质量上都有明显的优势。

Influence Function Based Second-Order Channel Pruning-Evaluating True Loss Changes For Pruning Is Possible Without Retraining

paper_url: http://arxiv.org/abs/2308.06755
repo_url: https://github.com/hrcheng1066/ifso
paper_authors: Hongrong Cheng, Miao Zhang, Javen Qinfeng Shi
for: 这篇论文旨在提出一种新的通道缩减方法，以更有效地选择需要缩减的通道。
methods: 该方法使用了Influence Function（影响函数）来评估通道的真实损失变化，而不需要重新训练权重。
results: 实验表明，该方法可以更加准确地选择需要缩减的通道，并且比exististing方法更快速。此外，该方法还开拓出了一些新的可能性，例如可以不需要重新训练权重来评估true损失变化。

Abstract
A challenge of channel pruning is designing efficient and effective criteria to select channels to prune. A widely used criterion is minimal performance degeneration. To accurately evaluate the truth performance degeneration requires retraining the survived weights to convergence, which is prohibitively slow. Hence existing pruning methods use previous weights (without retraining) to evaluate the performance degeneration. However, we observe the loss changes differ significantly with and without retraining. It motivates us to develop a technique to evaluate true loss changes without retraining, with which channels to prune can be selected more reliably and confidently. We first derive a closed-form estimator of the true loss change per pruning mask change, using influence functions without retraining. Influence function which is from robust statistics reveals the impacts of a training sample on the model's prediction and is repurposed by us to assess impacts on true loss changes. We then show how to assess the importance of all channels simultaneously and develop a novel global channel pruning algorithm accordingly. We conduct extensive experiments to verify the effectiveness of the proposed algorithm. To the best of our knowledge, we are the first that shows evaluating true loss changes for pruning without retraining is possible. This finding will open up opportunities for a series of new paradigms to emerge that differ from existing pruning methods. The code is available at https://github.com/hrcheng1066/IFSO.

摘要
一个频道剔除挑战是设计高效、有效的选择频道的 критеририи。广泛使用的标准是最小性能倒退。然而，要准确评估真正的性能倒退，需要重新训练存活的权重，这是非常慢的。因此，现有的剔除方法使用前一个 weights（无需重新训练）来评估性能倒退。但我们发现，无需重新训练时的损失变化很大。这种发现使我们开发一种评估真正的损失变化的技术，以更加可靠和自信地选择剔除频道。我们首先 deriv 一个关闭式估计器，用于评估每个剔除面积变化后的真正损失变化。我们使用 robust 统计中的影响函数，无需重新训练，可以准确地评估频道对模型预测的影响。然后，我们可以同时评估所有频道的重要性，并开发了一种全局频道剔除算法。我们进行了广泛的实验，证明了我们的提案的有效性。根据我们所知，我们是第一个证明可以无需重新训练评估真正的损失变化的人。这一发现将开启一系列的新思想，与现有的剔除方法不同。我们的代码可以在上找到。

FastLLVE: Real-Time Low-Light Video Enhancement with Intensity-Aware Lookup Table

paper_url: http://arxiv.org/abs/2308.06749
repo_url: https://github.com/wenhao-li-777/fastllve
paper_authors: Wenhao Li, Guangyang Wu, Wenyi Wang, Peiran Ren, Xiaohong Liu
for: 提高低光照视频质量
methods: 使用Look-Up-Table（LUT）技术维护间帧亮度一致性，并设计了学习型Intensity-Aware LUT（IA-LUT）模块进行自适应增强。
results: 实验结果表明，我们的方法在质量和间帧亮度一致性两个方面均达到了领先水平，并且可以在1080p视频上实现50+帧/秒的处理速度，比SOTA CNN基于方法更快。

Abstract
Low-Light Video Enhancement (LLVE) has received considerable attention in recent years. One of the critical requirements of LLVE is inter-frame brightness consistency, which is essential for maintaining the temporal coherence of the enhanced video. However, most existing single-image-based methods fail to address this issue, resulting in flickering effect that degrades the overall quality after enhancement. Moreover, 3D Convolution Neural Network (CNN)-based methods, which are designed for video to maintain inter-frame consistency, are computationally expensive, making them impractical for real-time applications. To address these issues, we propose an efficient pipeline named FastLLVE that leverages the Look-Up-Table (LUT) technique to maintain inter-frame brightness consistency effectively. Specifically, we design a learnable Intensity-Aware LUT (IA-LUT) module for adaptive enhancement, which addresses the low-dynamic problem in low-light scenarios. This enables FastLLVE to perform low-latency and low-complexity enhancement operations while maintaining high-quality results. Experimental results on benchmark datasets demonstrate that our method achieves the State-Of-The-Art (SOTA) performance in terms of both image quality and inter-frame brightness consistency. More importantly, our FastLLVE can process 1,080p videos at $\mathit{50+}$ Frames Per Second (FPS), which is $\mathit{2 \times}$ faster than SOTA CNN-based methods in inference time, making it a promising solution for real-time applications. The code is available at https://github.com/Wenhao-Li-777/FastLLVE.

摘要
低光照视频提升（LLVE）在最近几年内获得了广泛关注。一个关键的需求是 между帧亮度一致性，这是维护提升后视频的时间一致性的关键。然而，大多数现有的单图像基方法无法解决这个问题，导致提升后的视频呈现出抖抖的效果，从而降低总质量。此外，基于视频的3D卷积神经网络（CNN）方法，尽管可以维护间帧一致性，但是计算成本高昂，使其不适合实时应用。为解决这些问题，我们提出了一个高效的排序名为快速LLVE，它利用了Look-Up-Table（LUT）技术来保证间帧亮度一致性。我们特别设计了一个可学习的Intensity-Aware LUT（IA-LUT）模块，用于自适应增强，解决低动态问题在低光照场景下。这使得快速LLVE可以在低延迟和低复杂度下进行增强操作，同时保持高质量结果。实验结果表明，我们的方法在标准测试集上达到了领先的性能水平， both image quality和间帧亮度一致性。此外，我们的快速LLVE可以处理1080p视频，并在50+帧每秒进行加速，这比SOTA CNN基于方法的推理时间快速2倍。代码可以在https://github.com/Wenhao-Li-777/FastLLVE中找到。

Target before Shooting: Accurate Anomaly Detection and Localization under One Millisecond via Cascade Patch Retrieval

paper_url: http://arxiv.org/abs/2308.06748
repo_url: https://github.com/flyinghu123/cpr
paper_authors: Hanxi Li, Jianfei Hu, Bo Li, Hao Chen, Yongbin Zheng, Chunhua Shen
for: 提出了一种新的异常检测框架，实现了同时保证异常检测精度和运行速度的两个目标。
methods: 该框架通过粗细匹配方法选择测试图像各个小块的最佳对比图像，然后使用地区匹配方法在这些地区找到最佳的地方匹配。最后，计算每个测试图像块的异常分数基于地方匹配距离和非背景概率。
results: 在MVTec AD、BTAD和MVTec-3D AD等三个评测 dataset 上，提出的方法与所有参照方法进行比较，具有显著的优势，测试结果表明，该方法在不同的异常检测任务中具有较高的精度和较低的时间复杂度。

Abstract
In this work, by re-examining the "matching" nature of Anomaly Detection (AD), we propose a new AD framework that simultaneously enjoys new records of AD accuracy and dramatically high running speed. In this framework, the anomaly detection problem is solved via a cascade patch retrieval procedure that retrieves the nearest neighbors for each test image patch in a coarse-to-fine fashion. Given a test sample, the top-K most similar training images are first selected based on a robust histogram matching process. Secondly, the nearest neighbor of each test patch is retrieved over the similar geometrical locations on those "global nearest neighbors", by using a carefully trained local metric. Finally, the anomaly score of each test image patch is calculated based on the distance to its "local nearest neighbor" and the "non-background" probability. The proposed method is termed "Cascade Patch Retrieval" (CPR) in this work. Different from the conventional patch-matching-based AD algorithms, CPR selects proper "targets" (reference images and locations) before "shooting" (patch-matching). On the well-acknowledged MVTec AD, BTAD and MVTec-3D AD datasets, the proposed algorithm consistently outperforms all the comparing SOTA methods by remarkable margins, measured by various AD metrics. Furthermore, CPR is extremely efficient. It runs at the speed of 113 FPS with the standard setting while its simplified version only requires less than 1 ms to process an image at the cost of a trivial accuracy drop. The code of CPR is available at https://github.com/flyinghu123/CPR.

摘要
在这个工作中，我们重新审视了异常检测（AD）的“匹配”性质，并提出了一种新的AD框架，该框架同时具有新纪录级AD准确率和极高的运行速度。在该框架中，异常检测问题通过一种层次补丁检索过程来解决，首先选择测试样本中最相似的训练图像集，然后在这些“全球最似图像”上进行精心训练的本地度量来检索测试补丁的最近邻居。最后，测试图像补丁的异常分数根据补丁与“本地最似图像”以及“非背景”概率来计算。我们称这种方法为“层次补丁检索”（CPR）。与传统的补丁匹配基于AD算法不同，CPR在选择“目标”（参考图像和位置）之前已经选择了合适的“目标”。在广泛承认的MVTec AD、BTAD和MVTec-3D AD数据集上，我们的提案方法与所有比较参考方法的较大胜利差度相比，按照不同的AD指标进行评价。此外，CPR非常高效，它在标准设置下运行速度达113帧/秒，而其简化版本只需0.1毫秒来处理一幅图像，而且只有一rivial的准确率下降。CPR的代码可以在GitHub上找到：https://github.com/flyinghu123/CPR。

Self-supervised Noise2noise Method Utilizing Corrupted Images with a Modular Network for LDCT Denoising

paper_url: http://arxiv.org/abs/2308.06746
repo_url: https://github.com/xyuan01/self-supervised-noise2noise-for-ldct
paper_authors: Yuting Zhu, Qiang He, Yudong Yao, Yueyang Teng
for: 这篇论文旨在提出一种基于单簇 Computed Tomography (CT) 影像的自动降噪方法，不需要配对的陌生资料。
methods: 这篇论文使用了一种组合方法，包括自我指导的噪声2噪声模型和陌生噪声策略。首先，我们将 LDCT 影像重复地添加了一种相似的噪声。然后，我们使用只有次要损坏的影像进行训练。我们选择了一个模组化 U-Net 结构来进行任务，这样可以增加讯号场的视野而无需增加参数数。
results: 实验结果显示，提案的方法比过去的深度学习方法更有效率，在 Mayo LDCT 数据集上得到了好的效果。

Abstract
Deep learning is a very promising technique for low-dose computed tomography (LDCT) image denoising. However, traditional deep learning methods require paired noisy and clean datasets, which are often difficult to obtain. This paper proposes a new method for performing LDCT image denoising with only LDCT data, which means that normal-dose CT (NDCT) is not needed. We adopt a combination including the self-supervised noise2noise model and the noisy-as-clean strategy. First, we add a second yet similar type of noise to LDCT images multiple times. Note that we use LDCT images based on the noisy-as-clean strategy for corruption instead of NDCT images. Then, the noise2noise model is executed with only the secondary corrupted images for training. We select a modular U-Net structure from several candidates with shared parameters to perform the task, which increases the receptive field without increasing the parameter size. The experimental results obtained on the Mayo LDCT dataset show the effectiveness of the proposed method compared with that of state-of-the-art deep learning methods. The developed code is available at https://github.com/XYuan01/Self-supervised-Noise2Noise-for-LDCT.

摘要
深度学习是LDCT图像锈除的非常有前途的技术。然而，传统的深度学习方法需要配备附近的噪声和清洁数据集，这经常很难以获得。这篇论文提出了一种使用仅LDCT数据进行LDCT图像锈除的新方法。我们采用了混合自我supervised随机噪声模型和噪声作为清洁策略。首先，我们将LDCT图像添加了多个相似的噪声。注意，我们使用LDCT图像作为噪声Strategy instead ofNDCT图像。然后，我们执行了噪声2噪声模型，只使用次要损害的图像进行训练。我们选择了一种模块化U-Net结构从多个候选结构中，以增加感知场而不是增加参数大小。实验结果在Mayo LDCT数据集上表明了提议的方法的有效性，比对现有的深度学习方法更好。开发代码可以在https://github.com/XYuan01/Self-supervised-Noise2Noise-for-LDCT中下载。

TextDiff: Mask-Guided Residual Diffusion Models for Scene Text Image Super-Resolution

paper_url: http://arxiv.org/abs/2308.06743
repo_url: https://github.com/lenubolim/textdiff
paper_authors: Baolin Liu, Zongyuan Yang, Pengfei Wang, Junjie Zhou, Ziqi Liu, Ziyi Song, Yan Liu, Yongping Xiong
For: The paper aims to improve the readability and recognizability of scene text images by proposing a diffusion-based framework for scene text image super-resolution.* Methods: The proposed method, called TextDiff, consists of two modules: the Text Enhancement Module (TEM) and the Mask-Guided Residual Diffusion Module (MRD). The TEM generates an initial deblurred text image and a mask that encodes the spatial location of the text, while the MRD effectively sharpenes the text edge by modeling the residuals between the ground-truth images and the initial deblurred images.* Results: The proposed TextDiff achieves state-of-the-art (SOTA) performance on public benchmark datasets and can improve the readability of scene text images. Additionally, the MRD module is plug-and-play and can effectively sharpens the text edges produced by SOTA methods without requiring any additional joint training.

Abstract
The goal of scene text image super-resolution is to reconstruct high-resolution text-line images from unrecognizable low-resolution inputs. The existing methods relying on the optimization of pixel-level loss tend to yield text edges that exhibit a notable degree of blurring, thereby exerting a substantial impact on both the readability and recognizability of the text. To address these issues, we propose TextDiff, the first diffusion-based framework tailored for scene text image super-resolution. It contains two modules: the Text Enhancement Module (TEM) and the Mask-Guided Residual Diffusion Module (MRD). The TEM generates an initial deblurred text image and a mask that encodes the spatial location of the text. The MRD is responsible for effectively sharpening the text edge by modeling the residuals between the ground-truth images and the initial deblurred images. Extensive experiments demonstrate that our TextDiff achieves state-of-the-art (SOTA) performance on public benchmark datasets and can improve the readability of scene text images. Moreover, our proposed MRD module is plug-and-play that effectively sharpens the text edges produced by SOTA methods. This enhancement not only improves the readability and recognizability of the results generated by SOTA methods but also does not require any additional joint training. Available Codes:https://github.com/Lenubolim/TextDiff.

摘要
目标是帮助您将低分辨率的场景文本图像转换成高分辨率文本线图像。现有的方法通常通过像素级损失优化来实现文本边缘的增强，但这会导致文本边缘变得模糊，从而影响文本的可读性和识别性。为了解决这些问题，我们提出了 TextDiff，首个适用于场景文本图像超分辨率的扩散框架。它包括两个模块：文本增强模块（TEM）和帮助器导向残差扩散模块（MRD）。TEM 生成了初始的去噪文本图像和一个描述文本的空间位置的面罩。MRD 负责通过模拟实际图像和初始去噪图像之间的差异来有效地尖锐文本边缘。我们进行了广泛的实验，结果表明 TextDiff 在公共测试集上达到了领先的表现水平（SOTA），并可以提高场景文本图像的可读性。此外，我们提出的 MRD 模块可以很好地增强 SOTA 方法生成的文本边缘，无需额外的联合训练。可以在 GitHub 上下载代码：https://github.com/Lenubolim/TextDiff。

Free-ATM: Exploring Unsupervised Learning on Diffusion-Generated Images with Free Attention Masks

paper_url: http://arxiv.org/abs/2308.06739
repo_url: None
paper_authors: David Junhao Zhang, Mutian Xu, Chuhui Xue, Wenqing Zhang, Xiaoguang Han, Song Bai, Mike Zheng Shou
for: 本研究旨在解决无监督学习在视觉表示中的快速进步，尽管需要训练大规模数据集，但这会导致数据采集成本高昂，并且存在数据隐私问题。
methods: 我们开始通过探索 diffusion models 的 cross-attention层内置的annotation-free注意力掩模来解决这一问题。我们还investigate了三种常见的无监督学习技术（即对比学习、遮盖模型和视觉语言预训练），并提出了专门采用这些自由注意力掩模的解决方案。
results: 我们通过了广泛的实验，证明了我们的方法可以在不同的下游任务中提高基eline模型的性能，包括图像分类、检测、分割和图像文本检索。通过使用我们的方法，可以将无监督预训练在synthetic数据上的性能与实际场景中的性能趋同。

Abstract
Despite the rapid advancement of unsupervised learning in visual representation, it requires training on large-scale datasets that demand costly data collection, and pose additional challenges due to concerns regarding data privacy. Recently, synthetic images generated by text-to-image diffusion models, have shown great potential for benefiting image recognition. Although promising, there has been inadequate exploration dedicated to unsupervised learning on diffusion-generated images. To address this, we start by uncovering that diffusion models' cross-attention layers inherently provide annotation-free attention masks aligned with corresponding text inputs on generated images. We then investigate the problems of three prevalent unsupervised learning techniques ( i.e., contrastive learning, masked modeling, and vision-language pretraining) and introduce customized solutions by fully exploiting the aforementioned free attention masks. Our approach is validated through extensive experiments that show consistent improvements in baseline models across various downstream tasks, including image classification, detection, segmentation, and image-text retrieval. By utilizing our method, it is possible to close the performance gap between unsupervised pretraining on synthetic data and real-world scenarios.

摘要
Translated into Simplified Chinese:尽管Unsupervised learning在视觉表示方面进步 Rapidly, but it still requires expensive data collection and raises additional concerns about data privacy. Recently, text-to-image diffusion models generated synthetic images have shown great potential for image recognition. Although promising, there has been inadequate exploration of unsupervised learning on diffusion-generated images. To address this, we start by discovering that diffusion models' cross-attention layers inherently provide annotation-free attention masks aligned with corresponding text inputs on generated images. We then investigate the problems of three prevalent unsupervised learning techniques (i.e., contrastive learning, masked modeling, and vision-language pretraining) and introduce customized solutions by fully exploiting the aforementioned free attention masks. Our approach is validated through extensive experiments that show consistent improvements in baseline models across various downstream tasks, including image classification, detection, segmentation, and image-text retrieval. By utilizing our method, it is possible to close the performance gap between unsupervised pretraining on synthetic data and real-world scenarios.Translated into Traditional Chinese:尽管Unsupervised learning在视觉表示方面进步 Rapidly, but it still requires expensive data collection and raises additional concerns about data privacy. Recently, text-to-image diffusion models generated synthetic images have shown great potential for image recognition. Although promising, there has been inadequate exploration of unsupervised learning on diffusion-generated images. To address this, we start by discovering that diffusion models' cross-attention layers inherently provide annotation-free attention masks aligned with corresponding text inputs on generated images. We then investigate the problems of three prevalent unsupervised learning techniques (i.e., contrastive learning, masked modeling, and vision-language pretraining) and introduce customized solutions by fully exploiting the aforementioned free attention masks. Our approach is validated through extensive experiments that show consistent improvements in baseline models across various downstream tasks, including image classification, detection, segmentation, and image-text retrieval. By utilizing our method, it is possible to close the performance gap between unsupervised pretraining on synthetic data and real-world scenarios.

3D Scene Graph Prediction on Point Clouds Using Knowledge Graphs

paper_url: http://arxiv.org/abs/2308.06719
repo_url: None
paper_authors: Yiding Qiu, Henrik I. Christensen
for: scene graph prediction in 3D environments
methods: message-passing method with commonsense knowledge graphs
results: 15.0% improvement in scene graph prediction accuracy with external knowledge, 7.96% improvement with internal knowledge compared to state-of-the-art algorithms, and real-world testing with 10 frames per second for scene graph generation.Here’s the full text in Simplified Chinese:
for: scene graph prediction在3D环境中
methods: message-passing方法与常识知识图
results: 外部知识Integration leads to 15.0% improvement in scene graph prediction accuracy, 7.96% improvement with internal knowledge compared to state-of-the-art algorithms, and real-world testing with 10 frames per second for scene graph generation.

Abstract
3D scene graph prediction is a task that aims to concurrently predict object classes and their relationships within a 3D environment. As these environments are primarily designed by and for humans, incorporating commonsense knowledge regarding objects and their relationships can significantly constrain and enhance the prediction of the scene graph. In this paper, we investigate the application of commonsense knowledge graphs for 3D scene graph prediction on point clouds of indoor scenes. Through experiments conducted on a real-world indoor dataset, we demonstrate that integrating external commonsense knowledge via the message-passing method leads to a 15.0 % improvement in scene graph prediction accuracy with external knowledge and $7.96\%$ with internal knowledge when compared to state-of-the-art algorithms. We also tested in the real world with 10 frames per second for scene graph generation to show the usage of the model in a more realistic robotics setting.

摘要
三维场景图预测是一项任务，旨在同时预测场景中对象的类别和其之间的关系。由于这些环境主要由人类设计和使用，因此包含常识知识对场景图预测具有明显的约束和优化作用。在这篇论文中，我们调查了在点云indoor场景中使用commonsense知识图进行三维场景图预测的应用。通过对实际indoor数据集进行实验，我们表明了将外部常识知识integrated到消息传递方法中可以提高场景图预测精度，比对 estado-of-the-art算法提高15.0%。此外，我们还在真实的 robotics 环境中测试了Scene Graph生成，以示模型的应用。

StairNetV3: Depth-aware Stair Modeling using Deep Learning

paper_url: http://arxiv.org/abs/2308.06715
repo_url: None
paper_authors: Chen Wang, Zhongcai Pei, Shuang Qiu, Yachun Wang, Zhiyong Tang
for: 这 paper 的目的是提出一种基于视觉的自主移动 робоット climb 楼梯的技术，尤其是在不熟悉的环境中。
methods: 该 paper 使用了一种基于 convolutional neural network (CNN) 的 depth-aware stair modeling 方法，包括提取楼梯几何特征和预测深度图像为联合任务，并使用设计的信息传播架构以实现有效的超视觉学习。
results: 实验表明，该方法与之前最佳的单目视觉方法相比，有一个显著的提升（IOU 提升3.4%），并且Lightweight 版本具有快速检测速度，可满足大多数实时应用的需求。

Abstract
Vision-based stair perception can help autonomous mobile robots deal with the challenge of climbing stairs, especially in unfamiliar environments. To address the problem that current monocular vision methods are difficult to model stairs accurately without depth information, this paper proposes a depth-aware stair modeling method for monocular vision. Specifically, we take the extraction of stair geometric features and the prediction of depth images as joint tasks in a convolutional neural network (CNN), with the designed information propagation architecture, we can achieve effective supervision for stair geometric feature learning by depth information. In addition, to complete the stair modeling, we take the convex lines, concave lines, tread surfaces and riser surfaces as stair geometric features and apply Gaussian kernels to enable the network to predict contextual information within the stair lines. Combined with the depth information obtained by depth sensors, we propose a stair point cloud reconstruction method that can quickly get point clouds belonging to the stair step surfaces. Experiments on our dataset show that our method has a significant improvement over the previous best monocular vision method, with an intersection over union (IOU) increase of 3.4 %, and the lightweight version has a fast detection speed and can meet the requirements of most real-time applications. Our dataset is available at https://data.mendeley.com/datasets/6kffmjt7g2/1.

摘要
<>使用视觉技术，自动移动Robot可以更好地处理楼梯，特别是在未知环境中。为了解决目前的单目视觉方法难以准确地模型楼梯 without depth information，这篇论文提出了一种基于深度信息的楼梯模型方法。具体来说，我们将提取楼梯的 geometric 特征和预测深度图作为一个 convolutional neural network (CNN) 中的联合任务，通过我们设计的信息传递架构，可以实现有效的监督楼梯 geometric 特征学习。此外，为了完成楼梯模型，我们将楼梯的 convex 线、拱线、踏板面和踏梯面作为楼梯的 geometric 特征，并应用 Gaussian kernels，使网络可以预测楼梯内部的信息。与depth sensor获取的深度信息结合，我们提出了一种可以快速获取楼梯步骤表面的点云重建方法。实验结果表明，我们的方法与前一个最佳单目视觉方法相比，IOU 提高了 3.4%，轻量版本具有快速检测速度，可满足大多数实时应用的需求。我们的数据集可以在中下载。

LAW-Diffusion: Complex Scene Generation by Diffusion with Layouts

paper_url: http://arxiv.org/abs/2308.06713
repo_url: None
paper_authors: Binbin Yang, Yi Luo, Ziliang Chen, Guangrun Wang, Xiaodan Liang, Liang Lin
for: 这篇研究是为了实现高品质的复杂场景生成，以优化现有的散射模型。
methods: 这篇研究提出了一个具有 semantic control 的 Layout-Aware 散射模型（LAW-Diffusion），通过内置的空间依赖解析和位置意识的跨物体注意力模组，实现了具有属地对应性和空间相互关联的场景生成。
results: compared to previous Layout-to-Image（L2I）方法，LAW-Diffusion 可以更好地生成具有内在逻辑和空间相互关联的场景，并且可以实现实际中的实例重新构成。

Abstract
Thanks to the rapid development of diffusion models, unprecedented progress has been witnessed in image synthesis. Prior works mostly rely on pre-trained linguistic models, but a text is often too abstract to properly specify all the spatial properties of an image, e.g., the layout configuration of a scene, leading to the sub-optimal results of complex scene generation. In this paper, we achieve accurate complex scene generation by proposing a semantically controllable Layout-AWare diffusion model, termed LAW-Diffusion. Distinct from the previous Layout-to-Image generation (L2I) methods that only explore category-aware relationships, LAW-Diffusion introduces a spatial dependency parser to encode the location-aware semantic coherence across objects as a layout embedding and produces a scene with perceptually harmonious object styles and contextual relations. To be specific, we delicately instantiate each object's regional semantics as an object region map and leverage a location-aware cross-object attention module to capture the spatial dependencies among those disentangled representations. We further propose an adaptive guidance schedule for our layout guidance to mitigate the trade-off between the regional semantic alignment and the texture fidelity of generated objects. Moreover, LAW-Diffusion allows for instance reconfiguration while maintaining the other regions in a synthesized image by introducing a layout-aware latent grafting mechanism to recompose its local regional semantics. To better verify the plausibility of generated scenes, we propose a new evaluation metric for the L2I task, dubbed Scene Relation Score (SRS) to measure how the images preserve the rational and harmonious relations among contextual objects. Comprehensive experiments demonstrate that our LAW-Diffusion yields the state-of-the-art generative performance, especially with coherent object relations.

摘要
due to the rapid development of diffusion models, there have been unprecedented advances in image synthesis. previous works mainly rely on pre-trained linguistic models, but a text is often too abstract to properly specify all the spatial properties of an image, such as the layout configuration of a scene, leading to sub-optimal results of complex scene generation. in this paper, we achieve accurate complex scene generation by proposing a semantically controllable Layout-AWare diffusion model, termed LAW-Diffusion. unlike previous Layout-to-Image (L2I) methods that only explore category-aware relationships, LAW-Diffusion introduces a spatial dependency parser to encode the location-aware semantic coherence across objects as a layout embedding and produces a scene with perceptually harmonious object styles and contextual relations. specifically, we delicately instantiate each object's regional semantics as an object region map and leverage a location-aware cross-object attention module to capture the spatial dependencies among those disentangled representations. we also propose an adaptive guidance schedule for our layout guidance to mitigate the trade-off between the regional semantic alignment and the texture fidelity of generated objects. furthermore, LAW-Diffusion allows for instance reconfiguration while maintaining the other regions in a synthesized image by introducing a layout-aware latent grafting mechanism to recompose its local regional semantics. to better verify the plausibility of generated scenes, we propose a new evaluation metric for the L2I task, dubbed Scene Relation Score (SRS) to measure how the images preserve the rational and harmonious relations among contextual objects. comprehensive experiments demonstrate that our LAW-Diffusion yields the state-of-the-art generative performance, especially with coherent object relations.

Compositional Feature Augmentation for Unbiased Scene Graph Generation

paper_url: http://arxiv.org/abs/2308.06712
repo_url: None
paper_authors: Lin Li, Guikun Chen, Jun Xiao, Yi Yang, Chunping Wang, Long Chen
for: 本研究旨在探讨如何更好地探测图像中的视觉关系 triplets <sub, pred, obj>，以提高Scene Graph Generation (SGG) 的性能。
methods: 本文提出了一种新的Compositional Feature Augmentation (CFA)策略，该策略可以增加每个 predicate 的关系 triplet 特征的多样性，从而提高 SGG 的鲁棒性。CFA 包括将每个关系 triplet 特征分解成两部分：内在特征和外在特征，然后通过将这些特征与其他样本的特征进行替换或混合来增加 triplet 特征的多样性。
results: 对比于现有的重新权衡策略，CFA 可以更好地增加每个 predicate 的关系 triplet 特征的多样性，从而提高 SGG 的性能。经过广泛的ablation研究，我们发现CFA 可以在不同的 metrics 之间取得新的状态公共表现。

Abstract
Scene Graph Generation (SGG) aims to detect all the visual relation triplets in a given image. With the emergence of various advanced techniques for better utilizing both the intrinsic and extrinsic information in each relation triplet, SGG has achieved great progress over the recent years. However, due to the ubiquitous long-tailed predicate distributions, today's SGG models are still easily biased to the head predicates. Currently, the most prevalent debiasing solutions for SGG are re-balancing methods, e.g., changing the distributions of original training samples. In this paper, we argue that all existing re-balancing strategies fail to increase the diversity of the relation triplet features of each predicate, which is critical for robust SGG. To this end, we propose a novel Compositional Feature Augmentation (CFA) strategy, which is the first unbiased SGG work to mitigate the bias issue from the perspective of increasing the diversity of triplet features. Specifically, we first decompose each relation triplet feature into two components: intrinsic feature and extrinsic feature, which correspond to the intrinsic characteristics and extrinsic contexts of a relation triplet, respectively. Then, we design two different feature augmentation modules to enrich the feature diversity of original relation triplets by replacing or mixing up either their intrinsic or extrinsic features from other samples. Due to its model-agnostic nature, CFA can be seamlessly incorporated into various SGG frameworks. Extensive ablations have shown that CFA achieves a new state-of-the-art performance on the trade-off between different metrics.

摘要
Specifically, we first decompose each relation triplet feature into two components: intrinsic feature and extrinsic feature, which correspond to the intrinsic characteristics and extrinsic contexts of a relation triplet, respectively. Then, we design two different feature augmentation modules to enrich the feature diversity of original relation triplets by replacing or mixing up either their intrinsic or extrinsic features from other samples. Due to its model-agnostic nature, CFA can be seamlessly incorporated into various SGG frameworks. Extensive ablations have shown that CFA achieves a new state-of-the-art performance on the trade-off between different metrics.

Condition-Adaptive Graph Convolution Learning for Skeleton-Based Gait Recognition

paper_url: http://arxiv.org/abs/2308.06707
repo_url: https://github.com/oliverhxh/cag
paper_authors: Xiaohu Huang, Xinggang Wang, Zhidianqiu Jin, Bo Yang, Botao He, Bin Feng, Wenyu Liu
for: 本研究旨在提高skeleton-based gait认知 task中的个人识别率，使用graph convolutional networks (GCNs)来提取多视角下不同人体姿势的特征。
methods: 我们提出了一种condition-adaptive graph (CAG) convolution network，具有自适应特征和视角的能力。CAG网络包括joint-specific filter learning (JSFL)模块和view-adaptive topology learning (VATL)模块。JSFL模块生成每个关节独特的滤波器， capture细腻的姿势特征；VATL模块生成适应视角的图学结构，对关节进行相应的相关处理。
results: 实验结果表明，CAG网络在CASIA-B和OU-MVLP两个最常用的数据集上都超过了所有之前的skeleton-based方法。此外，通过与视觉基本方法相结合，CAG网络可以提供有用的补充信息，提高了识别率。

Abstract
Graph convolutional networks have been widely applied in skeleton-based gait recognition. A key challenge in this task is to distinguish the individual walking styles of different subjects across various views. Existing state-of-the-art methods employ uniform convolutions to extract features from diverse sequences and ignore the effects of viewpoint changes. To overcome these limitations, we propose a condition-adaptive graph (CAG) convolution network that can dynamically adapt to the specific attributes of each skeleton sequence and the corresponding view angle. In contrast to using fixed weights for all joints and sequences, we introduce a joint-specific filter learning (JSFL) module in the CAG method, which produces sequence-adaptive filters at the joint level. The adaptive filters capture fine-grained patterns that are unique to each joint, enabling the extraction of diverse spatial-temporal information about body parts. Additionally, we design a view-adaptive topology learning (VATL) module that generates adaptive graph topologies. These graph topologies are used to correlate the joints adaptively according to the specific view conditions. Thus, CAG can simultaneously adjust to various walking styles and viewpoints. Experiments on the two most widely used datasets (i.e., CASIA-B and OU-MVLP) show that CAG surpasses all previous skeleton-based methods. Moreover, the recognition performance can be enhanced by simply combining CAG with appearance-based methods, demonstrating the ability of CAG to provide useful complementary information.The source code will be available at https://github.com/OliverHxh/CAG.

摘要
“几何卷积网络在人体骨架基于步行识别中广泛应用。一个关键挑战在这个任务中是在不同的视角下分辨别人的步行风格。现有的状态艺术方法使用固定的权重来抽取不同序列中的特征，并忽略视角变化的影响。为了解决这些限制，我们提议一种可适应条件的几何卷积网络（CAG），可以动态适应每个骨架序列和相应的视角。而不是使用所有关节和序列中的固定权重，我们引入了关节特定的缓冲学（JSFL）模块，该模块生成序列特有的缓冲。这些缓冲能够捕捉每个关节细腻的特征，并提取不同的空间-时间信息。此外，我们设计了视角适应图学（VATL）模块，该模块生成适应视角的图学结构。这些图学结构用于相互相关关节，以适应特定的视角条件。因此，CAG可以同时适应不同的步行风格和视角。实验结果表明，CAG超过了所有之前的骨架基于方法，并且可以通过简单地将CAG与外观基于方法相结合，进一步提高识别性能。代码将在 GitHub 上发布，请参考。”

Isomer: Isomerous Transformer for Zero-shot Video Object Segmentation

paper_url: http://arxiv.org/abs/2308.06693
repo_url: https://github.com/dlut-yyc/isomer
paper_authors: Yichen Yuan, Yifan Wang, Lijun Wang, Xiaoqi Zhao, Huchuan Lu, Yu Wang, Weibo Su, Lei Zhang
for: 这个论文主要针对 Zero-Shot Video Object Segmentation (ZVOS) 任务，即在不使用任何 annotated video data 的情况下，将视频中的 объекты segmentation 到准确的位置和类别。
methods: 该论文提出了两种基于 Transformer 的方法，分别是 Context-Sharing Transformer (CST) 和 Semantic Gathering-Scattering Transformer (SGST)，以提高 ZVOS 的性能和计算效率。
results: 与基eline相比，该论文的方法在 ZVOS 任务中具有新的 state-of-the-art 性能，同时提高了计算效率，相比基eline的 13 倍。 Code 可以在 https://github.com/DLUT-yyc/Isomer 上下载。

Abstract
Recent leading zero-shot video object segmentation (ZVOS) works devote to integrating appearance and motion information by elaborately designing feature fusion modules and identically applying them in multiple feature stages. Our preliminary experiments show that with the strong long-range dependency modeling capacity of Transformer, simply concatenating the two modality features and feeding them to vanilla Transformers for feature fusion can distinctly benefit the performance but at a cost of heavy computation. Through further empirical analysis, we find that attention dependencies learned in Transformer in different stages exhibit completely different properties: global query-independent dependency in the low-level stages and semantic-specific dependency in the high-level stages. Motivated by the observations, we propose two Transformer variants: i) Context-Sharing Transformer (CST) that learns the global-shared contextual information within image frames with a lightweight computation. ii) Semantic Gathering-Scattering Transformer (SGST) that models the semantic correlation separately for the foreground and background and reduces the computation cost with a soft token merging mechanism. We apply CST and SGST for low-level and high-level feature fusions, respectively, formulating a level-isomerous Transformer framework for ZVOS task. Compared with the baseline that uses vanilla Transformers for multi-stage fusion, ours significantly increase the speed by 13 times and achieves new state-of-the-art ZVOS performance. Code is available at https://github.com/DLUT-yyc/Isomer.

摘要
现代领先的零shot视频对象分割（ZVOS）方法强调 интеграción appeared和动作信息，通过设计优化的特征融合模块来实现。我们的初步实验表明，使用强大的长距离依赖模型Transformer可以明显提高性能，但是需要高计算成本。通过进一步的实验分析，我们发现Transformer中不同阶段的注意力关系都有不同性质：低阶段的全局缺省关系和高阶段的Semantic特定关系。这些发现驱动我们提出两种Transformer变体：i) 共享上下文Transformer（CST），通过轻量级计算学习图像帧中的全局共享上下文信息。ii) semantic聚合散发Transformer（SGST），通过软token合并机制模型对eground和background的semantic相关性，减少计算成本。我们在不同阶段使用CST和SGST进行特征融合，组成了级别异谱Transformer框架，与基eline相比，我们的方法可以提高13倍的速度，并达到新的ZVOS性能记录。代码可以在https://github.com/DLUT-yyc/Isomer上下载。

SimMatchV2: Semi-Supervised Learning with Graph Consistency

paper_url: http://arxiv.org/abs/2308.06692
repo_url: https://github.com/mingkai-zheng/simmatchv2
paper_authors: Mingkai Zheng, Shan You, Lang Huang, Chen Luo, Fei Wang, Chen Qian, Chang Xu
for: 这个论文目的是提出一种新的半监督学习算法，以解决计算机视觉领域中的半监督图像分类问题。
methods: 该算法基于图 teoría的消息传递和节点分类，并提出了四种一致性，包括节点-节点一致性、节点-边一致性、边-边一致性和边-节点一致性。
results: 该算法在多个半监督学习benchmark上进行验证，与ResNet-50作为背景网络和300个训练 epoch，SimMatchV2实现了71.9%和76.2%的Top-1准确率，分别使用1%和10%的标注样本。这些成果在之前的方法中显著超越，达到了状态作准的性能。

Abstract
Semi-Supervised image classification is one of the most fundamental problem in computer vision, which significantly reduces the need for human labor. In this paper, we introduce a new semi-supervised learning algorithm - SimMatchV2, which formulates various consistency regularizations between labeled and unlabeled data from the graph perspective. In SimMatchV2, we regard the augmented view of a sample as a node, which consists of a label and its corresponding representation. Different nodes are connected with the edges, which are measured by the similarity of the node representations. Inspired by the message passing and node classification in graph theory, we propose four types of consistencies, namely 1) node-node consistency, 2) node-edge consistency, 3) edge-edge consistency, and 4) edge-node consistency. We also uncover that a simple feature normalization can reduce the gaps of the feature norm between different augmented views, significantly improving the performance of SimMatchV2. Our SimMatchV2 has been validated on multiple semi-supervised learning benchmarks. Notably, with ResNet-50 as our backbone and 300 epochs of training, SimMatchV2 achieves 71.9\% and 76.2\% Top-1 Accuracy with 1\% and 10\% labeled examples on ImageNet, which significantly outperforms the previous methods and achieves state-of-the-art performance. Code and pre-trained models are available at \href{https://github.com/mingkai-zheng/SimMatchV2}{https://github.com/mingkai-zheng/SimMatchV2}.

摘要
semi-supervised图像分类是计算机视觉中最基本的问题之一，可以减少人工劳动。在这篇论文中，我们介绍了一种新的semi-supervised学习算法——SimMatchV2，它在图像视角下对各个样本进行了不同的拓展视图，并在图表视角下定义了多种一致性规范。在SimMatchV2中，我们将每个样本的拓展视图看作一个节点，这些节点之间通过 Edge 连接， Edge 的 Similarity 度量节点表示的一致性。我们提出了四种一致性类型：1）节点-节点一致性，2）节点-边一致性，3）边-边一致性，4）边-节点一致性。我们还发现，一个简单的特征Normalization可以降低不同拓展视图特征的差异，从而提高SimMatchV2的性能。我们的SimMatchV2在多个 semi-supervised 学习 benchmark 上进行了验证，与 ResNet-50 作为背景网络和 300 epoch 训练，SimMatchV2 在 ImageNet 上 achieve 71.9% 和 76.2% Top-1 Accuracy WITH 1% 和 10% 标注样本，显著超过先前的方法，实现了状态的最佳性能。代码和预训练模型可以在 \href{https://github.com/mingkai-zheng/SimMatchV2}{https://github.com/mingkai-zheng/SimMatchV2} 上获取。

Estimator Meets Equilibrium Perspective: A Rectified Straight Through Estimator for Binary Neural Networks Training

paper_url: http://arxiv.org/abs/2308.06689
repo_url: https://github.com/dravenalg/reste
paper_authors: Xiao-Ming Wu, Dian Zheng, Zuhao Liu, Wei-Shi Zheng
for: 这个论文的目的是提出一种能够充分考虑条件对应网络的训练稳定性的条件对应网络训练方法。
methods: 这个论文使用了一种名为Rectified Straight Through Estimator（ReSTE）的新的条件对应网络训练方法，它可以充分考虑条件对应网络的训练稳定性。
results: 实验结果显示，ReSTE可以在CIFAR-10和ImageNet datasets上 achieve excellent performance，并且比其他方法（不含任何辅助模组或损失）还要好。

Abstract
Binarization of neural networks is a dominant paradigm in neural networks compression. The pioneering work BinaryConnect uses Straight Through Estimator (STE) to mimic the gradients of the sign function, but it also causes the crucial inconsistency problem. Most of the previous methods design different estimators instead of STE to mitigate it. However, they ignore the fact that when reducing the estimating error, the gradient stability will decrease concomitantly. These highly divergent gradients will harm the model training and increase the risk of gradient vanishing and gradient exploding. To fully take the gradient stability into consideration, we present a new perspective to the BNNs training, regarding it as the equilibrium between the estimating error and the gradient stability. In this view, we firstly design two indicators to quantitatively demonstrate the equilibrium phenomenon. In addition, in order to balance the estimating error and the gradient stability well, we revise the original straight through estimator and propose a power function based estimator, Rectified Straight Through Estimator (ReSTE for short). Comparing to other estimators, ReSTE is rational and capable of flexibly balancing the estimating error with the gradient stability. Extensive experiments on CIFAR-10 and ImageNet datasets show that ReSTE has excellent performance and surpasses the state-of-the-art methods without any auxiliary modules or losses.

摘要
neural networks 的归纳化是现代神经网络压缩的主导方法。 BinaryConnect 开创性的工作使用 Straight Through Estimator (STE) 模仿签名函数的梯度，但也会导致重要的不一致问题。前一些方法设计不同的估计器来缓解这个问题，但它们忽略了当减少估计错误时，模型的梯度稳定性会降低。这些高度不同梯度会危害模型的训练和梯度涨落和爆炸。为了充分考虑梯度稳定性，我们提出了一新的审视方法，将 BNNs 训练视为梯度稳定性和估计错误之间的平衡。在这种视角下，我们首先设计了两个指标来量化平衡现象。此外，为了平衡估计错误和梯度稳定性，我们修改了原始的直通估计器，并提出了一个功能基于 rectified straight through estimator (ReSTE)。与其他估计器相比，ReSTE 是理性的，可以很好地平衡估计错误和梯度稳定性。我们对 CIFAR-10 和 ImageNet dataset 进行了广泛的实验，结果表明 ReSTE 表现出色，超过了当前的状态艺术方法，不需要任何辅助模块或损失。

Foundation Models in Smart Agriculture: Basics, Opportunities, and Challenges

paper_url: http://arxiv.org/abs/2308.06668
repo_url: https://github.com/jiajiali04/agriculture-foundation-models
paper_authors: Jiajia Li, Mingle Xu, Lirong Xiang, Dong Chen, Weichao Zhuang, Xunyuan Yin, Zhaojian Li
for: 本研究旨在探讨基础模型（Foundation Model，FM）在智能农业领域的潜力。
methods: 本研究首先对最新的FM进行了 обзор，并将其分为四类：语言FM、视觉FM、多模态FM和强化学习FM。然后，我们详细介绍了在农业领域开发农业FM的过程，以及其在智能农业中的潜在应用。
results: 本研究通过对FM的探讨，提供了一个新的AI在农业领域的发展方向，即基于FM的智能农业系统。这种系统可以减少大量标注数据的依赖，提高效率和通用性。同时，我们还描述了在开发农业FM时的独特挑战，包括模型训练、验证和部署。

Abstract
The past decade has witnessed the rapid development of ML and DL methodologies in agricultural systems, showcased by great successes in variety of agricultural applications. However, these conventional ML/DL models have certain limitations: They heavily rely on large, costly-to-acquire labeled datasets for training, require specialized expertise for development and maintenance, and are mostly tailored for specific tasks, thus lacking generalizability. Recently, foundation models have demonstrated remarkable successes in language and vision tasks across various domains. These models are trained on a vast amount of data from multiple domains and modalities. Once trained, they can accomplish versatile tasks with just minor fine-tuning and minimal task-specific labeled data. Despite their proven effectiveness and huge potential, there has been little exploration of applying FMs to agriculture fields. Therefore, this study aims to explore the potential of FMs in the field of smart agriculture. In particular, we present conceptual tools and technical background to facilitate the understanding of the problem space and uncover new research directions in this field. To this end, we first review recent FMs in the general computer science domain and categorize them into four categories: language FMs, vision FMs, multimodal FMs, and reinforcement learning FMs. Subsequently, we outline the process of developing agriculture FMs and discuss their potential applications in smart agriculture. We also discuss the unique challenges associated with developing AFMs, including model training, validation, and deployment. Through this study, we contribute to the advancement of AI in agriculture by introducing AFMs as a promising paradigm that can significantly mitigate the reliance on extensive labeled datasets and enhance the efficiency, effectiveness, and generalization of agricultural AI systems.

摘要
过去一代，机器学习（ML）和深度学习（DL）方法在农业系统中得到了迅速发展，在各种农业应用中显示出了很大成功。然而，传统的ML/DL模型有一些局限性：它们需要大量、成本高的标注数据进行训练，需要专业知识进行开发和维护，而且主要针对特定任务，缺乏总体化性。在最近的几年，基础模型（FM）在语言和视觉任务中获得了很大成功。这些模型在多个领域和模式上训练了庞大数据。一旦训练完成，它们可以通过微调和微量标注数据完成多种任务。despite their proven effectiveness and huge potential, there has been little exploration of applying FMs to agriculture fields. Therefore, this study aims to explore the potential of FMs in the field of smart agriculture. In particular, we present conceptual tools and technical background to facilitate the understanding of the problem space and uncover new research directions in this field. To this end, we first review recent FMs in the general computer science domain and categorize them into four categories: language FMs, vision FMs, multimodal FMs, and reinforcement learning FMs. Subsequently, we outline the process of developing agriculture FMs and discuss their potential applications in smart agriculture. We also discuss the unique challenges associated with developing AFMs, including model training, validation, and deployment. Through this study, we contribute to the advancement of AI in agriculture by introducing AFMs as a promising paradigm that can significantly mitigate the reliance on extensive labeled datasets and enhance the efficiency, effectiveness, and generalization of agricultural AI systems.

Polar Collision Grids: Effective Interaction Modelling for Pedestrian Trajectory Prediction in Shared Space Using Collision Checks

paper_url: http://arxiv.org/abs/2308.06654
repo_url: None
paper_authors: Mahsa Golchoubian, Moojan Ghafurian, Kerstin Dautenhahn, Nasser Lashgarian Azad
for: 预测步行人的轨迹是自动驾驶车辆安全导航中的关键能力，特别是在与步行人共享空间时。步行人运动在共享空间中受到汽车和其他步行人的影响，因此可以更好地模型步行人-汽车和步行人之间的交互，从而提高步行人轨迹预测模型的准确性。
methods: 我们提出了一种基于启发的方法，通过计算碰撞风险来选择交互对象。我们将关注与目标步行人之间可能碰撞的两个代理的时间到碰撞和方向角来编码交互效果。我们还 introduce了一种新的方向角坐标系，以便更好地表示交互对象之间的位势。
results: 我们的结果显示，与基eline方法（用作比较）相比，我们的方法在HBS数据集上预测的轨迹更加准确。

Abstract
Predicting pedestrians' trajectories is a crucial capability for autonomous vehicles' safe navigation, especially in spaces shared with pedestrians. Pedestrian motion in shared spaces is influenced by both the presence of vehicles and other pedestrians. Therefore, effectively modelling both pedestrian-pedestrian and pedestrian-vehicle interactions can increase the accuracy of the pedestrian trajectory prediction models. Despite the huge literature on ways to encode the effect of interacting agents on a pedestrian's predicted trajectory using deep-learning models, limited effort has been put into the effective selection of interacting agents. In the majority of cases, the interaction features used are mainly based on relative distances while paying less attention to the effect of the velocity and approaching direction in the interaction formulation. In this paper, we propose a heuristic-based process of selecting the interacting agents based on collision risk calculation. Focusing on interactions of potentially colliding agents with a target pedestrian, we propose the use of time-to-collision and the approach direction angle of two agents for encoding the interaction effect. This is done by introducing a novel polar collision grid map. Our results have shown predicted trajectories closer to the ground truth compared to existing methods (used as a baseline) on the HBS dataset.

摘要
预测行人轨迹是自动驾驶车辆安全导航中的关键能力，特别是在与行人共享空间时。行人运动在共享空间中受到车辆和其他行人的影响。因此，可以准确地模拟行人与车辆和其他行人之间的交互，可以提高行人轨迹预测模型的准确性。虽然有很大的文献研究了使用深度学习模型来编码交互代理的影响，但是对选择交互代理的有效选择尚未得到足够的关注。大多数情况下，交互特征 mainly based on relative distances，而忽略了交互形式中 velocities和接近方向的影响。在这篇论文中，我们提出了一种基于冲突风险计算的交互代理选择规则。关注可能发生冲突的两个代理之间的时间差距和接近方向角，以编码交互效果。我们通过引入一种新的圆形冲突网格地图来实现这一点。我们的结果显示，与基eline方法（作为参照）相比，我们的方法在HBS数据集上预测轨迹更加准确。

Advances in Self-Supervised Learning for Synthetic Aperture Sonar Data Processing, Classification, and Pattern Recognition

paper_url: http://arxiv.org/abs/2308.11633
repo_url: None
paper_authors: Brandon Sheffield, Frank E. Bobe III, Bradley Marchand, Matthew S. Emigh
for: 提高水下探测技术的精度和效率
methods: 使用自主学习方法（SSL）处理SAS数据，进行分类和特征识别
results: 实验结果表明，MoCo-SAS在F1分数方面表现 significanly better than传统的指导学习方法，这表明SSL在SAS数据处理中具有潜在的应用前景和可能性。

Abstract
Synthetic Aperture Sonar (SAS) imaging has become a crucial technology for underwater exploration because of its unique ability to maintain resolution at increasing ranges, a characteristic absent in conventional sonar techniques. However, the effective application of deep learning to SAS data processing is often limited due to the scarcity of labeled data. To address this challenge, this paper proposes MoCo-SAS that leverages self-supervised learning (SSL) for SAS data processing, classification, and pattern recognition. The experimental results demonstrate that MoCo-SAS significantly outperforms traditional supervised learning methods, as evidenced by significant improvements observed in terms of the F1-score. These findings highlight the potential of SSL in advancing the state-of-the-art in SAS data processing, offering promising avenues for enhanced underwater object detection and classification.

摘要
美式 Synthetic Aperture Sonar（SAS）成像技术在水下探索中变得非常重要，因为它可以保持分辨率随距离增长，这是传统sonar技术缺乏的特点。然而，通常的深度学习应用于SAS数据处理中频繁受限因为标注数据的罕见。为解决这个挑战，这篇论文提议了MoCo-SAS，它利用自动标注学习（SSL）进行SAS数据处理、分类和模式识别。实验结果表明，MoCo-SAS在F1分数方面有显著提高，比传统监督学习方法要好。这些发现表明SSL在SAS数据处理中具有潜在的潜力，提供了更好的水下对象检测和分类技术。

3DMOTFormer: Graph Transformer for Online 3D Multi-Object Tracking

paper_url: http://arxiv.org/abs/2308.06635
repo_url: https://github.com/dsx0511/3dmotformer
paper_authors: Shuxiao Ding, Eike Rehder, Lukas Schneider, Marius Cordts, Juergen Gall
for: 这篇论文的目的是提出一种学习基于 transformer 架构的三维物体跟踪（3DMOT）方法，以提高自动驾驶 vehicle 的精度和可靠性。
methods: 本文使用 Edge-Augmented Graph Transformer 来在帧帧基础上进行 track-detection 图грамreasoning，并通过边类划分进行数据归一化。在线上训练中，我们提出了一种novel的自适应训练策略，包括循环和回归的前进 pass，以及顺序批量优化。
results: 使用 CenterPoint 检测结果，本文的方法实现了 71.2% 和 68.2% AMOTA 在 nuScenes 验证和测试分别，并且一个训练好的 3DMOTFormer 模型可以在不同的物体检测器上进行泛化。

Abstract
Tracking 3D objects accurately and consistently is crucial for autonomous vehicles, enabling more reliable downstream tasks such as trajectory prediction and motion planning. Based on the substantial progress in object detection in recent years, the tracking-by-detection paradigm has become a popular choice due to its simplicity and efficiency. State-of-the-art 3D multi-object tracking (MOT) approaches typically rely on non-learned model-based algorithms such as Kalman Filter but require many manually tuned parameters. On the other hand, learning-based approaches face the problem of adapting the training to the online setting, leading to inevitable distribution mismatch between training and inference as well as suboptimal performance. In this work, we propose 3DMOTFormer, a learned geometry-based 3D MOT framework building upon the transformer architecture. We use an Edge-Augmented Graph Transformer to reason on the track-detection bipartite graph frame-by-frame and conduct data association via edge classification. To reduce the distribution mismatch between training and inference, we propose a novel online training strategy with an autoregressive and recurrent forward pass as well as sequential batch optimization. Using CenterPoint detections, our approach achieves 71.2% and 68.2% AMOTA on the nuScenes validation and test split, respectively. In addition, a trained 3DMOTFormer model generalizes well across different object detectors. Code is available at: https://github.com/dsx0511/3DMOTFormer.

摘要
Tracking 3D 物体 precisely 和 consistently 是自动驾驶车辆中关键的，允许更可靠的下游任务，如轨迹预测和运动规划。基于近年来对 объек detection 的重要进步，跟踪-by-detection 方法在现场中变得越来越受欢迎，因为它的简单性和效率。当前的 3D 多对象跟踪（MOT）方法通常采用非学习基于模型的方法，如卡尔曼筛滤器，但是需要许多手动调整的参数。在另一方面，学习基于approaches 在线设定中遇到了适应训练的问题，导致在执行和训练之间的分布差异，以及不佳的性能。在这种情况下，我们提出了 3DMOTFormer，一种基于 transformer 架构的学习geometry-based 3D MOT 框架。我们使用 Edge-Augmented Graph Transformer 来在每帧上对跟踪-检测二分图进行理解，并通过边类别进行数据关联。为了减少训练和执行之间的分布差异，我们提出了一种新的在线训练策略，包括自适应循环和回归前进 pass，以及顺序批量优化。使用 CenterPoint 检测，我们的方法实现了 71.2% 和 68.2% AMOTA 在 nuScenes 验证和测试分割中，并且一个训练过的 3DMOTFormer 模型具有良好的泛化性。代码可以在 GitHub 上找到：https://github.com/dsx0511/3DMOTFormer。

Fusion-GRU: A Deep Learning Model for Future Bounding Box Prediction of Traffic Agents in Risky Driving Videos

paper_url: http://arxiv.org/abs/2308.06628
repo_url: None
paper_authors: Muhammad Monjurul Karim, Ruwen Qin, Yinhai Wang
for: 预测周围交通代理人员未来矩形 bounding box 以确保自动驾驶车辆和高级驾驶协助系统在复杂交通场景中安全和高效 navigate.
methods: 本文提出了一种 novel encoder-decoder 架构 called Fusion-GRU, which accounts for the mutual and complex interactions among input features, and uses an intermediary estimator coupled with a self-attention aggregation layer to learn sequential dependencies for long-range prediction.
results: 实验结果表明 Fusion-GRU 能够有效地预测交通代理人员未来矩形 bounding box, 并且在 ROL 和 HEV-I 两个公共数据集上达到了出色的表现。

Abstract
To ensure the safe and efficient navigation of autonomous vehicles and advanced driving assistance systems in complex traffic scenarios, predicting the future bounding boxes of surrounding traffic agents is crucial. However, simultaneously predicting the future location and scale of target traffic agents from the egocentric view poses challenges due to the vehicle's egomotion causing considerable field-of-view changes. Moreover, in anomalous or risky situations, tracking loss or abrupt motion changes limit the available observation time, requiring learning of cues within a short time window. Existing methods typically use a simple concatenation operation to combine different cues, overlooking their dynamics over time. To address this, this paper introduces the Fusion-Gated Recurrent Unit (Fusion-GRU) network, a novel encoder-decoder architecture for future bounding box localization. Unlike traditional GRUs, Fusion-GRU accounts for mutual and complex interactions among input features. Moreover, an intermediary estimator coupled with a self-attention aggregation layer is also introduced to learn sequential dependencies for long range prediction. Finally, a GRU decoder is employed to predict the future bounding boxes. The proposed method is evaluated on two publicly available datasets, ROL and HEV-I. The experimental results showcase the promising performance of the Fusion-GRU, demonstrating its effectiveness in predicting future bounding boxes of traffic agents.

摘要
要确保自动驾驶车和高级驾驶帮助系统在复杂交通场景中安全和高效地导航，预测周围交通代理的未来矩形框是关键。然而，同时预测目标交通代理的未来位置和Scale从 egocentric 视图出现困难，由于车辆的 egomotion 导致了 considrable 视场变化。此外，在异常或危险情况下，跟踪损失或突然运动变化限制了可用观察时间，需要学习在短时间窗口内的信号。现有方法通常使用简单的 concatenation 操作将不同的信号组合起来，忽略他们在时间上的动态变化。为解决这个问题，本文提出了 Fusion-Gated Recurrent Unit (Fusion-GRU) 网络，一种新的编码器-解码器架构 для 未来矩形框 Localization。与传统 GRU 不同，Fusion-GRU 考虑了输入特征之间的相互和复杂交互。此外，一个中间估计器和一个自我注意汇聚层也是引入，以学习长距离预测的时间序列关系。最后，一个 GRU 解码器用于预测未来矩形框。提案的方法在 ROL 和 HEV-I 两个公共可用数据集上进行了测试，实验结果表明 Fusion-GRU 的批处理能力很出色，证明其在预测交通代理未来矩形框方面的效果是非常有 Promise。

ADRMX: Additive Disentanglement of Domain Features with Remix Loss

paper_url: http://arxiv.org/abs/2308.06624
repo_url: https://github.com/berkerdemirel/ADRMX
paper_authors: Berker Demirel, Erchan Aptoula, Huseyin Ozkan
for: 这个研究目的是为了实现多域领域对应，即将模型从多个来源领域中撷取具有通用性的特征，以减少域别的分布变化对模型的影响。
methods: 本研究使用了一种名为“添加式分离”的新架构，将域别特征与域共通特征整合在一起，以实现域variant特征的捕捉。此外，还引入了一种新的数据增强技术，将不同域的样本混合在维度空间中，以进一步支持模型的通用能力。
results: 经过广泛的DomainBed experiments，ADRMX模型在竞争性的情况下实现了州际状态的表现，并且超过了现有的模型。代码将会在GitHub上公开。

Abstract
The common assumption that train and test sets follow similar distributions is often violated in deployment settings. Given multiple source domains, domain generalization aims to create robust models capable of generalizing to new unseen domains. To this end, most of existing studies focus on extracting domain invariant features across the available source domains in order to mitigate the effects of inter-domain distributional changes. However, this approach may limit the model's generalization capacity by relying solely on finding common features among the source domains. It overlooks the potential presence of domain-specific characteristics that could be prevalent in a subset of domains, potentially containing valuable information. In this work, a novel architecture named Additive Disentanglement of Domain Features with Remix Loss (ADRMX) is presented, which addresses this limitation by incorporating domain variant features together with the domain invariant ones using an original additive disentanglement strategy. Moreover, a new data augmentation technique is introduced to further support the generalization capacity of ADRMX, where samples from different domains are mixed within the latent space. Through extensive experiments conducted on DomainBed under fair conditions, ADRMX is shown to achieve state-of-the-art performance. Code will be made available at GitHub after the revision process.

摘要
通常假设训练集和测试集都遵循相似的分布是在部署场景下常被违反的。面对多个源领域，领域泛化目标是创建可以泛化到新未经见过的领域的Robust模型。为此，大多数现有的研究都是EXTRACTING DOMAIN INVARIANT FEATURES ACROSS AVAILABLE SOURCE DOMAINS，以减少INTER-DOMAIN分布变化的影响。然而，这种方法可能会限制模型的泛化能力，因为它只是在多个源领域中找到共同特征。这会忽略可能在一些领域中具有价值信息的领域特有特征。在这种工作中，一种新的架构被提出，即Additive Disentanglement of Domain Features with Remix Loss（ADRMX），它解决了这个限制。ADRMX通过将领域特征与领域 invariant 特征相加来实现这一点。此外，一种新的数据增强技术也被引入，其中来自不同领域的样本被混合在离散空间中。经过了EXTENSIVE EXPERIMENTS CONDUCTED ON DOMAINBED UNDER FAIR CONDITIONS，ADRMX得到了状态机器的表现。代码将在GitHub上公布后进行修订。

Polyp-SAM++: Can A Text Guided SAM Perform Better for Polyp Segmentation?

paper_url: http://arxiv.org/abs/2308.06623
repo_url: https://github.com/RisabBiswas/Polyp-SAM-PlusPlus
paper_authors: Risab Biswas
for: The paper is written for the task of polyp segmentation in medical images, with the goal of improving the accuracy and robustness of the segmentation process.
methods: The paper uses the Segment Anything Model (SAM) as the base model for polyp segmentation, and incorporates text prompting to guide the segmentation process.
results: The paper evaluates the performance of the text-guided SAM on benchmark datasets and compares the results with unprompted SAM. The results show that the text-guided SAM achieves better segmentation accuracy and robustness than unprompted SAM.Here are the three points in Simplified Chinese text:
for: 本文是为医疗图像中的肿吸分 segmentation任务而写的，目标是提高分 segmentation的准确性和稳定性。
methods: 本文使用 Segment Anything Model (SAM) 作为基本模型，并通过文本提示来导引分 segmentation 过程。
results: 本文对 benchmark 数据集进行评估，并比较文本提示 SAM 和无提示 SAM 的结果。结果显示，文本提示 SAM 在分 segmentation 任务上的性能更高、更稳定。

Abstract
Meta recently released SAM (Segment Anything Model) which is a general-purpose segmentation model. SAM has shown promising results in a wide variety of segmentation tasks including medical image segmentation. In the field of medical image segmentation, polyp segmentation holds a position of high importance, thus creating a model which is robust and precise is quite challenging. Polyp segmentation is a fundamental task to ensure better diagnosis and cure of colorectal cancer. As such in this study, we will see how Polyp-SAM++, a text prompt-aided SAM, can better utilize a SAM using text prompting for robust and more precise polyp segmentation. We will evaluate the performance of a text-guided SAM on the polyp segmentation task on benchmark datasets. We will also compare the results of text-guided SAM vs unprompted SAM. With this study, we hope to advance the field of polyp segmentation and inspire more, intriguing research. The code and other details will be made publically available soon at https://github.com/RisabBiswas/Polyp-SAM++.

摘要
meta 最近发布了 SAM（ Segment Anything Model），这是一个通用分割模型。 SAM 在多种分割任务中表现出色，包括医疗图像分割。在医疗图像分割领域，肿瘤分割具有非常高的重要性，因此创建一个稳定和精准的模型非常具有挑战性。肿瘤分割是检测和治疗潜肿瘤的基础任务。在这项研究中，我们将研究如何使用文本提示来更好地使用 SAM 进行肿瘤分割。我们将对文本引导 SAM 在标准数据集上进行评估，并与不引导 SAM 进行比较。我们希望通过这项研究，推动肿瘤分割领域的发展，并鼓励更多的激动人心的研究。代码和其他细节将在 https://github.com/RisabBiswas/Polyp-SAM++ 上公开。

DFM-X: Augmentation by Leveraging Prior Knowledge of Shortcut Learning

paper_url: http://arxiv.org/abs/2308.06622
repo_url: https://github.com/nis-research/dfmx-augmentation
paper_authors: Shunxin Wang, Christoph Brune, Raymond Veldhuis, Nicola Strisciuglio
for: 提高模型的普适性和鲁棒性，防止神经网络学习 superficiale 的统计学特征，从而提高模型的泛化能力和鲁棒性。
methods: 提出了一种数据增强策略，称为DFM-X，该策略利用了预测模型中的主导频率图（DFM）来避免神经网络学习快捷解决方案。
results: 实验结果表明，DFM-X 可以提高模型对常见损害和攻击的Robustness，并且可以轻松地与其他增强技术结合使用，以进一步提高模型的泛化能力和鲁棒性。

Abstract
Neural networks are prone to learn easy solutions from superficial statistics in the data, namely shortcut learning, which impairs generalization and robustness of models. We propose a data augmentation strategy, named DFM-X, that leverages knowledge about frequency shortcuts, encoded in Dominant Frequencies Maps computed for image classification models. We randomly select X% training images of certain classes for augmentation, and process them by retaining the frequencies included in the DFMs of other classes. This strategy compels the models to leverage a broader range of frequencies for classification, rather than relying on specific frequency sets. Thus, the models learn more deep and task-related semantics compared to their counterpart trained with standard setups. Unlike other commonly used augmentation techniques which focus on increasing the visual variations of training data, our method targets exploiting the original data efficiently, by distilling prior knowledge about destructive learning behavior of models from data. Our experimental results demonstrate that DFM-X improves robustness against common corruptions and adversarial attacks. It can be seamlessly integrated with other augmentation techniques to further enhance the robustness of models.

摘要
We use Dominant Frequencies Maps (DFMs) to identify the frequency shortcuts that image classification models are prone to learning. We then select a percentage of training images from certain classes and process them by retaining the frequencies included in the DFMs of other classes. This forces the models to use a broader range of frequencies for classification, rather than relying on specific frequency sets.Unlike other augmentation techniques that focus on increasing visual variations in the training data, DFM-X targets the efficient use of the original data by leveraging prior knowledge about the destructive learning behavior of models. Our experimental results show that DFM-X improves the robustness of models against common corruptions and adversarial attacks. It can be easily integrated with other augmentation techniques to further enhance the robustness of models.

LadleNet: Translating Thermal Infrared Images to Visible Light Images Using A Scalable Two-stage U-Net

paper_url: http://arxiv.org/abs/2308.06603
repo_url: https://github.com/ach-1914/ladlenet
paper_authors: Tonghui Zou
for: 这 paper 的目的是将thermal infrared (TIR) 图像转换成可见光 (VI) 图像，并且可以应用于多个领域，如TIR-VI 图像registratin 和融合。
methods: 这 paper 使用了一种基于 U-Net 架构的算法，称为 LadleNet，其包括 ‘Handle’ 模块和 ‘Bowl’ 模块。 Handle 模块constructs an abstract semantic space，而 Bowl 模块 decode这 semantic space来生成 mapped VI 图像。 Handle 模块可以通过使用semantic segmentation networks来扩展其网络架构，从而提高模型性能。
results: comparative experiments 表明， compared to existing methodologies, our approach achieves state-of-the-art performance in terms of image clarity and perceptual quality。

Abstract
The translation of thermal infrared (TIR) images to visible light (VI) images presents a challenging task with potential applications spanning various domains such as TIR-VI image registration and fusion. Leveraging supplementary information derived from TIR image conversions can significantly enhance model performance and generalization across these applications. However, prevailing issues within this field include suboptimal image fidelity and limited model scalability. In this paper, we introduce an algorithm, LadleNet, based on the U-Net architecture. LadleNet employs a two-stage U-Net concatenation structure, augmented with skip connections and refined feature aggregation techniques, resulting in a substantial enhancement in model performance. Comprising 'Handle' and 'Bowl' modules, LadleNet's Handle module facilitates the construction of an abstract semantic space, while the Bowl module decodes this semantic space to yield mapped VI images. The Handle module exhibits extensibility by allowing the substitution of its network architecture with semantic segmentation networks, thereby establishing more abstract semantic spaces to bolster model performance. Consequently, we propose LadleNet+, which replaces LadleNet's Handle module with the pre-trained DeepLabv3+ network, thereby endowing the model with enhanced semantic space construction capabilities. The proposed method is evaluated and tested on the KAIST dataset, accompanied by quantitative and qualitative analyses. Compared to existing methodologies, our approach achieves state-of-the-art performance in terms of image clarity and perceptual quality. The source code will be made available at https://github.com/Ach-1914/LadleNet/tree/main/.

摘要
通过将热成像（TIR）图像转换成可见光（VI）图像，提供了一些应用领域的挑战，如TIR-VI图像匹配和融合。利用TIR图像转换生成的补充信息可以significantly enhance模型性能和泛化性。然而，现有的问题包括低效图像准确性和有限的模型扩展性。在这篇文章中，我们提出了一种算法，即LadleNet，基于U-Net架构。LadleNet使用了两个阶段的U-Net堆叠结构，并添加了跳过连接和精细特征聚合技术，从而实现了显著提高模型性能。LadleNet由“ Handle”和“Bowl”模块组成，其中“ Handle”模块建立了一个抽象的 semantic space，而“Bowl”模块将这个semantic space解码成生成的VI图像。“ Handle”模块具有扩展性，可以通过更改其网络架构来使用semantic segmentation网络，从而建立更加抽象的semantic spaces，以提高模型性能。因此，我们提出了LadleNet+，其替换了LadleNet的“ Handle”模块为预训练的DeepLabv3+网络，从而使模型具有更高的semantic space建立能力。我们的方法在KAIST数据集上进行了评估和测试，并进行了量化和质量分析。相比现有的方法，我们的方法在图像清晰度和感知质量方面达到了状态态的性能。模型源代码将在https://github.com/Ach-1914/LadleNet/tree/main/下提供。