2023-08-04

cs.CV

cs.CV - 2023-08-04

A Bi-variant Variational Model for Diffeomorphic Image Registration with Relaxed Jacobian Determinant Constraints

paper_url: http://arxiv.org/abs/2308.02393
repo_url: None
paper_authors: Yanyan Li, Ke Chen, Chong Chen, Jianping Zhang
for: 本文提出了一种新的二元滤波图像匹配模型，用于处理大幅度变换的图像匹配问题。
methods: 本文使用了一种基于Jacobian方程的软约束方法，以控制扩展和压缩的地方几何变换。此外，本文还提出了一种用于保证扩展的正则化项，以确保扩展的稳定性。
results: 数值实验表明，提出的算法是可靠的，并且可以控制相对体积的范围，不会产生匹配错误。此外，本文的模型还可以生成 diffeomorphic 图像匹配，并比较多个现有的匹配模型表现更好。

Abstract
Diffeomorphic registration has become a powerful approach for seeking a smooth and invertible spatial transformation between two coordinate systems which have been measured via the template and reference images. While the pointwise volume-preserving constraint is effective for some problems, it is too stringent for many other problems especially when the local deformations are relatively large, because it may lead to a poor large-deformation for enforcing local matching.In this paper, we propose a novel bi-variant diffeomorphic image registration model with the soft constraint of Jacobian equation, which allows local deformations to shrink and grow in a flexible range.The Jacobian determinant of the transformation is explicitly controlled by optimizing the relaxation function. To prevent deformation folding and enhance the smoothness of deformation, we not only impose a positivity constraint in optimizing the relaxation function, but also employ a regularizer to ensure the smoothness of the relaxation function.Furthermore, the positivity constraint ensures that is as close to one as possible, which helps to obtain a volume-preserving transformation on average.We further analyze the existence of the minimizer for the variational model and propose a penalty splitting method with a multilevel strategy to solve this model. Numerical experiments show that the proposed algorithm is convergent, and the positivity constraint can control the range of relative volume and not compromise registration accuracy. Moreover, the proposed model produces diffeomorphic maps for large deformation, and achieves better performance compared to the several existing registration models.

摘要
Diffusion registration已经成为一种有力的方法，用于在两个坐标系统之间找到一个平滑和反函数的空间变换。虽然点位量保持约束有效于一些问题，但是在许多问题上，特别是当地方减少很大时，这种约束可能会导致差异匮乏。在这篇论文中，我们提出了一种新的二元diffusion图像 регистраción模型，使得地方减少和增加在一个flexible范围内进行。Jacobian方程的determinant是通过优化放松函数来控制的。为了避免减少和提高折叠的smoothness，我们不仅在优化放松函数时强制实施一个正负性约束，还使用一个正则化项来保证放松函数的smoothness。此外，正负性约束使得变换趋近于一个，帮助获得一个保持体积的变换。我们进一步分析了变量模型的存在性和提出了一种罚分法，用于解决这个模型。 numerically experiments show that the proposed algorithm is convergent, and the positivity constraint can control the range of relative volume without compromising registration accuracy. Moreover, the proposed model produces diffeomorphic maps for large deformation, and achieves better performance compared to several existing registration models.

Color Image Recovery Using Generalized Matrix Completion over Higher-Order Finite Dimensional Algebra

paper_url: http://arxiv.org/abs/2308.02621
repo_url: None
paper_authors: Liang Liao, Zhuang Guo, Qi Gao, Yan Wang, Fajun Yu, Qifeng Zhao, Stephen Johh Maybank
for: 填充彩色图像中缺失部分的高精度图像复制
methods: 基于扩展的高阶矩阵模型，包括像素邻域扩展策略来描述本地像素约束
results: 对各种算法进行了广泛的实验，并对公共可用的图像进行了比较，结果表明我们的扩展矩阵完成模型和相应的算法与低阶矩阵和传统矩阵完成器相比，有较好的性能。

Abstract
To improve the accuracy of color image completion with missing entries, we present a recovery method based on generalized higher-order scalars. We extend the traditional second-order matrix model to a more comprehensive higher-order matrix equivalent, called the "t-matrix" model, which incorporates a pixel neighborhood expansion strategy to characterize the local pixel constraints. This "t-matrix" model is then used to extend some commonly used matrix and tensor completion algorithms to their higher-order versions. We perform extensive experiments on various algorithms using simulated data and algorithms on simulated data and publicly available images and compare their performance. The results show that our generalized matrix completion model and the corresponding algorithm compare favorably with their lower-order tensor and conventional matrix counterparts.

摘要
要提高颜色图像完成缺失项的准确率，我们提出一种基于总是更高级别的scalar的恢复方法。我们将传统的第二阶matrix模型扩展到更加全面的高阶矩阵相等模型，称之为"t-矩阵"模型，该模型利用像素邻域扩展策略来描述当地像素约束。这个"t-矩阵"模型然后用来扩展一些常用的矩阵和tensor completion算法到其高阶版本。我们在各种算法上进行了广泛的实验，使用模拟数据和公共available的图像，并比较了其性能。结果显示，我们的总是更高级别的矩阵 completion模型和相应的算法与其低阶tensor和传统矩阵counterparts相比较avorably。

Frequency Disentangled Features in Neural Image Compression

paper_url: http://arxiv.org/abs/2308.02620
repo_url: None
paper_authors: Ali Zafari, Atefeh Khoshkhahtinat, Piyush Mehta, Mohammad Saeed Ebrahimi Saadabadi, Mohammad Akyash, Nasser M. Nasrabadi
for: 提高图像压缩网络的设计，使其更好地匹配真实分布。
methods: 使用自适应量化、增强的自我注意力计算和通道级感知Entropy模型。
results: 提出了一种基于频率分离的特征级别频率分解方法，使得压缩率下降，同时也超过了手工编码和基于计算昂贵的空间自适应Entropy模型的 neural网络编码器。

Abstract
The design of a neural image compression network is governed by how well the entropy model matches the true distribution of the latent code. Apart from the model capacity, this ability is indirectly under the effect of how close the relaxed quantization is to the actual hard quantization. Optimizing the parameters of a rate-distortion variational autoencoder (R-D VAE) is ruled by this approximated quantization scheme. In this paper, we propose a feature-level frequency disentanglement to help the relaxed scalar quantization achieve lower bit rates by guiding the high entropy latent features to include most of the low-frequency texture of the image. In addition, to strengthen the de-correlating power of the transformer-based analysis/synthesis transform, an augmented self-attention score calculation based on the Hadamard product is utilized during both encoding and decoding. Channel-wise autoregressive entropy modeling takes advantage of the proposed frequency separation as it inherently directs high-informational low-frequency channels to the first chunks and conditions the future chunks on it. The proposed network not only outperforms hand-engineered codecs, but also neural network-based codecs built on computation-heavy spatially autoregressive entropy models.

摘要
neural image compression network 的设计受 latent code 的真实分布决定。除了模型容量之外，这种能力受到较量量化的距离影响。 optimize R-D VAE 的参数被这种粗略量化方案控制。在这篇论文中，我们提出了一种基于频谱分解的特征级频率分离，以帮助 relaxed scalar quantization 实现更低的比特率。此外，通过在编码和解码过程中使用增强的自注意力分数计算，以及通过渠道级自动 entropy 模型来利用提出的频谱分解，我们提高了 transformer 基于分析/synthesis 变换的杜尔拜议性。这种网络不仅超越了手工编码器，还超越了基于 computation-heavy 空间自适应 entropy 模型的神经网络编码器。

Brain MRI Segmentation using Template-Based Training and Visual Perception Augmentation

paper_url: http://arxiv.org/abs/2308.02363
repo_url: None
paper_authors: Fang-Cheng Yeh
for: 用一个人脑MRI模组来对多种脑MRI进行分类和推导，并且不需要大量的训练数据。
methods: 使用模板基本的训练方法，将单一的人脑MRI模组和其相应的分类标签用于训练3D U-Net模型，并且运用视觉认知增强法来提高模型的适应能力和防止过滤。
results: 透过这种方法，训练了多种动物和人脑MRI的3D U-Net模型，并且在分类和推导任务中获得了高准确率。这个工具有效地解决了深度学习应用中的训练数据短缺问题，并且具有扩展深度学习应用的潜力。

Abstract
Deep learning models usually require sufficient training data to achieve high accuracy, but obtaining labeled data can be time-consuming and labor-intensive. Here we introduce a template-based training method to train a 3D U-Net model from scratch using only one population-averaged brain MRI template and its associated segmentation label. The process incorporated visual perception augmentation to enhance the model's robustness in handling diverse image inputs and mitigating overfitting. Leveraging this approach, we trained 3D U-Net models for mouse, rat, marmoset, rhesus, and human brain MRI to achieve segmentation tasks such as skull-stripping, brain segmentation, and tissue probability mapping. This tool effectively addresses the limited availability of training data and holds significant potential for expanding deep learning applications in image analysis, providing researchers with a unified solution to train deep neural networks with only one image sample.

摘要
通常，深度学习模型需要充足的训练数据来实现高精度，但获取标注数据可以是时间consuming和劳动 INTENSIVE。在这里，我们介绍了一种模板基于的训练方法，可以用一个人类大脑MRI模板和其关联的标注来训练一个3D U-Net模型从零开始。该过程包括视觉感知加强来提高模型对不同输入图像的Robustness和避免过拟合。通过这种方法，我们训练了鼠、老鼠、玛瑙鼠、人类大脑MRI等3D U-Net模型来完成 segmentation 任务，如颅腔抽取、脑 segmentation 和组织概率地图。这种工具有效地解决了训练数据的有限性问题，并且具有扩展深度学习应用于图像分析的潜在力量，为研究人员提供了一个统一的解决方案，只需一个图像样本就可以训练深度神经网络。

T-UNet: Triplet UNet for Change Detection in High-Resolution Remote Sensing Images

paper_url: http://arxiv.org/abs/2308.02356
repo_url: https://github.com/pl-2000/t-unet
paper_authors: Huan Zhong, Chen Wu
for:This paper aims to improve remote sensing image change detection by proposing a novel network called Triplet UNet (T-UNet), which can simultaneously extract object features and change features between pre- and post-time-phase images.methods:The proposed T-UNet network uses a three-branch encoder and a multi-branch spatial-spectral cross-attention module (MBSSCA) to interact and fuse the features extracted from the three branches. The network also uses channel attention mechanism (CAM) and spatial attention mechanism (SAM) in the decoder stage to fully mine and integrate detailed textures information and semantic localization information.results:The proposed T-UNet network can accurately detect changes between remote sensing images acquired at different times, and can effectively discern the edges of changed objects. The network’s use of triplet encoder and multi-branch spatial-spectral cross-attention module (MBSSCA) allows it to simultaneously extract object features and change features, leading to more accurate results.

Abstract
Remote sensing image change detection aims to identify the differences between images acquired at different times in the same area. It is widely used in land management, environmental monitoring, disaster assessment and other fields. Currently, most change detection methods are based on Siamese network structure or early fusion structure. Siamese structure focuses on extracting object features at different times but lacks attention to change information, which leads to false alarms and missed detections. Early fusion (EF) structure focuses on extracting features after the fusion of images of different phases but ignores the significance of object features at different times for detecting change details, making it difficult to accurately discern the edges of changed objects. To address these issues and obtain more accurate results, we propose a novel network, Triplet UNet(T-UNet), based on a three-branch encoder, which is capable to simultaneously extract the object features and the change features between the pre- and post-time-phase images through triplet encoder. To effectively interact and fuse the features extracted from the three branches of triplet encoder, we propose a multi-branch spatial-spectral cross-attention module (MBSSCA). In the decoder stage, we introduce the channel attention mechanism (CAM) and spatial attention mechanism (SAM) to fully mine and integrate detailed textures information at the shallow layer and semantic localization information at the deep layer.

摘要
<>传感图像变化检测目标是在不同时间在同一地区内检测图像之间的差异。它广泛应用于地区管理、环境监测、灾害评估等领域。目前大多数变化检测方法基于Siamesenet结构或早期融合结构。Siamesenet结构专注于不同时间扫描图像中的对象特征，但缺乏关注变化信息，导致假报警和错过检测。早期融合（EF）结构专注于将不同阶段图像融合后的特征进行检测，但忽视了不同时间对变化细节的重要性，从而难以准确识别变化的边界。为了解决这些问题并获得更高精度结果，我们提议一种新的网络模型，Triplet UNet（T-UNet），基于三个分支编码器。T-UNet模型可同时检测不同时间图像中对象特征和变化特征，并通过 triplet encoder 来实现。为了有效地交互和融合三个分支编码器中的特征，我们提议一种多分支空间 спектral cross-attention模块（MBSSCA）。在解码阶段，我们引入通道注意机制（CAM）和空间注意机制（SAM），以全面挖掘和融合图像的细节信息，并充分利用深层层次结构的semantic localization信息。

Multi-attacks: Many images $+$ the same adversarial attack $\to$ many target labels

paper_url: http://arxiv.org/abs/2308.03792
repo_url: https://github.com/stanislavfort/multi-attacks
paper_authors: Stanislav Fort
for: 这篇论文旨在描述如何用单个敌意扰动($P$)改变$n$张图像($X_1,X_2,\dots,X_n$)的原始分类($c_1, c_2,\dots,c_n$)为新的目标分类($c^*_1,c^*_2,\dots,c^*_n$)，并可以同时对数百张图像和目标分类进行修改。这些攻击被称为“多重攻击”。
methods: 这篇论文使用了多种方法来研究攻击和防御，包括计算最大可以达到的$n$值，并估算每个像素空间中高度相信度的区域数量为$10^{\mathcal{O}(100)}$。
results: 论文发现了一些 immediat 的后果，包括基于强度的攻击和缩放不依赖的攻击例子。它们还发现了分类决策边界在像素空间中的重复和丰富性，并证明了将多个分类器 ensemble 可以减少攻击的可能性。

Abstract
We show that we can easily design a single adversarial perturbation $P$ that changes the class of $n$ images $X_1,X_2,\dots,X_n$ from their original, unperturbed classes $c_1, c_2,\dots,c_n$ to desired (not necessarily all the same) classes $c^*_1,c^*_2,\dots,c^*_n$ for up to hundreds of images and target classes at once. We call these \textit{multi-attacks}. Characterizing the maximum $n$ we can achieve under different conditions such as image resolution, we estimate the number of regions of high class confidence around a particular image in the space of pixels to be around $10^{\mathcal{O}(100)}$, posing a significant problem for exhaustive defense strategies. We show several immediate consequences of this: adversarial attacks that change the resulting class based on their intensity, and scale-independent adversarial examples. To demonstrate the redundancy and richness of class decision boundaries in the pixel space, we look for its two-dimensional sections that trace images and spell words using particular classes. We also show that ensembling reduces susceptibility to multi-attacks, and that classifiers trained on random labels are more susceptible. Our code is available on GitHub.

摘要
我们显示了我们可以轻松地设计一个单一的对抗 perturbation $P$，使得 $n$ 张图像 $X_1,X_2,\dots,X_n$ 的原始、未受 perturbation 的类 $c_1, c_2,\dots,c_n$ 变为欲要的类 $c^*_1,c^*_2,\dots,c^*_n$。我们称这为 “多个攻击”。在不同的条件下，如图像分辨率，我们估算最大 achievable $n$ 的值在 $10^{\mathcal{O}(100)}$ 之间，这意味着在像素空间中有大约 $10^{\mathcal{O}(100)}$ 个高度相关的类 confidence 区域，对抗措施是一个重大问题。我们显示了一些 immediate consequence：对于图像的 Intensity 的变化，以及缩放不依赖的对抗例子。此外，我们还证明了像素空间中精度的分布是丰富的，可以用两dimensional 的section找到图像和描述字符串的轨迹。最后，我们发现 ensemble 可以减少多个攻击的影响，而类ifiers 训练于随机标签的情况下更容易受到攻击。我们的代码可以在 GitHub 上找到。

A Parameter-efficient Multi-subject Model for Predicting fMRI Activity

paper_url: http://arxiv.org/abs/2308.02351
repo_url: https://github.com/cmi-dair/algonauts23
paper_authors: Connor Lane, Gregory Kiar
for: This paper is written for the Algonauts 2023 submission, and it describes the team “BlobGPT”‘s model and its components.
methods: The paper uses a multi-subject linear encoding head attached to a pretrained trunk model, which consists of three components: a shared multi-layer feature projection, shared plus subject-specific low-dimension linear transformations, and a shared PCA fMRI embedding.
results: The paper presents experimental results for the team’s model, which demonstrate its effectiveness in certain tasks.Here’s the simplified Chinese text for the three key information points:
for: 这篇论文是为 Algonauts 2023 提交而写的，描述了 “BlobGPT” 团队的模型和其组成部分。
methods: 这篇论文使用了一个多主题线性编码头，附加到预训练的树模型上，该头包括三个组成部分：共享多层特征投影、共享 plus 特定主题低维度线性变换、和共享 PCA fMRI 嵌入。
results: 这篇论文发表了团队的模型实验结果，以证明其在某些任务中的效果。

Abstract
This is the Algonauts 2023 submission report for team "BlobGPT". Our model consists of a multi-subject linear encoding head attached to a pretrained trunk model. The multi-subject head consists of three components: (1) a shared multi-layer feature projection, (2) shared plus subject-specific low-dimension linear transformations, and (3) a shared PCA fMRI embedding. In this report, we explain these components in more detail and present some experimental results. Our code is available at https://github.com/cmi-dair/algonauts23.

摘要
这是 Algonauts 2023 提交报告，我们的团队名称是 "BlobGPT"。我们的模型包括一个多主题线性编码头，附加到预训练的核心模型上。多主题头包括三个组成部分：（1）共享多层特征投影，（2）共享 plus 特定主题低维度线性变换，（3）共享 PCA fMRI 嵌入。在这份报告中，我们将这些组成部分进行详细介绍，并展示一些实验结果。我们的代码可以在 GitHub 上找到：https://github.com/cmi-dair/algonauts23。

RobustMQ: Benchmarking Robustness of Quantized Models

paper_url: http://arxiv.org/abs/2308.02350
repo_url: None
paper_authors: Yisong Xiao, Aishan Liu, Tianyuan Zhang, Haotong Qin, Jinyang Guo, Xianglong Liu
for: 评估量化神经网络模型在各种噪声环境下的Robustness。
methods: 对ImageNet上的量化神经网络模型进行了广泛的评估，包括对抗攻击、自然损害和系统性噪声的评估。
results: 研究结果表明，量化模型对抗攻击 exhibits higher robustness than its floating-point counterpart, but is more vulnerable to natural corruptions and systematic noises。增加量化比特宽度会导致对抗攻击的Robustness下降，自然 robustness 增加，系统 robustness 增加。 among corruption methods, impulse noise and glass blur are the most harmful to quantized models, while brightness has the least impact. among systematic noises, the nearest neighbor interpolation has the highest impact, while bilinear interpolation, cubic interpolation, and area interpolation are the three least harmful.

Abstract
Quantization has emerged as an essential technique for deploying deep neural networks (DNNs) on devices with limited resources. However, quantized models exhibit vulnerabilities when exposed to various noises in real-world applications. Despite the importance of evaluating the impact of quantization on robustness, existing research on this topic is limited and often disregards established principles of robustness evaluation, resulting in incomplete and inconclusive findings. To address this gap, we thoroughly evaluated the robustness of quantized models against various noises (adversarial attacks, natural corruptions, and systematic noises) on ImageNet. The comprehensive evaluation results empirically provide valuable insights into the robustness of quantized models in various scenarios, for example: (1) quantized models exhibit higher adversarial robustness than their floating-point counterparts, but are more vulnerable to natural corruptions and systematic noises; (2) in general, increasing the quantization bit-width results in a decrease in adversarial robustness, an increase in natural robustness, and an increase in systematic robustness; (3) among corruption methods, \textit{impulse noise} and \textit{glass blur} are the most harmful to quantized models, while \textit{brightness} has the least impact; (4) among systematic noises, the \textit{nearest neighbor interpolation} has the highest impact, while bilinear interpolation, cubic interpolation, and area interpolation are the three least harmful. Our research contributes to advancing the robust quantization of models and their deployment in real-world scenarios.

摘要
“量化技术已经成为深度神经网络（DNNs）部署在有限资源设备时的关键手段。然而，量化模型在实际应用中容易受到各种噪声的影响。尽管评估量化对模型的Robustness的影响非常重要，但现有的研究往往忽视了已有的Robustness评估原则，导致结果不完整、不一致。为了解决这个空白，我们对ImageNet上的量化模型进行了全面的评估，包括了骚扰攻击、自然损害和系统性噪声。我们的实验结果表明：（1）量化模型对骚扰攻击更为抵抗，但对自然损害和系统性噪声更为敏感；（2）随着量化比特宽度的增加，对骚扰攻击的抵抗力下降，对自然损害的抵抗力增加，对系统性噪声的抵抗力增加；（3）在损害方法中，雷达噪声和玻璃抹涂是对量化模型最有害的，而亮度损害最小。在系统性噪声中，最危险的是 nearest neighbor interpolation，而bilinear interpolation、cubic interpolation和area interpolation是最安全的。”

Class Incremental Learning with Self-Supervised Pre-Training and Prototype Learning

paper_url: http://arxiv.org/abs/2308.02346
repo_url: None
paper_authors: Wenzhuo Liu, Xinjian Wu, Fei Zhu, Mingming Yu, Chuang Wang, Cheng-Lin Liu
for: 这篇研究目的是解决随时间变化的开放组别集（class incremental learning，CIL）中的衰弱现象（catastrophic forgetting）。
methods: 本研究使用了两阶段学习框架，其中一阶段是固定的encoder，另一阶段是逐渐更新的prototype标签分类器。encoder使用了自我超vis Learning来生成一个高内在维度的特征空间，以提高它的转移性和通用性。prototype标签分类器逐渐学习新的标签，同时保留之前学习的标签的标签，这是保持决策界的关键。
results: 实验结果显示， compared to state-of-the-art exemplar-based methods when they reserved 5 examplers per class, our method can significantly outperform them under the incremental setting of 10 phases, by 18.24% on CIFAR-100 and 9.37% on ImageNet100.

Abstract
Deep Neural Network (DNN) has achieved great success on datasets of closed class set. However, new classes, like new categories of social media topics, are continuously added to the real world, making it necessary to incrementally learn. This is hard for DNN because it tends to focus on fitting to new classes while ignoring old classes, a phenomenon known as catastrophic forgetting. State-of-the-art methods rely on knowledge distillation and data replay techniques but still have limitations. In this work, we analyze the causes of catastrophic forgetting in class incremental learning, which owes to three factors: representation drift, representation confusion, and classifier distortion. Based on this view, we propose a two-stage learning framework with a fixed encoder and an incrementally updated prototype classifier. The encoder is trained with self-supervised learning to generate a feature space with high intrinsic dimensionality, thus improving its transferability and generality. The classifier incrementally learns new prototypes while retaining the prototypes of previously learned data, which is crucial in preserving the decision boundary.Our method does not rely on preserved samples of old classes, is thus a non-exemplar based CIL method. Experiments on public datasets show that our method can significantly outperform state-of-the-art exemplar-based methods when they reserved 5 examplers per class, under the incremental setting of 10 phases, by 18.24% on CIFAR-100 and 9.37% on ImageNet100.

摘要
深度神经网络（DNN）在封闭类集的数据上达到了很大的成功。然而，实际世界中新的类是不断添加的，使得需要逐步学习。这是DNN很难受的，因为它往往会专注于新类的适应而忽略旧类，这被称为慢性忘却。现状的方法包括知识传递和数据重播技术，但还有局限性。在这种情况下，我们分析了逐步学习中的慢性忘却的原因，它是由于表示变化、表示混乱和分类器扭曲三个因素引起的。基于这种视角，我们提出了一种两阶段学习框架，其中有一个固定的编码器和一个逐步更新的原型分类器。编码器通过自我超vised学习来生成一个具有高内在维度的特征空间，从而提高其传输性和通用性。原型分类器逐步学习新的原型，同时保留之前学习的数据的原型，这是保持决策边界的关键。我们的方法不需要保留旧类的示例，因此不是基于示例的CIL方法。在公共数据集上进行了实验，我们的方法可以在10个阶段的逐步学习Setting下，在与存储5个示例的旧类的情况下，与现状的 exemplar-based 方法相比，提高了18.24%的CIFAR-100和9.37%的ImageNet100。

Generative Image Priors for MRI Reconstruction Trained from Magnitude-Only Images

paper_url: http://arxiv.org/abs/2308.02340
repo_url: https://github.com/mrirecon/image-priors
paper_authors: Guanxiong Luo, Xiaoqing Wang, Mortiz Blumenthal, Martin Schilling, Erik Hans Ulrich Rauf, Raviteja Kotikalapudi, Niels Focke, Martin Uecker
for: 这个研究旨在构建基于大量数据和阶段信息的生成性图像假设，以提高MRI重建图像质量。
methods: 该研究 workflow包括准备具有阶段信息的训练数据集，并使用这些数据集来训练复杂图像的生成假设。最后，用于重建图像的训练过程中使用了线性和非线性重建方法。
results: 实验结果表明，基于复杂图像的生成假设比只基于幅度图像的生成假设更高效。此外，使用更大的数据集可以提高生成假设的可靠性。最后，我们发现使用生成假设比L1-抽象冲激regularization更有利于高Undersampling下的分辨率增强。

Abstract
Purpose: In this work, we present a workflow to construct generic and robust generative image priors from magnitude-only images. The priors can then be used for regularization in reconstruction to improve image quality. Methods: The workflow begins with the preparation of training datasets from magnitude-only MR images. This dataset is then augmented with phase information and used to train generative priors of complex images. Finally, trained priors are evaluated using both linear and nonlinear reconstruction for compressed sensing parallel imaging with various undersampling schemes. Results: The results of our experiments demonstrate that priors trained on complex images outperform priors trained only on magnitude images. Additionally, a prior trained on a larger dataset exhibits higher robustness. Finally, we show that the generative priors are superior to L1 -wavelet regularization for compressed sensing parallel imaging with high undersampling. Conclusion: These findings stress the importance of incorporating phase information and leveraging large datasets to raise the performance and reliability of the generative priors for MRI reconstruction. Phase augmentation makes it possible to use existing image databases for training.

摘要
目的：在这项工作中，我们提出了一个工作流程，用于从偏度只图像中生成一般化和稳定的生成图像假设。这些假设可以用于图像重建中的规则化，以提高图像质量。方法：我们的工作流程开始于准备待训练的训练集，其中包含了偏度只的MR图像。这个集合然后被补充了相位信息，并用于训练复杂图像的生成假设。最后，我们训练了这些假设，并对其进行了线性和非线性重建的评估。结果：我们的实验结果表明，使用复杂图像进行训练的假设，可以超过只使用偏度图像进行训练的假设。此外，我们发现，使用更大的数据集来训练假设，可以提高假设的稳定性。最后，我们表明，生成假设比L1-wavelet正则化更有效地进行压缩感知并行图像重建。结论：这些发现表明，在MRI重建中，包含相位信息和利用大数据集来训练生成假设，是提高性和可靠性的关键。相位补充使得可以使用现有的图像库进行训练。

Improving Scene Graph Generation with Superpixel-Based Interaction Learning

paper_url: http://arxiv.org/abs/2308.02339
repo_url: None
paper_authors: Jingyi Wang, Can Zhang, Jinfa Huang, Botao Ren, Zhidong Deng
for: 提高Scene Graph Generation（SGG）的精度，使其更好地捕捉场景中entity之间的关系和Semantics。
methods: 提出了一种新的Superpixel-based Interaction Learning（SIL）方法，通过对场景点Cloud进行分 clustering，并在不同superpixel之间进行交互学习，以提高SGG中entity之间的finegrained交互。
results: 经过extensive的实验 validate，SIL方法能够在Visual Genome和Open Image V6两个Benchmark上达到state-of-the-art的性能，并且可以与现有的Box-level方法相结合，以提高其性能。

Abstract
Recent advances in Scene Graph Generation (SGG) typically model the relationships among entities utilizing box-level features from pre-defined detectors. We argue that an overlooked problem in SGG is the coarse-grained interactions between boxes, which inadequately capture contextual semantics for relationship modeling, practically limiting the development of the field. In this paper, we take the initiative to explore and propose a generic paradigm termed Superpixel-based Interaction Learning (SIL) to remedy coarse-grained interactions at the box level. It allows us to model fine-grained interactions at the superpixel level in SGG. Specifically, (i) we treat a scene as a set of points and cluster them into superpixels representing sub-regions of the scene. (ii) We explore intra-entity and cross-entity interactions among the superpixels to enrich fine-grained interactions between entities at an earlier stage. Extensive experiments on two challenging benchmarks (Visual Genome and Open Image V6) prove that our SIL enables fine-grained interaction at the superpixel level above previous box-level methods, and significantly outperforms previous state-of-the-art methods across all metrics. More encouragingly, the proposed method can be applied to boost the performance of existing box-level approaches in a plug-and-play fashion. In particular, SIL brings an average improvement of 2.0% mR (even up to 3.4%) of baselines for the PredCls task on Visual Genome, which facilitates its integration into any existing box-level method.

摘要
Recent advances in Scene Graph Generation (SGG) 通常使用预定的检测器提供的箱级别特征来模型实体之间的关系。我们认为SGG中的粗略交互不足以捕捉场景中的语义上下文，实际上限制了该领域的发展。在这篇论文中，我们决定探索和提议一种通用的思想，称之为Superpixel-based Interaction Learning (SIL)，以解决粗略交互的问题。它允许我们在SGG中模型细化的交互。具体来说，我们将场景视为一组点，并将它们分组成相应的superpixel，表示场景中的子区域。然后，我们会探索这些superpixel之间的内部和跨实体交互，以增强交互的细化。广泛的实验表明，我们的SIL可以在Visual Genome和Open Image V6两个挑战性评价指标上提高细化交互的性能，并在所有指标上超过之前的箱级方法。此外，我们的方法可以与现有的箱级方法相结合，以提高其性能。例如，在Visual Genome上，我们的SIL可以与基eline方法相结合，提高PredCls任务的平均改进率2.0%（最高达3.4%）。

Diffusion-Augmented Depth Prediction with Sparse Annotations

paper_url: http://arxiv.org/abs/2308.02283
repo_url: None
paper_authors: Jiaqi Li, Yiran Wang, Zihao Huang, Jinghong Zheng, Ke Xian, Zhiguo Cao, Jianming Zhang
for: 这篇论文的目的是提出一种supervised推断方法，以解决自主驾驶场景中稀疏的注释问题，并且提高depth estimation的稠密度和Robustness。
methods: 这篇论文提出了一种名为Diffusion-Augmented Depth Prediction（DADP）的框架，利用diffusion模型的结构特征来强制深度模型中的深度结构，同时还提出了一种对象指导完整性损失函数，以进一步增强区域结构完整性。
results: 在三个驾驶 benchmark上测试了DADP框架，实现了显著提高的深度结构和Robustness。这种方法为自主驾驶场景中的depth estimation带来了新的视角和解决方案。

Abstract
Depth estimation aims to predict dense depth maps. In autonomous driving scenes, sparsity of annotations makes the task challenging. Supervised models produce concave objects due to insufficient structural information. They overfit to valid pixels and fail to restore spatial structures. Self-supervised methods are proposed for the problem. Their robustness is limited by pose estimation, leading to erroneous results in natural scenes. In this paper, we propose a supervised framework termed Diffusion-Augmented Depth Prediction (DADP). We leverage the structural characteristics of diffusion model to enforce depth structures of depth models in a plug-and-play manner. An object-guided integrality loss is also proposed to further enhance regional structure integrality by fetching objective information. We evaluate DADP on three driving benchmarks and achieve significant improvements in depth structures and robustness. Our work provides a new perspective on depth estimation with sparse annotations in autonomous driving scenes.

摘要
depth estimation 目标是预测粗粒度地图。在自动驾驶场景中，笔记率的假设使得任务变得困难。经过监督的模型会生成凹形物体，因为缺乏结构信息而导致过拟合有效像素，并且失去空间结构。为了解决这问题，自动驾驶方法被提出。它们的可靠性受到定位估计的限制，导致在自然场景中产生错误结果。在这篇论文中，我们提出了一种名为傅立叶-扩展 depth prediction（DADP）的监督框架。我们利用傅立叶模型的结构特征来强制深度模型中的深度结构，并且在插件和撤销的方式下进行替换。我们还提出了一种对象指导的完整性损失函数，以进一步增强地域结构完整性。我们对 DADP 进行了三个驾驶benchmark测试，并实现了深度结构和可靠性方面的显著提高。我们的工作为自动驾驶场景中笔记率稀缺的深度估计带来了新的视角。

SURE-Val: Safe Urban Relevance Extension and Validation

paper_url: http://arxiv.org/abs/2308.02266
repo_url: None
paper_authors: Kai Storms, Ken Mori, Steven Peters
for: 本研究目的是评估自动驾驶系统的感知组件，需要定义相关的目标对象。而城市领域是自动驾驶数据集的常用领域，但是 relevance 不够充分定义。因此，本研究采用了已有的方法，将 relevance 定义 extend 到城市领域。
methods: 本研究采用了现有的方法，并将 relevance 定义 extend 到城市领域。此外，本研究还提出了一种新的 relevance 验证方法，利用了预测组件，以验证 relevance 的定义。
results: 本研究通过对 relevance 验证方法进行 verify ，并证明了 relevance 验证方法的有效性。此外，本研究还发现了一些 relevance 验证方法的缺陷，并提出了一些改进建议。

Abstract
To evaluate perception components of an automated driving system, it is necessary to define the relevant objects. While the urban domain is popular among perception datasets, relevance is insufficiently specified for this domain. Therefore, this work adopts an existing method to define relevance in the highway domain and expands it to the urban domain. While different conceptualizations and definitions of relevance are present in literature, there is a lack of methods to validate these definitions. Therefore, this work presents a novel relevance validation method leveraging a motion prediction component. The validation leverages the idea that removing irrelevant objects should not influence a prediction component which reflects human driving behavior. The influence on the prediction is quantified by considering the statistical distribution of prediction performance across a large-scale dataset. The validation procedure is verified using criteria specifically designed to exclude relevant objects. The validation method is successfully applied to the relevance criteria from this work, thus supporting their validity.

摘要
There are various conceptualizations and definitions of relevance in literature, but there is a lack of methods to validate these definitions. Therefore, this study presents a novel relevance validation method that leverages a motion prediction component. The validation method is based on the idea that removing irrelevant objects should not influence a prediction component that reflects human driving behavior. The influence on the prediction is quantified by considering the statistical distribution of prediction performance across a large-scale dataset.The validation procedure is verified using criteria specifically designed to exclude relevant objects. The validation method is successfully applied to the relevance criteria from this study, thus supporting their validity.

On the Calibration of Uncertainty Estimation in LiDAR-based Semantic Segmentation

paper_url: http://arxiv.org/abs/2308.02248
repo_url: None
paper_authors: Mariella Dreissig, Florian Piewak, Joschka Boedecker
for: 这种研究的目的是提高深度学习基于感知模型的可靠性，尤其是在自动驾驶等下游任务中，需要准确的信任估计。
methods: 该研究使用了一种新的度量方法来衡量semantic segmentation模型中每个类别的信任抑制质量。
results: 该方法可以帮助找到 Label 问题，以提高手动或自动标注数据集的质量。

Abstract
The confidence calibration of deep learning-based perception models plays a crucial role in their reliability. Especially in the context of autonomous driving, downstream tasks like prediction and planning depend on accurate confidence estimates. In point-wise multiclass classification tasks like sematic segmentation the model has to deal with heavy class imbalances. Due to their underrepresentation, the confidence calibration of classes with smaller instances is challenging but essential, not only for safety reasons. We propose a metric to measure the confidence calibration quality of a semantic segmentation model with respect to individual classes. It is calculated by computing sparsification curves for each class based on the uncertainty estimates. We use the classification calibration metric to evaluate uncertainty estimation methods with respect to their confidence calibration of underrepresented classes. We furthermore suggest a double use for the method to automatically find label problems to improve the quality of hand- or auto-annotated datasets.

摘要
“深度学习基于观察模型的自信核算对其可靠性起着关键作用。尤其在自动驾驶上，下游任务如预测和规划都需要准确的自信估算。在点对多类分类任务如semantic segmentation中，模型需要面临重重分类异常现象。由于这些类型的实例较少，对这些类型的自信核算是挑战性的，但也是安全性上的必需。我们提出了一个度量用于评估semantic segmentation模型中每个类型的自信核算质量的指标。它是通过计算每个类型的缩减曲线来计算的，该曲线基于模型的不确定性估算。我们使用这个分类calibration指标来评估不确定性估算方法的自信核算质量，特别是对于下游类型。此外，我们还建议使用这个方法自动找到手动或自动标注数据集中的标签问题，以提高数据集的质量。”

Improving Human-Object Interaction Detection via Virtual Image Learning

paper_url: http://arxiv.org/abs/2308.02606
repo_url: None
paper_authors: Shuman Fang, Shuai Liu, Jie Li, Guannan Jiang, Xianming Lin, Rongrong Ji
for: 本文旨在提高人工物交互检测的精度，尤其是对于长尾类别的交互对象。
methods: 本文提出了一种基于虚拟图像学习的方法，包括多步图像生成（MUSIC）和教师学生框架。同时，为了解决虚拟图像的初始标签不准确，提出了一种适应性匹配滤波（AMF）模块。
results: 本文在两个benchmark上实现了显著提高，并取得了新的状态码记录。

Abstract
Human-Object Interaction (HOI) detection aims to understand the interactions between humans and objects, which plays a curtail role in high-level semantic understanding tasks. However, most works pursue designing better architectures to learn overall features more efficiently, while ignoring the long-tail nature of interaction-object pair categories. In this paper, we propose to alleviate the impact of such an unbalanced distribution via Virtual Image Leaning (VIL). Firstly, a novel label-to-image approach, Multiple Steps Image Creation (MUSIC), is proposed to create a high-quality dataset that has a consistent distribution with real images. In this stage, virtual images are generated based on prompts with specific characterizations and selected by multi-filtering processes. Secondly, we use both virtual and real images to train the model with the teacher-student framework. Considering the initial labels of some virtual images are inaccurate and inadequate, we devise an Adaptive Matching-and-Filtering (AMF) module to construct pseudo-labels. Our method is independent of the internal structure of HOI detectors, so it can be combined with off-the-shelf methods by training merely 10 additional epochs. With the assistance of our method, multiple methods obtain significant improvements, and new state-of-the-art results are achieved on two benchmarks.

摘要
人机物交互（HOI）检测目标是理解人与物之间的交互，这对高级 semantic 理解任务起到关键作用。然而，大多数工作强调设计更好的建筑来更有效地学习总体特征。在这篇文章中，我们提议通过虚拟图像学习（VIL）来减轻交互对象分类的长尾分布的影响。首先，我们提出了一种新的标签到图像的方法，称为多步图像创建（MUSIC），以创建一个具有均衡分布的高质量数据集。在这个阶段，虚拟图像通过特定特征的描述和多步过滤进程选择。其次，我们使用虚拟和真实图像来帮助模型在教师-学生框架下学习。由于一些虚拟图像的初始标签不准确和不充分，我们提出了一种适应性匹配和筛选（AMF）模块来构建pseudo标签。我们的方法不依赖HOI检测器的内部结构，因此可以与市面上的方法结合使用，只需要训练10个额外的熬杯。通过我们的方法，多种方法均 obtianed significan improvements，并在两个标准准点上获得了新的州arameter Record。

MSECNet: Accurate and Robust Normal Estimation for 3D Point Clouds by Multi-Scale Edge Conditioning

paper_url: http://arxiv.org/abs/2308.02237
repo_url: https://github.com/martianxiu/MSECNet
paper_authors: Haoyi Xiu, Xin Liu, Weimin Wang, Kyoung-Sook Kim, Masashi Matsuoka
for: surface normals estimation from 3D point clouds, especially in regions with rapidly changing normals
methods: MSECNet, a novel approach that treats normal variation modeling as an edge detection problem, consisting of a backbone network and a multi-scale edge conditioning (MSEC) stream
results: outperforms existing methods on both synthetic and real-world datasets while running significantly faster, and demonstrates effectiveness in surface reconstruction

Abstract
Estimating surface normals from 3D point clouds is critical for various applications, including surface reconstruction and rendering. While existing methods for normal estimation perform well in regions where normals change slowly, they tend to fail where normals vary rapidly. To address this issue, we propose a novel approach called MSECNet, which improves estimation in normal varying regions by treating normal variation modeling as an edge detection problem. MSECNet consists of a backbone network and a multi-scale edge conditioning (MSEC) stream. The MSEC stream achieves robust edge detection through multi-scale feature fusion and adaptive edge detection. The detected edges are then combined with the output of the backbone network using the edge conditioning module to produce edge-aware representations. Extensive experiments show that MSECNet outperforms existing methods on both synthetic (PCPNet) and real-world (SceneNN) datasets while running significantly faster. We also conduct various analyses to investigate the contribution of each component in the MSEC stream. Finally, we demonstrate the effectiveness of our approach in surface reconstruction.

摘要
<>将三维点云中的表面法向估计作为应用的核心问题，包括表面重建和渲染。现有的normal估计方法在normal变化缓慢地区表现良好，但在normal变化迅速地区时，这些方法往往失败。为解决这问题，我们提出了一种新的方法 called MSECNet，它在normal变化地区进行了改进，并将normal变化模型化为边检测问题。MSECNet包括一个后处网络和一个多尺度边条件（MSEC）流。MSEC流通过多尺度特征融合和自适应边检测来实现Robust的边检测。检测到的边然后与后处网络输出的edge conditioning模块结合，以生成edge-aware表示。广泛的实验显示，MSECNet在PCPNet和SceneNN两个synthetic和实际数据集上具有更高的性能，而且运行速度更快。我们还进行了各种分析，以Investigate MSEC流中每个组件的贡献。最后，我们示出了我们的方法在表面重建中的效果。<>

FB-BEV: BEV Representation from Forward-Backward View Transformations

paper_url: http://arxiv.org/abs/2308.02236
repo_url: https://github.com/nvlabs/fb-bev
paper_authors: Zhiqi Li, Zhiding Yu, Wenhai Wang, Anima Anandkumar, Tong Lu, Jose M. Alvarez
for: 本文提出了一种新的视图转换模块，用于改进现有的前向投影和后向投影方法，以提高Camera-based Bird-Eye-View（BEV）识别系统的性能。
methods: 本文使用了两种主要的视图转换方法：前向投影和后向投影。前向投影（Lift-Splat-Shoot）会导致稀疏地投影 BEV 特征，而后向投影（BEVFormer）则可能因深度不准确导致 false-positive BEV 特征。本文提出了一种新的 forward-backward 视图转换模块，用于补做这两种方法的缺陷，以获得更高质量的 BEV 表示。
results: 本文实现了一种新的 FB-BEV 模型，通过将 forward-backward 视图转换模块与 BEVFormer 结合使用，实现了 nuScenes 测试集上的新state-of-the-art 结果，达到 62.4% NDS。代码和模型可以在 https://github.com/NVlabs/FB-BEV 上下载。

Abstract
View Transformation Module (VTM), where transformations happen between multi-view image features and Bird-Eye-View (BEV) representation, is a crucial step in camera-based BEV perception systems. Currently, the two most prominent VTM paradigms are forward projection and backward projection. Forward projection, represented by Lift-Splat-Shoot, leads to sparsely projected BEV features without post-processing. Backward projection, with BEVFormer being an example, tends to generate false-positive BEV features from incorrect projections due to the lack of utilization on depth. To address the above limitations, we propose a novel forward-backward view transformation module. Our approach compensates for the deficiencies in both existing methods, allowing them to enhance each other to obtain higher quality BEV representations mutually. We instantiate the proposed module with FB-BEV, which achieves a new state-of-the-art result of 62.4% NDS on the nuScenes test set. Code and models are available at https://github.com/NVlabs/FB-BEV.

摘要
视图变换模块（VTM），负责将多视图图像特征转换为飞行视图（BEV）表示，是Camera-based BEV感知系统中关键的步骤。目前，最为流行的VTM方法有两种：前 projection和后 projection。前 projection，表示Lift-Splat-Shoot，会导致稀疏地 proyect BEV特征无需后处理。后 projection，如BEVFormer，可能会从错误的投影中生成错误的 BEV特征，因为不利用深度。为了解决以上限制，我们提议一种新的前后视图变换模块。我们的方法可以补做两种现有方法的缺陷，使其互相增强，从而获得更高质量的 BEV表示。我们实现了提议的模块，并实现了FB-BEV，在nuScenes测试集上达到了62.4% NDS的新状态态。代码和模型可以在https://github.com/NVlabs/FB-BEV中下载。

Painterly Image Harmonization using Diffusion Model

paper_url: http://arxiv.org/abs/2308.02228
repo_url: https://github.com/bcmi/phdiffusion-painterly-image-harmonization
paper_authors: Lingxiao Lu, Jiangtong Li, Junyan Cao, Li Niu, Liqing Zhang
for: 将 фотографические对象插入画作中并获得艺术一致的复合图像。
methods: 我们提出了一种新的稳定扩散模型（PHDiffusion），它包括一个轻量级适应编码器和一个双编码器融合（DEF）模块。特别是，适应编码器和DEF模块首先在每个编码器中塑造了前景特征。然后，这些塑造后的前景特征从两个编码器中都被组合以导引协调过程。
results: 与相关领域的状态机型相比，我们的PHDiffusion可以更好地塑造前景并同时保留更细的内容。

Abstract
Painterly image harmonization aims to insert photographic objects into paintings and obtain artistically coherent composite images. Previous methods for this task mainly rely on inference optimization or generative adversarial network, but they are either very time-consuming or struggling at fine control of the foreground objects (e.g., texture and content details). To address these issues, we propose a novel Painterly Harmonization stable Diffusion model (PHDiffusion), which includes a lightweight adaptive encoder and a Dual Encoder Fusion (DEF) module. Specifically, the adaptive encoder and the DEF module first stylize foreground features within each encoder. Then, the stylized foreground features from both encoders are combined to guide the harmonization process. During training, besides the noise loss in diffusion model, we additionally employ content loss and two style losses, i.e., AdaIN style loss and contrastive style loss, aiming to balance the trade-off between style migration and content preservation. Compared with the state-of-the-art models from related fields, our PHDiffusion can stylize the foreground more sufficiently and simultaneously retain finer content. Our code and model are available at https://github.com/bcmi/PHDiffusion-Painterly-Image-Harmonization.

摘要
painterly 图像协调的目标是插入 fotografic 对象到画作中并获得艺术一致的复合图像。现有的方法主要基于推理优化或生成敌对网络，但他们是非常占用时间或缺乏细节控制（例如，Texture和内容细节）。为解决这些问题，我们提出了一种新的笔画协调稳定扩散模型（PHDiffusion），它包括一个轻量级适应编码器和双编码器融合（DEF）模块。具体来说，适应编码器和 DEF 模块首先在每个编码器中进行了笔画化前景特征。然后，这些笔画化前景特征从两个编码器中被合并，以指导协调过程。在训练时，我们还采用了内容损失和两种风格损失，即 AdaIN 风格损失和对比风格损失，以努力均衡风格迁移和内容保留。相比之前的相关领域模型，我们的 PHDiffusion 可以更好地笔画前景并同时保留更细的内容。我们的代码和模型可以在 GitHub 上找到：https://github.com/bcmi/PHDiffusion-Painterly-Image-Harmonization。

Deep Semantic Model Fusion for Ancient Agricultural Terrace Detection

paper_url: http://arxiv.org/abs/2308.02225
repo_url: https://github.com/wangyi111/international-archaeology-ai-challenge
paper_authors: Yi Wang, Chenying Liu, Arti Tiwari, Micha Silver, Arnon Karnieli, Xiao Xiang Zhu, Conrad M Albrecht
for: 本研究旨在提高古代农业 terrace 的检测和识别效率，使用机器学习技术对古代遗产地区进行自动检测和识别。
methods: 本研究使用了深度semantic模型融合方法，输入数据包括空中图像和LiDAR生成的地形特征，并使用了DeepLabv3+和UNet两种深度semantic segmentation模型，以提取古代 terrace 和墙壕的特征。
results: 提出的方法在国际 AI 考古挑战中获得了第一名，代表了该方法的高效性和可靠性。

Abstract
Discovering ancient agricultural terraces in desert regions is important for the monitoring of long-term climate changes on the Earth's surface. However, traditional ground surveys are both costly and limited in scale. With the increasing accessibility of aerial and satellite data, machine learning techniques bear large potential for the automatic detection and recognition of archaeological landscapes. In this paper, we propose a deep semantic model fusion method for ancient agricultural terrace detection. The input data includes aerial images and LiDAR generated terrain features in the Negev desert. Two deep semantic segmentation models, namely DeepLabv3+ and UNet, with EfficientNet backbone, are trained and fused to provide segmentation maps of ancient terraces and walls. The proposed method won the first prize in the International AI Archaeology Challenge. Codes are available at https://github.com/wangyi111/international-archaeology-ai-challenge.

摘要
发现古代垦殖 terrace 在 desert 地区对地球表面长期气候变化的监测非常重要。然而，传统的地面调查非常昂贵，而且规模有限。随着飞行和卫星数据的更加可 accessible，机器学习技术在自动检测和识别考古遗产中具有大量潜力。在这篇论文中，我们提议一种深度 semantic model fusion 方法，用于古代垦殖 terrace 的检测。输入数据包括飞行图像和 LiDAR 生成的地形特征，位于约旦河谷。我们在 DeepLabv3+ 和 UNet 两种深度 semantic segmentation 模型中使用 EfficientNet 背景，并将其们训练和融合，以生成古代 terrace 和墙壁的分 segmentation 图。该方法在国际 AI 考古挑战中获得了首奖。代码可以在上获取。

Balanced Classification: A Unified Framework for Long-Tailed Object Detection

paper_url: http://arxiv.org/abs/2308.02213
repo_url: https://github.com/tianhao-qi/bacl
paper_authors: Tianhao Qi, Hongtao Xie, Pandeng Li, Jiannan Ge, Yongdong Zhang
for: 强调处理长尾数据的抑降性缺陷，提高分类器对异常类别的识别率。
methods: 提出了一种名为Balanced Classification（BACL）的统一框架，通过自适应地正则化分类器的竞争性和动态增强样本多样性来解决这些问题。特别是，开发了一种新的前景分类均衡损失函数（FCBL），以便更好地平衡前景类别的竞争，并避免压抑尾类别。此外，提出了一种动态特征幻化模块（FHM），通过合成幻化样本来增加特征空间中的数据变化。
results: 在CVPR2020 LVIS数据集上，BACL与基于ResNet-50-FPN的标准Faster R-CNN相比，提高了5.8% AP和16.1% AP的性能，并在不同的数据集和结构上进行了广泛的实验，都表现出了性能改进。

Abstract
Conventional detectors suffer from performance degradation when dealing with long-tailed data due to a classification bias towards the majority head categories. In this paper, we contend that the learning bias originates from two factors: 1) the unequal competition arising from the imbalanced distribution of foreground categories, and 2) the lack of sample diversity in tail categories. To tackle these issues, we introduce a unified framework called BAlanced CLassification (BACL), which enables adaptive rectification of inequalities caused by disparities in category distribution and dynamic intensification of sample diversities in a synchronized manner. Specifically, a novel foreground classification balance loss (FCBL) is developed to ameliorate the domination of head categories and shift attention to difficult-to-differentiate categories by introducing pairwise class-aware margins and auto-adjusted weight terms, respectively. This loss prevents the over-suppression of tail categories in the context of unequal competition. Moreover, we propose a dynamic feature hallucination module (FHM), which enhances the representation of tail categories in the feature space by synthesizing hallucinated samples to introduce additional data variances. In this divide-and-conquer approach, BACL sets a new state-of-the-art on the challenging LVIS benchmark with a decoupled training pipeline, surpassing vanilla Faster R-CNN with ResNet-50-FPN by 5.8% AP and 16.1% AP for overall and tail categories. Extensive experiments demonstrate that BACL consistently achieves performance improvements across various datasets with different backbones and architectures. Code and models are available at https://github.com/Tianhao-Qi/BACL.

摘要
传统探测器在处理长尾数据时会出现性能下降，这是因为探测器受到多类别头部的分类偏好。在这篇论文中，我们认为这种学习偏好来自两个因素：1）分类 distribu-tion 不均衡，2）尾类别样本缺乏多样性。为解决这些问题，我们提出了一个统一框架，即 Balanced Classification（BACL），它可以同步地修正由分类 distribu-tion 不均衡所引起的不平等，并强制tail类别样本的多样性。在BACL框架中，我们提出了一种新的前景分类均衡损失函数（FCBL），用于改善头部类别的Domination，并增强难以区分的类别的 Representation。FCBL使用对类别之间的对比margin和自适应权重项，从而避免尾类别在不平等环境中的过度压抑。此外，我们还提出了一种动态特征幻化模块（FHM），用于增强尾类别在特征空间中的表示。在这种分治策略中，BACL在LCVIS数据集上实现了新的状态态-of-the-art，超过了基于ResNet-50-FPN的标准Faster R-CNN的5.8% AP和16.1% AP。我们在多个数据集和不同的后端和架构上进行了广泛的实验，并证明了BACL在不同的环境下都能够实现性能提高。代码和模型可以在https://github.com/Tianhao-Qi/BACL上下载。

Paired Competing Neurons Improving STDP Supervised Local Learning In Spiking Neural Networks

paper_url: http://arxiv.org/abs/2308.02194
repo_url: None
paper_authors: Gaspard Goupy, Pierre Tirilly, Ioan Marius Bilasco
for: 这 paper 的目的是提出一种基于 SNN 的图像识别方法，以减少 ANN 训练过程中的高能耗。
methods: 该 paper 使用了 Spike Timing-Dependent Plasticity (STDP) 作为本 статьи的学习规则，并提出了一种名为 Stabilized Supervised STDP (S2-STDP) 的监督学习规则，以train 图像识别层。同时，paper 还提出了一种名为 Paired Competing Neurons (PCN) 的训练架构，以进一步增强图像识别层的学习能力。
results: results 表明，该 paper 的方法可以与现有的监督 STDP 基本相同的架构和 neuron 数量进行比较，并且在 MNIST、Fashion-MNIST 和 CIFAR-10 等图像识别 datasets 上达到了更高的性能。同时，PCN 对 S2-STDP 的使用也有进一步提高图像识别层的性能的效果，无需调整任何超参数。进一步分析还表明，该方法具有更好的超参数Robustness，减少了训练过程中的调整次数。

Abstract
Direct training of Spiking Neural Networks (SNNs) on neuromorphic hardware has the potential to significantly reduce the high energy consumption of Artificial Neural Networks (ANNs) training on modern computers. The biological plausibility of SNNs allows them to benefit from bio-inspired plasticity rules, such as Spike Timing-Dependent Plasticity (STDP). STDP offers gradient-free and unsupervised local learning, which can be easily implemented on neuromorphic hardware. However, relying solely on unsupervised STDP to perform classification tasks is not enough. In this paper, we propose Stabilized Supervised STDP (S2-STDP), a supervised STDP learning rule to train the classification layer of an SNN equipped with unsupervised STDP. S2-STDP integrates error-modulated weight updates that align neuron spikes with desired timestamps derived from the average firing time within the layer. Then, we introduce a training architecture called Paired Competing Neurons (PCN) to further enhance the learning capabilities of our classification layer trained with S2-STDP. PCN associates each class with paired neurons and encourages neuron specialization through intra-class competition. We evaluated our proposed methods on image recognition datasets, including MNIST, Fashion-MNIST, and CIFAR-10. Results showed that our methods outperform current supervised STDP-based state of the art, for comparable architectures and numbers of neurons. Also, the use of PCN enhances the performance of S2-STDP, regardless of the configuration, and without introducing any hyperparameters.Further analysis demonstrated that our methods exhibited improved hyperparameter robustness, which reduces the need for tuning.

摘要
直接训练神经网络（SNN）在神经模仿硬件上有可能大幅降低人工神经网络（ANNs）训练在现代计算机上的高能耗。生物可能性让SNN受益于生物灵感的пластичность规则，如电声相互作用（STDP）。STDP提供了无梯度和无监督的本地学习，可以轻松实现在神经模仿硬件上。然而，仅仅通过不监督的STDP来完成分类任务并不够。在这篇论文中，我们提出了稳定化supervised STDP（S2-STDP），一种监督STDP学习规则，用于训练SNN的分类层。S2-STDP将电声射频更新策略与 neuron 射频相互作用，以实现更稳定和有效的学习。然后，我们引入了一种叫做Paired Competing Neurons（PCN）的训练架构，以进一步增强我们的分类层的学习能力。PCN将每个类别与对应的两个神经元相关联，并且鼓励神经元特化通过类内竞争。我们在MNIST、Fashion-MNIST和CIFAR-10等图像识别数据集上评估了我们的方法。结果显示，我们的方法在相同的架构和神经元数量下，超过了当前supervised STDP基eline的性能。此外，PCN的使用可以在不同的配置下，提高S2-STDP的性能，而无需调整任何超参数。进一步分析表明，我们的方法具有改善的超参数鲁棒性，减少了调整的需求。

ES-MVSNet: Efficient Framework for End-to-end Self-supervised Multi-View Stereo

paper_url: http://arxiv.org/abs/2308.02191
repo_url: None
paper_authors: Qiang Zhou, Chaohui Yu, Jingliang Li, Yuang Liu, Jing Wang, Zhibin Wang
for: 提出了一种高效的终端到终端自动多视角摄影增强方法，以解决现有终端自动多视角摄影方法中的高内存消耗问题。
methods: 提出了一种具有内存减少43%的终端自动多视角摄影模型，并通过不同视角选择策略和区域准确深度一致性来提高模型性能。
results: 在DTU和Tanks&Temples benchmark上进行了广泛的实验，并证明了提出的ES-MVSNet方法可以在终端自动多视角摄影领域实现最佳性能，并与许多超级vised和多Stage自动多视角摄影方法竞争。

Abstract
Compared to the multi-stage self-supervised multi-view stereo (MVS) method, the end-to-end (E2E) approach has received more attention due to its concise and efficient training pipeline. Recent E2E self-supervised MVS approaches have integrated third-party models (such as optical flow models, semantic segmentation models, NeRF models, etc.) to provide additional consistency constraints, which grows GPU memory consumption and complicates the model's structure and training pipeline. In this work, we propose an efficient framework for end-to-end self-supervised MVS, dubbed ES-MVSNet. To alleviate the high memory consumption of current E2E self-supervised MVS frameworks, we present a memory-efficient architecture that reduces memory usage by 43% without compromising model performance. Furthermore, with the novel design of asymmetric view selection policy and region-aware depth consistency, we achieve state-of-the-art performance among E2E self-supervised MVS methods, without relying on third-party models for additional consistency signals. Extensive experiments on DTU and Tanks&Temples benchmarks demonstrate that the proposed ES-MVSNet approach achieves state-of-the-art performance among E2E self-supervised MVS methods and competitive performance to many supervised and multi-stage self-supervised methods.

摘要
Translated into Simplified Chinese:与多stage自动规范多视图零抽象（MVS）方法相比，端到端（E2E）方法得到了更多的关注，因为它的训练管道更简洁和高效。现在的E2E自我监督MVS方法通常将第三方模型（如光流模型、semantic排序模型、NeRF模型等）integrated into the model to provide additional consistency constraints, which leads to increased GPU memory consumption and a more complex model structure and training pipeline. In this work, we propose an efficient framework for end-to-end self-supervised MVS, called ES-MVSNet. To alleviate the high memory consumption of current E2E self-supervised MVS frameworks, we present a memory-efficient architecture that reduces memory usage by 43% without compromising model performance. Furthermore, with the novel design of asymmetric view selection policy and region-aware depth consistency, we achieve state-of-the-art performance among E2E self-supervised MVS methods, without relying on third-party models for additional consistency signals. 经验表明，提出的ES-MVSNet方法在DTU和Tanks&Temples标准benchmark上实现了E2E自我监督MVS方法的state-of-the-art性和多supervised和多stage自动规范方法的竞争性。

Synthetic outlier generation for anomaly detection in autonomous driving

paper_url: http://arxiv.org/abs/2308.02184
repo_url: None
paper_authors: Martin Bikandi, Gorka Velez, Naiara Aginako, Itziar Irigoien
for: 这个研究的目的是提高自驾车中的异常检测性能，以避免安全重要的意外事件。
methods: 这个研究使用了 modifying the training stage of the state-of-the-art DenseHybrid model, 以及 proposing a simplified detector.
results: 研究获得了 significant performance improvements in anomaly detection, 并且 proposed detector 可以与 modified DenseHybrid approach 相比，而且也超过了原始 DenseHybrid 模型的性能。

Abstract
Anomaly detection, or outlier detection, is a crucial task in various domains to identify instances that significantly deviate from established patterns or the majority of data. In the context of autonomous driving, the identification of anomalies is particularly important to prevent safety-critical incidents, as deep learning models often exhibit overconfidence in anomalous or outlier samples. In this study, we explore different strategies for training an image semantic segmentation model with an anomaly detection module. By introducing modifications to the training stage of the state-of-the-art DenseHybrid model, we achieve significant performance improvements in anomaly detection. Moreover, we propose a simplified detector that achieves comparable results to our modified DenseHybrid approach, while also surpassing the performance of the original DenseHybrid model. These findings demonstrate the efficacy of our proposed strategies for enhancing anomaly detection in the context of autonomous driving.

摘要
anomaly detection，或者异常检测，是在不同领域中找到与Established patterns或主要数据分布不匹配的情况的关键任务。在自动驾驶领域中，异常检测的identification是非常重要，因为深度学习模型经常对异常或异常样本表现出过自信。在这种研究中，我们探索了不同的方法来训练一个图像semantic segmentation模型，并在训练阶段引入了一些修改以提高异常检测性能。此外，我们提出了一种简化的检测器，可以与我们修改的DenseHybrid模型相比，并且在性能上超过了原始DenseHybrid模型。这些发现表明了我们提出的策略对自动驾驶中异常检测具有效果。

Scene-aware Human Pose Generation using Transformer

paper_url: http://arxiv.org/abs/2308.02177
repo_url: None
paper_authors: Jieteng Yao, Junjie Chen, Li Niu, Bin Sheng
for: scene understanding and intelligent robotics
methods: template-based human pose generation, interaction between query embeddings and scene feature map, knowledge distillation
results: effective prediction of scale and offsets for each pose template, demonstrated effectiveness on Sitcom dataset

Abstract
Affordance learning considers the interaction opportunities for an actor in the scene and thus has wide application in scene understanding and intelligent robotics. In this paper, we focus on contextual affordance learning, i.e., using affordance as context to generate a reasonable human pose in a scene. Existing scene-aware human pose generation methods could be divided into two categories depending on whether using pose templates. Our proposed method belongs to the template-based category, which benefits from the representative pose templates. Moreover, inspired by recent transformer-based methods, we associate each query embedding with a pose template, and use the interaction between query embeddings and scene feature map to effectively predict the scale and offsets for each pose template. In addition, we employ knowledge distillation to facilitate the offset learning given the predicted scale. Comprehensive experiments on Sitcom dataset demonstrate the effectiveness of our method.

摘要
<>scene理解和智能机器人中的人 pose生成方法的学习，我们将关注场景上的可行性学习，即通过场景来生成合理的人姿。现有的场景意识人姿生成方法可以分为两类：不使用pose模板和使用pose模板两类。我们的提议方法属于使用pose模板的类别，受益于代表性的pose模板。此外，受到最近的transformer基本方法的启发，我们将每个查询嵌入与场景特征图的交互用来有效地预测每个pose模板的尺度和偏移。此外，我们还使用知识传播来促进偏移学习。对于Sitcom数据集的全面实验，我们证明了我们的方法的有效性。<>

Efficient Labelling of Affective Video Datasets via Few-Shot & Multi-Task Contrastive Learning

paper_url: http://arxiv.org/abs/2308.02173
repo_url: https://github.com/ravikiranrao/mtclar-fsl
paper_authors: Ravikiran Parameshwara, Ibrahim Radwan, Akshay Asthana, Iman Abbasnejad, Ramanathan Subramanian, Roland Goecke
for: 这个论文旨在提出一种基于多任务对照学习的方法，用于减少深度学习模型需要的标注数据量，以提高情绪预测的精度。
methods: 该方法称为多任务对照学习 для情绪表示（MT-CLAR），它使用了一个SIAMESE网络，通过对照学习来学习从两个表情图像对的不同之处，以及这两个图像的情绪水平之异。
results: 实验结果显示，MT-CLAR可以在AFEW-VA数据集上达到与状态之巅相当的情绪预测性能，而且与支持集的大小相比，MT-CLAR可以大幅提高情绪预测的精度。

Abstract
Whilst deep learning techniques have achieved excellent emotion prediction, they still require large amounts of labelled training data, which are (a) onerous and tedious to compile, and (b) prone to errors and biases. We propose Multi-Task Contrastive Learning for Affect Representation (\textbf{MT-CLAR}) for few-shot affect inference. MT-CLAR combines multi-task learning with a Siamese network trained via contrastive learning to infer from a pair of expressive facial images (a) the (dis)similarity between the facial expressions, and (b) the difference in valence and arousal levels of the two faces. We further extend the image-based MT-CLAR framework for automated video labelling where, given one or a few labelled video frames (termed \textit{support-set}), MT-CLAR labels the remainder of the video for valence and arousal. Experiments are performed on the AFEW-VA dataset with multiple support-set configurations; moreover, supervised learning on representations learnt via MT-CLAR are used for valence, arousal and categorical emotion prediction on the AffectNet and AFEW-VA datasets. The results show that valence and arousal predictions via MT-CLAR are very comparable to the state-of-the-art (SOTA), and we significantly outperform SOTA with a support-set $\approx$6\% the size of the video dataset.

摘要
而深度学习技术已经实现了出色的情感预测，但它们仍需要大量标注训练数据，这些数据是（a）困难和繁琐准备，以及（b）容易出现错误和偏见。我们提出了多任务对照学习 для情感表示(\textbf{MT-CLAR})，用于几个shot情感预测。MT-CLAR将多任务学习与对照学习相结合，通过对两个表达性脸部图像进行比较，推断两个脸部图像之间的（不）相似性，以及两个脸部图像的情感强度水平之间的差异。我们还扩展了基于图像的MT-CLAR框架，用于自动化视频标注，给定一个或几个标注过的视频帧（称为\textit{支持集}），MT-CLAR将视频中剩下的所有帧标注为情感强度和激动度水平。我们在AFEW-VA数据集上进行了多种支持集配置的实验，并使用MT-CLAR学习的表示进行情感预测，包括情感强度、激动度和 categorical emotion预测。结果显示，MT-CLAR在情感预测中的值和激动度预测与状态之前（SOTA）几乎相同，而且我们在支持集大小为6%的视频数据集上显著超越SOTA。

Learning Referring Video Object Segmentation from Weak Annotation

paper_url: http://arxiv.org/abs/2308.02162
repo_url: None
paper_authors: Wangbo Zhao, Kepan Nan, Songyang Zhang, Kai Chen, Dahua Lin, Yang You
for: 本研究旨在开发一种可以在视频帧中 segment 目标对象，而无需投入大量标注数据的方法。
methods: 我们提出了一种新的标注方案，其中在目标对象出现的第一帧中使用 маска 标注，并在后续帧中使用 bounding box。我们还设计了一种跨帧 segmentation 方法，使用语言指导的动态滤波器，以全面利用有价值的mask标注和 bounding box。另外，我们还开发了一种二级对比学习方法，以促进模型学习精细的像素表示。
results: 我们的方法可以在不需要密集标注的情况下实现竞争力强的 segmentation 性能。经过广泛的实验和ablative分析，我们的方法在不同的 dataset 上都表现出了优秀的result。

Abstract
Referring video object segmentation (RVOS) is a task that aims to segment the target object in all video frames based on a sentence describing the object. Previous RVOS methods have achieved significant performance with densely-annotated datasets, whose construction is expensive and time-consuming. To relieve the burden of data annotation while maintaining sufficient supervision for segmentation, we propose a new annotation scheme, in which we label the frame where the object first appears with a mask and use bounding boxes for the subsequent frames. Based on this scheme, we propose a method to learn from this weak annotation. Specifically, we design a cross frame segmentation method, which uses the language-guided dynamic filters to thoroughly leverage the valuable mask annotation and bounding boxes. We further develop a bi-level contrastive learning method to encourage the model to learn discriminative representation at the pixel level. Extensive experiments and ablative analyses show that our method is able to achieve competitive performance without the demand of dense mask annotation. The code will be available at https://github.com/wangbo-zhao/WRVOS/.

摘要
<>使用 Referring video object segmentation (RVOS) 任务，目标是在所有视频帧中基于一句描述对象进行对象分割。先前的 RVOS 方法已经在 densely-annotated 数据集上达到了显著性能，但是构建这些数据集是贵重的和时间consuming。为了减轻数据标注的负担而不失去分割的足够supervision，我们提议一种新的标注方案，在该方案中，我们将第一帧中的对象批注为Mask，并使用 bounding boxes 来标注后续帧。基于这种方案，我们提议一种从weak annotation学习的方法。我们设计了一种 across frame segmentation 方法，使用语言导向的动态滤波器，以完全利用valuable的批注和 bounding boxes。我们还开发了一种 bi-level 对比学习方法，以促进模型学习精细的像素表示。广泛的实验和ablative 分析表明，我们的方法可以在无需厚度的批注下实现竞争性的性能。代码将在 https://github.com/wangbo-zhao/WRVOS/ 上发布。

M2Former: Multi-Scale Patch Selection for Fine-Grained Visual Recognition

paper_url: http://arxiv.org/abs/2308.02161
repo_url: None
paper_authors: Jiyong Moon, Junseok Lee, Yunju Lee, Seongsik Park
for: 提高 fine-grained visual recognition (FGVR) 模型的多尺度能力。
methods: 利用 multi-scale patch selection (MSPS) 和 class token transfer (CTT) 以及 multi-scale cross-attention (MSCA) 来提高模型的多尺度表示能力和权重学习能力。
results: 比前一代单尺度 patch selection (SSPS) 提高表达能力和性能，在多种广泛使用的 FGVR benchmark 上达到了新的高点。

Abstract
Recently, vision Transformers (ViTs) have been actively applied to fine-grained visual recognition (FGVR). ViT can effectively model the interdependencies between patch-divided object regions through an inherent self-attention mechanism. In addition, patch selection is used with ViT to remove redundant patch information and highlight the most discriminative object patches. However, existing ViT-based FGVR models are limited to single-scale processing, and their fixed receptive fields hinder representational richness and exacerbate vulnerability to scale variability. Therefore, we propose multi-scale patch selection (MSPS) to improve the multi-scale capabilities of existing ViT-based models. Specifically, MSPS selects salient patches of different scales at different stages of a multi-scale vision Transformer (MS-ViT). In addition, we introduce class token transfer (CTT) and multi-scale cross-attention (MSCA) to model cross-scale interactions between selected multi-scale patches and fully reflect them in model decisions. Compared to previous single-scale patch selection (SSPS), our proposed MSPS encourages richer object representations based on feature hierarchy and consistently improves performance from small-sized to large-sized objects. As a result, we propose M2Former, which outperforms CNN-/ViT-based models on several widely used FGVR benchmarks.

摘要
最近，视觉变换器（ViT）已经活跃地应用于细腻视识别（FGVR）。ViT可以有效地模型分割后的对象区域之间的相互依赖关系，通过自然的自注意机制。此外，patch选择被用于ViT，以除去重复的patch信息并强调最重要的对象patch。然而，现有的ViT基于的FGVR模型受到一个固定的见识场所限制，这限制了表达的丰富性和对比例变化的抗性。因此，我们提出了多比例 patch选择（MSPS），以改进现有的ViT基于模型的多比例能力。具体来说，MSPS在不同的级别上选择不同的比例的patch，并在不同的阶段使用多比例视Transformer（MS-ViT）来模型交叉的多比例关系。此外，我们引入了类token传递（CTT）和多比例交叉注意（MSCA），以便充分反映交叉的多比例关系，并将其纳入模型决策中。相比之下，单比例patch选择（SSPS）只能在固定的比例下进行选择，从而封锁了表达的层次结构和对比例变化的能力。因此，我们提出了M2Former，它在多个普遍使用的FGVR标准benchmark上表现出色，并且超越了基于CNN和ViT的模型。

CTP-Net: Character Texture Perception Network for Document Image Forgery Localization

paper_url: http://arxiv.org/abs/2308.02158
repo_url: None
paper_authors: Xin Liao, Siliang Chen, Jiaxin Chen, Tianyi Wang, Xiehua Li
for: 本研究旨在提高文档图像的伪造探测精度，提出一种基于Character Texture Perception Network（CTP-Net）的文档图像伪造探测方法。
methods: 该方法首先设计了一个Character Texture Stream（CTS），通过光学字符识别技术捕捉文档图像中文本区域的特征。同时，图像全像的文本特征也被利用Image Texture Stream（ITS）捕捉。然后，CTP-Net将CTS和ITS中提取的特征进行组合，以探测文档图像中的伪造 trace。
results: 实验结果表明，CPT-Net可以准确地探测文档图像中的多尺度伪造区域，并且在不同的数据集上表现出excel。此外，为了解决因缺少伪造文档图像而导致的挑战，该研究还提出了一种数据生成策略，用于构建一个Fake Chinese Trademark数据集（FCTM）。

Abstract
Due to the progression of information technology in recent years, document images have been widely disseminated on social networks. With the help of powerful image editing tools, document images are easily forged without leaving visible manipulation traces, which leads to severe issues if significant information is falsified for malicious use. Therefore, the research of document image forensics is worth further exploring. In this paper, we propose a Character Texture Perception Network (CTP-Net) to localize the forged regions in document images. Specifically, considering the characters with semantics in a document image are highly vulnerable, capturing the forgery traces is the key to localize the forged regions. We design a Character Texture Stream (CTS) based on optical character recognition to capture features of text areas that are essential components of a document image. Meanwhile, texture features of the whole document image are exploited by an Image Texture Stream (ITS). Combining the features extracted from the CTS and the ITS, the CTP-Net can reveal more subtle forgery traces from document images. Moreover, to overcome the challenge caused by the lack of fake document images, we design a data generation strategy that is utilized to construct a Fake Chinese Trademark dataset (FCTM). Experimental results on different datasets demonstrate that the proposed CTP-Net is able to localize multi-scale forged areas in document images, and outperform the state-of-the-art forgery localization methods, even though post-processing operations are applied.

摘要

SDDM: Score-Decomposed Diffusion Models on Manifolds for Unpaired Image-to-Image Translation

paper_url: http://arxiv.org/abs/2308.02154
repo_url: None
paper_authors: Shikun Sun, Longhui Wei, Junliang Xing, Jia Jia, Qi Tian
for: 这种新的分解 diffusion model (SDDM) 是为了Explicitly optimize the tangled distributions during image generation.
methods: SDDM 使用 manifold 来分解 score function 或 energy guidance into an image “denoising” part and a content “refinement” part.
results: SDDM 可以在 fewer diffusion steps 中比 existing SBDM-based methods 表现出更好的效果 on several I2I benchmarks.I hope that helps! Let me know if you have any other questions.

Abstract
Recent score-based diffusion models (SBDMs) show promising results in unpaired image-to-image translation (I2I). However, existing methods, either energy-based or statistically-based, provide no explicit form of the interfered intermediate generative distributions. This work presents a new score-decomposed diffusion model (SDDM) on manifolds to explicitly optimize the tangled distributions during image generation. SDDM derives manifolds to make the distributions of adjacent time steps separable and decompose the score function or energy guidance into an image ``denoising" part and a content ``refinement" part. To refine the image in the same noise level, we equalize the refinement parts of the score function and energy guidance, which permits multi-objective optimization on the manifold. We also leverage the block adaptive instance normalization module to construct manifolds with lower dimensions but still concentrated with the perturbed reference image. SDDM outperforms existing SBDM-based methods with much fewer diffusion steps on several I2I benchmarks.

摘要
现代分数基 diffusion 模型（SBDM）在无对照图像至图像翻译（I2I）中显示出了有前途的成绩。然而，现有的方法，可能是能量基或统计基的，没有直接提供杂乱的中间生成分布的显式形式。本工作提出了一种新的分数 decomposed diffusion model（SDDM），该模型在抽象上分解了分数函数或能量导航的杂乱部分，并在图像生成过程中显式优化这些分布。为了在同一个噪音水平上细化图像，我们将图像“净化”部分和内容“精度”部分的分数函数和能量导航的平衡化，从而实现多目标优化在抽象上。此外，我们还利用了块适应性的实例normalization模块来构建抽象上的低维度拟合分布，以便更好地翻译图像。 SDDM 在多个 I2I 测试准则上表现出了较好的成绩，并且只需要比较少的扩散步数。

Robust Self-Supervised Extrinsic Self-Calibration

paper_url: http://arxiv.org/abs/2308.02153
repo_url: None
paper_authors: Takayuki Kanai, Igor Vasiljevic, Vitor Guizilini, Adrien Gaidon, Rares Ambrus
for:自动驾驶和机器人需要在多种场景中运行，以完成任务高效和安全。methods:我们提出了一种基于自我超视觉学习的新方法，即使无需其他传感器，可以高效和准确地进行外部约束。我们的方法使用彩色视频中的单目深度和pose估计器，并在运动视频中提供速度监视，以便估计外部约束。results:我们的方法在一个多摄像头数据集（DDAD）上进行了实验，并证明了在不同场景中可以Robustly和高效地进行自我约束，而不需要传感器。此外，我们还证明了在depth预测中使用外部约束可以提高深度预测的准确性。

Abstract
Autonomous vehicles and robots need to operate over a wide variety of scenarios in order to complete tasks efficiently and safely. Multi-camera self-supervised monocular depth estimation from videos is a promising way to reason about the environment, as it generates metrically scaled geometric predictions from visual data without requiring additional sensors. However, most works assume well-calibrated extrinsics to fully leverage this multi-camera setup, even though accurate and efficient calibration is still a challenging problem. In this work, we introduce a novel method for extrinsic calibration that builds upon the principles of self-supervised monocular depth and ego-motion learning. Our proposed curriculum learning strategy uses monocular depth and pose estimators with velocity supervision to estimate extrinsics, and then jointly learns extrinsic calibration along with depth and pose for a set of overlapping cameras rigidly attached to a moving vehicle. Experiments on a benchmark multi-camera dataset (DDAD) demonstrate that our method enables self-calibration in various scenes robustly and efficiently compared to a traditional vision-based pose estimation pipeline. Furthermore, we demonstrate the benefits of extrinsics self-calibration as a way to improve depth prediction via joint optimization.

摘要

Attention-Driven Lightweight Model for Pigmented Skin Lesion Detection

paper_url: http://arxiv.org/abs/2308.02119
repo_url: None
paper_authors: Mingzhe Hu, Xiaofeng Yang
for:This paper presents a lightweight pipeline for skin lesion detection, addressing the challenges of imbalanced class distribution and subtle or atypical appearances of some lesions.methods:The pipeline uses a lightweight model that leverages ghosted features and the DFC attention mechanism to reduce computational complexity while maintaining high performance. The model was trained on the HAM10000 dataset, which includes various types of skin lesions, and incorporates a knowledge-based loss weighting technique to address class imbalance.results:The model achieved an accuracy of 92.4%, a precision of 84.2%, a recall of 86.9%, and a f1-score of 85.4%, with particularly strong performance in identifying Benign Keratosis-like lesions (BKL) and Nevus (NV). Despite its superior performance, the model’s computational cost is considerably lower than some models with less accuracy, making it an optimal solution for real-world applications where both accuracy and efficiency are essential.

Abstract
This study presents a lightweight pipeline for skin lesion detection, addressing the challenges posed by imbalanced class distribution and subtle or atypical appearances of some lesions. The pipeline is built around a lightweight model that leverages ghosted features and the DFC attention mechanism to reduce computational complexity while maintaining high performance. The model was trained on the HAM10000 dataset, which includes various types of skin lesions. To address the class imbalance in the dataset, the synthetic minority over-sampling technique and various image augmentation techniques were used. The model also incorporates a knowledge-based loss weighting technique, which assigns different weights to the loss function at the class level and the instance level, helping the model focus on minority classes and challenging samples. This technique involves assigning different weights to the loss function on two levels - the class level and the instance level. By applying appropriate loss weights, the model pays more attention to the minority classes and challenging samples, thus improving its ability to correctly detect and classify different skin lesions. The model achieved an accuracy of 92.4%, a precision of 84.2%, a recall of 86.9%, a f1-score of 85.4% with particularly strong performance in identifying Benign Keratosis-like lesions (BKL) and Nevus (NV). Despite its superior performance, the model's computational cost is considerably lower than some models with less accuracy, making it an optimal solution for real-world applications where both accuracy and efficiency are essential.

摘要

Rethinking Class Activation Maps for Segmentation: Revealing Semantic Information in Shallow Layers by Reducing Noise

paper_url: http://arxiv.org/abs/2308.02118
repo_url: None
paper_authors: Hang-Cheng Dong, Yuhao Jiang, Yingyan Huang, Jingxiao Liao, Bingguo Liu, Dong Ye, Guodong Liu
for: 这篇论文旨在提高深度神经网络中的分类激活图的质量，以便在无监督学习中获得更高的性能。
methods: 本文提出了一种简单的梯度基于的减噪方法，可以用于过滤非目标噪声，从而提高分类激活图的质量。
results: 经过大量实验 validate 的结果表明，该方法可以提高无监督 semantic segmentation 任务中的性能。

Abstract
Class activation maps are widely used for explaining deep neural networks. Due to its ability to highlight regions of interest, it has evolved in recent years as a key step in weakly supervised learning. A major limitation to the performance of the class activation maps is the small spatial resolution of the feature maps in the last layer of the convolutional neural network. Therefore, we expect to generate high-resolution feature maps that result in high-quality semantic information. In this paper, we rethink the properties of semantic information in shallow feature maps. We find that the shallow feature maps still have fine-grained non-discriminative features while mixing considerable non-target noise. Furthermore, we propose a simple gradient-based denoising method to filter the noise by truncating the positive gradient. Our proposed scheme can be easily deployed in other CAM-related methods, facilitating these methods to obtain higher-quality class activation maps. We evaluate the proposed approach through a weakly-supervised semantic segmentation task, and a large number of experiments demonstrate the effectiveness of our approach.

摘要
<>translate text into Simplified Chinese<>��iddle activation maps �� Ди�� deep neural networks �� explain. Due to its ability to highlight regions of interest, it has evolved in recent years as a key step in weakly supervised learning. A major limitation to the performance of the class activation maps is the small spatial resolution of the feature maps in the last layer of the convolutional neural network. Therefore, we expect to generate high-resolution feature maps that result in high-quality semantic information. In this paper, we rethink the properties of semantic information in shallow feature maps. We find that the shallow feature maps still have fine-grained non-discriminative features while mixing considerable non-target noise. Furthermore, we propose a simple gradient-based denoising method to filter the noise by truncating the positive gradient. Our proposed scheme can be easily deployed in other CAM-related methods, facilitating these methods to obtain higher-quality class activation maps. We evaluate the proposed approach through a weakly-supervised semantic segmentation task, and a large number of experiments demonstrate the effectiveness of our approach.Note: Please keep in mind that the translation is done using a machine translation tool, and the quality of the translation may vary depending on the complexity and nuances of the original text.

Breast Ultrasound Tumor Classification Using a Hybrid Multitask CNN-Transformer Network

paper_url: http://arxiv.org/abs/2308.02101
repo_url: None
paper_authors: Bryar Shareef, Min Xian, Aleksandar Vakanski, Haotian Wang
for: 这个研究是为了掌握乳腺超音波图像分类。
methods: 这个研究使用了一种混合式多任务深度神经网络，名为Hybrid-MT-ESTAN，它的架构由CNN和Swin Transformer组成。
results: 研究结果显示，Hybrid-MT-ESTAN在3,320幅乳腺超音波图像中的分类和分 segmentation任务中取得了最高的准确率、敏感度和F1分数，分别为82.7%, 86.4%和86.0%。

Abstract
Capturing global contextual information plays a critical role in breast ultrasound (BUS) image classification. Although convolutional neural networks (CNNs) have demonstrated reliable performance in tumor classification, they have inherent limitations for modeling global and long-range dependencies due to the localized nature of convolution operations. Vision Transformers have an improved capability of capturing global contextual information but may distort the local image patterns due to the tokenization operations. In this study, we proposed a hybrid multitask deep neural network called Hybrid-MT-ESTAN, designed to perform BUS tumor classification and segmentation using a hybrid architecture composed of CNNs and Swin Transformer components. The proposed approach was compared to nine BUS classification methods and evaluated using seven quantitative metrics on a dataset of 3,320 BUS images. The results indicate that Hybrid-MT-ESTAN achieved the highest accuracy, sensitivity, and F1 score of 82.7%, 86.4%, and 86.0%, respectively.

摘要
globally capturing contextual information plays a crucial role in breast ultrasound (BUS) image classification. Although convolutional neural networks (CNNs) have shown reliable performance in tumor classification, they have inherent limitations in modeling global and long-range dependencies due to the localized nature of convolution operations. Vision Transformers have an improved capability of capturing global contextual information but may distort local image patterns due to tokenization operations. In this study, we proposed a hybrid multitask deep neural network called Hybrid-MT-ESTAN, designed to perform BUS tumor classification and segmentation using a hybrid architecture composed of CNNs and Swin Transformer components. The proposed approach was compared to nine BUS classification methods and evaluated using seven quantitative metrics on a dataset of 3,320 BUS images. The results indicate that Hybrid-MT-ESTAN achieved the highest accuracy, sensitivity, and F1 score of 82.7%, 86.4%, and 86.0%, respectively.

CT Reconstruction from Few Planar X-rays with Application towards Low-resource Radiotherapy

paper_url: http://arxiv.org/abs/2308.02100
repo_url: https://github.com/wanderinrain/xray2ct
paper_authors: Yiran Sun, Tucker Netherton, Laurence Court, Ashok Veeraraghavan, Guha Balakrishnan
for: 这种方法用于生成基于少量（<5）平面X射图像的计算机断层Volume，并在临床应用中进行了首次评估：放疗规划。
methods: 我们提出了一种深度生成模型，基于神经隐式表示来生成Volumetric CT扫描图像从少量入力平面X射图像的不同角度。
results: 我们在使用这种方法生成的 thoracic CT扫描图像上进行了2场对抗的、舒缓放疗规划，并发现了<1%的误差在计算机中计算的辐射剂量与临床获得的CT扫描图像中的辐射剂量之间。此外，我们的方法也比最近的稀疙CT重建基线性能更高（PSNR、SSIM、Dice分数）在公共的LIDC肺CT数据集上。

Abstract
CT scans are the standard-of-care for many clinical ailments, and are needed for treatments like external beam radiotherapy. Unfortunately, CT scanners are rare in low and mid-resource settings due to their costs. Planar X-ray radiography units, in comparison, are far more prevalent, but can only provide limited 2D observations of the 3D anatomy. In this work, we propose a method to generate CT volumes from few (<5) planar X-ray observations using a prior data distribution, and perform the first evaluation of such a reconstruction algorithm for a clinical application: radiotherapy planning. We propose a deep generative model, building on advances in neural implicit representations to synthesize volumetric CT scans from few input planar X-ray images at different angles. To focus the generation task on clinically-relevant features, our model can also leverage anatomical guidance during training (via segmentation masks). We generated 2-field opposed, palliative radiotherapy plans on thoracic CTs reconstructed by our method, and found that isocenter radiation dose on reconstructed scans have <1% error with respect to the dose calculated on clinically acquired CTs using <=4 X-ray views. In addition, our method is better than recent sparse CT reconstruction baselines in terms of standard pixel and structure-level metrics (PSNR, SSIM, Dice score) on the public LIDC lung CT dataset. Code is available at: https://github.com/wanderinrain/Xray2CT.

摘要
CT扫描是现代医疗标准，用于许多临床疾病的诊断和治疗，如外部β射线治疗。然而，在LOW和中等资源设置中，CT扫描仪仍然 rare due to its high cost.在这些设置中，平面X射线成像机器更加普遍，但它们只能提供限定的2D观察，无法提供3D анатомия的全面观察。在这种情况下，我们提出了一种方法，使用先前的数据分布来生成CT卷积体从少量（<5）平面X射线图像的观察。我们采用了深度生成模型，基于神经隐式表示来生成3D CT扫描图像从平面X射线图像的不同角度。为了将生成任务关注临床有关的特征，我们的模型可以在训练过程中使用解剖指导（via分剖标签）。我们在使用我们的方法生成的肺CT扫描图像上进行了2场对称的肺癌治疗规划，并发现了辐射剂量在重建的SCANS上和临床获得的CT扫描图像上的差异小于1%。此外，我们的方法也比最近的稀疏CT重建基线更好，根据公共的LIDC肺CT数据集的标准像素级和结构级指标（PSNR、SSIM、Dice分数）。代码可以在：https://github.com/wanderinrain/Xray2CT中找到。

Multi-interactive Feature Learning and a Full-time Multi-modality Benchmark for Image Fusion and Segmentation

paper_url: http://arxiv.org/abs/2308.02097
repo_url: https://github.com/jinyuanliu-cv/segmif
paper_authors: Jinyuan Liu, Zhu Liu, Guanyao Wu, Long Ma, Risheng Liu, Wei Zhong, Zhongxuan Luo, Xin Fan
for: 这篇论文旨在提出一种多Modalities图像合并和 segmentation方法，以提高自动驾驶和机器人操作的性能。
methods: 该方法基于一种多InteractiveFeature学习架构，通过将多modalities的图像合并到一起，并利用双任务相互关系来提高图像合并和 segmentation 的性能。
results: 实验表明，该方法可以输出可观赏的合并图像，并在实际场景中提高 segmentation mIoU 值，相比之前的方法。

Abstract
Multi-modality image fusion and segmentation play a vital role in autonomous driving and robotic operation. Early efforts focus on boosting the performance for only one task, \emph{e.g.,} fusion or segmentation, making it hard to reach~`Best of Both Worlds'. To overcome this issue, in this paper, we propose a \textbf{M}ulti-\textbf{i}nteractive \textbf{F}eature learning architecture for image fusion and \textbf{Seg}mentation, namely SegMiF, and exploit dual-task correlation to promote the performance of both tasks. The SegMiF is of a cascade structure, containing a fusion sub-network and a commonly used segmentation sub-network. By slickly bridging intermediate features between two components, the knowledge learned from the segmentation task can effectively assist the fusion task. Also, the benefited fusion network supports the segmentation one to perform more pretentiously. Besides, a hierarchical interactive attention block is established to ensure fine-grained mapping of all the vital information between two tasks, so that the modality/semantic features can be fully mutual-interactive. In addition, a dynamic weight factor is introduced to automatically adjust the corresponding weights of each task, which can balance the interactive feature correspondence and break through the limitation of laborious tuning. Furthermore, we construct a smart multi-wave binocular imaging system and collect a full-time multi-modality benchmark with 15 annotated pixel-level categories for image fusion and segmentation. Extensive experiments on several public datasets and our benchmark demonstrate that the proposed method outputs visually appealing fused images and perform averagely $7.66\%$ higher segmentation mIoU in the real-world scene than the state-of-the-art approaches. The source code and benchmark are available at \url{https://github.com/JinyuanLiu-CV/SegMiF}.

摘要
多Modalitate的图像融合和分割在自动驾驶和机器人操作中扮演着重要的角色。初期努力主要是提高单一任务的性能，例如融合或分割，这使得达到“Best of Both Worlds”的目标变得困难。为解决这个问题，在本文中，我们提出了一种多因素互动特征学习架构，即SegMiF，并利用双任务相关性来提高两个任务的性能。SegMiF具有搅合结构，包括融合子网络和通常使用的分割子网络。通过细腻地桥接两个组件之间的中间特征，分割任务学习的知识可以有效地帮助融合任务。同时，融合网络也可以支持分割任务更加准确地进行。此外，我们还设立了一个层次互动注意力块，以确保所有重要信息之间的细腻 mapping，以便全面互动。此外，我们还引入了一个自动调整相应任务的权重因子，以平衡互动特征对应和缓解劳动劳累调整的限制。此外，我们还构建了一个智能多波普普摄影系统，并为多模态图像融合和分割建立了一个全天候多模态标准套件，包括15个注释的像素级分类。经过对多个公共数据集和我们的标准套件进行广泛的实验，我们发现提出的方法可以输出视觉吸引人的融合图像，并在实际场景中的分割mIoU平均提高7.66%。源代码和标准套件可以在获取。

Diffusion Models for Counterfactual Generation and Anomaly Detection in Brain Images

paper_url: http://arxiv.org/abs/2308.02062
repo_url: https://github.com/alessandro-f/dif-fuse
paper_authors: Alessandro Fontanella, Grant Mair, Joanna Wardlaw, Emanuele Trucco, Amos Storkey
for: 本研究用于生成疾病图像的健康版本，以便用于验证图像分割模型的可解释性和提高训练数据集的质量。
methods: 本研究使用了一种弱监督法，首先生成一个疾病图像的saliency map，然后使用一种混合的扩散模型（DDPM和DDIM）进行修改，以保持图像的健康部分不受影响。
results: 对于IST-3和BraTS2021两个数据集，我们的方法与其他弱监督方法进行比较，显示我们的方法可以提高DICE分数从0.6534提高到0.7056，表明我们的方法可以更好地生成健康版本图像。

Abstract
Segmentation masks of pathological areas are useful in many medical applications, such as brain tumour and stroke management. Moreover, healthy counterfactuals of diseased images can be used to enhance radiologists' training files and to improve the interpretability of segmentation models. In this work, we present a weakly supervised method to generate a healthy version of a diseased image and then use it to obtain a pixel-wise anomaly map. To do so, we start by considering a saliency map that approximately covers the pathological areas, obtained with ACAT. Then, we propose a technique that allows to perform targeted modifications to these regions, while preserving the rest of the image. In particular, we employ a diffusion model trained on healthy samples and combine Denoising Diffusion Probabilistic Model (DDPM) and Denoising Diffusion Implicit Model (DDIM) at each step of the sampling process. DDPM is used to modify the areas affected by a lesion within the saliency map, while DDIM guarantees reconstruction of the normal anatomy outside of it. The two parts are also fused at each timestep, to guarantee the generation of a sample with a coherent appearance and a seamless transition between edited and unedited parts. We verify that when our method is applied to healthy samples, the input images are reconstructed without significant modifications. We compare our approach with alternative weakly supervised methods on IST-3 for stroke lesion segmentation and on BraTS2021 for brain tumour segmentation, where we improve the DICE score of the best competing method from $0.6534$ to $0.7056$.

摘要
干将疾病区域分割的分割面貌是医学应用中非常有用的，例如脑肿瘤和中风管理。此外，健康的对比样本可以用于增强放射学家的训练文件，并提高分割模型的可读性。在这种情况下，我们提出了一种弱监督的方法，可以生成一个健康版本的疾病图像，并使用其生成一个像素精度的异常地图。我们开始是通过考虑一个ACAT获得的病理区域精度的报告来，然后我们提出了一种可以在这些区域进行目标 modify 的技术，保持图像的其他部分不变。特别是，我们使用了训练在健康样本上的扩散模型，并将DDPM和DDIM相互融合在每个抽样过程中。DDPM用于在病理区域内的精度报告中修改病理区域，而DDIM确保在病理区域外的正常解剖结构的重建。这两个部分也在每个时间步骤中融合，以保证生成一个具有净合的外观和无缝过渡的编辑和未编辑部分。我们证明，当我们的方法应用于健康样本时，输入图像不会经受显著的修改。我们与其他弱监督方法进行比较，在IST-3上对stroke lesion segmentation和BraTS2021上对脑肿瘤分割进行比较，我们提高了最佳竞争方法的DICE分数从0.6534提高到0.7056。

UGainS: Uncertainty Guided Anomaly Instance Segmentation

paper_url: http://arxiv.org/abs/2308.02046
repo_url: https://github.com/kumuji/ugains
paper_authors: Alexey Nekrasov, Alexander Hermans, Lars Kuhnert, Bastian Leibe
for: 本研究旨在提供一种可靠的机器视觉方法，用于检测道路上的异常对象，以避免交通事故和人员受伤。
methods: 本研究使用了一种异常分割模型，包括一个异常分割模型和一个通用分割模型，以 segments 异常对象。另外，研究还使用了不确定区域来引导异常分割模型进行异常对象的分割。
results: 研究发现，使用该方法可以对道路上的异常对象进行高精度的检测，并且达到了80.08%和88.98%的AP值在Fishyscapes Lost and Found和RoadAnomaly验证集上。项目页面：https://vision.rwth-aachen.de/ugains

Abstract
A single unexpected object on the road can cause an accident or may lead to injuries. To prevent this, we need a reliable mechanism for finding anomalous objects on the road. This task, called anomaly segmentation, can be a stepping stone to safe and reliable autonomous driving. Current approaches tackle anomaly segmentation by assigning an anomaly score to each pixel and by grouping anomalous regions using simple heuristics. However, pixel grouping is a limiting factor when it comes to evaluating the segmentation performance of individual anomalous objects. To address the issue of grouping multiple anomaly instances into one, we propose an approach that produces accurate anomaly instance masks. Our approach centers on an out-of-distribution segmentation model for identifying uncertain regions and a strong generalist segmentation model for anomaly instances segmentation. We investigate ways to use uncertain regions to guide such a segmentation model to perform segmentation of anomalous instances. By incorporating strong object priors from a generalist model we additionally improve the per-pixel anomaly segmentation performance. Our approach outperforms current pixel-level anomaly segmentation methods, achieving an AP of 80.08% and 88.98% on the Fishyscapes Lost and Found and the RoadAnomaly validation sets respectively. Project page: https://vision.rwth-aachen.de/ugains

摘要
一个不期望的对象在路上可能会导致事故或伤害。为了防止这种情况，我们需要一种可靠的机制来检测路上异常对象。这项任务被称为异常分割，可以作为自动驾驶安全可靠的一个步骤。现有的方法对异常分割采用分配异常分数到每个像素点和使用简单的规则来组合异常区域。然而，像素点组合是评估异常分割性能的限制因素。为了解决多个异常实例被一起分组的问题，我们提出一种方法，可以生成准确的异常实例面积。我们的方法围绕着非常区分分数模型和一个强大的通用模型来实现异常实例分割。我们调查了如何使用不确定区域来引导这种分割模型进行异常实例分割。通过将强大的物体先验从通用模型integrated，我们还提高了每个像素点的异常分割性能。我们的方法在当前像素级异常分割方法的基础上提高了AP值，达到80.08%和88.98%在Fishyscapes Lost and Found和RoadAnomaly验证集上。项目页面：https://vision.rwth-aachen.de/ugains

ETran: Energy-Based Transferability Estimation

paper_url: http://arxiv.org/abs/2308.02027
repo_url: None
paper_authors: Mohsen Gholami, Mohammad Akbari, Xinglu Wang, Behnam Kamranian, Yong Zhang
for: 本研究旨在解决预训练模型选择问题，以提高对象检测和图像分类的性能。
methods: 我们提出了一种基于能量分布的转移性评估指标（ETran），包括三个分数：1) 能量分数，2) 分类分数，3) 回归分数。ETran使用能量模型来判断目标数据集是否为预训练模型的内部数据集（IND）或外部数据集（OOD）。与前一些工作不同，ETran可以应用于多种任务，包括分类、回归和对象检测（分类+回归）。
results: 我们在四个 benchmark 和两个任务上进行了广泛的实验，并证明了 ETran 在对象检测和分类 benchmark 上平均高于前一些工作 by 21% 和 12%，并达到了转移性评估中的最佳性能。

Abstract
This paper addresses the problem of ranking pre-trained models for object detection and image classification. Selecting the best pre-trained model by fine-tuning is an expensive and time-consuming task. Previous works have proposed transferability estimation based on features extracted by the pre-trained models. We argue that quantifying whether the target dataset is in-distribution (IND) or out-of-distribution (OOD) for the pre-trained model is an important factor in the transferability estimation. To this end, we propose ETran, an energy-based transferability assessment metric, which includes three scores: 1) energy score, 2) classification score, and 3) regression score. We use energy-based models to determine whether the target dataset is OOD or IND for the pre-trained model. In contrast to the prior works, ETran is applicable to a wide range of tasks including classification, regression, and object detection (classification+regression). This is the first work that proposes transferability estimation for object detection task. Our extensive experiments on four benchmarks and two tasks show that ETran outperforms previous works on object detection and classification benchmarks by an average of 21% and 12%, respectively, and achieves SOTA in transferability assessment.

摘要

paper_url: http://arxiv.org/abs/2308.01994
repo_url: None
paper_authors: Chengjia Wang, Giorgos Papanastasiou
for: 这篇论文旨在提出一种基于深度学习的多模态和多器官MRI图像 регистрация方法，以帮助临床决策。
methods: 该方法使用了多种MRI序列（定义为’模态’），并使用了Grad-CAM基于解释框架来解释模型数据行为。
results: 作者在之前的研究中已经达到了超过现有标准Syn方法的性能水平，而在这项工作中，他们则证明了他们的DL模型可以完全解释，这将为将来的医学图像数据进行扩展提供了基础。

Abstract
Clinical decision making from magnetic resonance imaging (MRI) combines complementary information from multiple MRI sequences (defined as 'modalities'). MRI image registration aims to geometrically 'pair' diagnoses from different modalities, time points and slices. Both intra- and inter-modality MRI registration are essential components in clinical MRI settings. Further, an MRI image processing pipeline that can address both afine and non-rigid registration is critical, as both types of deformations may be occuring in real MRI data scenarios. Unlike image classification, explainability is not commonly addressed in image registration deep learning (DL) methods, as it is challenging to interpet model-data behaviours against transformation fields. To properly address this, we incorporate Grad-CAM-based explainability frameworks in each major component of our unsupervised multi-modal and multi-organ image registration DL methodology. We previously demonstrated that we were able to reach superior performance (against the current standard Syn method). In this work, we show that our DL model becomes fully explainable, setting the framework to generalise our approach on further medical imaging data.

摘要
临床决策从核磁共振成像（MRI）结合多种MRI序列（定义为“模态”）的信息。MRI图像匹配目标是将不同模态、时间点和切片的诊断“匹配”在一起。在临床MRI Setting中，内部和间部MRI匹配都是重要组成部分。此外，一个能够处理both afine和非RIGID匹配的MRI图像处理管道是关键，因为这两种类型的变形都可能出现在实际MRI数据enario中。与图像分类不同，explainability不是通常在图像匹配深度学习（DL）方法中被考虑，因为很难从变换场景中解释模型-数据行为。为了正确地 Addressing this, we incorporate Grad-CAM-based explainability frameworks in each major component of our unsupervised multi-modal and multi-organ image registration DL methodology. In our previous work, we demonstrated that our DL model could reach superior performance against the current standard Syn method. In this work, we show that our DL model becomes fully explainable, setting the framework for generalizing our approach to further medical imaging data.

Predicting Ki67, ER, PR, and HER2 Statuses from H&E-stained Breast Cancer Images

paper_url: http://arxiv.org/abs/2308.01982
repo_url: None
paper_authors: Amir Akbarnejad, Nilanjan Ray, Penny J. Barnes, Gilbert Bigras
for: 本研究的目的是确定机器学习方法是否可以准确地预测蛋白质信息仅基于组织结构图像。
methods: 我们建立了一个大规模的数据集（185538张图像），其中包含可靠的测量数据 для Ki67、ER、PR和HER2状况。这个数据集由H&E和相关的免疫抑制试验（Ki67、ER、PR和HER2）的图像组成，这些图像通过注册进行了镜像。为了增强可靠性，我们 manually inspect each pair of images and discarded those with artifacts (such as tissue folding or bubbles). Measurements for Ki67、ER和PR were determined by calculating H-Score from image analysis, while HER2 measurement was based on binary classification.
results: 我们发现，使用标准的ViT基础pipeline可以在训练with proper labeling protocol下达到约90%的AUC预测性能。此外，我们还证明了训练的分类器能够正确地 lokalisate relevant regions，这将激发未来的工作以提高localizations。我们的提出的数据集现在公开可用：https://ihc4bc.github.io/

Abstract
Despite the advances in machine learning and digital pathology, it is not yet clear if machine learning methods can accurately predict molecular information merely from histomorphology. In a quest to answer this question, we built a large-scale dataset (185538 images) with reliable measurements for Ki67, ER, PR, and HER2 statuses. The dataset is composed of mirrored images of H\&E and corresponding images of immunohistochemistry (IHC) assays (Ki67, ER, PR, and HER2. These images are mirrored through registration. To increase reliability, individual pairs were inspected and discarded if artifacts were present (tissue folding, bubbles, etc). Measurements for Ki67, ER and PR were determined by calculating H-Score from image analysis. HER2 measurement is based on binary classification: 0 and 1+ (IHC scores representing a negative subset) vs 3+ (IHC score positive subset). Cases with IHC equivocal score (2+) were excluded. We show that a standard ViT-based pipeline can achieve prediction performances around 90% in terms of Area Under the Curve (AUC) when trained with a proper labeling protocol. Finally, we shed light on the ability of the trained classifiers to localize relevant regions, which encourages future work to improve the localizations. Our proposed dataset is publicly available: https://ihc4bc.github.io/

摘要
尽管机器学习和数字 PATHOLOGY 技术有所进步，但是还没有清楚地表明机器学习方法可以准确地预测分子信息仅通过 histomorphology。为了回答这个问题，我们建立了一个大规模数据集（185538张图像），其中包含可靠的测量数据 для Ki67、ER、PR 和 HER2 状态。该数据集由 H\&E 和相关的免疫抑制试验（IHC）图像组成（Ki67、ER、PR 和 HER2），这些图像通过注册进行镜像。为了增加可靠性，我们对每个对应的图像进行了检查，并且如果存在遗留物（如组织卷绕、气泡等），那么将其排除。我们通过计算 H-Score 来确定 Ki67、ER 和 PR 的测量，而 HER2 的测量则基于二分类：0和1+（IHC 分数表示负集）vs 3+（IHC 分数正集）。我们排除了 IHC 不明确分数（2+）的案例。我们显示，使用标准 ViT-based 管道可以在训练时以 около 90% 的 AUC 性能进行预测。最后，我们探讨了训练的分类器所能够启发的相关区域，这引发了未来工作的改进局部化。我们所建立的数据集现在公开可用：https://ihc4bc.github.io/

CartiMorph: a framework for automated knee articular cartilage morphometrics

paper_url: http://arxiv.org/abs/2308.01981
repo_url: https://github.com/yongchengyao/cartimorph
paper_authors: Yongcheng Yao, Junru Zhong, Liping Zhang, Sheheryar Khan, Weitian Chen
for: 这个论文旨在提供一个自动化膝关节cartilage morphometrics框架，以便通过计算软骨cartilage厚度、面积和体积来评估膝关节疾病。
methods: 这个框架使用了深度学习模型来表示图像特征，并使用模板建构和图像注册来生成量化metric。它还使用了表面正则化来计算软骨厚度图和血液损伤率。
results: 研究发现，使用这个框架可以生成高精度的软骨厚度图和血液损伤率图，并且与手动分割的结果相比，root-mean-squared deviation只有8%以下，而且存在strong correlation。此外，与之前的研究中的血液损伤率比较，这个方法的测量结果更接近真实值。

Abstract
We introduce CartiMorph, a framework for automated knee articular cartilage morphometrics. It takes an image as input and generates quantitative metrics for cartilage subregions, including the percentage of full-thickness cartilage loss (FCL), mean thickness, surface area, and volume. CartiMorph leverages the power of deep learning models for hierarchical image feature representation. Deep learning models were trained and validated for tissue segmentation, template construction, and template-to-image registration. We established methods for surface-normal-based cartilage thickness mapping, FCL estimation, and rule-based cartilage parcellation. Our cartilage thickness map showed less error in thin and peripheral regions. We evaluated the effectiveness of the adopted segmentation model by comparing the quantitative metrics obtained from model segmentation and those from manual segmentation. The root-mean-squared deviation of the FCL measurements was less than 8%, and strong correlations were observed for the mean thickness (Pearson's correlation coefficient $\rho \in [0.82,0.97]$), surface area ($\rho \in [0.82,0.98]$) and volume ($\rho \in [0.89,0.98]$) measurements. We compared our FCL measurements with those from a previous study and found that our measurements deviated less from the ground truths. We observed superior performance of the proposed rule-based cartilage parcellation method compared with the atlas-based approach. CartiMorph has the potential to promote imaging biomarkers discovery for knee osteoarthritis.

摘要
我们介绍CartiMorph，一个数据科学框架，用于自动脚关节软骨形态分析。它可以将脚关节软骨影像作为输入，生成软骨各个子区域的量化指标，包括软骨全厚度损伤率（FCL）、平均厚度、表面积和体积。CartiMorph充分利用深度学习模型来表示图像特征 hierarchy。我们训练了深度学习模型并进行验证，以进行组织分类、模板建立和模板与图像对接。我们开发了基于表面法向的软骨厚度对应、FCL估计和规律基于的软骨分割方法。我们的软骨厚度图示在薄和 périphérique 区域中表现较低的误差。我们评估了运用数据科学模型进行分类的效果，与手动分类结果进行比较。在FCL量化中，根mean-squared deviation的误差小于8%，且在平均厚度（Pearson's correlation coefficient $\rho \in [0.82,0.97]$）、表面积（Pearson's correlation coefficient $\rho \in [0.82,0.98]$）和体积（Pearson's correlation coefficient $\rho \in [0.89,0.98]$）量化中都 observe strong correlation。我们与之前的研究比较FCL量化结果，发现我们的量化结果与真实值更接近。我们发现了规律基于的软骨分割方法比Atlas-based方法表现更好。CartiMorph具有推广镜影像生物 markers的潜力。

Unmasking Parkinson’s Disease with Smile: An AI-enabled Screening Framework

paper_url: http://arxiv.org/abs/2308.02588
repo_url: None
paper_authors: Tariq Adnan, Md Saiful Islam, Wasifur Rahman, Sangwu Lee, Sutapa Dey Tithi, Kazi Noshin, Imran Sarker, M Saifur Rahman, Ehsan Hoque
for: 这个研究旨在开发一种基于微表情的抑阻肌萎病（PD）诊断方法，以提高PD诊断的可靠性和访问性。
methods: 该研究使用了 largest video dataset containing micro-expressions，并使用了面部特征和动作单元来提取有关肌萎病的特征。 ensemble of AI models 被训练于这些特征，并 achieved an accuracy of 89.7% 和 AUROC of 89.3%。
results: 研究发现，基于微表情的抑阻肌萎病诊断方法可以具有高度可靠性和抗检测性，并且可以在不同的性别和民族背景下实现同等的表现。

Abstract
Parkinson's disease (PD) diagnosis remains challenging due to lacking a reliable biomarker and limited access to clinical care. In this study, we present an analysis of the largest video dataset containing micro-expressions to screen for PD. We collected 3,871 videos from 1,059 unique participants, including 256 self-reported PD patients. The recordings are from diverse sources encompassing participants' homes across multiple countries, a clinic, and a PD care facility in the US. Leveraging facial landmarks and action units, we extracted features relevant to Hypomimia, a prominent symptom of PD characterized by reduced facial expressions. An ensemble of AI models trained on these features achieved an accuracy of 89.7% and an Area Under the Receiver Operating Characteristic (AUROC) of 89.3% while being free from detectable bias across population subgroups based on sex and ethnicity on held-out data. Further analysis reveals that features from the smiling videos alone lead to comparable performance, even on two external test sets the model has never seen during training, suggesting the potential for PD risk assessment from smiling selfie videos.

摘要

RealCQA: Scientific Chart Question Answering as a Test-bed for First-Order Logic

paper_url: http://arxiv.org/abs/2308.01979
repo_url: https://github.com/cse-ai-lab/RealCQA
paper_authors: Saleem Ahmed, Bhavin Jawade, Shubham Pandey, Srirangaraj Setlur, Venu Govindaraju
for: 该论文目的是解决在文档中理解和提取图表视觉化数据中遇到的挑战。
methods: 该论文使用了实际世界的真实图表数据集，并提出了一个新的答案类型“列表”，包括排名和未排名两种变种。
results: 该论文通过实验证明了大规模预训练模型在图表视觉问答任务中的表现，并为图表视觉问答和形式逻辑验证各提供了一个标准。

Abstract
We present a comprehensive study of chart visual question-answering(QA) task, to address the challenges faced in comprehending and extracting data from chart visualizations within documents. Despite efforts to tackle this problem using synthetic charts, solutions are limited by the shortage of annotated real-world data. To fill this gap, we introduce a benchmark and dataset for chart visual QA on real-world charts, offering a systematic analysis of the task and a novel taxonomy for template-based chart question creation. Our contribution includes the introduction of a new answer type, 'list', with both ranked and unranked variations. Our study is conducted on a real-world chart dataset from scientific literature, showcasing higher visual complexity compared to other works. Our focus is on template-based QA and how it can serve as a standard for evaluating the first-order logic capabilities of models. The results of our experiments, conducted on a real-world out-of-distribution dataset, provide a robust evaluation of large-scale pre-trained models and advance the field of chart visual QA and formal logic verification for neural networks in general.

摘要
我们进行了一项全面的研究，探讨了图表视觉问答（QA）任务，以解决在文档中理解和提取图表视觉数据时所遇到的挑战。尽管有尝试使用合成图表来解决这个问题，但解决方案受到实际数据缺乏答案的限制。为了填补这个差距，我们提出了一个基准和图表视觉QA实际数据集，并提供了一种系统性的分析和一种新的模板基于的图表问题创建分类法。我们的贡献包括引入一种新的答案类型，“列表”，其中包括排序和无排序两种变种。我们的研究在科学文献中采集的实际图表数据集上进行，这个数据集的视觉复杂性较高于其他工作。我们的重点是模板基于的QA和如何使其成为评估模型首觉逻辑能力的标准。实验结果，在一个实际 OUT-OF-distribution 数据集上进行，为大规模预训练模型提供了robust的评估，并为图表视觉QA和形式逻辑验证 для神经网络在通过总的提升了领域。

Synthesising Rare Cataract Surgery Samples with Guided Diffusion Models

paper_url: http://arxiv.org/abs/2308.02587
repo_url: https://github.com/meclabtuda/catasynth
paper_authors: Yannik Frisch, Moritz Fuchs, Antoine Sanner, Felix Anton Ucar, Marius Frenzel, Joana Wasielica-Poslednik, Adrian Gericke, Felix Mathias Wagner, Thomas Dratsch, Anirban Mukhopadhyay
for:The paper aims to address the challenge of gathering and annotating data for training automated assistance systems for cataract surgery, specifically by analyzing cataract surgery video data and synthesizing diverse, high-quality examples of surgical phases and tool usage.methods:The authors use a conditional generative model based on Denoising Diffusion Implicit Models (DDIM) and Classifier-Free Guidance (CFG) to synthesize realistic examples of surgical phases and tool usage, which can improve the data sparsity problem for the downstream task of tool classification.results:The model can generate valuable unseen examples that allow the tool classifier to improve by up to 10% for rare cases, and the synthetically extended data can facilitate the development of automated assistance systems for cataract surgery by providing a reliable source of realistic synthetic data.Here’s the Chinese translation of the three key points:for:这篇论文目标是解决cataract surgery assistive system的自动化协助系统训练数据收集和标注的挑战，特别是分析cataract surgery视频数据并生成多样化、高质量的手术阶段和工具使用示例。methods:作者使用了一种基于Denoising Diffusion Implicit Models (DDIM)和Classifier-Free Guidance (CFG)的 conditional generative模型来生成真实的手术阶段和工具使用示例，以解决下游任务的数据稀缺问题。results:模型可以生成有价值的未看过的示例，使得工具分类器在罕见 случа中提高至10%，并且将生成的 synthetically extended data 提供了一个可靠的真实的synthetic数据源，以便助长cataract surgery assistive system的自动化协助系统的发展。

Abstract
Cataract surgery is a frequently performed procedure that demands automation and advanced assistance systems. However, gathering and annotating data for training such systems is resource intensive. The publicly available data also comprises severe imbalances inherent to the surgical process. Motivated by this, we analyse cataract surgery video data for the worst-performing phases of a pre-trained downstream tool classifier. The analysis demonstrates that imbalances deteriorate the classifier's performance on underrepresented cases. To address this challenge, we utilise a conditional generative model based on Denoising Diffusion Implicit Models (DDIM) and Classifier-Free Guidance (CFG). Our model can synthesise diverse, high-quality examples based on complex multi-class multi-label conditions, such as surgical phases and combinations of surgical tools. We affirm that the synthesised samples display tools that the classifier recognises. These samples are hard to differentiate from real images, even for clinical experts with more than five years of experience. Further, our synthetically extended data can improve the data sparsity problem for the downstream task of tool classification. The evaluations demonstrate that the model can generate valuable unseen examples, allowing the tool classifier to improve by up to 10% for rare cases. Overall, our approach can facilitate the development of automated assistance systems for cataract surgery by providing a reliable source of realistic synthetic data, which we make available for everyone.

摘要
喷洗手术是一种非常常见的手术程序，需要自动化和高级帮助系统。然而，收集和标注数据 для训练这些系统是资源占用的。公共可用数据也存在严重的不均衡问题，这些问题会影响下游工具分类器的性能。为了解决这个挑战，我们分析了喷洗手术视频数据，并发现异常性对下游工具分类器的性能有负面影响。为了解决这个问题，我们使用了基于零噪扩散模型（DDIM）和无类标注指导（CFG）的冲激生成模型。我们的模型可以生成多样化、高质量的示例，包括手术阶段和手术工具的复杂多类多标签条件。我们证明了这些生成的样例中的工具都是分类器认可的。这些样例与真实图像很难分辨，即使是临床专家超过5年的经验。此外，我们通过生成的数据扩展，可以解决下游工具分类器的数据稀缺问题。评估表明，我们的模型可以生成有价值的未见样例，使工具分类器提高至10%。总之，我们的方法可以促进喷洗手术自动化系统的开发，提供一个可靠的真实synthetic数据源，我们将其公开给大家。

The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World

paper_url: http://arxiv.org/abs/2308.01907
repo_url: https://github.com/opengvlab/all-seeing
paper_authors: Weiyun Wang, Min Shi, Qingyun Li, Wenhai Wang, Zhenhang Huang, Linjie Xing, Zhe Chen, Hao Li, Xizhou Zhu, Zhiguo Cao, Yushi Chen, Tong Lu, Jifeng Dai, Yu Qiao
for: 这个项目（All-Seeing）的目标是开发一个能够识别和理解开放世界中的所有元素的大规模数据和模型。
methods: 这个项目使用了可扩展的数据引擎，并利用人类反馈和高效的模型来创建一个名为AS-1B的新数据集，包含超过10亿个区域，每个区域都有semantic标签、问题对答对和详细的描述。
results: 这个项目开发了一个名为All-Seeing模型（ASM），这是一个统一的框架，可以处理多种视觉和语言任务，包括区域文本检索、区域识别、描述和问题回答。这个模型在零基础情况下表现出惊人的表现，并且可以 generale到不同的视觉和语言任务。

Abstract
We present the All-Seeing (AS) project: a large-scale data and model for recognizing and understanding everything in the open world. Using a scalable data engine that incorporates human feedback and efficient models in the loop, we create a new dataset (AS-1B) with over 1 billion regions annotated with semantic tags, question-answering pairs, and detailed captions. It covers a wide range of 3.5 million common and rare concepts in the real world, and has 132.2 billion tokens that describe the concepts and their attributes. Leveraging this new dataset, we develop the All-Seeing model (ASM), a unified framework for panoptic visual recognition and understanding. The model is trained with open-ended language prompts and locations, which allows it to generalize to various vision and language tasks with remarkable zero-shot performance, including region-text retrieval, region recognition, captioning, and question-answering. We hope that this project can serve as a foundation for vision-language artificial general intelligence research. Models and the dataset shall be released at https://github.com/OpenGVLab/All-Seeing, and demo can be seen at https://huggingface.co/spaces/OpenGVLab/all-seeing.

摘要
我们介绍《全见计划》（AS）项目：一个大规模数据和模型，旨在认知和理解开放世界中的一切。我们使用可扩展的数据引擎，并将人类反馈和高效的模型纳入循环，创建了一个新的数据集（AS-1B），包含超过10亿个区域，每个区域均被标注为Semantic tag、问题对答对和详细描述。这些数据覆盖了350万个常见和罕见的世界现象，并有132.2亿个字符来描述这些概念和其属性。基于这个新的数据集，我们开发了《全见模型》（ASM），一个统一的视觉认知和理解框架。该模型通过使用开放语言提示和位置训练，能够通过Zero-shot学习来解决多种视觉和语言任务，包括区域文本检索、区域识别、描述和问题回答。我们希望这个项目可以成为视觉语言人工智能研究的基础。模型和数据将在GitHub上发布，demo可以在Hugging Face上看到。

DETR Doesn’t Need Multi-Scale or Locality Design

paper_url: http://arxiv.org/abs/2308.01904
repo_url: https://github.com/impiga/plain-detr
paper_authors: Yutong Lin, Yuhui Yuan, Zheng Zhang, Chen Li, Nanning Zheng, Han Hu
for: 提高DETR检测器的性能，使其具有”平铺”的性质，即使用单个缩放矩阵和全局权重计算而不受特定的本地性约束。
methods: 使用两种简单 yet 有效的技术来补偿无多scale feature map和本地性约束：首先是在交换表达式中添加盒体对像Pixel相对位偏移（BoxRPB）项，使每个查询能够准确地attend to对应的物体区域，并提供编码flexibility。其次是基于Masked Image Modeling（MIM）的后驱核心预训练，帮助学习表示能力与细致的本地化。
results: 通过 incorporating these technologies and recent advancements in training and problem formation, the improved “plain” DETR showed exceptional improvements over the original DETR detector. Using the Object365 dataset for pre-training, it achieved 63.9 mAP accuracy with a Swin-L backbone, which is highly competitive with state-of-the-art detectors that all heavily rely on multi-scale feature maps and region-based feature extraction.

Abstract
This paper presents an improved DETR detector that maintains a "plain" nature: using a single-scale feature map and global cross-attention calculations without specific locality constraints, in contrast to previous leading DETR-based detectors that reintroduce architectural inductive biases of multi-scale and locality into the decoder. We show that two simple technologies are surprisingly effective within a plain design to compensate for the lack of multi-scale feature maps and locality constraints. The first is a box-to-pixel relative position bias (BoxRPB) term added to the cross-attention formulation, which well guides each query to attend to the corresponding object region while also providing encoding flexibility. The second is masked image modeling (MIM)-based backbone pre-training which helps learn representation with fine-grained localization ability and proves crucial for remedying dependencies on the multi-scale feature maps. By incorporating these technologies and recent advancements in training and problem formation, the improved "plain" DETR showed exceptional improvements over the original DETR detector. By leveraging the Object365 dataset for pre-training, it achieved 63.9 mAP accuracy using a Swin-L backbone, which is highly competitive with state-of-the-art detectors which all heavily rely on multi-scale feature maps and region-based feature extraction. Code is available at https://github.com/impiga/Plain-DETR .

摘要

UniSim: A Neural Closed-Loop Sensor Simulator

paper_url: http://arxiv.org/abs/2308.01898
repo_url: None
paper_authors: Ze Yang, Yun Chen, Jingkang Wang, Sivabalan Manivasagam, Wei-Chiu Ma, Anqi Joyce Yang, Raquel Urtasun
for: 这篇论文的目的是为了实现安全自动驾驶车辆（SDV）的可靠性测试。
methods: 这篇论文使用了一种神经网络感知器 simulator，可以将单个录制的驾驶记录转换成真实的多感知closed-loop simulator。
results: 实验表明，UniSim可以生成具有小域领域差异的感知数据，并在下游任务上表现出优秀的性能。通过UniSim，我们实现了在安全关键场景下进行闭环评估自动驾驶系统的能力。

Abstract
Rigorously testing autonomy systems is essential for making safe self-driving vehicles (SDV) a reality. It requires one to generate safety critical scenarios beyond what can be collected safely in the world, as many scenarios happen rarely on public roads. To accurately evaluate performance, we need to test the SDV on these scenarios in closed-loop, where the SDV and other actors interact with each other at each timestep. Previously recorded driving logs provide a rich resource to build these new scenarios from, but for closed loop evaluation, we need to modify the sensor data based on the new scene configuration and the SDV's decisions, as actors might be added or removed and the trajectories of existing actors and the SDV will differ from the original log. In this paper, we present UniSim, a neural sensor simulator that takes a single recorded log captured by a sensor-equipped vehicle and converts it into a realistic closed-loop multi-sensor simulation. UniSim builds neural feature grids to reconstruct both the static background and dynamic actors in the scene, and composites them together to simulate LiDAR and camera data at new viewpoints, with actors added or removed and at new placements. To better handle extrapolated views, we incorporate learnable priors for dynamic objects, and leverage a convolutional network to complete unseen regions. Our experiments show UniSim can simulate realistic sensor data with small domain gap on downstream tasks. With UniSim, we demonstrate closed-loop evaluation of an autonomy system on safety-critical scenarios as if it were in the real world.

摘要
rigorously testing autonomous systems is essential for making safe self-driving vehicles (SDV) a reality. It requires one to generate safety-critical scenarios beyond what can be collected safely on public roads, as many scenarios happen rarely. To accurately evaluate performance, we need to test the SDV on these scenarios in closed-loop, where the SDV and other actors interact with each other at each timestep. Previously recorded driving logs provide a rich resource to build these new scenarios from, but for closed-loop evaluation, we need to modify the sensor data based on the new scene configuration and the SDV's decisions, as actors might be added or removed and the trajectories of existing actors and the SDV will differ from the original log. In this paper, we present UniSim, a neural sensor simulator that takes a single recorded log captured by a sensor-equipped vehicle and converts it into a realistic closed-loop multi-sensor simulation. UniSim builds neural feature grids to reconstruct both the static background and dynamic actors in the scene, and composites them together to simulate LiDAR and camera data at new viewpoints, with actors added or removed and at new placements. To better handle extrapolated views, we incorporate learnable priors for dynamic objects, and leverage a convolutional network to complete unseen regions. Our experiments show UniSim can simulate realistic sensor data with small domain gap on downstream tasks. With UniSim, we demonstrate closed-loop evaluation of an autonomy system on safety-critical scenarios as if it were in the real world.

DualCoOp++: Fast and Effective Adaptation to Multi-Label Recognition with Limited Annotations

paper_url: http://arxiv.org/abs/2308.01890
repo_url: None
paper_authors: Ping Hu, Ximeng Sun, Stan Sclaroff, Kate Saenko
for: 多 Label图像识别在低标签 режиме是一项具有挑战性和实际意义的任务。过去的工作通常是通过学习图像和文本空间之间的协调来补偿限制的图像标签，但可能会受到质量不佳的多标签注释的限制。
methods: 我们利用了强大的文本和视觉特征之间的协调，通过 millions of auxiliary image-text pairs预训练的 poderoso alignement。我们提出了一种高效可行的框架called Evidence-guided Dual Context Optimization (DualCoOp++),它作为一种统一的方法来解决 partial-label 和零shot多标签识别。在 DualCoOp++ 中，我们分别编码了目标类的证据、正面和负面上下文作为参数的文本输入（即提示）中的 parametric 组件。
results: experiments on standard multi-label recognition benchmarks across two challenging low-label settings demonstrate the superior performance of our approach compared to state-of-the-art methods.

Abstract
Multi-label image recognition in the low-label regime is a task of great challenge and practical significance. Previous works have focused on learning the alignment between textual and visual spaces to compensate for limited image labels, yet may suffer from reduced accuracy due to the scarcity of high-quality multi-label annotations. In this research, we leverage the powerful alignment between textual and visual features pretrained with millions of auxiliary image-text pairs. We introduce an efficient and effective framework called Evidence-guided Dual Context Optimization (DualCoOp++), which serves as a unified approach for addressing partial-label and zero-shot multi-label recognition. In DualCoOp++ we separately encode evidential, positive, and negative contexts for target classes as parametric components of the linguistic input (i.e., prompts). The evidential context aims to discover all the related visual content for the target class, and serves as guidance to aggregate positive and negative contexts from the spatial domain of the image, enabling better distinguishment between similar categories. Additionally, we introduce a Winner-Take-All module that promotes inter-class interaction during training, while avoiding the need for extra parameters and costs. As DualCoOp++ imposes minimal additional learnable overhead on the pretrained vision-language framework, it enables rapid adaptation to multi-label recognition tasks with limited annotations and even unseen classes. Experiments on standard multi-label recognition benchmarks across two challenging low-label settings demonstrate the superior performance of our approach compared to state-of-the-art methods.

摘要
多 label 图像识别在低标签 régime 是一项具有挑战性和实际意义的任务。先前的工作通过学习图像和文本空间之间的对应关系来资料缺乏多 label 约束，但可能会受到丰富约束的影响。在本研究中，我们利用图像和文本特征之间的强大对应关系，预训练了数百万个辅助图像-文本对。我们提出了一种高效的框架，称为 Evidence-guided Dual Context Optimization (DualCoOp++)，它是一种综合性的方法，用于解决 partial-label 和 zero-shot 多 label 识别任务。在 DualCoOp++ 中，我们分别对目标类的 evidential、正例和负例上下文进行编码，并将它们作为文本输入（即提示）的参数。目标类的 evidential 上下文旨在找到与其相关的所有视觉内容，并作为指导，将正例和负例上下文从图像的空间领域集成，以提高类别之间的分辨率。此外，我们还提出了一种 Winner-Take-All 模块，它在训练过程中促进了类之间的互动，而不需要额外的参数和成本。由于 DualCoOp++ 对预训练的视觉语言框架增加了最小的可学习增加，因此它可以快速适应多 label 识别任务，即使有限的标签和未知的类别。我们在标准多 label 识别测试上进行了两个低标签设定下的实验，并证明了我们的方法在比 estado-of-the-art 方法更高效。

FROD: Robust Object Detection for Free

paper_url: http://arxiv.org/abs/2308.01888
repo_url: None
paper_authors: Muhammad, Awais, Weiming, Zhuang, Lingjuan, Lyu, Sung-Ho, Bae
for: 提高物体检测器的鲁棒性，尤其是对于小型攻击的鲁棒性。
methods: 利用针对攻击的 adversarially 训练的分类模型作为物体检测器的基础模型，并对基础模型进行修改以实现鲁棒性。
results: 通过两种轻量级组件——模仿损失和延迟反恶作用训练——进一步提高物体检测器的鲁棒性，并在 MS-COCO 和 Pascal VOC 数据集上进行了广泛的实验来证明提案的有效性。

Abstract
Object detection is a vital task in computer vision and has become an integral component of numerous critical systems. However, state-of-the-art object detectors, similar to their classification counterparts, are susceptible to small adversarial perturbations that can significantly alter their normal behavior. Unlike classification, the robustness of object detectors has not been thoroughly explored. In this work, we take the initial step towards bridging the gap between the robustness of classification and object detection by leveraging adversarially trained classification models. Merely utilizing adversarially trained models as backbones for object detection does not result in robustness. We propose effective modifications to the classification-based backbone to instill robustness in object detection without incurring any computational overhead. To further enhance the robustness achieved by the proposed modified backbone, we introduce two lightweight components: imitation loss and delayed adversarial training. Extensive experiments on the MS-COCO and Pascal VOC datasets are conducted to demonstrate the effectiveness of our proposed approach.

摘要
Object detection 是计算机视觉中的一项重要任务，已成为许多关键系统的重要组件。然而，当前的状态体检器，与分类器一样，受到小型恶作剂的影响，可能导致它们的正常行为发生重大变化。与分类器不同的是，对象检测器的可靠性还没有得到了充分的探索。在这项工作中，我们通过利用针对攻击的分类器模型来减轻这一问题，开始尝试将分类器和检测器的可靠性更加相似。但是，直接使用针对攻击的分类器模型作为检测器的后备，并不能确保对象检测器的可靠性。我们提出了一些有效的修改，以使得分类器后备具备对象检测器的Robustness，而不需要任何计算负担增加。此外，我们还引入了两种轻量级组件：模仿损失和延迟攻击训练。我们对 MS-COCO 和 Pascal VOC 数据集进行了广泛的实验，以证明我们的提出的方法的有效性。

ConceptLab: Creative Generation using Diffusion Prior Constraints

paper_url: http://arxiv.org/abs/2308.02669
repo_url: https://github.com/kfirgoldberg/ConceptLab
paper_authors: Elad Richardson, Kfir Goldberg, Yuval Alaluf, Daniel Cohen-Or
for: 本研究旨在提出一种创新的文本到图像生成任务，即生成一个新的、从未seen before的概念。
methods: 我们使用了Diffusion Prior模型，并解释了如何将创意生成问题формализова为输出空间的优化问题，从而获得一组“先验约束”。
results: 我们的研究表明，通过在优化问题中适应添加问题回答模型，可以避免生成的概念混合到现有的成员中，并且可以在生成过程中创造更多的创新性。此外，我们还发现了“先验约束”可以作为一种强大的混合机制，使得我们可以在生成过程中创造更多的杂合体。

Abstract
Recent text-to-image generative models have enabled us to transform our words into vibrant, captivating imagery. The surge of personalization techniques that has followed has also allowed us to imagine unique concepts in new scenes. However, an intriguing question remains: How can we generate a new, imaginary concept that has never been seen before? In this paper, we present the task of creative text-to-image generation, where we seek to generate new members of a broad category (e.g., generating a pet that differs from all existing pets). We leverage the under-studied Diffusion Prior models and show that the creative generation problem can be formulated as an optimization process over the output space of the diffusion prior, resulting in a set of "prior constraints". To keep our generated concept from converging into existing members, we incorporate a question-answering model that adaptively adds new constraints to the optimization problem, encouraging the model to discover increasingly more unique creations. Finally, we show that our prior constraints can also serve as a strong mixing mechanism allowing us to create hybrids between generated concepts, introducing even more flexibility into the creative process.

摘要

Reconstructing Three-Dimensional Models of Interacting Humans

paper_url: http://arxiv.org/abs/2308.01854
repo_url: https://github.com/sminchisescu-research/imar_vision_datasets_tools
paper_authors: Mihai Fieraru, Mihai Zanfir, Elisabeta Oneata, Alin-Ionut Popa, Vlad Olaru, Cristian Sminchisescu
For: This paper focuses on improving the accuracy of 3D human interactions, which is essential for fine-grained scene analysis and behavioral modeling.* Methods: The authors introduce several new models for interaction signature estimation (ISP), including contact detection, segmentation, and 3D contact signature prediction. They also show how these components can be used to ensure contact consistency during 3D reconstruction.* Results: The authors construct several large datasets for learning and evaluating 3D contact prediction and reconstruction methods, including CHI3D and FlickrCI3D. They also propose a methodology for recovering the ground-truth pose and shape of interacting people in a controlled setup and annotate all 3D interaction motions in CHI3D with textual descriptions.

Abstract
Understanding 3d human interactions is fundamental for fine-grained scene analysis and behavioural modeling. However, most of the existing models predict incorrect, lifeless 3d estimates, that miss the subtle human contact aspects--the essence of the event--and are of little use for detailed behavioral understanding. This paper addresses such issues with several contributions: (1) we introduce models for interaction signature estimation (ISP) encompassing contact detection, segmentation, and 3d contact signature prediction; (2) we show how such components can be leveraged to ensure contact consistency during 3d reconstruction; (3) we construct several large datasets for learning and evaluating 3d contact prediction and reconstruction methods; specifically, we introduce CHI3D, a lab-based accurate 3d motion capture dataset with 631 sequences containing $2,525$ contact events, $728,664$ ground truth 3d poses, as well as FlickrCI3D, a dataset of $11,216$ images, with $14,081$ processed pairs of people, and $81,233$ facet-level surface correspondences. Finally, (4) we propose methodology for recovering the ground-truth pose and shape of interacting people in a controlled setup and (5) annotate all 3d interaction motions in CHI3D with textual descriptions. Motion data in multiple formats (GHUM and SMPLX parameters, Human3.6m 3d joints) is made available for research purposes at \url{https://ci3d.imar.ro}, together with an evaluation server and a public benchmark.

摘要
理解人类三维交互是场景分析和行为模型的基础。然而，大多数现有模型预测错误的、无生命的三维估计，缺少人类接触细节--场景的核心--并对详细行为理解无用。本文解决这些问题，并提供以下贡献：1. 我们引入交互特征估计（ISP）模型，包括接触检测、分割和三维接触特征预测；2. 我们示出如何使用这些组件保证接触一致性 during 3D重建；3. 我们建立了多个大型数据集用于学习和评估3D接触预测和重建方法，其中包括CHI3D，一个室内精确3D动作捕捉数据集，包含2,525个接触事件，728,664个真实3D姿势、以及FlickrCI3D，一个包含11,216个图像、14,081个处理后的人物对象、81,233个表面匹配的数据集。最后，4. 我们提出了方法来恢复实际的人类接触姿势和形状，并5. 对CHI3D数据集中的所有3D交互动作进行文本描述。运动数据在多种格式（GHUM和SMPLX参数、人类3.6m JOINTS）可以用于研究目的下载于\url{https://ci3d.imar.ro}，同时也提供评估服务器和公共准则。

Is your data alignable? Principled and interpretable alignability testing and integration of single-cell data

paper_url: http://arxiv.org/abs/2308.01839
repo_url: https://github.com/rongstat/smai
paper_authors: Rong Ma, Eric D. Sun, David Donoho, James Zou
for: 提供一个整体分子视图的单元数据集合并将其 инте格рирова为多种算法来消除技术和生物学的干扰。
methods: spectral manifold alignment and inference (SMAI) 框架，该框架提供了一种可靠的统计测试以确定数据集之间的可alignability，并且可以保持数据的结构Integrity。
results: SMAI在多个实际和模拟 benchmark 数据集上表现出色，并且可以提高各种后续分析，如差异表达基因的鉴定和单元 spatial transcriptomics 的填充。

Abstract
Single-cell data integration can provide a comprehensive molecular view of cells, and many algorithms have been developed to remove unwanted technical or biological variations and integrate heterogeneous single-cell datasets. Despite their wide usage, existing methods suffer from several fundamental limitations. In particular, we lack a rigorous statistical test for whether two high-dimensional single-cell datasets are alignable (and therefore should even be aligned). Moreover, popular methods can substantially distort the data during alignment, making the aligned data and downstream analysis difficult to interpret. To overcome these limitations, we present a spectral manifold alignment and inference (SMAI) framework, which enables principled and interpretable alignability testing and structure-preserving integration of single-cell data. SMAI provides a statistical test to robustly determine the alignability between datasets to avoid misleading inference, and is justified by high-dimensional statistical theory. On a diverse range of real and simulated benchmark datasets, it outperforms commonly used alignment methods. Moreover, we show that SMAI improves various downstream analyses such as identification of differentially expressed genes and imputation of single-cell spatial transcriptomics, providing further biological insights. SMAI's interpretability also enables quantification and a deeper understanding of the sources of technical confounders in single-cell data.

摘要
单元数据集 интеграción可以提供完整的分子视图单元，并有许多算法开发以除掉技术或生物学变化。Despite their widespread use, existing methods have several fundamental limitations. In particular, we lack a rigorous statistical test for whether two high-dimensional single-cell datasets are alignable (and therefore should even be aligned). Moreover, popular methods can significantly distort the data during alignment, making the aligned data and downstream analysis difficult to interpret. To overcome these limitations, we present a spectral manifold alignment and inference (SMAI) framework, which enables principled and interpretable alignability testing and structure-preserving integration of single-cell data. SMAI provides a statistical test to robustly determine the alignability between datasets to avoid misleading inference, and is justified by high-dimensional statistical theory. On a diverse range of real and simulated benchmark datasets, it outperforms commonly used alignment methods. Moreover, we show that SMAI improves various downstream analyses such as identification of differentially expressed genes and imputation of single-cell spatial transcriptomics, providing further biological insights. SMAI's interpretability also enables quantification and a deeper understanding of the sources of technical confounders in single-cell data.

2023-08-04

A Bi-variant Variational Model for Diffeomorphic Image Registration with Relaxed Jacobian Determinant Constraints

Color Image Recovery Using Generalized Matrix Completion over Higher-Order Finite Dimensional Algebra

Frequency Disentangled Features in Neural Image Compression

Brain MRI Segmentation using Template-Based Training and Visual Perception Augmentation

T-UNet: Triplet UNet for Change Detection in High-Resolution Remote Sensing Images

Multi-attacks: Many images $+$ the same adversarial attack $\to$ many target labels

A Parameter-efficient Multi-subject Model for Predicting fMRI Activity

RobustMQ: Benchmarking Robustness of Quantized Models

Class Incremental Learning with Self-Supervised Pre-Training and Prototype Learning

Generative Image Priors for MRI Reconstruction Trained from Magnitude-Only Images

Improving Scene Graph Generation with Superpixel-Based Interaction Learning

Diffusion-Augmented Depth Prediction with Sparse Annotations

SURE-Val: Safe Urban Relevance Extension and Validation

On the Calibration of Uncertainty Estimation in LiDAR-based Semantic Segmentation

Improving Human-Object Interaction Detection via Virtual Image Learning

MSECNet: Accurate and Robust Normal Estimation for 3D Point Clouds by Multi-Scale Edge Conditioning

FB-BEV: BEV Representation from Forward-Backward View Transformations

Painterly Image Harmonization using Diffusion Model

Deep Semantic Model Fusion for Ancient Agricultural Terrace Detection

Balanced Classification: A Unified Framework for Long-Tailed Object Detection

Paired Competing Neurons Improving STDP Supervised Local Learning In Spiking Neural Networks

ES-MVSNet: Efficient Framework for End-to-end Self-supervised Multi-View Stereo

Synthetic outlier generation for anomaly detection in autonomous driving

Scene-aware Human Pose Generation using Transformer

Efficient Labelling of Affective Video Datasets via Few-Shot & Multi-Task Contrastive Learning

Learning Referring Video Object Segmentation from Weak Annotation

M2Former: Multi-Scale Patch Selection for Fine-Grained Visual Recognition

CTP-Net: Character Texture Perception Network for Document Image Forgery Localization

SDDM: Score-Decomposed Diffusion Models on Manifolds for Unpaired Image-to-Image Translation

Robust Self-Supervised Extrinsic Self-Calibration

Attention-Driven Lightweight Model for Pigmented Skin Lesion Detection

Rethinking Class Activation Maps for Segmentation: Revealing Semantic Information in Shallow Layers by Reducing Noise

Breast Ultrasound Tumor Classification Using a Hybrid Multitask CNN-Transformer Network

CT Reconstruction from Few Planar X-rays with Application towards Low-resource Radiotherapy

Multi-interactive Feature Learning and a Full-time Multi-modality Benchmark for Image Fusion and Segmentation

Diffusion Models for Counterfactual Generation and Anomaly Detection in Brain Images

UGainS: Uncertainty Guided Anomaly Instance Segmentation

ETran: Energy-Based Transferability Estimation

Explainable unsupervised multi-modal image registration using deep networks

Predicting Ki67, ER, PR, and HER2 Statuses from H&E-stained Breast Cancer Images

CartiMorph: a framework for automated knee articular cartilage morphometrics

Unmasking Parkinson’s Disease with Smile: An AI-enabled Screening Framework

RealCQA: Scientific Chart Question Answering as a Test-bed for First-Order Logic

Synthesising Rare Cataract Surgery Samples with Guided Diffusion Models

The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World

DETR Doesn’t Need Multi-Scale or Locality Design

UniSim: A Neural Closed-Loop Sensor Simulator

DualCoOp++: Fast and Effective Adaptation to Multi-Label Recognition with Limited Annotations

FROD: Robust Object Detection for Free

ConceptLab: Creative Generation using Diffusion Prior Constraints

Reconstructing Three-Dimensional Models of Interacting Humans

Is your data alignable? Principled and interpretable alignability testing and integration of single-cell data