2023-07-02

cs.CV

cs.CV - 2023-07-02

X-MLP: A Patch Embedding-Free MLP Architecture for Vision

paper_url: http://arxiv.org/abs/2307.00592
repo_url: None
paper_authors: Xinyue Wang, Zhicheng Cai, Chenglei Peng
For: This paper proposes a new architecture for vision called X-MLP, which is designed to be independent from convolutions and self-attention operations.* Methods: The X-MLP architecture is constructed absolutely upon fully connected layers and is free from patch embedding. It decouples the features extremely and utilizes MLPs to interact the information across the dimension of width, height, and channel independently and alternately.* Results: X-MLP is tested on ten benchmark datasets and obtains better performance than other vision MLP models. It even surpasses CNNs by a clear margin on various datasets. Additionally, the paper visualizes the information communication between any couples of pixels in the feature map and observes the phenomenon of capturing long-range dependency.Here is the simplified Chinese text for the three key points:* For: 这篇论文提出了一种新的视觉建模方法，即X-MLP，它不依赖卷积和自注意力操作。* Methods: X-MLP采用全连接层构建，不需要负权重映射。它强制Feature decoupling，通过多层Perceptron交互信息，独立地处理宽高和通道维度。* Results: X-MLP在十个基准数据集上进行测试，与其他视觉MLP模型相比，表现更好。甚至在不同的数据集上，X-MLP还超过了CNN。此外，通过数学还原空间权重，可以visualize任意像素对像的信息交换，观察长距离关系。

Abstract
Convolutional neural networks (CNNs) and vision transformers (ViT) have obtained great achievements in computer vision. Recently, the research of multi-layer perceptron (MLP) architectures for vision have been popular again. Vision MLPs are designed to be independent from convolutions and self-attention operations. However, existing vision MLP architectures always depend on convolution for patch embedding. Thus we propose X-MLP, an architecture constructed absolutely upon fully connected layers and free from patch embedding. It decouples the features extremely and utilizes MLPs to interact the information across the dimension of width, height and channel independently and alternately. X-MLP is tested on ten benchmark datasets, all obtaining better performance than other vision MLP models. It even surpasses CNNs by a clear margin on various dataset. Furthermore, through mathematically restoring the spatial weights, we visualize the information communication between any couples of pixels in the feature map and observe the phenomenon of capturing long-range dependency.

摘要
卷积神经网络（CNN）和视Transformer（ViT）在计算机视觉领域已经取得了很大的成就。最近，视觉多层批处理（MLP）建筑的研究又再次受到了关注。视觉MLP建筑始终依赖于混合层和自我注意力操作。然而，现有的视觉MLP建筑都是通过混合层进行质心编码。因此，我们提出X-MLP，一种完全基于全连接层的建筑，不依赖于混合层和质心编码。它将特征分解得非常细致，并使用MLP进行特征之间的信息交互，在宽度、高度和通道维度上独立地交换信息。X-MLP在十个 benchmark 数据集上进行测试，都超过了其他视觉MLP模型。它甚至在不同的数据集上超过了 CNN。此外，通过数学还原空间权重，我们可以视觉化特征图中任意对的像素对之间的信息交换，并观察捕捉长距离相互作用。

ClipSitu: Effectively Leveraging CLIP for Conditional Predictions in Situation Recognition

paper_url: http://arxiv.org/abs/2307.00586
repo_url: None
paper_authors: Debaditya Roy, Dhruv Verma, Basura Fernando
for: 这篇论文是关于图像上的情境识别任务，即通过活动词和图像中的人物和物品的SemanticRole来描述图像中的情境。
methods: 我们使用CLIP基础模型，CLIP已经通过语言描述学习了图像的上下文知识。我们还使用深度和宽的多层感知器（MLP）块，通过CLIP图像和文本嵌入特征来进行情境识别任务，并且超过了现有的State-of-the-art CoFormer模型。
results: 我们的结果显示，使用CLIP视觉token和架构设计的涉及式注意力Transformer（ClipSitu XTF）可以在semantic role labeling（值）任务中带来14.1%的提升，使得我们的模型在imSitu数据集上达到了最高的top-1准确率。

Abstract
Situation Recognition is the task of generating a structured summary of what is happening in an image using an activity verb and the semantic roles played by actors and objects. In this task, the same activity verb can describe a diverse set of situations as well as the same actor or object category can play a diverse set of semantic roles depending on the situation depicted in the image. Hence model needs to understand the context of the image and the visual-linguistic meaning of semantic roles. Therefore, we leverage the CLIP foundational model that has learned the context of images via language descriptions. We show that deeper-and-wider multi-layer perceptron (MLP) blocks obtain noteworthy results for the situation recognition task by using CLIP image and text embedding features and it even outperforms the state-of-the-art CoFormer, a Transformer-based model, thanks to the external implicit visual-linguistic knowledge encapsulated by CLIP and the expressive power of modern MLP block designs. Motivated by this, we design a cross-attention-based Transformer using CLIP visual tokens that model the relation between textual roles and visual entities. Our cross-attention-based Transformer known as ClipSitu XTF outperforms existing state-of-the-art by a large margin of 14.1% on semantic role labelling (value) for top-1 accuracy using imSitu dataset. We will make the code publicly available.

摘要
Situation recognition是将图像中的活动和角色描述为结构化的摘要。在这个任务中，同一个活动词可以描述多种不同的情况，同一个actor或object category可以扮演多种不同的semantic role，具体取决于图像中所示的情况。因此，模型需要理解图像的上下文和视觉语言意义。为此，我们利用CLIP基础模型，它通过语言描述学习了图像的上下文。我们发现，在使用CLIP图像和文本嵌入特征的更深更宽多层感知（MLP）块时，可以获得优秀的结果，并在CoFormer，一种基于Transformer的模型，之上出现胜利。这是因为CLIP中的外部隐式视觉语言知识和现代MLP块设计的表达能力。为了进一步解决这个问题，我们设计了一个基于交叉关注的Transformer，使用CLIP的视觉token来模拟文本角色和视觉实体之间的关系。我们称之为ClipSitu XTF，它在imSitu数据集上的semantic role标签（值）的top-1准确率上超越了现有的状态艺术。我们将代码公开。

A multi-task learning framework for carotid plaque segmentation and classification from ultrasound images

paper_url: http://arxiv.org/abs/2307.00583
repo_url: None
paper_authors: Haitao Gan, Ran Zhou, Yanghan Ou, Furong Wang, Xinyao Cheng, Xiaoyan Wu, Aaron Fenster
for: 这种研究的目的是提出一种多任务学习框架，用于 ultrasound 血管壁瘤分割和分类。
methods: 该方法使用了一个区域权重模块 (RWM) 和一个样本权重模块 (SWM)，以利用分割和分类两个任务之间的相互关系。RWM 提供了一种含有血管壁瘤区域特征的特征知识，而 SWM 则是用于学习分割任务中的 categorical 样本权重。
results: 实验结果显示，提出的方法可以significantly提高与单个任务训练的网络性能，其中分类任务的准确率为 85.82%，分割任务的 dice similarity 系数为 84.92%。在减少研究中，结果表明了 RWM 和 SWM 都对网络性能的提高有所贡献。因此，我们认为该方法可以在临床试验和实践中用于血管壁瘤分析。

Abstract
Carotid plaque segmentation and classification play important roles in the treatment of atherosclerosis and assessment for risk of stroke. Although deep learning methods have been used for carotid plaque segmentation and classification, most focused on a single task and ignored the relationship between the segmentation and classification of carotid plaques. Therefore, we propose a multi-task learning framework for ultrasound carotid plaque segmentation and classification, which utilizes a region-weight module (RWM) and a sample-weight module (SWM) to exploit the correlation between these two tasks. The RWM provides a plaque regional prior knowledge to the classification task, while the SWM is designed to learn the categorical sample weight for the segmentation task. A total of 1270 2D ultrasound images of carotid plaques were collected from Zhongnan Hospital (Wuhan, China) for our experiments. The results of the experiments showed that the proposed method can significantly improve the performance compared to existing networks trained for a single task, with an accuracy of 85.82% for classification and a Dice similarity coefficient of 84.92% for segmentation. In the ablation study, the results demonstrated that both the designed RWM and SWM were beneficial in improving the network's performance. Therefore, we believe that the proposed method could be useful for carotid plaque analysis in clinical trials and practice.

摘要
<>translate("Carotid plaque segmentation and classification play important roles in the treatment of atherosclerosis and assessment for risk of stroke. Although deep learning methods have been used for carotid plaque segmentation and classification, most focused on a single task and ignored the relationship between the segmentation and classification of carotid plaques. Therefore, we propose a multi-task learning framework for ultrasound carotid plaque segmentation and classification, which utilizes a region-weight module (RWM) and a sample-weight module (SWM) to exploit the correlation between these two tasks. The RWM provides a plaque regional prior knowledge to the classification task, while the SWM is designed to learn the categorical sample weight for the segmentation task. A total of 1270 2D ultrasound images of carotid plaques were collected from Zhongnan Hospital (Wuhan, China) for our experiments. The results of the experiments showed that the proposed method can significantly improve the performance compared to existing networks trained for a single task, with an accuracy of 85.82% for classification and a Dice similarity coefficient of 84.92% for segmentation. In the ablation study, the results demonstrated that both the designed RWM and SWM were beneficial in improving the network's performance. Therefore, we believe that the proposed method could be useful for carotid plaque analysis in clinical trials and practice.")Here's the translation:脉络栓分割和分类在血栓病治疗和风险评估中发挥重要作用。虽然深度学习方法已经用于脉络栓分割和分类，但大多数方法只关注单一任务，忽略了脉络栓分割和分类之间的关系。因此，我们提出了一种多任务学习框架 для超声脉络栓分割和分类，该框架利用区域权重模块（RWM）和样本权重模块（SWM）来利用这两个任务之间的相关性。RWM提供了脉络栓区域先验知识 для分类任务，而SWM是用于学习分类任务中的 categorical 样本权重。我们对1270个2D超声脉络栓图像进行了实验，结果显示，我们的方法可以与现有的单任务网络相比，提高表现，具体数据为85.82%的准确率和84.92%的相似度。在剥离研究中，结果表明，我们设计的RWM和SWM都对网络表现有益。因此，我们认为该方法在临床试验和实践中可能是有用的。

TinySiamese Network for Biometric Analysis

paper_url: http://arxiv.org/abs/2307.00578
repo_url: None
paper_authors: Islem Jarraya, Tarek M. Hamdani, Habib Chabchoub, Adel M. Alimi
for: 这篇论文主要关注于提高鉴定识别效率和适用范围，使用TinySiamese替代标准Siamese来实现这个目标。
methods: 这篇论文使用TinySiamese方法，它不需要全部CNN进行训练，可以使用预训练的CNN作为特征提取器，然后使用TinySiamese进行学习。
results: 这篇论文的结果显示，使用TinySiamese可以大幅降低训练时间和汇入时间，并且在鉴定和分类任务中获得了比标准Siamese更高的精度。

Abstract
Biometric recognition is the process of verifying or classifying human characteristics in images or videos. It is a complex task that requires machine learning algorithms, including convolutional neural networks (CNNs) and Siamese networks. Besides, there are several limitations to consider when using these algorithms for image verification and classification tasks. In fact, training may be computationally intensive, requiring specialized hardware and significant computational resources to train and deploy. Moreover, it necessitates a large amount of labeled data, which can be time-consuming and costly to obtain. The main advantage of the proposed TinySiamese compared to the standard Siamese is that it does not require the whole CNN for training. In fact, using a pre-trained CNN as a feature extractor and the TinySiamese to learn the extracted features gave almost the same performance and efficiency as the standard Siamese for biometric verification. In this way, the TinySiamese solves the problems of memory and computational time with a small number of layers which did not exceed 7. It can be run under low-power machines which possess a normal GPU and cannot allocate a large RAM space. Using TinySiamese with only 8 GO of memory, the matching time decreased by 76.78% on the B2F (Biometric images of Fingerprints and Faces), FVC2000, FVC2002 and FVC2004 while the training time for 10 epochs went down by approximately 93.14% on the B2F, FVC2002, THDD-part1 and CASIA-B datasets. The accuracy of the fingerprint, gait (NM-angle 180 degree) and face verification tasks was better than the accuracy of a standard Siamese by 0.87%, 20.24% and 3.85% respectively. TinySiamese achieved comparable accuracy with related works for the fingerprint and gait classification tasks.

摘要
生物特征识别是将人体特征从图像或视频中鉴别或分类的过程。这是一项复杂的任务，需要机器学习算法，包括卷积神经网络（CNN）和同一个网络（Siamese）。然而，使用这些算法进行图像鉴别和分类任务时，存在一些限制。首先，训练可能会是 computationally intensive，需要特殊的硬件和大量计算资源来训练和部署。此外，需要大量标注数据，可能需要很长时间和成本来获得。与标准Siamese相比，我们提出的TinySiamese具有一些优点。首先，它不需要整个CNN进行训练。相反，使用预训练的CNN作为特征提取器，然后使用TinySiamese来学习提取的特征。这与标准Siamese的性能和效率几乎相同。此外，TinySiamese通过减少内存和计算时间的问题，使用只有7层的小型神经网络。这使得它可以在低功耗机器上运行，并且可以避免大量的内存和计算资源。使用TinySiamese，我们可以在10个epoch的训练时间下降约93.14%，而匹配时间下降约76.78%。此外，在B2F、FVC2000、FVC2002和FVC2004等数据集上，TinySiamese的精度比标准Siamese高出0.87%、20.24%和3.85%。此外，TinySiamese在指纹、步态（NM-angle 180度）和脸部鉴别任务中达到了相当的精度。

Bidirectional Temporal Diffusion Model for Temporally Consistent Human Animation

paper_url: http://arxiv.org/abs/2307.00574
repo_url: None
paper_authors: Tserendorj Adiya, Sanghun Kim, Jung Eun Lee, Jae Shin Yoon, Hwasup Lim
for: 生成具有时间准确性的人体动画从单个图像、视频或随机噪声中。
methods: 我们提出了一种模型人体动画的auto-regressive生成方法，即预测过去帧来解码未来帧。然而，这种单向生成具有时间扩散的问题，容易产生人体动画中的不实际的干扰，如人体外观扭曲。我们认为，对于生成网络来说，双向时间模型可以强制实现人体动画的时间准确性。
results: 我们的方法在实验中表现出了强大的表现，与现有的单向方法相比，具有真实的时间准确性。

Abstract
We introduce a method to generate temporally coherent human animation from a single image, a video, or a random noise. This problem has been formulated as modeling of an auto-regressive generation, i.e., to regress past frames to decode future frames. However, such unidirectional generation is highly prone to motion drifting over time, generating unrealistic human animation with significant artifacts such as appearance distortion. We claim that bidirectional temporal modeling enforces temporal coherence on a generative network by largely suppressing the motion ambiguity of human appearance. To prove our claim, we design a novel human animation framework using a denoising diffusion model: a neural network learns to generate the image of a person by denoising temporal Gaussian noises whose intermediate results are cross-conditioned bidirectionally between consecutive frames. In the experiments, our method demonstrates strong performance compared to existing unidirectional approaches with realistic temporal coherence

摘要
我们提出了一种将单一图像、视频或随机变数转换为有着时间对称的人体动画的方法。这问题被视为一个自动复原生成问题，即从过去的几帧图像中预测未来几帧图像。但这种单向生成很容易受到时间漂移的影响，导致生成的人体动画有着严重的扭转现象和错误。我们声称，将时间方向统一到生成网络中可以严重抑制人体出现的动作抽象。为了证明我们的声明，我们设计了一个新的人体动画框架，使用一个对应推对推的滤除扩散模型：一个神经网络将从随机 Gaussian 噪声中获得人体图像，并在过去和未来几帧图像之间进行对应推对推的滤除过程。在实验中，我们的方法与现有的单向方法相比，具有更强的时间对称性和更真实的人体动画。

A MIL Approach for Anomaly Detection in Surveillance Videos from Multiple Camera Views

paper_url: http://arxiv.org/abs/2307.00562
repo_url: https://github.com/santiagosilas/mc-vad-dataset-basedon-pets2009
paper_authors: Silas Santiago Lopes Pereira, José Everardo Bessa Maia
for: 本研究旨在解决监控视频中检测异常的问题，异常事件是罕见的，因此监控视频中的异常检测任务受到类别不平衡和缺乏异常数据的限制。
methods: 本研究使用多个视角的多例学习（MIL）和多视角视图（MC）技术来解决监控视频中异常检测任务中的缺乏标注数据和 occlusion 和干扰的问题。
results: 对多个视角PETS-2009benchmark dataset进行重新标注，并使用多视角combined损失函数和Sultani的MIL排名函数来训练一个回归网络，得到了与单视角配置相比显著的F1分数提升。

Abstract
Occlusion and clutter are two scene states that make it difficult to detect anomalies in surveillance video. Furthermore, anomaly events are rare and, as a consequence, class imbalance and lack of labeled anomaly data are also key features of this task. Therefore, weakly supervised methods are heavily researched for this application. In this paper, we tackle these typical problems of anomaly detection in surveillance video by combining Multiple Instance Learning (MIL) to deal with the lack of labels and Multiple Camera Views (MC) to reduce occlusion and clutter effects. In the resulting MC-MIL algorithm we apply a multiple camera combined loss function to train a regression network with Sultani's MIL ranking function. To evaluate the MC-MIL algorithm first proposed here, the multiple camera PETS-2009 benchmark dataset was re-labeled for the anomaly detection task from multiple camera views. The result shows a significant performance improvement in F1 score compared to the single-camera configuration.

摘要
干扰和堵塞是两种场景状态，使得异常检测在监视视频中具有棘手。此外，异常事件是罕见的，因此类别不平衡和缺乏异常数据也是这个任务的关键特点。因此，弱地监视方法受到了广泛的研究。在这篇论文中，我们解决了监视视频中异常检测中的常见问题，包括缺乏标签和干扰和堵塞的影响。我们提出了 MC-MIL 算法，该算法通过将多个摄像头视图组合成一个多视图损失函数，来训练一个 regression 网络，并使用 Sulani 的 MIL 排名函数。为评估 MC-MIL 算法，我们对多Camera PETS-2009 数据集进行了重新标注，以便进行异常检测任务。结果表明，相比单摄像头配置，MC-MIL 算法在 F1 分数上显著提高了性能。

Partial-label Learning with Mixed Closed-set and Open-set Out-of-candidate Examples

paper_url: http://arxiv.org/abs/2307.00553
repo_url: None
paper_authors: Shuo He, Lei Feng, Guowu Yang
for: 本研究探讨了在半标签学习（PLL）中处理异常样本（OOC）的问题，即样本的真实标签可能不在归类集中。
methods: 本研究提出了两种类型的OOC例子，即关闭集/开集OOC例子，并提出了一种基于特殊设计的识别 criterion。然后，对关闭集OOC例子，进行反向标签混淆处理；对开集OOC例子，利用了一种有效的辅助正则化策略，动态分配随机的归类标签。
results: 对比州的PLL方法，本研究的方法得到了更高的性能。

Abstract
Partial-label learning (PLL) relies on a key assumption that the true label of each training example must be in the candidate label set. This restrictive assumption may be violated in complex real-world scenarios, and thus the true label of some collected examples could be unexpectedly outside the assigned candidate label set. In this paper, we term the examples whose true label is outside the candidate label set OOC (out-of-candidate) examples, and pioneer a new PLL study to learn with OOC examples. We consider two types of OOC examples in reality, i.e., the closed-set/open-set OOC examples whose true label is inside/outside the known label space. To solve this new PLL problem, we first calculate the wooden cross-entropy loss from candidate and non-candidate labels respectively, and dynamically differentiate the two types of OOC examples based on specially designed criteria. Then, for closed-set OOC examples, we conduct reversed label disambiguation in the non-candidate label set; for open-set OOC examples, we leverage them for training by utilizing an effective regularization strategy that dynamically assigns random candidate labels from the candidate label set. In this way, the two types of OOC examples can be differentiated and further leveraged for model training. Extensive experiments demonstrate that our proposed method outperforms state-of-the-art PLL methods.

摘要

ARHNet: Adaptive Region Harmonization for Lesion-aware Augmentation to Improve Segmentation Performance

paper_url: http://arxiv.org/abs/2307.01220
repo_url: https://github.com/king-haw/arhnet
paper_authors: Jiayu Huo, Yang Liu, Xi Ouyang, Alejandro Granados, Sebastien Ourselin, Rachel Sparks
for: 提高MRI扫描图像中脑损伤的识别精度，以提供patients with prognoses和 neurosurgical monitoring。
methods: 使用CNN模型进行图像分割，并采用进步的数据增强策略来提高模型的可靠性。
results: 在实验中，ARHNet在实验中提高了下游分割性能，并在真实和Synthetic图像上达到了最高的表现。

Abstract
Accurately segmenting brain lesions in MRI scans is critical for providing patients with prognoses and neurological monitoring. However, the performance of CNN-based segmentation methods is constrained by the limited training set size. Advanced data augmentation is an effective strategy to improve the model's robustness. However, they often introduce intensity disparities between foreground and background areas and boundary artifacts, which weakens the effectiveness of such strategies. In this paper, we propose a foreground harmonization framework (ARHNet) to tackle intensity disparities and make synthetic images look more realistic. In particular, we propose an Adaptive Region Harmonization (ARH) module to dynamically align foreground feature maps to the background with an attention mechanism. We demonstrate the efficacy of our method in improving the segmentation performance using real and synthetic images. Experimental results on the ATLAS 2.0 dataset show that ARHNet outperforms other methods for image harmonization tasks, and boosts the down-stream segmentation performance. Our code is publicly available at https://github.com/King-HAW/ARHNet.

摘要
优敏诊断患者的脑部病变需要高精度的MRI扫描图像分割，以提供病人诊断和 neuroscience 监测。然而，基于Convolutional Neural Network（CNN）的分割方法的性能受训练集大小的限制。高级数据增强技术可以提高模型的鲁棒性，但它们经常导致Intensity Disparity（ID）和边界artefacts，这会削弱这些策略的效果。本文提出了一种前景协调框架（ARHNet），以解决ID和Synthetic Image的不真实性问题。特别是，我们提出了一种适应区域协调（ARH）模块，通过关注机制来动态将前景特征图与背景图进行对齐。我们通过实验表明，ARHNet可以提高下游分割性能，并在ATLAS 2.0数据集上超越其他图像协调任务的方法。代码可以在https://github.com/King-HAW/ARHNet上获取。

paper_url: http://arxiv.org/abs/2307.00536
repo_url: None
paper_authors: Meng Lan, Fu Rong, Lefei Zhang
for: 提高视频对象分割的精度和准确性，并提出一个可插入执行器模块来增强视频序列中对象的时空协同特征学习。
methods: 提出了一种基于Transformer的新的对象分割框架，称为IFIRVOS，其包括一个插件式的时空交互模块和一个视频语言交互模块。
results: 经验 результаты表明，IFIRVOS在三个标准测试集上表现出了与state-of-the-art方法相比的superiority，并且 validate了我们提出的模块的有效性。

Abstract
Referring video object segmentation (RVOS) aims to segment the target object in a video sequence described by a language expression. Typical query-based methods process the video sequence in a frame-independent manner to reduce the high computational cost, which however affects the performance due to the lack of inter-frame interaction for temporal coherence modeling and spatio-temporal representation learning of the referred object. Besides, they directly adopt the raw and high-level sentence feature as the language queries to decode the visual features, where the weak correlation between visual and linguistic features also increases the difficulty of decoding the target information and limits the performance of the model. In this paper, we proposes a novel RVOS framework, dubbed IFIRVOS, to address these issues. Specifically, we design a plug-and-play inter-frame interaction module in the Transformer decoder to efficiently learn the spatio-temporal features of the referred object, so as to decode the object information in the video sequence more precisely and generate more accurate segmentation results. Moreover, we devise the vision-language interaction module before the multimodal Transformer to enhance the correlation between the visual and linguistic features, thus facilitating the process of decoding object information from visual features by language queries in Transformer decoder and improving the segmentation performance. Extensive experimental results on three benchmarks validate the superiority of our IFIRVOS over state-of-the-art methods and the effectiveness of our proposed modules.

摘要
Traditional query-based methods for Referring Video Object Segmentation (RVOS) process the video sequence in a frame-independent manner, which can reduce computational cost but also affects performance due to the lack of inter-frame interaction for temporal coherence modeling and spatio-temporal representation learning of the referred object. Moreover, they directly use raw and high-level sentence features as language queries to decode visual features, which can lead to weak correlation between visual and linguistic features, limiting the performance of the model.In this paper, we propose a novel RVOS framework, called IFIRVOS, to address these issues. Specifically, we design a plug-and-play inter-frame interaction module in the Transformer decoder to efficiently learn spatio-temporal features of the referred object, allowing for more precise decoding of object information in the video sequence and generating more accurate segmentation results. Furthermore, we propose a vision-language interaction module before the multimodal Transformer to enhance the correlation between visual and linguistic features, facilitating the process of decoding object information from visual features using language queries in the Transformer decoder and improving segmentation performance.Experimental results on three benchmarks demonstrate the superiority of our IFIRVOS over state-of-the-art methods and the effectiveness of our proposed modules.

End-to-End Out-of-distribution Detection with Self-supervised Sampling

paper_url: http://arxiv.org/abs/2307.00519
repo_url: None
paper_authors: Sen Pei, Jiaxi Sun, Peng Qin, Qi Chen, Xinglong Wu, Xun Wang
for: 提高 Open World 中的 Out-of-distribution (OOD) 检测精度，帮助模型在关闭集合训练后在开放世界中识别未知数据。
methods: 提出一个通用概率框架来看待多种已有方法，并提出一种无需 OOD 数据的模型，即 Self-supervised Sampling for OOD Detection (SSOD)，利用了 convolution 操作的本地性来从 ID 数据中抽取自然的 OOD 信号，并同时优化 OOD 检测和正常 ID 分类。
results: 在多个大规模 benchmark 上实现了竞争力的状态平台，比如 SUN 上的 FPR95 下的性能提高了大约 48.99% 到 35.52% 之间，比最近的方法，如 KNN，有大幅度的优势。

Abstract
Out-of-distribution (OOD) detection empowers the model trained on the closed set to identify unknown data in the open world. Though many prior techniques have yielded considerable improvements, two crucial obstacles still remain. Firstly, a unified perspective has yet to be presented to view the developed arts with individual designs, which is vital for providing insights into the related directions. Secondly, most research focuses on the post-processing schemes of the pre-trained features while disregarding the superiority of end-to-end training, dramatically limiting the upper bound of OOD detection. To tackle these issues, we propose a general probabilistic framework to interpret many existing methods and an OOD-data-free model, namely Self-supervised Sampling for OOD Detection (SSOD), to unfold the potential of end-to-end learning. SSOD efficiently exploits natural OOD signals from the in-distribution (ID) data based on the local property of convolution. With these supervisions, it jointly optimizes the OOD detection and conventional ID classification. Extensive experiments reveal that SSOD establishes competitive state-of-the-art performance on many large-scale benchmarks, where it outperforms the most recent approaches, such as KNN, by a large margin, e.g., 48.99% to 35.52% on SUN at FPR95.

摘要

SUGAR: Spherical Ultrafast Graph Attention Framework for Cortical Surface Registration

paper_url: http://arxiv.org/abs/2307.00511
repo_url: None
paper_authors: Jianxun Ren, Ning An, Youjia Zhang, Danyang Wang, Zhenyu Sun, Cong Lin, Weigang Cui, Weiwei Wang, Ying Zhou, Wei Zhang, Qingyu Hu, Ping Zhang, Dan Hu, Danhong Wang, Hesheng Liu
for:SUGAR is designed to improve cortical surface registration, specifically addressing the challenge of aligning cortical functional and anatomical features across individuals.methods:SUGAR is a unified unsupervised deep-learning framework that incorporates a U-Net-based spherical graph attention network and leverages the Euler angle representation for deformation. The framework includes a similarity loss, fold loss, and multiple distortion losses to preserve topology and minimize various types of distortions.results:SUGAR exhibits comparable or superior registration performance in accuracy, distortion, and test-retest reliability compared to conventional and learning-based methods. Additionally, SUGAR achieves remarkable sub-second processing times, offering a notable speed-up of approximately 12,000 times in registering 9,000 subjects from the UK Biobank dataset in just 32 minutes.

Abstract
Cortical surface registration plays a crucial role in aligning cortical functional and anatomical features across individuals. However, conventional registration algorithms are computationally inefficient. Recently, learning-based registration algorithms have emerged as a promising solution, significantly improving processing efficiency. Nonetheless, there remains a gap in the development of a learning-based method that exceeds the state-of-the-art conventional methods simultaneously in computational efficiency, registration accuracy, and distortion control, despite the theoretically greater representational capabilities of deep learning approaches. To address the challenge, we present SUGAR, a unified unsupervised deep-learning framework for both rigid and non-rigid registration. SUGAR incorporates a U-Net-based spherical graph attention network and leverages the Euler angle representation for deformation. In addition to the similarity loss, we introduce fold and multiple distortion losses, to preserve topology and minimize various types of distortions. Furthermore, we propose a data augmentation strategy specifically tailored for spherical surface registration, enhancing the registration performance. Through extensive evaluation involving over 10,000 scans from 7 diverse datasets, we showed that our framework exhibits comparable or superior registration performance in accuracy, distortion, and test-retest reliability compared to conventional and learning-based methods. Additionally, SUGAR achieves remarkable sub-second processing times, offering a notable speed-up of approximately 12,000 times in registering 9,000 subjects from the UK Biobank dataset in just 32 minutes. This combination of high registration performance and accelerated processing time may greatly benefit large-scale neuroimaging studies.

摘要
cortical surface registration 在跨个体标准化中扮演关键角色，但传统的注册算法 computationally inefficient。 recent learning-based registration algorithms 作为一个有前途的解决方案，Significantly improving processing efficiency。然而，there remains a gap in the development of a learning-based method that exceeds the state-of-the-art conventional methods simultaneously in computational efficiency, registration accuracy, and distortion control，despite the theoretically greater representational capabilities of deep learning approaches。To address the challenge，we present SUGAR，a unified unsupervised deep-learning framework for both rigid and non-rigid registration。SUGAR incorporates a U-Net-based spherical graph attention network and leverages the Euler angle representation for deformation。In addition to the similarity loss，we introduce fold and multiple distortion losses，to preserve topology and minimize various types of distortions。Furthermore，we propose a data augmentation strategy specifically tailored for spherical surface registration，enhancing the registration performance。Through extensive evaluation involving over 10,000 scans from 7 diverse datasets，we showed that our framework exhibits comparable or superior registration performance in accuracy, distortion, and test-retest reliability compared to conventional and learning-based methods。Additionally，SUGAR achieves remarkable sub-second processing times，offering a notable speed-up of approximately 12,000 times in registering 9,000 subjects from the UK Biobank dataset in just 32 minutes。This combination of high registration performance and accelerated processing time may greatly benefit large-scale neuroimaging studies。

Data-Free Quantization via Mixed-Precision Compensation without Fine-Tuning

paper_url: http://arxiv.org/abs/2307.00498
repo_url: None
paper_authors: Jun Chen, Shipeng Bai, Tianxin Huang, Mengmeng Wang, Guanzhong Tian, Yong Liu
for: 提高模型压缩后的精度，不需要原始数据和 fine-tuning 过程。
methods: 基于异常抽象数据的生成方法，提出了一种无数据的混合精度补偿方法（DF-MPC），通过减少量化误差，提高模型的精度。
results: 实验表明，DF-MPC 能够在无数据和 fine-tuning 过程的情况下，提高模型的精度，比较有效率。

Abstract
Neural network quantization is a very promising solution in the field of model compression, but its resulting accuracy highly depends on a training/fine-tuning process and requires the original data. This not only brings heavy computation and time costs but also is not conducive to privacy and sensitive information protection. Therefore, a few recent works are starting to focus on data-free quantization. However, data-free quantization does not perform well while dealing with ultra-low precision quantization. Although researchers utilize generative methods of synthetic data to address this problem partially, data synthesis needs to take a lot of computation and time. In this paper, we propose a data-free mixed-precision compensation (DF-MPC) method to recover the performance of an ultra-low precision quantized model without any data and fine-tuning process. By assuming the quantized error caused by a low-precision quantized layer can be restored via the reconstruction of a high-precision quantized layer, we mathematically formulate the reconstruction loss between the pre-trained full-precision model and its layer-wise mixed-precision quantized model. Based on our formulation, we theoretically deduce the closed-form solution by minimizing the reconstruction loss of the feature maps. Since DF-MPC does not require any original/synthetic data, it is a more efficient method to approximate the full-precision model. Experimentally, our DF-MPC is able to achieve higher accuracy for an ultra-low precision quantized model compared to the recent methods without any data and fine-tuning process.

摘要
neural network 量化是一种非常有前途的解决方案在模型压缩领域，但它的结果准确度具有训练/精度调整过程和原始数据的依赖关系。这不仅带来了重量计算和时间成本，还不利于隐私信息保护。因此，一些最近的研究开始关注无数据量化。然而，无数据量化在ultra-low precision量化时不太好进行。虽然研究人员利用生成方法生成的假数据来解决这个问题，但数据生成需要很多计算和时间。在这篇论文中，我们提出了一种无数据混合精度补偿方法（DF-MPC）来恢复ultra-low precision量化模型的性能。我们假设量化错误由low-precision量化层引起的可以通过高精度量化层的重建来恢复。我们使用数学形式来表述重建loss между预训练全精度模型和它的层 wise混合精度量化模型。根据我们的形式，我们可以 theoretically得出closed-form解。由于DF-MPC不需要任何原始/生成数据，它是一种更高效的方法来近似全精度模型。实验表明，我们的DF-MPC可以在ultra-low precision量化模型中达到更高的准确度，比悉些最近的方法不需要任何数据和精度调整过程。

TopicFM+: Boosting Accuracy and Efficiency of Topic-Assisted Feature Matching

paper_url: http://arxiv.org/abs/2307.00485
repo_url: https://github.com/truongkhang/topicfm
paper_authors: Khang Truong Giang, Soohwan Song, Sungho Jo
for: 本研究旨在解决图像匹配中难以处理的场景，如场景变化较大或 texture 较少，并强调计算效率。
methods: 我们提出了一种使用话题模型策略来捕捉图像中高级别上下文信息的新方法。每个图像都被表示为一个多omial分布中的话题，每个话题代表一个隐藏的semantic实例。通过这些话题，我们可以有效地捕捉具有广泛上下文信息的全面特征。此外，我们还提出了一种优化feature匹配的方法，通过估算 covisible 话题来匹配相应的semantic区域中的特征。
results: 我们通过了广泛的实验，证明了我们的方法在难以处理的场景中具有显著的优势。具体来说，我们的方法可以减少计算成本，同时保持高效性和高匹配精度。代码将在https://github.com/TruongKhang/TopicFM 上更新。

Abstract
This study tackles the challenge of image matching in difficult scenarios, such as scenes with significant variations or limited texture, with a strong emphasis on computational efficiency. Previous studies have attempted to address this challenge by encoding global scene contexts using Transformers. However, these approaches suffer from high computational costs and may not capture sufficient high-level contextual information, such as structural shapes or semantic instances. Consequently, the encoded features may lack discriminative power in challenging scenes. To overcome these limitations, we propose a novel image-matching method that leverages a topic-modeling strategy to capture high-level contexts in images. Our method represents each image as a multinomial distribution over topics, where each topic represents a latent semantic instance. By incorporating these topics, we can effectively capture comprehensive context information and obtain discriminative and high-quality features. Additionally, our method effectively matches features within corresponding semantic regions by estimating the covisible topics. To enhance the efficiency of feature matching, we have designed a network with a pooling-and-merging attention module. This module reduces computation by employing attention only on fixed-sized topics and small-sized features. Through extensive experiments, we have demonstrated the superiority of our method in challenging scenarios. Specifically, our method significantly reduces computational costs while maintaining higher image-matching accuracy compared to state-of-the-art methods. The code will be updated soon at https://github.com/TruongKhang/TopicFM

摘要
To overcome these limitations, we propose a novel image-matching method that leverages a topic-modeling strategy to capture high-level contexts in images. Our method represents each image as a multinomial distribution over topics, where each topic represents a latent semantic instance. By incorporating these topics, we can effectively capture comprehensive context information and obtain discriminative and high-quality features. Additionally, our method effectively matches features within corresponding semantic regions by estimating the covisible topics.To enhance the efficiency of feature matching, we have designed a network with a pooling-and-merging attention module. This module reduces computation by employing attention only on fixed-sized topics and small-sized features. Through extensive experiments, we have demonstrated the superiority of our method in challenging scenarios. Specifically, our method significantly reduces computational costs while maintaining higher image-matching accuracy compared to state-of-the-art methods. The code will be updated soon at .

Seeing is not Believing: An Identity Hider for Human Vision Privacy Protection

paper_url: http://arxiv.org/abs/2307.00481
repo_url: None
paper_authors: Tao Wang, Yushu Zhang, Zixuan Yang, Hua Zhang, Zhongyun Hua
for: 隐私保护和人脸识别稳定性
methods: 使用 StyleGAN2 manipulate 原始 face 的 latent space，生成虚拟 face，并将虚拟 face 的视觉内容传播到原始 face 上，然后替换背景。
results: 提出了一种可以具有出色隐私保护和人脸识别稳定性的身份隐藏方法，并通过实验证明其性能优秀。

Abstract
Massive captured face images are stored in the database for the identification of individuals. However, the stored images can be observed intentionally or unintentionally by data managers, which is not at the will of individuals and may cause privacy violations. Existing protection works only slightly change the visual content of the face while maintaining the utility of identification, making it susceptible to the inference of the true identity by human vision. In this paper, we propose an identity hider that enables significant visual content change for human vision while preserving high identifiability for face recognizers. Firstly, the identity hider generates a virtual face with new visual content by manipulating the latent space in StyleGAN2. In particular, the virtual face has the same irrelevant attributes as the original face, e.g., pose and expression. Secondly, the visual content of the virtual face is transferred into the original face and then the background is replaced with the original one. In addition, the identity hider has strong transferability, which ensures an arbitrary face recognizer can achieve satisfactory accuracy. Adequate experiments show that the proposed identity hider achieves excellent performance on privacy protection and identifiability preservation.

摘要
巨大的捕捉到的面部图像被存储在数据库中，用于个体识别。然而，存储的图像可能会被意外或非意外地被数据管理人员见到，这会导致个人隐私被侵犯。现有的保护措施只是轻度地改变视觉内容，保留了识别功能，这使得真实身份易于推测。在这篇论文中，我们提出了一种隐身器，允许对人类视觉来源进行显著视觉内容变化，同时保留高度识别性。首先，隐身器使用StyleGAN2中的latent space进行抽象，生成一个虚拟面，该虚拟面具有与原始面相同的无关属性，例如姿势和表情。其次，虚拟面中的视觉内容被转移到原始面上，并将背景替换为原始的背景。此外，隐身器具有强大的传送性，使得任何面Recognizer都可以达到满意的准确率。经过合适的实验，我们的隐身器实现了出色的隐私保护和识别性 preserved。

Domain Transfer Through Image-to-Image Translation for Uncertainty-Aware Prostate Cancer Classification

paper_url: http://arxiv.org/abs/2307.00479
repo_url: None
paper_authors: Meng Zhou, Amoon Jamzad, Jason Izard, Alexandre Menard, Robert Siemens, Parvin Mousavi
for: 这个研究是为了提供一种基于深度学习的检测过程，以帮助医生在诊断普通癌中的过程中更加准确。
methods: 这个研究使用了一种名为“域传递”的新方法，将不同域的照片转换为相同的域，以增加训练数据的数量。此外，研究还使用了一种名为“证据深度学习”的方法来估计模型的不确定性，并使用“统计扫描”技术来筛选训练数据。
results: 研究结果显示，这个新方法可以将AUC提高到20%以上，与先前的研究相比（98.4% vs. 76.2%）。这显示了这个方法的可行性和有效性。

Abstract
Prostate Cancer (PCa) is often diagnosed using High-resolution 3.0 Tesla(T) MRI, which has been widely established in clinics. However, there are still many medical centers that use 1.5T MRI units in the actual diagnostic process of PCa. In the past few years, deep learning-based models have been proven to be efficient on the PCa classification task and can be successfully used to support radiologists during the diagnostic process. However, training such models often requires a vast amount of data, and sometimes it is unobtainable in practice. Additionally, multi-source MRIs can pose challenges due to cross-domain distribution differences. In this paper, we have presented a novel approach for unpaired image-to-image translation of prostate mp-MRI for classifying clinically significant PCa, to be applied in data-constrained settings. First, we introduce domain transfer, a novel pipeline to translate unpaired 3.0T multi-parametric prostate MRIs to 1.5T, to increase the number of training data. Second, we estimate the uncertainty of our models through an evidential deep learning approach; and leverage the dataset filtering technique during the training process. Furthermore, we introduce a simple, yet efficient Evidential Focal Loss that incorporates the focal loss with evidential uncertainty to train our model. Our experiments demonstrate that the proposed method significantly improves the Area Under ROC Curve (AUC) by over 20% compared to the previous work (98.4% vs. 76.2%). We envision that providing prediction uncertainty to radiologists may help them focus more on uncertain cases and thus expedite the diagnostic process effectively. Our code is available at https://github.com/med-i-lab/DT_UE_PCa

摘要
prostato cancer (PCa) oftener diagnosed using High-resolution 3.0 Tesla(T) MRI, which has been widely established in clinics. However, there are still many medical centers that use 1.5T MRI units in the actual diagnostic process of PCa. In the past few years, deep learning-based models have been proven to be efficient on the PCa classification task and can be successfully used to support radiologists during the diagnostic process. However, training such models often requires a vast amount of data, and sometimes it is unobtainable in practice. Additionally, multi-source MRIs can pose challenges due to cross-domain distribution differences. In this paper, we have presented a novel approach for unpaired image-to-image translation of prostate mp-MRI for classifying clinically significant PCa, to be applied in data-constrained settings. First, we introduce domain transfer, a novel pipeline to translate unpaired 3.0T multi-parametric prostate MRIs to 1.5T, to increase the number of training data. Second, we estimate the uncertainty of our models through an evidential deep learning approach; and leverage the dataset filtering technique during the training process. Furthermore, we introduce a simple, yet efficient Evidential Focal Loss that incorporates the focal loss with evidential uncertainty to train our model. Our experiments demonstrate that the proposed method significantly improves the Area Under ROC Curve (AUC) by over 20% compared to the previous work (98.4% vs. 76.2%). We envision that providing prediction uncertainty to radiologists may help them focus more on uncertain cases and thus expedite the diagnostic process effectively. Our code is available at https://github.com/med-i-lab/DT_UE_PCa.

Query-Efficient Decision-based Black-Box Patch Attack

paper_url: http://arxiv.org/abs/2307.00477
repo_url: None
paper_authors: Zhaoyu Chen, Bo Li, Shuang Wu, Shouhong Ding, Wenqiang Zhang
for: This paper is written to explore and improve the efficiency of black-box patch attacks on deep neural networks (DNNs), specifically in the decision-based setting.
methods: The paper proposes a new method called DevoPatch, which uses a differential evolutionary algorithm to optimize patches for black-box patch attacks. The method models patches using paired key-points and uses targeted images as the initialization of patches, and parameter optimizations are all performed on the integer domain.
results: The paper demonstrates that DevoPatch outperforms state-of-the-art black-box patch attacks in terms of patch area and attack success rate within a given query budget on image classification and face verification. Additionally, the paper conducts the vulnerability evaluation of ViT and MLP on image classification in the decision-based patch attack setting for the first time.

Abstract
Deep neural networks (DNNs) have been showed to be highly vulnerable to imperceptible adversarial perturbations. As a complementary type of adversary, patch attacks that introduce perceptible perturbations to the images have attracted the interest of researchers. Existing patch attacks rely on the architecture of the model or the probabilities of predictions and perform poorly in the decision-based setting, which can still construct a perturbation with the minimal information exposed -- the top-1 predicted label. In this work, we first explore the decision-based patch attack. To enhance the attack efficiency, we model the patches using paired key-points and use targeted images as the initialization of patches, and parameter optimizations are all performed on the integer domain. Then, we propose a differential evolutionary algorithm named DevoPatch for query-efficient decision-based patch attacks. Experiments demonstrate that DevoPatch outperforms the state-of-the-art black-box patch attacks in terms of patch area and attack success rate within a given query budget on image classification and face verification. Additionally, we conduct the vulnerability evaluation of ViT and MLP on image classification in the decision-based patch attack setting for the first time. Using DevoPatch, we can evaluate the robustness of models to black-box patch attacks. We believe this method could inspire the design and deployment of robust vision models based on various DNN architectures in the future.

摘要
Translation notes:* "Deep neural networks" is translated as "深度神经网络" (shēn dào shén zhī wǎng luò)* "adversarial perturbations" is translated as "敌对扰动" (dí zhòu zhào dòng)* "patch attacks" is translated as "质子攻击" (zhì zǐ gōng jī)* "decision-based setting" is translated as "决策基础设定" (jīe yì jī bǎsè)* "top-1 predicted label" is translated as "预测的第一个标签" (yù jí de dì yī gè biǎo)* "DevoPatch" is translated as "DevoPatch" (德朋补丁)* "query-efficient" is translated as "高效的查询" (gāo fáng de kè qiú)* "black-box patch attacks" is translated as "黑盒质子攻击" (hēi bāo zhì zǐ gōng jī)* "ViT" and "MLP" are translated as "ViT" and "MLP" (Vision Transformer and Multi-Layer Perceptron) respectively.

Weighted Anisotropic-Isotropic Total Variation for Poisson Denoising

paper_url: http://arxiv.org/abs/2307.00439
repo_url: https://github.com/kbui1993/official_aitv_poisson_denoising
paper_authors: Kevin Bui, Yifei Lou, Fredrick Park, Jack Xin
for: 这篇研究旨在提出一个基于偏微分方程的Poisson噪声去除模型，以提高受到Poisson噪声影响的影像质量。
methods: 本研究使用weighted anisotropic-isotropic总variation（AITV）作为调整器，并使用分布式向量运算来实现高效的实现。
results: numerical experiments显示，我们的算法在影像质量和计算效率方面都比其他Poisson噪声去除方法表现更好。

Abstract
Poisson noise commonly occurs in images captured by photon-limited imaging systems such as in astronomy and medicine. As the distribution of Poisson noise depends on the pixel intensity value, noise levels vary from pixels to pixels. Hence, denoising a Poisson-corrupted image while preserving important details can be challenging. In this paper, we propose a Poisson denoising model by incorporating the weighted anisotropic-isotropic total variation (AITV) as a regularization. We then develop an alternating direction method of multipliers with a combination of a proximal operator for an efficient implementation. Lastly, numerical experiments demonstrate that our algorithm outperforms other Poisson denoising methods in terms of image quality and computational efficiency.

摘要
POisson 噪声通常发生在由光子限制的摄影系统中，如天文学和医学中的图像摄影。由于噪声分布取决于像素Intensity值，噪声水平各像素不同，因此去噪Poisson损带的图像保持重要细节是有挑战性的。在这篇论文中，我们提出了一种含有加重weighted anisotropic-isotropic总变量(AITV)的Poisson去噪模型。然后，我们开发了一种alternating direction method of multipliers，并使用距离算子进行高效实现。最后，数值实验表明，我们的算法在图像质量和计算效率方面比其他Poisson去噪方法更高效。Note: "Poisson denoising" is translated as "去噪Poisson" in Simplified Chinese, which is a common way to refer to the process of removing Poisson noise from an image.

One Copy Is All You Need: Resource-Efficient Streaming of Medical Imaging Data at Scale

paper_url: http://arxiv.org/abs/2307.00438
repo_url: https://github.com/um2ii/openjphpy
paper_authors: Pranav Kulkarni, Adway Kanhere, Eliot Siegel, Paul H. Yi, Vishwa S. Parekh
for: 这篇研究旨在解决医疗影像大量数据对于储存和网络带宽的瓶颈问题，并且提供一个开源框架来实现进程Resolution的运算。
methods: 这篇研究使用了一个名为MIST的开源框架，可以将医疗影像存储在单一高分辨率的档案中，并且可以根据使用者的需求进行不同的分辨率下载。
results: 研究发现，使用MIST可以对医疗影像储存和流程实现大量数据储存和网络带宽的节省，并且可以维持深度学习应用的诊断质量。

Abstract
Large-scale medical imaging datasets have accelerated development of artificial intelligence tools for clinical decision support. However, the large size of these datasets is a bottleneck for users with limited storage and bandwidth. Many users may not even require such large datasets as AI models are often trained on lower resolution images. If users could directly download at their desired resolution, storage and bandwidth requirements would significantly decrease. However, it is impossible to anticipate every users' requirements and impractical to store the data at multiple resolutions. What if we could store images at a single resolution but send them at different ones? We propose MIST, an open-source framework to operationalize progressive resolution for streaming medical images at multiple resolutions from a single high-resolution copy. We demonstrate that MIST can dramatically reduce imaging infrastructure inefficiencies for hosting and streaming medical images by >90%, while maintaining diagnostic quality for deep learning applications.

摘要
大规模医疗影像数据集已经推动人工智能工具的开发，以支持临床决策。然而，这些大规模数据集的尺寸成为用户储存和带宽限制的瓶颈。许多用户可能不需要这么大的数据集，因为人工智能模型通常在较低分辨率的图像上训练。如果用户可以直接下载他们所需的分辨率，储存和带宽要求就会减少很多。然而，预测每个用户的需求是不可能的，并且不可能将数据存储在多个分辨率下。我们提出了MIST框架，一个开源的框架，用于在多个分辨率下进行进程式分辨率的流动医疗影像传输。我们示示了MIST可以减少医疗影像基础设施的不效ientecies >90%，同时保持深度学习应用的诊断质量。

Brightness-Restricted Adversarial Attack Patch

paper_url: http://arxiv.org/abs/2307.00421
repo_url: None
paper_authors: Mingzhen Shao
for: 防御攻击补丁的实际应用场景中，尝试减少袋颜色的使用，以降低人类识别的风险。
methods: 使用光学特性来减少袋颜色的醒目性，保持图像独立性。
results: 对各种图像特征（如颜色、文本、噪声、大小）的分析，发现攻击袋 exhibit 强大的彩度相对性和颜色传输异常，并且对噪声有较强的抗性。基于这些发现，提出了进一步减少袋颜色的方法。

Abstract
Adversarial attack patches have gained increasing attention due to their practical applicability in physical-world scenarios. However, the bright colors used in attack patches represent a significant drawback, as they can be easily identified by human observers. Moreover, even though these attacks have been highly successful in deceiving target networks, which specific features of the attack patch contribute to its success are still unknown. Our paper introduces a brightness-restricted patch (BrPatch) that uses optical characteristics to effectively reduce conspicuousness while preserving image independence. We also conducted an analysis of the impact of various image features (such as color, texture, noise, and size) on the effectiveness of an attack patch in physical-world deployment. Our experiments show that attack patches exhibit strong redundancy to brightness and are resistant to color transfer and noise. Based on our findings, we propose some additional methods to further reduce the conspicuousness of BrPatch. Our findings also explain the robustness of attack patches observed in physical-world scenarios.

摘要
侵略攻击贴图在实际场景中获得了越来越多的关注，但是侵略贴图中使用的鲜明颜色带来了一定的缺点，因为它们可以轻易被人类识别。尽管这些攻击已经在目标网络中取得了高度的成功，但是哪些特征使得攻击贴图成功还不清楚。我们的论文介绍了一种具有限制颜色的攻击贴图（BrPatch），使用光学特性来有效地减少鲜明程度，保持图像独立性。我们还进行了图像特征（如颜色、文本、噪声和大小）对攻击贴图的影响分析。我们的实验表明，攻击贴图具有强大的颜色减少和噪声抗耗特性。基于我们的发现，我们提出了进一步减少鲜明度的方法。我们的发现也解释了实际场景中观察到的攻击贴图 Robustness。

Applications of Binary Similarity and Distance Measures

paper_url: http://arxiv.org/abs/2307.00411
repo_url: None
paper_authors: Manoj Muniswamaiah, Tilak Agerwala, Charles C. Tappert
for: 本研究探讨了二进制相似度测量在不同领域的应用。
methods: 本研究使用了二进制距离度量法和相似度测量方法。
results: 研究发现，二进制相似度测量在生物特征识别、手写字符识别和复眼图像识别等领域具有广泛的应用前景。Here’s a breakdown of each point:
for: 本研究探讨了二进制相似度测量在不同领域的应用。 (The paper explores the application of binary similarity measures in various fields.)
methods: 本研究使用了二进制距离度量法和相似度测量方法。 (The study uses binary distance measures and similarity measurement methods.)
results: 研究发现，二进制相似度测量在生物特征识别、手写字符识别和复眼图像识别等领域具有广泛的应用前景。 (The study finds that binary similarity measures have broad applications in fields such as biometric identification, handwritten character recognition, and iris image recognition.)

Abstract
In the recent past, binary similarity measures have been applied in solving biometric identification problems, including fingerprint, handwritten character detection, and in iris image recognition. The application of the relevant measurements has also resulted in more accurate data analysis. This paper surveys the applicability of binary similarity and distance measures in various fields.

摘要
Note:* "二进制" (er-jie) means "binary" in Chinese.* "相似度度量" (xiang-si-duo-liang) means "similarity measures" or "distance measures" in Chinese.* "生物метриiddle" (shēng-wù-métài) means "biometric identification" in Chinese.* "指纹" (zhǐ-yīn) means "fingerprint" in Chinese.* "手写字符" (shǒu-xī-zi-fu) means "handwritten character" in Chinese.* "远景图像" (yuán-jǐng-tú-xìng) means "iris image" in Chinese.

Improving CNN-based Person Re-identification using score Normalization

paper_url: http://arxiv.org/abs/2307.00397
repo_url: None
paper_authors: Ammar Chouchane, Abdelmalik Ouamane, Yassine Himeur, Wathiq Mansoor, Shadi Atalla, Afaf Benzaibak, Chahrazed Boudellal
For: 本文提出了一种新的人识别方法，用于解决多视角和不同照明和背景的问题。* Methods: 本文使用了卷积神经网络（CNN）来提取特征，并使用了交叉视角Quadratic Discriminant Analysis（XQDA）来学习度量。* Results: 在四个复杂的数据集上进行测试，包括VIPer、GRID、CUHK01和PRID450S，并取得了承诺的结果，例如在GRID、CUHK01、VIPer和PRID450S数据集上，无 норма化情况下的rank-20率准确率分别为61.92%、83.90%、92.03%和96.22%，而经过分数 нормализа后，它们分别提高至64.64%、89.30%、92.78%和98.76%。

Abstract
Person re-identification (PRe-ID) is a crucial task in security, surveillance, and retail analysis, which involves identifying an individual across multiple cameras and views. However, it is a challenging task due to changes in illumination, background, and viewpoint. Efficient feature extraction and metric learning algorithms are essential for a successful PRe-ID system. This paper proposes a novel approach for PRe-ID, which combines a Convolutional Neural Network (CNN) based feature extraction method with Cross-view Quadratic Discriminant Analysis (XQDA) for metric learning. Additionally, a matching algorithm that employs Mahalanobis distance and a score normalization process to address inconsistencies between camera scores is implemented. The proposed approach is tested on four challenging datasets, including VIPeR, GRID, CUHK01, and PRID450S, and promising results are obtained. For example, without normalization, the rank-20 rate accuracies of the GRID, CUHK01, VIPeR and PRID450S datasets were 61.92%, 83.90%, 92.03%, 96.22%; however, after score normalization, they have increased to 64.64%, 89.30%, 92.78%, and 98.76%, respectively. Accordingly, the promising results on four challenging datasets indicate the effectiveness of the proposed approach.

摘要
人员重识别（PRe-ID）是安全、监测和营销分析中的关键任务，它的目标是在多个摄像头和视图下识别个体。然而，由于照明、背景和视角的变化，PRe-IDTask 是一项具有挑战性的任务。efficient的特征提取和度量学习算法是一个成功的 PRe-ID 系统的关键。这篇论文提出了一种新的 PRe-ID 方法，该方法结合了卷积神经网络（CNN）基于特征提取方法和跨视图quadratice Discriminant Analysis（XQDA）度量学习算法。此外，一种使用 Mahalanobis 距离和分数 нормализа处理来解决相机分数不一致的匹配算法也被实现。提议的方法在四个复杂的数据集上进行测试，包括 VIPeR、GRID、CUHK01 和 PRID450S，并取得了俊czy的结果。例如，无Normalization 的 GRID、CUHK01、VIPeR 和 PRID450S 数据集的 rank-20 率准确率分别为 61.92%、83.90%、92.03% 和 96.22%；然而，通过分数 Normalization 后，它们的 rank-20 率准确率分别提高到 64.64%、89.30%、92.78% 和 98.76%。因此，在四个复杂的数据集上取得的俊czy的结果表明了提议的方法的有效性。

MobileViG: Graph-Based Sparse Attention for Mobile Vision Applications

paper_url: http://arxiv.org/abs/2307.00395
repo_url: https://github.com/sldgroup/mobilevig
paper_authors: Mustafa Munir, William Avery, Radu Marculescu
For: This paper proposes a new graph-based sparse attention mechanism, Sparse Vision Graph Attention (SVGA), and a hybrid CNN-GNN architecture, MobileViG, for vision tasks on mobile devices.* Methods: The proposed SVGA mechanism is designed to reduce the computational cost of representing images as graph structures, making it more suitable for mobile devices. The MobileViG architecture combines SVGA with a CNN backbone to achieve better performance and efficiency.* Results: The proposed MobileViG model achieves state-of-the-art performance and efficiency on image classification, object detection, and instance segmentation tasks on mobile devices. The fastest model, MobileViG-Ti, achieves 75.7% top-1 accuracy on ImageNet-1K with 0.78 ms inference latency, while the largest model, MobileViG-B, obtains 82.6% top-1 accuracy with only 2.30 ms latency.

Abstract
Traditionally, convolutional neural networks (CNN) and vision transformers (ViT) have dominated computer vision. However, recently proposed vision graph neural networks (ViG) provide a new avenue for exploration. Unfortunately, for mobile applications, ViGs are computationally expensive due to the overhead of representing images as graph structures. In this work, we propose a new graph-based sparse attention mechanism, Sparse Vision Graph Attention (SVGA), that is designed for ViGs running on mobile devices. Additionally, we propose the first hybrid CNN-GNN architecture for vision tasks on mobile devices, MobileViG, which uses SVGA. Extensive experiments show that MobileViG beats existing ViG models and existing mobile CNN and ViT architectures in terms of accuracy and/or speed on image classification, object detection, and instance segmentation tasks. Our fastest model, MobileViG-Ti, achieves 75.7% top-1 accuracy on ImageNet-1K with 0.78 ms inference latency on iPhone 13 Mini NPU (compiled with CoreML), which is faster than MobileNetV2x1.4 (1.02 ms, 74.7% top-1) and MobileNetV2x1.0 (0.81 ms, 71.8% top-1). Our largest model, MobileViG-B obtains 82.6% top-1 accuracy with only 2.30 ms latency, which is faster and more accurate than the similarly sized EfficientFormer-L3 model (2.77 ms, 82.4%). Our work proves that well designed hybrid CNN-GNN architectures can be a new avenue of exploration for designing models that are extremely fast and accurate on mobile devices. Our code is publicly available at https://github.com/SLDGroup/MobileViG.

摘要
传统上，卷积神经网络（CNN）和视图转换器（ViT）在计算机视觉领域占据主导地位，但最近提出的视图图神经网络（ViG）提供了一个新的探索途径。然而，由于将图像表示为图结构的开销，ViG在移动设备上运行时 computationally expensive。在这种情况下，我们提出了一种新的图形基于的稀疏注意机制——图像稀疏视Graph注意（SVGA），用于ViG在移动设备上运行。此外，我们还提出了首个在移动设备上运行的hybrid CNN-GNN架构——MobileViG，它使用了SVGA。我们的实验表明，MobileViG可以在图像分类、物体检测和实例分割任务中比现有的ViG模型和移动设备上的CNN和ViT架构更高的准确率和/或速度。我们最快的模型MobileViG-Ti在ImageNet-1K上取得75.7%的顶部1准确率，并且在iPhone 13 Mini NPU上编译后的执行时间为0.78毫秒，比MobileNetV2x1.4（1.02毫秒，74.7%）和MobileNetV2x1.0（0.81毫秒，71.8%）更快。我们最大的模型MobileViG-B在82.6%的顶部1准确率下，只需2.30毫秒的执行时间，比同等大小的EfficientFormer-L3模型（2.77毫秒，82.4%）更快和更准确。我们的工作证明了，以Well designed hybrid CNN-GNN架构为基础的模型可以在移动设备上设计出非常快速和准确的模型。我们的代码公开在https://github.com/SLDGroup/MobileViG上。

2023-07-02

X-MLP: A Patch Embedding-Free MLP Architecture for Vision

ClipSitu: Effectively Leveraging CLIP for Conditional Predictions in Situation Recognition

A multi-task learning framework for carotid plaque segmentation and classification from ultrasound images

TinySiamese Network for Biometric Analysis

Bidirectional Temporal Diffusion Model for Temporally Consistent Human Animation

A MIL Approach for Anomaly Detection in Surveillance Videos from Multiple Camera Views

Partial-label Learning with Mixed Closed-set and Open-set Out-of-candidate Examples

ARHNet: Adaptive Region Harmonization for Lesion-aware Augmentation to Improve Segmentation Performance

Referring Video Object Segmentation with Inter-Frame Interaction and Cross-Modal Correlation

End-to-End Out-of-distribution Detection with Self-supervised Sampling

SUGAR: Spherical Ultrafast Graph Attention Framework for Cortical Surface Registration

Data-Free Quantization via Mixed-Precision Compensation without Fine-Tuning

TopicFM+: Boosting Accuracy and Efficiency of Topic-Assisted Feature Matching

Seeing is not Believing: An Identity Hider for Human Vision Privacy Protection

Domain Transfer Through Image-to-Image Translation for Uncertainty-Aware Prostate Cancer Classification

Query-Efficient Decision-based Black-Box Patch Attack

Weighted Anisotropic-Isotropic Total Variation for Poisson Denoising

One Copy Is All You Need: Resource-Efficient Streaming of Medical Imaging Data at Scale

Brightness-Restricted Adversarial Attack Patch

Applications of Binary Similarity and Distance Measures

Improving CNN-based Person Re-identification using score Normalization

MobileViG: Graph-Based Sparse Attention for Mobile Vision Applications