2023-07-06

cs.CV

cs.CV - 2023-07-06

Proto-CLIP: Vision-Language Prototypical Network for Few-Shot Learning

paper_url: http://arxiv.org/abs/2307.03073
repo_url: https://github.com/IRVLUTD/Proto-CLIP
paper_authors: Jishnu Jaykumar P, Kamalesh Palanisamy, Yu-Wei Chao, Xinya Du, Yu Xiang
for: 本研究提出了一种基于大规模视语言模型CLIP的新框架，用于几拍学习。
methods: 该方法利用图像prototype和文本prototype进行几拍学习，特别是在CLIP中采用了图像编码器和文本编码器的联合使用几拍示例。
results: 我们在多个几拍学习数据集上进行了实验，并在实际应用中进行了机器人感知的测试，结果表明我们的方法高效。

Abstract
We propose a novel framework for few-shot learning by leveraging large-scale vision-language models such as CLIP. Motivated by the unimodal prototypical networks for few-shot learning, we introduce PROTO-CLIP that utilizes image prototypes and text prototypes for few-shot learning. Specifically, PROTO-CLIP adapts the image encoder and text encoder in CLIP in a joint fashion using few-shot examples. The two encoders are used to compute prototypes of image classes for classification. During adaptation, we propose aligning the image and text prototypes of corresponding classes. Such a proposed alignment is beneficial for few-shot classification due to the contributions from both types of prototypes. We demonstrate the effectiveness of our method by conducting experiments on benchmark datasets for few-shot learning as well as in the real world for robot perception.

摘要
我们提出了一种新的几shot学习框架，利用大规模的视觉语言模型 such as CLIP。我们的方法被激发于干扰型网络 для几shot学习，我们称之为PROTO-CLIP。这个方法利用图像概念和文本概念来进行几shot学习。特别是，PROTO-CLIP在CLIP中的图像编码器和文本编码器进行联合调整，使用几shot示例来计算图像类别的概念。在调整过程中，我们提议将图像和文本概念的对应类型进行对接。这种提议的对接有助于几shot分类，因为图像和文本概念的贡献均有所提高。我们在几shot学习的标准数据集和实际世界中的 робот识别任务中进行了实验，以证明我们的方法的有效性。

PseudoCell: Hard Negative Mining as Pseudo Labeling for Deep Learning-Based Centroblast Cell Detection

paper_url: http://arxiv.org/abs/2307.03211
repo_url: None
paper_authors: Narongrid Seesawad, Piyalitt Ittichaiwong, Thapanun Sudhawiyangkul, Phattarapong Sawangjai, Peti Thuwajit, Paisarn Boonsakan, Supasan Sripodok, Kanyakorn Veerakanjana, Phoomraphee Luenam, Komgrid Charngkaew, Ananya Pongpaibul, Napat Angkathunyakul, Narit Hnoohom, Sumeth Yuenyong, Chanitra Thuwajit, Theerawit Wilaiprasitporn
For: The paper aims to assist pathologists in grading follicular lymphoma patients by automating centroblast detection in whole-slide images (WSI) of H&E-stained tissue samples using deep learning-based object detection frameworks.* Methods: The proposed method, called PseudoCell, incorporates centroblast labels from pathologists and combines them with pseudo-negative labels obtained from undersampled false-positive predictions using the cell’s morphological features.* Results: PseudoCell can accurately narrow down the areas requiring pathologists’ attention during examining tissue, eliminating 58.18-99.35% of non-centroblasts tissue areas on WSI, depending on the confidence threshold.Here’s the simplified Chinese text for the three key points:* For: 这篇论文旨在帮助病理学家评分淋巴癌病例，使用深度学习基于对象检测框架自动检测淋巴细胞在染色体抗静脉涂抹样本中。* Methods: 提议的方法叫做 PseudoCell，它将中心细胞标签与异常预测中的 pseudo-负样本结合使用，利用细胞形态特征来提高检测精度。* Results: PseudoCell 可以准确地减少病理学家的工作负担，在评分过程中减少 58.18-99.35% 的非中心细胞区域，具体的减少率取决于信任阈值。

Abstract
Patch classification models based on deep learning have been utilized in whole-slide images (WSI) of H&E-stained tissue samples to assist pathologists in grading follicular lymphoma patients. However, these approaches still require pathologists to manually identify centroblast cells and provide refined labels for optimal performance. To address this, we propose PseudoCell, an object detection framework to automate centroblast detection in WSI (source code is available at https://github.com/IoBT-VISTEC/PseudoCell.git). This framework incorporates centroblast labels from pathologists and combines them with pseudo-negative labels obtained from undersampled false-positive predictions using the cell's morphological features. By employing PseudoCell, pathologists' workload can be reduced as it accurately narrows down the areas requiring their attention during examining tissue. Depending on the confidence threshold, PseudoCell can eliminate 58.18-99.35% of non-centroblasts tissue areas on WSI. This study presents a practical centroblast prescreening method that does not require pathologists' refined labels for improvement. Detailed guidance on the practical implementation of PseudoCell is provided in the discussion section.

摘要
报告文献：基于深度学习的报告分类模型已经在染色涂抹检查（H&E染色）的整个检查图像（WSI）中用于协助病理学家评分抗体癌病例。然而，这些方法仍然需要病理学家手动标识中心blast细胞并提供精细的标签以实现最佳性能。为了解决这个问题，我们提出了 PseudoCell，一个对象检测框架，可以自动检测WSI中的中心blast细胞。这个框架利用病理学家提供的中心blast标签，并将其与使用细胞形态特征所获取的假性负标签相结合。通过使用PseudoCell，病理学家的工作负担可以减少，因为它准确地缩小了需要其注意的区域。根据信任阈值，PseudoCell可以从WSI中消除58.18-99.35%的非中心blast区域。本研究提供了不需要病理学家精细标签的实用中心blast预屏检测方法。详细的实现方法请参考讨论部分。

EffLiFe: Efficient Light Field Generation via Hierarchical Sparse Gradient Descent

paper_url: http://arxiv.org/abs/2307.03017
repo_url: None
paper_authors: Yijie Deng, Lei Han, Tianpeng Lin, Lin Li, Jinzhi Zhang, Lu Fang
for: 提高xtended reality（XR）技术中的实时光场生成速度，尤其是从稀疏视角输入中生成高质量的光场。
methods: 基于多平面图像（MPI）的彩色环境中的稀疏梯度下降（HSGD）方法，以实时生成高质量的光场。
results: 比较state-of-the-art offline方法100倍 faster的速度和相比其他在线方法提供2dB更高的PSNR表现。

Abstract
With the rise of Extended Reality (XR) technology, there is a growing need for real-time light field generation from sparse view inputs. Existing methods can be classified into offline techniques, which can generate high-quality novel views but at the cost of long inference/training time, and online methods, which either lack generalizability or produce unsatisfactory results. However, we have observed that the intrinsic sparse manifold of Multi-plane Images (MPI) enables a significant acceleration of light field generation while maintaining rendering quality. Based on this insight, we introduce EffLiFe, a novel light field optimization method, which leverages the proposed Hierarchical Sparse Gradient Descent (HSGD) to produce high-quality light fields from sparse view images in real time. Technically, the coarse MPI of a scene is first generated using a 3D CNN, and it is further sparsely optimized by focusing only on important MPI gradients in a few iterations. Nevertheless, relying solely on optimization can lead to artifacts at occlusion boundaries. Therefore, we propose an occlusion-aware iterative refinement module that removes visual artifacts in occluded regions by iteratively filtering the input. Extensive experiments demonstrate that our method achieves comparable visual quality while being 100x faster on average than state-of-the-art offline methods and delivering better performance (about 2 dB higher in PSNR) compared to other online approaches.

摘要
随着扩展现实（XR）技术的发展，需要实时生成高质量的场景灯场。现有方法可以分为线上方法和线下方法两类，线下方法可以生成高质量的新视图，但是需要长时间的推理/训练时间，而线上方法则缺乏通用性或生成结果不 satisfactory。然而，我们发现了场景多平面图（MPI）的内在稀畴推导，可以大幅加速场景灯场生成，同时保持渲染质量。基于这一点，我们提出了EffLiFe，一种新的场景灯场优化方法，利用我们提出的层次稀畴梯度下降（HSGD）来从稀畴视图图像中生成高质量的场景灯场，并在实时中完成。技术上，首先使用3D CNN生成场景的粗糙MPI，然后通过稀畴优化，只对场景中重要的MPI梯度进行几次迭代。然而，仅仅通过优化无法在 occlusion 边界处消除视觉遮挡。因此，我们提出了 occlusion-aware 迭代纠正模块，通过迭代筛选输入，消除 occlusion 区域中的视觉遮挡。广泛的实验表明，我们的方法可以与现状的线下方法相比，提供更好的视觉质量，同时在实时中完成，相比之下，其速度为100倍。

paper_url: http://arxiv.org/abs/2307.03008
repo_url: https://github.com/j-morano/multimodal-ssl-fpn
paper_authors: José Morano, Guilherme Aresta, Dmitrii Lachinov, Julia Mai, Ursula Schmidt-Erfurth, Hrvoje Bogunović
for:这篇论文主要是为了提出一种新的深度学习方法，以便在医疗图像分割任务中提高效率，减轻医生的工作负担。methods:该方法使用了一种新的卷积神经网络（CNN）和自助学习（SSL）方法，其中的3D encoder和2D decoder被连接了一些新的3D-to-2D块。而SSL方法则是通过不同维度的图像对准来实现对模式的重构。results:该方法在两个临床有 relevance 的任务中（抗纤维性atrophy和Reticular Pseudodrusen）中得到了显著的提高，比如在有限的标注数据情况下，该方法可以提高达到8%的Dice score。此外，SSL方法可以进一步提高这个性能，并且我们证明了SSL是无论网络架构是多少都有利。

Abstract
Deep learning has become a valuable tool for the automation of certain medical image segmentation tasks, significantly relieving the workload of medical specialists. Some of these tasks require segmentation to be performed on a subset of the input dimensions, the most common case being 3D-to-2D. However, the performance of existing methods is strongly conditioned by the amount of labeled data available, as there is currently no data efficient method, e.g. transfer learning, that has been validated on these tasks. In this work, we propose a novel convolutional neural network (CNN) and self-supervised learning (SSL) method for label-efficient 3D-to-2D segmentation. The CNN is composed of a 3D encoder and a 2D decoder connected by novel 3D-to-2D blocks. The SSL method consists of reconstructing image pairs of modalities with different dimensionality. The approach has been validated in two tasks with clinical relevance: the en-face segmentation of geographic atrophy and reticular pseudodrusen in optical coherence tomography. Results on different datasets demonstrate that the proposed CNN significantly improves the state of the art in scenarios with limited labeled data by up to 8% in Dice score. Moreover, the proposed SSL method allows further improvement of this performance by up to 23%, and we show that the SSL is beneficial regardless of the network architecture.

摘要
深度学习已成为医疗影像分割任务自动化的有价值工具，减轻医生的工作负担。一些这些任务需要在输入维度中进行分割，最常见的情况是3D-to-2D。然而，现有方法的性能受到可用标注数据的限制，例如没有数据效果的转移学习方法，即使是在这些任务上。在这种情况下，我们提出了一种新的卷积神经网络（CNN）和自我超vised学习（SSL）方法，用于标签效率3D-to-2D分割。CNN由3D编码器和2D解码器连接的新3D-to-2D块组成。SSL方法是通过不同维度的图像对象来重建图像对。我们在两个临床有关的任务中进行验证：扫描型atrophy和reticular pseudodrusen的en-face分割。不同的数据集上的结果表明，我们提出的CNN可以在有限标注数据的情况下提高状态的艺术到8%的Dice分数。此外，我们还证明了SSL方法可以进一步改进这种性能，并且SSL是无论网络架构如何的有利。

Fourier-Net+: Leveraging Band-Limited Representation for Efficient 3D Medical Image Registration

paper_url: http://arxiv.org/abs/2307.02997
repo_url: https://github.com/xi-jia/fourier-net
paper_authors: Xi Jia, Alexander Thorley, Alberto Gomez, Wenqi Lu, Dipak Kotecha, Jinming Duan
for: 这个论文是为了提高无监督的影像注册过程中的简洁场位数据运算效率，使用U-Net类型网络来预测高分辨率三维影像资料中的简洁场位数据。
methods: 我们首先提出了Fourier-Net，它取代了成本高的U-Net类型网络的膨胀路径，并使用一个无parameter的模型驱动解oder。相反于直接预测全分辨率的简洁场位数据，Fourier-Net学习了对应到对应的对角域 Fouriertransform中的低维度表示。然后我们将Fourier-Net发展为Fourier-Net+，它还接受了对应的对角域像素表示，并进一步删减U-Net类型网络的收缩路径中的卷积层数量。最后，我们提出了对Fourier-Net+的协调版本，以提高注册性能。
results: 我们使用三个 dataset 进行评估，其中包括CT scan和MRI等类型的影像资料。在这些 dataset 上，我们的提案方法均能与现有的州��♀�方法相比，并且具有更快的推论速度、更低的内存占用量和更少的乘法加法操作。这些小型的computational cost使我们的Fourier-Net+在低VRAM GPU上能够高效地训练大规模3D注册。我们的代码公开在 \url{https://github.com/xi-jia/Fourier-Net}。

Abstract
U-Net style networks are commonly utilized in unsupervised image registration to predict dense displacement fields, which for high-resolution volumetric image data is a resource-intensive and time-consuming task. To tackle this challenge, we first propose Fourier-Net, which replaces the costly U-Net style expansive path with a parameter-free model-driven decoder. Instead of directly predicting a full-resolution displacement field, our Fourier-Net learns a low-dimensional representation of the displacement field in the band-limited Fourier domain which our model-driven decoder converts to a full-resolution displacement field in the spatial domain. Expanding upon Fourier-Net, we then introduce Fourier-Net+, which additionally takes the band-limited spatial representation of the images as input and further reduces the number of convolutional layers in the U-Net style network's contracting path. Finally, to enhance the registration performance, we propose a cascaded version of Fourier-Net+. We evaluate our proposed methods on three datasets, on which our proposed Fourier-Net and its variants achieve comparable results with current state-of-the art methods, while exhibiting faster inference speeds, lower memory footprint, and fewer multiply-add operations. With such small computational cost, our Fourier-Net+ enables the efficient training of large-scale 3D registration on low-VRAM GPUs. Our code is publicly available at \url{https://github.com/xi-jia/Fourier-Net}.

摘要
通常，U-Net风格的网络在无监督图像注册中预测密集偏移场景，对高分辨率三维图像数据来说是一项资源挺大和时间消耗的任务。为解决这个挑战，我们首先提出了Fourier-Net，它将U-Net风格的昂贵的扩展路径换为无参数的模型驱动解码器。而不是直接预测全分辨率偏移场景，Fourier-Net将学习带有限 Fourier频域的低维度偏移场景表示。我们的模型驱动解码器将这个低维度表示转换为全分辨率偏移场景。在这个基础上，我们再次引入Fourier-Net+，它还将带有限空间表示的图像进行输入，并将U-Net风格网络的收缩路径中的几何层数降低。最后，我们提出了卷积率的缩放版本，以提高注册性能。我们在三个数据集上评估了我们的提议方法，其中Fourier-Net和其 variants与当前状态的方法相当，而且具有更快的推理速度、更低的内存占用量和更少的 multiply-add 操作。这样的小计算成本使得我们的Fourier-Net+可以高效地在低VRAM GPU上进行大规模3D注册的训练。我们的代码公开在 GitHub 上，可以在 \url{https://github.com/xi-jia/Fourier-Net} 上获取。

paper_url: http://arxiv.org/abs/2307.02978
repo_url: None
paper_authors: Sushanta Kumar Sahu, Ananda S. Chowdhury
for: 本研究旨在提出一种直接三类PD分类方法，使用两种不同的感知模式，即MRI和DTI。
methods: 本研究使用白 matter和灰 matter从MRI和扩展纹理度和mean diffusivity从DTI来实现目标。四个独立的CNN模型在上述四种数据上进行训练。决策层使用优化的Weighted Average fusion技术进行数据融合。
results: 在公共可用的PPMI数据库上，本研究实现了95.53%的直接三类分类精度，用于PD、HC和SWEDD的分类。对于PD、HC和SWEDD的分类，进行了广泛的比较，包括一系列的剥夺研究，明显地验证了我们的提议的有效性。

Abstract
Parkinson disease is the second most common neurodegenerative disorder, as reported by the World Health Organization. In this paper, we propose a direct three-Class PD classification using two different modalities, namely, MRI and DTI. The three classes used for classification are PD, Scans Without Evidence of Dopamine Deficit and Healthy Control. We use white matter and gray matter from the MRI and fractional anisotropy and mean diffusivity from the DTI to achieve our goal. We train four separate CNNs on the above four types of data. At the decision level, the outputs of the four CNN models are fused with an optimal weighted average fusion technique. We achieve an accuracy of 95.53 percentage for the direct three class classification of PD, HC and SWEDD on the publicly available PPMI database. Extensive comparisons including a series of ablation studies clearly demonstrate the effectiveness of our proposed solution.

摘要
parkinson病是第二常见的神经退化疾病，根据世界卫生组织的报告。在这篇论文中，我们提出了直接的三类PD分类方法，使用两种不同的感知模式，即MRI和DTI。我们使用MRI中的白 matter和灰 matter，以及DTI中的方向强度和mean扩散率来实现我们的目标。我们训练了四个独立的CNN模型。在决策层，四个CNN模型的输出使用最佳权重平均融合技术进行融合。我们在公共可用PPMI数据库上实现了95.53%的直接三类分类精度。经过了详细的对比，包括一系列剥离研究，我们的提议的解决方案得到了证明。

Cross-Spatial Pixel Integration and Cross-Stage Feature Fusion Based Transformer Network for Remote Sensing Image Super-Resolution

paper_url: http://arxiv.org/abs/2307.02974
repo_url: None
paper_authors: Yuting Lu, Lingtong Min, Binglu Wang, Le Zheng, Xiaoxu Wang, Yongqiang Zhao, Teng Long
for: 提高卫星图像的空间细节和质量
methods: 使用Transformer架构，并采用 crossed-spatial pixel integration attention和cross-stage feature fusion attention来提高模型的全像global cognition和特征表达能力
results: 在多个标准 dataset上进行了广泛的实验，并证明了我们提posed的SPIFFNet在量化指标和视觉质量上比现有方法更高的性能

Abstract
Remote sensing image super-resolution (RSISR) plays a vital role in enhancing spatial detials and improving the quality of satellite imagery. Recently, Transformer-based models have shown competitive performance in RSISR. To mitigate the quadratic computational complexity resulting from global self-attention, various methods constrain attention to a local window, enhancing its efficiency. Consequently, the receptive fields in a single attention layer are inadequate, leading to insufficient context modeling. Furthermore, while most transform-based approaches reuse shallow features through skip connections, relying solely on these connections treats shallow and deep features equally, impeding the model's ability to characterize them. To address these issues, we propose a novel transformer architecture called Cross-Spatial Pixel Integration and Cross-Stage Feature Fusion Based Transformer Network (SPIFFNet) for RSISR. Our proposed model effectively enhances global cognition and understanding of the entire image, facilitating efficient integration of features cross-stages. The model incorporates cross-spatial pixel integration attention (CSPIA) to introduce contextual information into a local window, while cross-stage feature fusion attention (CSFFA) adaptively fuses features from the previous stage to improve feature expression in line with the requirements of the current stage. We conducted comprehensive experiments on multiple benchmark datasets, demonstrating the superior performance of our proposed SPIFFNet in terms of both quantitative metrics and visual quality when compared to state-of-the-art methods.

摘要
卫星图像超分辨率（RSISR）在提高空间细节和卫星图像质量方面扮演着重要的角色。最近，基于Transformer模型的方法在RSISR中表现竞争力强。但由于全球自注意的 quadratic computational complexity， Various methods constrain attention to a local window, enhancing its efficiency。然而，单个注意层的接受场所不够，导致Context模型缺乏。另外，大多数transform基本方法通过 skip connections 重复 shallow features，但是这些连接只是对 shallow 和 deep features进行重复，导致模型无法区分它们。为了解决这些问题，我们提出了一种新的Transformer网络模型，即 Cross-Spatial Pixel Integration and Cross-Stage Feature Fusion Based Transformer Network（SPIFFNet）。我们的提议模型可以有效地提高全图像的全局认知和理解，并且可以有效地进行 across-stages 的特征集成。模型包括 cross-spatial pixel integration attention（CSPIA），可以将contextual information引入到当前窗口中，以及 across-stage feature fusion attention（CSFFA），可以适应当前阶段的特征需求，进行特征表达。我们在多个benchmark数据集上进行了广泛的实验，并证明了我们提出的SPIFFNet在量值指标和视觉质量方面与现有方法相比，表现出了更高的性能。

SegNetr: Rethinking the local-global interactions and skip connections in U-shaped networks

paper_url: http://arxiv.org/abs/2307.02953
repo_url: None
paper_authors: Junlong Cheng, Chengrui Gao, Fengjie Wang, Min Zhu
for: 这篇论文主要针对医疗影像分类 задачі提出了一个轻量级的网络模型，即SegNetr。
methods: 本文提出了一个新的SegNetr封页，可以在任何阶段进行本地-全球互动，并具有线性复杂度。此外， authors还设计了一个通用的信息保留skip连接（IRSC），以保持遮盾对象特征的空间位置信息，并与解oder特征进行精准融合。
results: 作者将SegNetr应用到四个主流的医疗影像分类 datasets上，与常用的U-Net模型相比，SegNetr的参数数量和GFLOPs都比相对少，但是其分类性能与现有的方法相对，并且可以适用于其他U-shaped网络中优化分类性能。

Abstract
Recently, U-shaped networks have dominated the field of medical image segmentation due to their simple and easily tuned structure. However, existing U-shaped segmentation networks: 1) mostly focus on designing complex self-attention modules to compensate for the lack of long-term dependence based on convolution operation, which increases the overall number of parameters and computational complexity of the network; 2) simply fuse the features of encoder and decoder, ignoring the connection between their spatial locations. In this paper, we rethink the above problem and build a lightweight medical image segmentation network, called SegNetr. Specifically, we introduce a novel SegNetr block that can perform local-global interactions dynamically at any stage and with only linear complexity. At the same time, we design a general information retention skip connection (IRSC) to preserve the spatial location information of encoder features and achieve accurate fusion with the decoder features. We validate the effectiveness of SegNetr on four mainstream medical image segmentation datasets, with 59\% and 76\% fewer parameters and GFLOPs than vanilla U-Net, while achieving segmentation performance comparable to state-of-the-art methods. Notably, the components proposed in this paper can be applied to other U-shaped networks to improve their segmentation performance.

摘要

They often rely on complex self-attention modules to compensate for the lack of long-term dependence in convolutional operations, which increases the number of parameters and computational complexity of the network.2. They simply fuse the features of the encoder and decoder, ignoring the spatial relationships between them.In this paper, we aim to address these issues and propose a lightweight medical image segmentation network called SegNetr. Our approach includes:1. A novel SegNetr block that can perform local-global interactions dynamically at any stage with linear complexity.2. A general information retention skip connection (IRSC) to preserve the spatial location information of encoder features and achieve accurate fusion with the decoder features.We validate the effectiveness of SegNetr on four mainstream medical image segmentation datasets, achieving segmentation performance comparable to state-of-the-art methods with 59% and 76% fewer parameters and GFLOPs than vanilla U-Net. The proposed components can be applied to other U-shaped networks to improve their segmentation performance.

DisAsymNet: Disentanglement of Asymmetrical Abnormality on Bilateral Mammograms using Self-adversarial Learning

paper_url: http://arxiv.org/abs/2307.02935
repo_url: None
paper_authors: Xin Wang, Tao Tan, Yuan Gao, Luyi Han, Tianyu Zhang, Chunyao Lu, Regina Beets-Tan, Ruisheng Su, Ritse Mann
for: 本研究旨在提供一个框架，以帮助解释胸部X线成像中的异常性。
methods: 本研究使用了偏 asymmetrical abnormality transformer 驱动的自我对抗学习，将异常性与正常成像分离。同时，我们的方法被部分指引由随机生成的异常性。
results: 我们在三个公开数据集和一个内部数据集上进行实验，结果显示我们的方法在异常性分类、分 segmentation 和本地化任务中均能超越现有的方法。此外，重建的正常成像可以提供更好的诊断视觉参考。

Abstract
Asymmetry is a crucial characteristic of bilateral mammograms (Bi-MG) when abnormalities are developing. It is widely utilized by radiologists for diagnosis. The question of 'what the symmetrical Bi-MG would look like when the asymmetrical abnormalities have been removed ?' has not yet received strong attention in the development of algorithms on mammograms. Addressing this question could provide valuable insights into mammographic anatomy and aid in diagnostic interpretation. Hence, we propose a novel framework, DisAsymNet, which utilizes asymmetrical abnormality transformer guided self-adversarial learning for disentangling abnormalities and symmetric Bi-MG. At the same time, our proposed method is partially guided by randomly synthesized abnormalities. We conduct experiments on three public and one in-house dataset, and demonstrate that our method outperforms existing methods in abnormality classification, segmentation, and localization tasks. Additionally, reconstructed normal mammograms can provide insights toward better interpretable visual cues for clinical diagnosis. The code will be accessible to the public.

摘要
bilateral mammograms (Bi-MG) 的非对称性是一个重要的特征，用于诊断。然而，关于“正常的对称 Bi-MG 看起来如何？”这个问题在算法开发中还没有得到强调。我们提出了一种新的框架，DisAsymNet，它使用对称异常变换导向自 adversarial learning 来分离异常和对称 Bi-MG。同时，我们的提议方法受到随机生成的异常影响。我们在三个公共数据集和一个内部数据集上进行了实验，并证明了我们的方法在异常分类、 segmentation 和本地化任务中表现出色。此外，重建的正常床静图可以提供更好的诊断视觉cue。代码将公开。

A Real-time Human Pose Estimation Approach for Optimal Sensor Placement in Sensor-based Human Activity Recognition

paper_url: http://arxiv.org/abs/2307.02906
repo_url: None
paper_authors: Orhan Konak, Alexander Wischmann, Robin van de Water, Bert Arnrich
for: 这篇论文旨在提供一种新的方法来确定人体活动识别中最佳的传感器位置，以便实现数据隐私和多模态识别。
methods: 该方法使用实时的2D pose估计技术，基于视频记录target活动来 derivation skeleton数据，并使用这些数据来确定最佳的传感器位置。
results: 我们通过一个可行性研究，使用各种感知器来监测13种不同的活动，并证明了视处理器方法的效果相当于深度学习方法。这项研究对人体活动识别领域带来了重要的进步，提供了一种轻量级、在设备上进行的传感器位置确定方法，以实现数据隐私和多模态识别。

Abstract
Sensor-based Human Activity Recognition facilitates unobtrusive monitoring of human movements. However, determining the most effective sensor placement for optimal classification performance remains challenging. This paper introduces a novel methodology to resolve this issue, using real-time 2D pose estimations derived from video recordings of target activities. The derived skeleton data provides a unique strategy for identifying the optimal sensor location. We validate our approach through a feasibility study, applying inertial sensors to monitor 13 different activities across ten subjects. Our findings indicate that the vision-based method for sensor placement offers comparable results to the conventional deep learning approach, demonstrating its efficacy. This research significantly advances the field of Human Activity Recognition by providing a lightweight, on-device solution for determining the optimal sensor placement, thereby enhancing data anonymization and supporting a multimodal classification approach.

摘要
<>传感器基于人体活动识别可以实现不侵入式监测人体运动。然而，确定最佳传感器位置以获得最佳分类性能仍然是一个挑战。这篇论文介绍了一种新的方法，使用实时2D姿态估计来确定最佳传感器位置。从视频记录中获取的skeleton数据提供了一种独特的策略来识别最佳传感器位置。我们通过实验验证了我们的方法，使用抗 гравитацион acceleration sensor进行13种活动的监测，并对10名参与者进行了评估。我们的结果表明，视觉基于的传感器位置选择方法与传统的深度学习方法相当，这表明了其效果。这项研究对人体活动识别领域发展了重要的进步，提供了轻量级、在设备上进行的传感器位置选择方法，从而提高数据匿名化和支持多模态分类approach。

RefVSR++: Exploiting Reference Inputs for Reference-based Video Super-resolution

paper_url: http://arxiv.org/abs/2307.02897
repo_url: None
paper_authors: Han Zou, Masanori Suganuma, Takayuki Okatani
for: 这paper是为了提高视频uperResolution的图像质量而写的。methods: 这paper使用了Reference-based SuperResolution和video SuperResolution两种方法，并将它们结合在一起以提高图像质量。results: 这paper实验表明，使用RefVSR++方法可以提高图像质量，比RefVSR方法高出1dB的PSNR。

Abstract
Smartphones equipped with a multi-camera system comprising multiple cameras with different field-of-view (FoVs) are becoming more prevalent. These camera configurations are compatible with reference-based SR and video SR, which can be executed simultaneously while recording video on the device. Thus, combining these two SR methods can improve image quality. Recently, Lee et al. have presented such a method, RefVSR. In this paper, we consider how to optimally utilize the observations obtained, including input low-resolution (LR) video and reference (Ref) video. RefVSR extends conventional video SR quite simply, aggregating the LR and Ref inputs over time in a single bidirectional stream. However, considering the content difference between LR and Ref images due to their FoVs, we can derive the maximum information from the two image sequences by aggregating them independently in the temporal direction. Then, we propose an improved method, RefVSR++, which can aggregate two features in parallel in the temporal direction, one for aggregating the fused LR and Ref inputs and the other for Ref inputs over time. Furthermore, we equip RefVSR++ with enhanced mechanisms to align image features over time, which is the key to the success of video SR. We experimentally show that RefVSR++ outperforms RefVSR by over 1dB in PSNR, achieving the new state-of-the-art.

摘要
智能手机配备多镜头系统，包括多个镜头不同视场（FoV），变得更加普遍。这些镜头配置与参考基 SR 和视频 SR 兼容，可以同时进行视频记录设备上的实时执行。因此，将这两种 SR 方法结合可以提高图像质量。李等人已经提出了这种方法，即 RefVSR。在这篇论文中，我们考虑了如何优化获得的观察结果，包括输入低分辨率（LR）视频和参考（Ref）视频。RefVSR 将LR 和 Ref 输入简单地聚合在一起，并在时间方向上进行单向bidirectional流。但是，由于LR 和 Ref 图像的内容差异，我们可以在两个图像序列之间独立地聚合它们，从而Derive最大的信息。然后，我们提出了改进的方法，即 RefVSR++，它可以在时间方向上并行聚合两个特征，一个用于聚合混合LR 和 Ref 输入，另一个用于Ref 输入的时间方向上的聚合。此外，我们为 RefVSR++ 提供了进一步的时间对齐机制，这是图像SR 的关键。我们实验表明，RefVSR++ 可以超过 RefVSR 的1dB PSNR，达到新的状态态-艺。

Probabilistic and Semantic Descriptions of Image Manifolds and Their Applications

paper_url: http://arxiv.org/abs/2307.02881
repo_url: None
paper_authors: Peter Tu, Zhaoyuan Yang, Richard Hartley, Zhiwei Xu, Jing Zhang, Dylan Campbell, Jaskirat Singh, Tianyu Wang
for: 本研究旨在透过估计图像诸元函数，实现图像资料的统计分布模型。
methods: 本研究使用了常见的生成模型，如泛型流和扩散模型，以模型图像资料的分布。
results: 本研究发现，使用这些生成模型可以建立对抗攻击性质的防御措施，并且可以使用 semantic interpretations 来描述图像资料的分布。

Abstract
This paper begins with a description of methods for estimating probability density functions for images that reflects the observation that such data is usually constrained to lie in restricted regions of the high-dimensional image space - not every pattern of pixels is an image. It is common to say that images lie on a lower-dimensional manifold in the high-dimensional space. However, although images may lie on such lower-dimensional manifolds, it is not the case that all points on the manifold have an equal probability of being images. Images are unevenly distributed on the manifold, and our task is to devise ways to model this distribution as a probability distribution. In pursuing this goal, we consider generative models that are popular in AI and computer vision community. For our purposes, generative/probabilistic models should have the properties of 1) sample generation: it should be possible to sample from this distribution according to the modelled density function, and 2) probability computation: given a previously unseen sample from the dataset of interest, one should be able to compute the probability of the sample, at least up to a normalising constant. To this end, we investigate the use of methods such as normalising flow and diffusion models. We then show that such probabilistic descriptions can be used to construct defences against adversarial attacks. In addition to describing the manifold in terms of density, we also consider how semantic interpretations can be used to describe points on the manifold. To this end, we consider an emergent language framework which makes use of variational encoders to produce a disentangled representation of points that reside on a given manifold. Trajectories between points on a manifold can then be described in terms of evolving semantic descriptions.

摘要
Translated into Simplified Chinese:这篇论文开始介绍了图像的概率密度函数估计方法，因为数据通常会受到高维图像空间中的限制，不是所有像素点都是图像。人们常说图像处在一个Lower-dimensional manifold上，但不是所有 manifold 上的点都有相同的概率成为图像。我们的目标是使用生成/概率模型来模型这种分布，这些模型应具有以下两个特性：1）样本生成和2）概率计算。我们研究使用normalizing flow和diffusion模型来实现这一目标。这些概率描述可以用来构建防御性 adversarial attack。此外，我们还考虑使用semantic interpretations来描述 manifold 上的点，使用一种 emergent language 框架，该框架使用variational encoders来生成一个分离的表示图像点。在 manifold 上的点之间的路径可以用 evolving semantic descriptions 来描述。

Towards accurate instance segmentation in large-scale LiDAR point clouds

paper_url: http://arxiv.org/abs/2307.02877
repo_url: https://github.com/bxiang233/panopticsegforlargescalepointcloud
paper_authors: Binbin Xiang, Torben Peters, Theodora Kontogianni, Frawa Vetterli, Stefano Puliti, Rasmus Astrup, Konrad Schindler
for: 提高outdoor scene理解的精度，从城市地图到森林管理。
methods: 利用多种学习的点嵌入，进行精细的分割和归一化。
results: 提高了实例分割的精度，可以更好地应用于 inventory 和管理类应用程序。

Abstract
Panoptic segmentation is the combination of semantic and instance segmentation: assign the points in a 3D point cloud to semantic categories and partition them into distinct object instances. It has many obvious applications for outdoor scene understanding, from city mapping to forest management. Existing methods struggle to segment nearby instances of the same semantic category, like adjacent pieces of street furniture or neighbouring trees, which limits their usability for inventory- or management-type applications that rely on object instances. This study explores the steps of the panoptic segmentation pipeline concerned with clustering points into object instances, with the goal to alleviate that bottleneck. We find that a carefully designed clustering strategy, which leverages multiple types of learned point embeddings, significantly improves instance segmentation. Experiments on the NPM3D urban mobile mapping dataset and the FOR-instance forest dataset demonstrate the effectiveness and versatility of the proposed strategy.

摘要
pan-opeptic segmentation是semantic和instance segmentation的组合：将3D点云中的点分配到semantic类别并将其 partition into distinct object instances。它在户外场景理解方面具有广泛的应用，从城市地图到森林管理。现有方法在邻近同semantic category的实例上具有 segmentation 瓶颈，例如邻近的街道家具或邻近的树木，这限制了它们在 inventory-或 management-type 应用中的可用性。本研究探讨了panoptic segmentation管道中关于点集 clustering into object instances的步骤，以解决这个瓶颈。我们发现了一种精心设计的 clustering 策略，该策略利用多种学习的点嵌入，可以有效地提高实例 segmentation。在NPM3D urban mobile mapping dataset和FOR-instance forest dataset上进行了实验，并证明了我们的策略的有效性和多样性。

Reference-based Motion Blur Removal: Learning to Utilize Sharpness in the Reference Image

paper_url: http://arxiv.org/abs/2307.02875
repo_url: None
paper_authors: Han Zou, Masanori Suganuma, Takayuki Okatani
for: 提高弱暗图像的清晰度
methods: 使用多个图像，包括一个参考图像，以提高图像清晰度
results: 实验结果显示提出的方法有效地提高图像的清晰度

Abstract
Despite the recent advancement in the study of removing motion blur in an image, it is still hard to deal with strong blurs. While there are limits in removing blurs from a single image, it has more potential to use multiple images, e.g., using an additional image as a reference to deblur a blurry image. A typical setting is deburring an image using a nearby sharp image(s) in a video sequence, as in the studies of video deblurring. This paper proposes a better method to use the information present in a reference image. The method does not need a strong assumption on the reference image. We can utilize an alternative shot of the identical scene, just like in video deblurring, or we can even employ a distinct image from another scene. Our method first matches local patches of the target and reference images and then fuses their features to estimate a sharp image. We employ a patch-based feature matching strategy to solve the difficult problem of matching the blurry image with the sharp reference. Our method can be integrated into pre-existing networks designed for single image deblurring. The experimental results show the effectiveness of the proposed method.

摘要
尽管最近在图像去挠模块中得到了进步，但强挠仍然困难去处理。尽管单个图像的挠模块有限，但使用多个图像可以更有可能性地去挠模块。例如，在视频序列中使用邻近的锐化图像来去挠模块。这篇论文提出了更好地使用参考图像中的信息。我们不需要强制对参考图像进行假设。我们可以使用视频序列中的另一幅锐化图像作为参考图像，或者我们可以使用另一幅来自另一个场景的图像。我们的方法首先匹配本地补丁图像的目标和参考图像的地方，然后将其特征进行融合，以便估算一个锐化图像。我们使用补丁图像基于特征匹配策略来解决图像挠模块与锐化图像的困难问题。我们的方法可以与现有的单个图像去挠模块网络集成。实验结果表明我们的方法的效果。

MomentDiff: Generative Video Moment Retrieval from Random to Real

paper_url: http://arxiv.org/abs/2307.02869
repo_url: https://github.com/imccretrieval/momentdiff
paper_authors: Pandeng Li, Chen-Wei Xie, Hongtao Xie, Liming Zhao, Lei Zhang, Yun Zheng, Deli Zhao, Yongdong Zhang
for: 本研究的目的是提出一种高效和普适的视频瞬间检索方法，能够从无 trimmed 视频中提取指定语言描述相应的具体时间段落。
methods: 我们提出了一种生成扩散基于的框架，名为 MomentDiff，它模拟了人类检索过程中的随机浏览和慢步定位。Specifically，我们首先将实际 span 扩散到随机噪声中，然后学习将噪声还原为原始 span 的指导下。这使得模型能够学习一个从随机位置到实际瞬间的映射，从而实现从随机初始化到精准定位的能力。
results: 我们的效果验证结果表明，MomentDiff 在三个公共 benchmark 上都能够准确地检索视频瞬间，并且在我们提出的两个anti-bias dataset上表现出更好的一致性和稳定性。code、模型和anti-bias评价 dataset 都可以在https://github.com/IMCCretrieval/MomentDiff 上下载。

Abstract
Video moment retrieval pursues an efficient and generalized solution to identify the specific temporal segments within an untrimmed video that correspond to a given language description. To achieve this goal, we provide a generative diffusion-based framework called MomentDiff, which simulates a typical human retrieval process from random browsing to gradual localization. Specifically, we first diffuse the real span to random noise, and learn to denoise the random noise to the original span with the guidance of similarity between text and video. This allows the model to learn a mapping from arbitrary random locations to real moments, enabling the ability to locate segments from random initialization. Once trained, MomentDiff could sample random temporal segments as initial guesses and iteratively refine them to generate an accurate temporal boundary. Different from discriminative works (e.g., based on learnable proposals or queries), MomentDiff with random initialized spans could resist the temporal location biases from datasets. To evaluate the influence of the temporal location biases, we propose two anti-bias datasets with location distribution shifts, named Charades-STA-Len and Charades-STA-Mom. The experimental results demonstrate that our efficient framework consistently outperforms state-of-the-art methods on three public benchmarks, and exhibits better generalization and robustness on the proposed anti-bias datasets. The code, model, and anti-bias evaluation datasets are available at https://github.com/IMCCretrieval/MomentDiff.

摘要
视频时刻段检索寻求一种有效和通用的解决方案，以确定给定语言描述中的具体时刻段 within an untrimmed video。为此，我们提供了一个生成扩散基于的框架 called MomentDiff，它模拟了人类检索过程的一般流程，从随机浏览到渐进的地点化。 Specifically, we first diffuse the real span to random noise, and learn to denoise the random noise to the original span with the guidance of similarity between text and video. This allows the model to learn a mapping from arbitrary random locations to real moments, enabling the ability to locate segments from random initialization. Once trained, MomentDiff could sample random temporal segments as initial guesses and iteratively refine them to generate an accurate temporal boundary. 不同于掌学工作（例如基于学习的提案或查询），MomentDiff with random initialized spans could resist the temporal location biases from datasets. To evaluate the influence of the temporal location biases, we propose two anti-bias datasets with location distribution shifts, named Charades-STA-Len and Charades-STA-Mom. The experimental results demonstrate that our efficient framework consistently outperforms state-of-the-art methods on three public benchmarks, and exhibits better generalization and robustness on the proposed anti-bias datasets. The code, model, and anti-bias evaluation datasets are available at https://github.com/IMCCretrieval/MomentDiff.

A Critical Look at the Current Usage of Foundation Model for Dense Recognition Task

paper_url: http://arxiv.org/abs/2307.02862
repo_url: None
paper_authors: Shiqi Yang, Atsushi Hashimoto, Yoshitaka Ushiku
for: 本研究目的是对现有的基础模型进行简报调研，以了解它们是否可以应用于不同的下游任务。
methods: 本研究使用了现有的泛型基础模型，并对其进行了简要的调研和实验分析。
results: 研究发现，现有的基础模型可以在不同的下游任务中提供优秀的性能，但是现有的方法并不是最佳的。这些结论可以提供未来基础模型应用下游任务的指导。

Abstract
In recent years large model trained on huge amount of cross-modality data, which is usually be termed as foundation model, achieves conspicuous accomplishment in many fields, such as image recognition and generation. Though achieving great success in their original application case, it is still unclear whether those foundation models can be applied to other different downstream tasks. In this paper, we conduct a short survey on the current methods for discriminative dense recognition tasks, which are built on the pretrained foundation model. And we also provide some preliminary experimental analysis of an existing open-vocabulary segmentation method based on Stable Diffusion, which indicates the current way of deploying diffusion model for segmentation is not optimal. This aims to provide insights for future research on adopting foundation model for downstream task.

摘要
Recently, large models trained on vast amounts of cross-modal data have achieved remarkable success in various fields, such as image recognition and generation. While these foundation models have excelled in their original applications, it remains unclear whether they can be applied to other downstream tasks. In this paper, we provide a brief overview of current methods for discriminative dense recognition tasks built on pre-trained foundation models. Additionally, we present some preliminary experimental analysis of an existing open-vocabulary segmentation method based on Stable Diffusion, which suggests that the current approach to deploying diffusion models for segmentation is not optimal. This aims to provide insights for future research on adopting foundation models for downstream tasks.Note: Simplified Chinese is used in mainland China and Singapore, while Traditional Chinese is used in Taiwan, Hong Kong, and other countries. The translation is written in Simplified Chinese.

Deep Ensemble Learning with Frame Skipping for Face Anti-Spoofing

paper_url: http://arxiv.org/abs/2307.02858
repo_url: https://github.com/Usman1021/DLE
paper_authors: Usman Muhammad, Md Ziaul Hoque, Mourad Oussalah, Jorma Laaksonen
for: 防止 spoofing 攻击，提高人脸识别系统的安全性。
methods: 利用深度 ensemble 学习模型和帧 skip 机制，从视频中提取人脸动作信息进行预测。
results: 在四个数据集上进行了广泛的实验，并在最复杂的跨数据集测试场景下达到了状态之最高的检测性能（HTER）。

Abstract
Face presentation attacks (PA), also known as spoofing attacks, pose a substantial threat to biometric systems that rely on facial recognition systems, such as access control systems, mobile payments, and identity verification systems. To mitigate the spoofing risk, several video-based methods have been presented in the literature that analyze facial motion in successive video frames. However, estimating the motion between adjacent frames is a challenging task and requires high computational cost. In this paper, we rephrase the face anti-spoofing task as a motion prediction problem and introduce a deep ensemble learning model with a frame skipping mechanism. In particular, the proposed frame skipping adopts a uniform sampling approach by dividing the original video into video clips of fixed size. By doing so, every nth frame of the clip is selected to ensure that the temporal patterns can easily be perceived during the training of three different recurrent neural networks (RNNs). Motivated by the performance of individual RNNs, a meta-model is developed to improve the overall detection performance by combining the prediction of individual RNNs. Extensive experiments were performed on four datasets, and state-of-the-art performance is reported on MSU-MFSD (3.12%), Replay-Attack (11.19%), and OULU-NPU (12.23%) databases by using half total error rates (HTERs) in the most challenging cross-dataset testing scenario.

摘要
面部攻击（PA），也称为伪装攻击，对具有面部识别系统的生物metric系统 pose 了很大的威胁，如访问控制系统、移动支付和身份验证系统。为了减少伪装风险，文献中有许多基于视频的方法进行分析。然而，在Successive video frames中 estimating 面部动作的问题是一项具有高计算成本的任务。在这篇论文中，我们将面部防伪任务重新定义为一个动作预测问题，并引入了一种深度ensemble学习模型和帧跳动机制。具体来说，我们的帧跳动采用了一种均匀采样方法，将原始视频分成固定大小的视频clip。这样，每个n帧的clip中的每一帧都会被选择，以便在训练三个不同的循环神经网络（RNNs）时，能够轻松地感受到时间模式。受到个体RNNs的表现的激励，我们开发了一个meta-模型，以提高总体检测性能。我们对四个数据集进行了广泛的实验，并在MSU-MFSD（3.12%）、Replay-Attack（11.19%）和OULU-NPU（12.23%)数据库上report了最新的状态。

Revisiting Computer-Aided Tuberculosis Diagnosis

paper_url: http://arxiv.org/abs/2307.02848
repo_url: None
paper_authors: Yun Liu, Yu-Huan Wu, Shi-Chen Zhang, Li Liu, Min Wu, Ming-Ming Cheng
For: 这个研究旨在提高计算机支持的肺炎诊断（CTD），使用深度学习。* Methods: 该研究使用了大规模的TB肺炎X射图数据集（TBX11K），并提出了SymFormer模型，该模型通过对称搜索注意力（SymAttention）和对称位置编码（SPE）来学习特征。* Results: 实验显示SymFormer在TBX11K数据集上达到了状态略的性能。

Abstract
Tuberculosis (TB) is a major global health threat, causing millions of deaths annually. Although early diagnosis and treatment can greatly improve the chances of survival, it remains a major challenge, especially in developing countries. Recently, computer-aided tuberculosis diagnosis (CTD) using deep learning has shown promise, but progress is hindered by limited training data. To address this, we establish a large-scale dataset, namely the Tuberculosis X-ray (TBX11K) dataset, which contains 11,200 chest X-ray (CXR) images with corresponding bounding box annotations for TB areas. This dataset enables the training of sophisticated detectors for high-quality CTD. Furthermore, we propose a strong baseline, SymFormer, for simultaneous CXR image classification and TB infection area detection. SymFormer incorporates Symmetric Search Attention (SymAttention) to tackle the bilateral symmetry property of CXR images for learning discriminative features. Since CXR images may not strictly adhere to the bilateral symmetry property, we also propose Symmetric Positional Encoding (SPE) to facilitate SymAttention through feature recalibration. To promote future research on CTD, we build a benchmark by introducing evaluation metrics, evaluating baseline models reformed from existing detectors, and running an online challenge. Experiments show that SymFormer achieves state-of-the-art performance on the TBX11K dataset. The data, code, and models will be released.

摘要
肺炎（TB）是全球主要的健康威胁，每年引起数百万人的死亡。尽管早期诊断和治疗可以大大提高存活的机会，但是在发展中国家，特别是在贫困地区，这是一项重要的挑战。最近，使用深度学习的计算支持肺炎诊断（CTD）已经显示出了承诺，但是进步受到有限的训练数据的限制。为了解决这一问题，我们建立了一个大规模的数据集，即肺炎X射像（TBX11K）数据集，该数据集包含11,200个胸部X射像（CXR）图像，以及对肺炎区域的矩形框标注。这个数据集允许建立高质量的CTD检测器。此外，我们提出了一个强大的基线，即SymFormer，用于同时对CXR图像进行分类和肺炎感染区域检测。SymFormer integrate Symmetric Search Attention（SymAttention）来处理胸部X射像图像的双面对称性，以学习特征。由于CXR图像可能不会严格遵循双面对称性，我们还提出了Symmetric Positional Encoding（SPE）来促进SymAttention通过特征重新定义。为促进未来的CTD研究，我们建立了一个标准套件，包括评价指标、评估基eline模型，以及在线挑战。实验显示，SymFormer在TBX11K数据集上达到了状态畅的性能。数据、代码和模型将被发布。

Noise-to-Norm Reconstruction for Industrial Anomaly Detection and Localization

paper_url: http://arxiv.org/abs/2307.02836
repo_url: None
paper_authors: Shiqi Deng, Zhiyu Sun, Ruiyan Zhuang, Jun Gong
for: 这篇研究旨在提出一种基于噪音至常平衡的重建方法，用于精确地检测服务器上的异常。
methods: 本研究使用的方法包括多个级别的融合和差分注意模组，实现端对端的检测和定位。
results: 实验结果显示，提案的方法能够很好地重建异常区域为正常模式，并且在MPDD和VisA dataset上取得更竞争性的结果，设置了新的州OF-THE-ART标准。

Abstract
Anomaly detection has a wide range of applications and is especially important in industrial quality inspection. Currently, many top-performing anomaly-detection models rely on feature-embedding methods. However, these methods do not perform well on datasets with large variations in object locations. Reconstruction-based methods use reconstruction errors to detect anomalies without considering positional differences between samples. In this study, a reconstruction-based method using the noise-to-norm paradigm is proposed, which avoids the invariant reconstruction of anomalous regions. Our reconstruction network is based on M-net and incorporates multiscale fusion and residual attention modules to enable end-to-end anomaly detection and localization. Experiments demonstrate that the method is effective in reconstructing anomalous regions into normal patterns and achieving accurate anomaly detection and localization. On the MPDD and VisA datasets, our proposed method achieved more competitive results than the latest methods, and it set a new state-of-the-art standard on the MPDD dataset.

摘要
异常检测有广泛的应用，特别是在工业质量检测中非常重要。目前，许多高性能异常检测模型都采用特征嵌入方法。然而，这些方法在样本位置差异较大的数据集上不太好表现。基于重建Error的方法可以快速检测异常点而无需考虑样本位置的差异。在本研究中，我们提出了基于噪声至 норahlattice的杂合策略，该策略可以避免异常区域的恒定重建。我们的重建网络基于M-net，并包括多scale融合和异常注意模块，以实现端到端异常检测和地图化。实验表明，我们的提议方法可以快速和准确地检测和地图化异常点。在MPDD和VisA数据集上，我们的提议方法比最新的方法更高效，并在MPDD数据集上设置了新的状态标准。

Sampling-based Fast Gradient Rescaling Method for Highly Transferable Adversarial Attacks

paper_url: http://arxiv.org/abs/2307.02828
repo_url: https://github.com/JHL-HUST/S-FGRM
paper_authors: Xu Han, Anmin Liu, Chenxuan Yao, Yanbo Fan, Kun He
for: 这个研究目的是提高黑盒攻击的攻击效率和稳定性。
methods: 这个方法使用数据缩寸来取代标志函数，并且提出深度首先抽样法来稳定梯度更新。
results: 实验结果显示，这个方法可以对于黑盒攻击提高攻击效率和稳定性，并且比前state-of-the-art基eline更高。

Abstract
Deep neural networks are known to be vulnerable to adversarial examples crafted by adding human-imperceptible perturbations to the benign input. After achieving nearly 100% attack success rates in white-box setting, more focus is shifted to black-box attacks, of which the transferability of adversarial examples has gained significant attention. In either case, the common gradient-based methods generally use the sign function to generate perturbations on the gradient update, that offers a roughly correct direction and has gained great success. But little work pays attention to its possible limitation. In this work, we observe that the deviation between the original gradient and the generated noise may lead to inaccurate gradient update estimation and suboptimal solutions for adversarial transferability. To this end, we propose a Sampling-based Fast Gradient Rescaling Method (S-FGRM). Specifically, we use data rescaling to substitute the sign function without extra computational cost. We further propose a Depth First Sampling method to eliminate the fluctuation of rescaling and stabilize the gradient update. Our method could be used in any gradient-based attacks and is extensible to be integrated with various input transformation or ensemble methods to further improve the adversarial transferability. Extensive experiments on the standard ImageNet dataset show that our method could significantly boost the transferability of gradient-based attacks and outperform the state-of-the-art baselines.

摘要
在这种情况下，我们提出了一种 Sampling-based Fast Gradient Rescaling Method (S-FGRM)。具体来说，我们使用数据缩放来取代 sign 函数，不需要额外的计算成本。我们还提出了 Depth First Sampling 方法，以消除缩放的波动并稳定梯度更新。我们的方法可以在任何 gradient-based 攻击中使用，并且可以与不同的输入变换或ensemble方法结合使用，以进一步提高攻击性能。我们在标准的 ImageNet 数据集上进行了广泛的实验，结果表明，我们的方法可以很大程度地提高攻击性能，并超越了当前的基eline。

Bundle-specific Tractogram Distribution Estimation Using Higher-order Streamline Differential Equation

paper_url: http://arxiv.org/abs/2307.02825
repo_url: None
paper_authors: Yuanjing Feng, Lei Xie, Jingqiang Wang, Jianzhong He, Fei Gao
for: 这种诊断方法是为了重建肌肉bundle，尤其是复杂的全球肌肉纤维。
methods: 该方法基于高阶流线差分方程，通过’群集到群集’的方式重建流线集。它还提出了一种全局优化模型来估计束specific tractogram分布函数(BTD)系数，以 characterize 肌肉纤维和束specific tractogram之间的关系。
results: 实验表明，该方法可以直接重建复杂的全球肌肉纤维，并且可以减少地方错误偏差和积累。它还能更好地重建长距离、扭转和大弯肌肉纤维。

Abstract
Tractography traces the peak directions extracted from fiber orientation distribution (FOD) suffering from ambiguous spatial correspondences between diffusion directions and fiber geometry, which is prone to producing erroneous tracks while missing true positive connections. The peaks-based tractography methods 'locally' reconstructed streamlines in 'single to single' manner, thus lacking of global information about the trend of the whole fiber bundle. In this work, we propose a novel tractography method based on a bundle-specific tractogram distribution function by using a higher-order streamline differential equation, which reconstructs the streamline bundles in 'cluster to cluster' manner. A unified framework for any higher-order streamline differential equation is presented to describe the fiber bundles with disjoint streamlines defined based on the diffusion tensor vector field. At the global level, the tractography process is simplified as the estimation of bundle-specific tractogram distribution (BTD) coefficients by minimizing the energy optimization model, and is used to characterize the relations between BTD and diffusion tensor vector under the prior guidance by introducing the tractogram bundle information to provide anatomic priors. Experiments are performed on simulated Hough, Sine, Circle data, ISMRM 2015 Tractography Challenge data, FiberCup data, and in vivo data from the Human Connectome Project (HCP) data for qualitative and quantitative evaluation. The results demonstrate that our approach can reconstruct the complex global fiber bundles directly. BTD reduces the error deviation and accumulation at the local level and shows better results in reconstructing long-range, twisting, and large fanning tracts.

摘要
tractography 跟踪 peak 方向，从 fibers orientation distribution (FOD) 提取的方向，受到 ambiguous 的空间匹配问题，容易产生错误的轨迹，而且错过真实正确的连接。 peaks-based tractography 方法在 'single to single' 方式重建流线，缺乏全局信息，不能捕捉整个纤维Bundle的趋势。在这项工作中，我们提出了一种基于纤维特有的 tractogram 分布函数的新 tractography 方法，使用高阶流线差分方程，重建流线集在 'cluster to cluster' 方式。我们提出了一个普适的高阶流线差分方程来描述纤维Bundle中的分支流线。在全局水平上， tractography 过程简化为估计纤维特有 tractogram 分布(BTD) 系数，通过能量优化模型来Characterize BTD 和 diffusion tensor vector field 之间的关系。我们在 tractogram 集信息的引导下，引入了 tractogram bundle 信息，以提供 anatomic priors。我们在 simulated Hough, Sine, Circle 数据、ISMRM 2015 Tractography Challenge 数据、FiberCup 数据和人类连接计划 (HCP) 数据上进行了质量和量化评估。结果显示，我们的方法可以直接重建复杂的全局纤维Bundle。BTD 降低了本地错误偏差和积累，并且更好地重建长距离、扭转和大扇辐纤维。

Single Image LDR to HDR Conversion using Conditional Diffusion

paper_url: http://arxiv.org/abs/2307.02814
repo_url: None
paper_authors: Dwip Dalal, Gautam Vashishtha, Prajwal Singh, Shanmuganathan Raman
for: 这篇论文的目的是提出一种基于深度学习的高动态范围（HDR）图像恢复方法，以恢复真实场景中的细节信息，并且使用 Conditional Denoising Diffusion Probabilistic Model（DDPM）框架。
methods: 该方法使用了一种基于深度学习的 autoencoder 来提高输入低动态范围（LDR）图像的内在表示质量，并且引入了一种新的曝光损失函数来引导梯度的方向。
results: 经过进行了全面的量化和质量测试，该方法得到了有效的结果，表明一种简单的增强 diffusion-based 方法可以替代复杂的摄像头管线架构。

Abstract
Digital imaging aims to replicate realistic scenes, but Low Dynamic Range (LDR) cameras cannot represent the wide dynamic range of real scenes, resulting in under-/overexposed images. This paper presents a deep learning-based approach for recovering intricate details from shadows and highlights while reconstructing High Dynamic Range (HDR) images. We formulate the problem as an image-to-image (I2I) translation task and propose a conditional Denoising Diffusion Probabilistic Model (DDPM) based framework using classifier-free guidance. We incorporate a deep CNN-based autoencoder in our proposed framework to enhance the quality of the latent representation of the input LDR image used for conditioning. Moreover, we introduce a new loss function for LDR-HDR translation tasks, termed Exposure Loss. This loss helps direct gradients in the opposite direction of the saturation, further improving the results' quality. By conducting comprehensive quantitative and qualitative experiments, we have effectively demonstrated the proficiency of our proposed method. The results indicate that a simple conditional diffusion-based method can replace the complex camera pipeline-based architectures.

摘要
digitization 目标是复制真实场景，但低动态范围（LDR）摄像机无法表示实际场景的广泛动态范围，导致阴影和高光部分损失。本文提出了一种基于深度学习的方法，用于从阴影和高光部分中恢复细节，并重建高动态范围（HDR）图像。我们将这个问题定义为一个图像到图像（I2I）翻译任务，并提出一种基于conditioned Denoising Diffusion Probabilistic Model（DDPM）的框架，不需要核心网络指导。我们在我们的提议框架中包含了一个深度透彻网络基于自动编码器，以提高输入LDR图像的latent表示质量。此外，我们还引入了一种新的损失函数，称为曝光损失（Exposure Loss），该损失函数可以帮助导向损失向量在强度方向上反方向，进一步提高结果质量。通过对比质量和质量的实验，我们有效地表明了我们的提议方法的效果。结果表明，一种简单的增强基于diffusion的方法可以取代复杂的摄像机管线结构。

Advancing Zero-Shot Digital Human Quality Assessment through Text-Prompted Evaluation

paper_url: http://arxiv.org/abs/2307.02808
repo_url: https://github.com/zzc-1998/sjtu-h3d
paper_authors: Zicheng Zhang, Wei Sun, Yingjie Zhou, Haoning Wu, Chunyi Li, Xiongkuo Min, Xiaohong Liu, Guangtao Zhai, Weisi Lin
for: This paper aims to address the lack of comprehensive digital human quality assessment (DHQA) databases by proposing a subjective quality assessment database called SJTU-H3D, which can serve as a benchmark for DHQA research.
methods: The proposed method leverages semantic and distortion features extracted from projections, as well as geometry features derived from the mesh structure of digital humans. The method employs the Contrastive Language-Image Pre-training (CLIP) model to measure semantic affinity and incorporates the Naturalness Image Quality Evaluator (NIQE) model to capture low-level distortion information.
results: The proposed Digital Human Quality Index (DHQI) demonstrates significant improvements in zero-shot performance and can serve as a robust baseline for DHQA tasks, facilitating advancements in the field.Here is the summary in the format you requested:
for: 提供了一个全面的数字人质量评估数据库（SJTU-H3D），用于数字人质量评估研究的参照基准。
methods: 利用投影图像中的 semantic 和 distortion 特征，以及数字人的 mesh 结构特征，提出了一种零shot 数字人质量评估方法。
results: 提出的数字人质量指数（DHQI）在零shot 场景下表现出了显著的改善，可以作为数字人质量评估任务的robust 基准，促进领域的进步。

Abstract
Digital humans have witnessed extensive applications in various domains, necessitating related quality assessment studies. However, there is a lack of comprehensive digital human quality assessment (DHQA) databases. To address this gap, we propose SJTU-H3D, a subjective quality assessment database specifically designed for full-body digital humans. It comprises 40 high-quality reference digital humans and 1,120 labeled distorted counterparts generated with seven types of distortions. The SJTU-H3D database can serve as a benchmark for DHQA research, allowing evaluation and refinement of processing algorithms. Further, we propose a zero-shot DHQA approach that focuses on no-reference (NR) scenarios to ensure generalization capabilities while mitigating database bias. Our method leverages semantic and distortion features extracted from projections, as well as geometry features derived from the mesh structure of digital humans. Specifically, we employ the Contrastive Language-Image Pre-training (CLIP) model to measure semantic affinity and incorporate the Naturalness Image Quality Evaluator (NIQE) model to capture low-level distortion information. Additionally, we utilize dihedral angles as geometry descriptors to extract mesh features. By aggregating these measures, we introduce the Digital Human Quality Index (DHQI), which demonstrates significant improvements in zero-shot performance. The DHQI can also serve as a robust baseline for DHQA tasks, facilitating advancements in the field. The database and the code are available at https://github.com/zzc-1998/SJTU-H3D.

摘要
digital humans 有广泛的应用在不同领域，需要相关的质量评估研究。然而，存在全面的数字人质量评估（DHQA）数据库缺失。为了填补这一空白，我们提出了清华大学三维数字人质量评估（SJTU-H3D）数据库，这是特意设计 для全身数字人的主观质量评估数据库。它包括40个高质量参照数字人和1,120个扭曲对应件，通过七种扭曲方式生成。SJTU-H3D数据库可以作为DHQA研究的标准准样， allowing 评估和优化处理算法。此外，我们提出了零极shot DHQA方法，专注于无参考（NR）场景，以确保普适性能力的同时降低数据库偏见。我们的方法利用CLIP模型测量 semantic affinity，并利用NIQE模型捕捉低级扭曲信息。此外，我们利用截角角度来描述数字人的网格结构，提取网格特征。通过综合这些指标，我们引入了数字人质量指数（DHQI），其在零极shot情况下显示了显著的改进。DHQI 也可以作为DHQA任务的稳定基准，促进领域的进步。数据库和代码可以在https://github.com/zzc-1998/SJTU-H3D 上下载。

UIT-Saviors at MEDVQA-GI 2023: Improving Multimodal Learning with Image Enhancement for Gastrointestinal Visual Question Answering

paper_url: http://arxiv.org/abs/2307.02783
repo_url: None
paper_authors: Triet M. Thai, Anh T. Vo, Hao K. Tieu, Linh N. P. Bui, Thien T. B. Nguyen
for: 本研究的目的是提高肠食道内镜图像中的医学视问答系统性能，以帮助医生更准确地诊断肠胃疾病。
methods: 该研究使用多模态学习方法，包括BERT编码器和不同的预训练视觉模型，以提取问题和内镜图像的特征。研究还使用图像增强技术来提高VQA性能。
results: 研究结果显示，使用 transformer 基于 CNN 和 Transformer 架构的视觉模型可以在肠食道内镜图像中提高 VQA 性能，而且图像增强技术也有显著提高效果。研究中的最佳方法，即 BERT+BEiT 混合和图像增强，在开发测试集上达到了87.25% 准确率和91.85% F1 分数。

Abstract
In recent years, artificial intelligence has played an important role in medicine and disease diagnosis, with many applications to be mentioned, one of which is Medical Visual Question Answering (MedVQA). By combining computer vision and natural language processing, MedVQA systems can assist experts in extracting relevant information from medical image based on a given question and providing precise diagnostic answers. The ImageCLEFmed-MEDVQA-GI-2023 challenge carried out visual question answering task in the gastrointestinal domain, which includes gastroscopy and colonoscopy images. Our team approached Task 1 of the challenge by proposing a multimodal learning method with image enhancement to improve the VQA performance on gastrointestinal images. The multimodal architecture is set up with BERT encoder and different pre-trained vision models based on convolutional neural network (CNN) and Transformer architecture for features extraction from question and endoscopy image. The result of this study highlights the dominance of Transformer-based vision models over the CNNs and demonstrates the effectiveness of the image enhancement process, with six out of the eight vision models achieving better F1-Score. Our best method, which takes advantages of BERT+BEiT fusion and image enhancement, achieves up to 87.25% accuracy and 91.85% F1-Score on the development test set, while also producing good result on the private test set with accuracy of 82.01%.

摘要
Our team participated in Task 1 of the challenge by proposing a multimodal learning method that incorporated image enhancement to improve the VQA performance on gastrointestinal images. Our approach used a BERT encoder and different pre-trained vision models based on convolutional neural networks (CNNs) and Transformer architecture for feature extraction from questions and endoscopy images.The results of our study showed that Transformer-based vision models outperformed CNNs, and the image enhancement process significantly improved the VQA performance. Our best method, which combined BERT+BEiT fusion and image enhancement, achieved up to 87.25% accuracy and 91.85% F1-Score on the development test set. Additionally, our method performed well on the private test set with an accuracy of 82.01%. These results demonstrate the effectiveness of our multimodal learning method and the importance of image enhancement in improving the performance of MedVQA systems.

SeLiNet: Sentiment enriched Lightweight Network for Emotion Recognition in Images

paper_url: http://arxiv.org/abs/2307.02773
repo_url: None
paper_authors: Tuneer Khargonkar, Shwetank Choudhary, Sumit Kumar, Barath Raj KR
for: 这个论文是为了进行情感识别 tasks 的轻量级网络模型和端到端在设备上的整合管线。
methods: 这个模型包括人体特征抽象、图像美学特征抽象和学习型融合网络，这些部分共同估计阶梯性情感和人类情感任务。
results: 在 EMOTIC 数据集上，提议的方法实现了对比基eline的 Average Precision (AP) 分数 27.17，并在实际上实现了降低模型大小的 >85% 和 >93%。

Abstract
In this paper, we propose a sentiment-enriched lightweight network SeLiNet and an end-to-end on-device pipeline for contextual emotion recognition in images. SeLiNet model consists of body feature extractor, image aesthetics feature extractor, and learning-based fusion network which jointly estimates discrete emotion and human sentiments tasks. On the EMOTIC dataset, the proposed approach achieves an Average Precision (AP) score of 27.17 in comparison to the baseline AP score of 27.38 while reducing the model size by >85%. In addition, we report an on-device AP score of 26.42 with reduction in model size by >93% when compared to the baseline.

摘要
在这篇论文中，我们提出了一种具有情感增强的轻量级网络SeLiNet，以及一个端到端在设备上的管道 для上下文感知图像中的情感识别。SeLiNet模型包括人体特征提取器、图像美学特征提取器和学习基于的融合网络，这些网络共同估计不同情感和人类情感任务。在EMOTIC数据集上，我们提议的方法实现了比基准AP分数27.17，而且减少了模型大小的>85%。此外，我们还报告了在设备上实现的AP分数26.42，并且减少了模型大小的>93%。

CityTrack: Improving City-Scale Multi-Camera Multi-Target Tracking by Location-Aware Tracking and Box-Grained Matching

paper_url: http://arxiv.org/abs/2307.02753
repo_url: None
paper_authors: Jincheng Lu, Xipeng Yang, Jin Ye, Yifu Zhang, Zhikang Zou, Wei Zhang, Xiao Tan
For: The paper is written for the task of multi-camera multi-target tracking (MCMT) in urban traffic visual analysis, with the goal of overcoming the challenges posed by complex and dynamic urban traffic scenes.* Methods: The paper proposes a novel systematic MCMT framework called CityTrack, which integrates various advanced techniques to improve the effectiveness of the MCMT task. These techniques include a Location-Aware SCMT tracker and a novel Box-Grained Matching (BGM) method for the ICA module.* Results: The paper achieved an IDF1 of 84.91% on the public test set of the CityFlowV2 dataset, ranking 1st in the 2022 AI CITY CHALLENGE. The experimental results demonstrate the effectiveness of the proposed approach in overcoming the challenges of urban traffic scenes.Here is the information in Simplified Chinese text:
for: 这篇论文是为了解决城市流Visual分析中的多摄像头多目标跟踪问题，目标是在复杂和动态的城市交通场景下实现高效的多摄像头多目标跟踪。
methods: 论文提出了一种新的系统性MCMT框架，称为CityTrack，该框架集成了多种高级技术来提高MCMT任务的效果。这些技术包括一种位置意识SCMT跟踪器和一种novel的盒子粗粒匹配（BGM）方法 дляICA模块。
results: 论文在CityFlowV2数据集的公共测试集上 achievied an IDF1 of 84.91%, ranking 1st in the 2022 AI CITY CHALLENGE。实验结果表明提出的方法在城市交通场景下能够有效地解决多摄像头多目标跟踪问题。

Abstract
Multi-Camera Multi-Target Tracking (MCMT) is a computer vision technique that involves tracking multiple targets simultaneously across multiple cameras. MCMT in urban traffic visual analysis faces great challenges due to the complex and dynamic nature of urban traffic scenes, where multiple cameras with different views and perspectives are often used to cover a large city-scale area. Targets in urban traffic scenes often undergo occlusion, illumination changes, and perspective changes, making it difficult to associate targets across different cameras accurately. To overcome these challenges, we propose a novel systematic MCMT framework, called CityTrack. Specifically, we present a Location-Aware SCMT tracker which integrates various advanced techniques to improve its effectiveness in the MCMT task and propose a novel Box-Grained Matching (BGM) method for the ICA module to solve the aforementioned problems. We evaluated our approach on the public test set of the CityFlowV2 dataset and achieved an IDF1 of 84.91%, ranking 1st in the 2022 AI CITY CHALLENGE. Our experimental results demonstrate the effectiveness of our approach in overcoming the challenges posed by urban traffic scenes.

摘要
多摄像头多目标跟踪（MCMT）是一种计算机视觉技术，它涉及同时跟踪多个目标在多个摄像头上。在城市交通视觉分析中，MCMT遇到了复杂和动态的城市交通场景，多个摄像头具有不同的视角和观点，用于覆盖大规模的城市区域。目标在城市交通场景中经常受到遮挡、照明变化和视角变化的影响，因此准确地相互关联目标在不同的摄像头上是一项具有挑战性的任务。为了解决这些挑战，我们提出了一种新的系统化的MCMT框架，称为CityTrack。具体来说，我们提出了一种位置意识的SCMT跟踪器，该跟踪器结合了多种进步技术来提高MCMT任务的效果。此外，我们还提出了一种新的盒子块匹配（BGM）方法，用于ICA模块解决上述问题。我们在CityFlowV2 dataset的公共测试集上进行了评估，并 achieved IDF1的84.91%，在2022年AI城市挑战中排名第一。我们的实验结果表明，我们的方法在城市交通场景中能够有效地解决MCMT任务中的挑战。

Active Learning with Contrastive Pre-training for Facial Expression Recognition

paper_url: http://arxiv.org/abs/2307.02744
repo_url: https://github.com/shuvenduroy/activefer
paper_authors: Shuvendu Roy, Ali Etemad
for: 这篇论文旨在提出和研究一种active learning方法，以优化 facial expression recognition（FER）中的标注费用。
methods: 本论文使用了8种最新的active learning方法，并在FER13、RAF-DB和KDEF三个公共数据集上进行实验。
results: 研究发现现有的active learning方法在FER中不太够，可能是因为”冷启动”现象，即初始标注样本不够代表整个数据集。为解决这个问题，我们提议使用自我超参差异批处理，首先学习整个无标注数据集下的基本表示，然后采用active learning方法。我们的2步方法比随机抽样和最佳现有基eline的active learning方法提高了9.2%和6.7%。

Abstract
Deep learning has played a significant role in the success of facial expression recognition (FER), thanks to large models and vast amounts of labelled data. However, obtaining labelled data requires a tremendous amount of human effort, time, and financial resources. Even though some prior works have focused on reducing the need for large amounts of labelled data using different unsupervised methods, another promising approach called active learning is barely explored in the context of FER. This approach involves selecting and labelling the most representative samples from an unlabelled set to make the best use of a limited 'labelling budget'. In this paper, we implement and study 8 recent active learning methods on three public FER datasets, FER13, RAF-DB, and KDEF. Our findings show that existing active learning methods do not perform well in the context of FER, likely suffering from a phenomenon called 'Cold Start', which occurs when the initial set of labelled samples is not well representative of the entire dataset. To address this issue, we propose contrastive self-supervised pre-training, which first learns the underlying representations based on the entire unlabelled dataset. We then follow this with the active learning methods and observe that our 2-step approach shows up to 9.2% improvement over random sampling and up to 6.7% improvement over the best existing active learning baseline without the pre-training. We will make the code for this study public upon publication at: github.com/ShuvenduRoy/ActiveFER.

摘要
深度学习在人脸表达识别（FER）中发挥了重要作用，感谢大型模型和庞大的标签数据。然而，获得标签数据需要巨量的人工劳动、时间和财务资源。一些先前的工作已经尝试了不同的无监督方法来降低需要大量标签数据的需求，但是另一个有投入的方法叫做活动学习仍未在FER上得到广泛的探讨。这种方法是通过从无标签集中选择和标签最 Representative 的样本来使用有限的标签预算。在这篇论文中，我们实现和研究了8种最近的活动学习方法在三个公共FER数据集上，即FER13、RAF-DB和KDEF。我们的发现表明现有的活动学习方法在FER上不 performs well，可能是因为一种被称为“冷启动”的现象，这种现象发生在整个数据集的初始标签样本不够 Representative 的情况下。为解决这个问题，我们提议了对比自适应预训练，首先学习整个无标签集下的基本表示，然后跟进活动学习方法。我们发现，我们的2步approach在Random Sampling和最佳无预训练基eline之间显示出9.2%的提升和6.7%的提升。我们将在发表之前在github.com/ShuvenduRoy/ActiveFER上公开代码。

An Uncertainty Aided Framework for Learning based Liver $T_1ρ$ Mapping and Analysis

paper_url: http://arxiv.org/abs/2307.02736
repo_url: None
paper_authors: Chaoxing Huang, Vincent Wai Sun Wong, Queenie Chan, Winnie Chiu Wing Chu, Weitian Chen
for:The paper aims to develop a learning-based approach for accurate and reliable quantitative $T_1\rho$ imaging of the liver, which can aid in the assessment of biochemical alterations in liver pathologies.methods:The proposed approach utilizes deep learning techniques to refine parametric maps and model the uncertainty of the predicted $T_1\rho$ values. The approach also employs a probabilistic framework to improve the mapping performance and remove unreliable pixels in the region of interest.results:The proposed approach leads to a relative mapping error of less than 3% and provides uncertainty estimation simultaneously. The estimated uncertainty reflects the actual error level and can be used to further reduce the relative $T_1\rho$ mapping error to 2.60% and remove unreliable pixels in the region of interest effectively.Here is the Chinese translation of the three key points:for:论文目的是开发一种基于学习的方法，以提供准确可靠的量化T1ρ成像，用于评估肝病变的生化变化。methods:提议的方法使用深度学习技术来改进参数地图，并模型预测T1ρ值的uncertainty。该方法还使用probabilistic框架来提高映射性能，并 removes unreliable pixels in the region of interest.results:提议的方法导致相对映射错误低于3%，同时提供uncertainty estimation。预测的uncertainty反映实际错误水平，可以进一步降低相对映射错误率至2.60%，并有效地 removes unreliable pixels in the region of interest.

Abstract
Objective: Quantitative $T_1\rho$ imaging has potential for assessment of biochemical alterations of liver pathologies. Deep learning methods have been employed to accelerate quantitative $T_1\rho$ imaging. To employ artificial intelligence-based quantitative imaging methods in complicated clinical environment, it is valuable to estimate the uncertainty of the predicated $T_1\rho$ values to provide the confidence level of the quantification results. The uncertainty should also be utilized to aid the post-hoc quantitative analysis and model learning tasks. Approach: To address this need, we propose a parametric map refinement approach for learning-based $T_1\rho$ mapping and train the model in a probabilistic way to model the uncertainty. We also propose to utilize the uncertainty map to spatially weight the training of an improved $T_1\rho$ mapping network to further improve the mapping performance and to remove pixels with unreliable $T_1\rho$ values in the region of interest. The framework was tested on a dataset of 51 patients with different liver fibrosis stages. Main results: Our results indicate that the learning-based map refinement method leads to a relative mapping error of less than 3% and provides uncertainty estimation simultaneously. The estimated uncertainty reflects the actual error level, and it can be used to further reduce relative $T_1\rho$ mapping error to 2.60% as well as removing unreliable pixels in the region of interest effectively. Significance: Our studies demonstrate the proposed approach has potential to provide a learning-based quantitative MRI system for trustworthy $T_1\rho$ mapping of the liver.

摘要
目的：量化$T_1\rho$成像有潜力评估肝病变的生物化学变化。深度学习方法已经被应用于加速量化$T_1\rho$成像。在复杂的临床环境中使用人工智能基本图像评估方法，有利于估计预测的$T_1\rho$值的不确定性，以提供评估结果的信任度。方法：为了解决这个需求，我们提出了参数Map重定义方法，用于学习基于$T_1\rho$映射的不确定性模型，并在概率方式上训练模型。我们还提议使用不确定度地图来进行空间权重训练改进$T_1\rho$映射网络，以提高映射性能，并从区域 interesset中除掉不可靠的$T_1\rho$值。框架在51名患者不同的肝病变stage的数据集上进行测试。主要结果：我们的结果表明，学习基于映射的地图重定义方法可以实现相对的映射错误率低于3%，并同时提供不确定性估计。预测的不确定性反映实际错误水平，可以用于进一步降低相对$T_1\rho$映射错误率至2.60%，以及有效地从区域 interesset中除掉不可靠的$T_1\rho$值。意义：我们的研究表明，我们提出的方法有可能为肝脏的量化MRI系统提供可靠的$T_1\rho$映射。

MMNet: Multi-Collaboration and Multi-Supervision Network for Sequential Deepfake Detection

paper_url: http://arxiv.org/abs/2307.02733
repo_url: None
paper_authors: Ruiyang Xia, Decheng Liu, Jie Li, Lin Yuan, Nannan Wang, Xinbo Gao
for: 这篇论文旨在应对伪造面孔图像，特别是伪造面孔图像的Sequential deepfake detection。methods: 本论文提出了Multi-Collaboration and Multi-Supervision Network (MMNet)，可以处理各种空间排序和排序排序的伪造面孔图像，并且不需要知道伪造方法。results: 实验结果显示，MMNet可以实现高度的检测性和独立的回复性。

Abstract
Advanced manipulation techniques have provided criminals with opportunities to make social panic or gain illicit profits through the generation of deceptive media, such as forged face images. In response, various deepfake detection methods have been proposed to assess image authenticity. Sequential deepfake detection, which is an extension of deepfake detection, aims to identify forged facial regions with the correct sequence for recovery. Nonetheless, due to the different combinations of spatial and sequential manipulations, forged face images exhibit substantial discrepancies that severely impact detection performance. Additionally, the recovery of forged images requires knowledge of the manipulation model to implement inverse transformations, which is difficult to ascertain as relevant techniques are often concealed by attackers. To address these issues, we propose Multi-Collaboration and Multi-Supervision Network (MMNet) that handles various spatial scales and sequential permutations in forged face images and achieve recovery without requiring knowledge of the corresponding manipulation method. Furthermore, existing evaluation metrics only consider detection accuracy at a single inferring step, without accounting for the matching degree with ground-truth under continuous multiple steps. To overcome this limitation, we propose a novel evaluation metric called Complete Sequence Matching (CSM), which considers the detection accuracy at multiple inferring steps, reflecting the ability to detect integrally forged sequences. Extensive experiments on several typical datasets demonstrate that MMNet achieves state-of-the-art detection performance and independent recovery performance.

摘要
高级操作技术为犯罪分子提供了创造社会恐慌或获得违法利润的机会，通过生成假面像。为了评估图像真实性，有多种深伪检测方法被提议。序列深伪检测是深伪检测的扩展，旨在识别假面像中的假区域，并且可以进行回归。然而，由于不同的空间和时序排序杂化，假面像中的假区域具有重大差异，这会严重影响检测性能。此外，回归假面像需要了解攻击者使用的杂化模型，这是困难的获取，因为攻击者通常会隐藏自己的技巧。为解决这些问题，我们提出了多方合作多级监督网络（MMNet），可以处理不同的空间级别和时序排序，并且可以实现不需要攻击者的杂化模型回归。此外，现有的评估指标仅考虑单个推理步骤的检测精度，而不考虑连续多个步骤中的匹配度。为了超越这些限制，我们提出了一个新的评估指标called完整序列匹配（CSM），可以考虑连续多个步骤中的检测精度，更好地反映检测完整性。广泛的实验表明，MMNet在多种典型数据集上达到了当前最佳的检测性能和独立回归性能。

Applying a Color Palette with Local Control using Diffusion Models

paper_url: http://arxiv.org/abs/2307.02698
repo_url: None
paper_authors: Vaibhav Vavilala, David Forsyth
for: 这两种新编辑方法适用于幻想卡牌艺术中。
methods: 这些方法包括alette transfer和segment控制。alette transfer使用指定参考色彩缓存到给定卡牌中，而segment控制允许艺术家将一个或多个图像段移动，并可选择要求结果中的一些段使用指定的颜色。
results: 这两种编辑方法的组合可以生成有价值的工作流程，例如：先移动一个段，然后重新颜色；或者先颜色，然后强制一些段使用指定的颜色。这些方法在Yu-Gi-Oh卡牌 datasets中得到了成功应用。

Abstract
We demonstrate two novel editing procedures in the context of fantasy card art. Palette transfer applies a specified reference palette to a given card. For fantasy art, the desired change in palette can be very large, leading to huge changes in the "look" of the art. We demonstrate that a pipeline of vector quantization; matching; and "vector dequantization" (using a diffusion model) produces successful extreme palette transfers. Segment control allows an artist to move one or more image segments, and to optionally specify the desired color of the result. The combination of these two types of edit yields valuable workflows, including: move a segment, then recolor; recolor, then force some segments to take a prescribed color. We demonstrate our methods on the challenging Yu-Gi-Oh card art dataset.

摘要
我们展示了两种新的编辑方法在魔幻卡牌艺术上。alette转移使用指定的参考色板应用到给定的卡牌中。为了满足魔幻艺术的需求， Desired 的色彩变化可能很大，导致卡牌的“look”发生巨大的变化。我们表明了一个频谱划分；匹配；和“频谱划分”（使用扩散模型）生成了成功的极大色彩转移。 segment控制允许艺术家移动一个或多个图像段落，并可选择要求结果的颜色。这两种类型的编辑结合使得 workflow 非常有价值，例如：移动一段后重新色彩；重新色彩，然后强制一些段落采用指定的颜色。我们在具有挑战性的Yu-Gi-Oh卡牌 dataset上展示了我们的方法。

A Study on the Impact of Face Image Quality on Face Recognition in the Wild

paper_url: http://arxiv.org/abs/2307.02679
repo_url: None
paper_authors: Na Zhang
for: 本研究探讨了深度学习在人脸认知中是否能够快速和高效地识别低质量人脸图像。
methods: 本研究使用了多种深度学习方法来处理不同质量的人脸图像，并对这些图像进行了分类和比较。
results: 研究发现，深度学习方法在识别低质量人脸图像时存在一定的挑战，而人类的识别能力在不同质量的人脸图像之间具有更高的灵活性和可靠性。

Abstract
Deep learning has received increasing interests in face recognition recently. Large quantities of deep learning methods have been proposed to handle various problems appeared in face recognition. Quite a lot deep methods claimed that they have gained or even surpassed human-level face verification performance in certain databases. As we know, face image quality poses a great challenge to traditional face recognition methods, e.g. model-driven methods with hand-crafted features. However, a little research focus on the impact of face image quality on deep learning methods, and even human performance. Therefore, we raise a question: Is face image quality still one of the challenges for deep learning based face recognition, especially in unconstrained condition. Based on this, we further investigate this problem on human level. In this paper, we partition face images into three different quality sets to evaluate the performance of deep learning methods on cross-quality face images in the wild, and then design a human face verification experiment on these cross-quality data. The result indicates that quality issue still needs to be studied thoroughly in deep learning, human own better capability in building the relations between different face images with large quality gaps, and saying deep learning method surpasses human-level is too optimistic.

摘要
深度学习在面recognition方面Received recent interest has increased. 大量的深度方法被提出来解决面recognition中的多种问题。许多深度方法声称他们在某些数据库中达到或even surpass human-level face verification performance。我们知道，面图像质量对传统的面recognition方法 pose a great challenge, such as model-driven methods with hand-crafted features。然而，只有一些研究关注深度学习方法在面图像质量下的影响，以及人类的表现。因此，我们提出了一个问题：深度学习基于面recognition中是否仍然面临质量挑战，特别是在无结构的 condition。基于这个问题，我们进一步调查这个问题，并设计了一个人类面验 эксперимент。结果表明，质量问题仍然需要进一步研究，人类在构建不同质量图像之间的关系方面具有更好的能力，而且说深度学习方法超过人类水平是太 оптимисти。

GIT: Detecting Uncertainty, Out-Of-Distribution and Adversarial Samples using Gradients and Invariance Transformations

paper_url: http://arxiv.org/abs/2307.02672
repo_url: None
paper_authors: Julia Lust, Alexandru P. Condurache
for: 提高深度神经网络的预测准确率和检测误差
methods: combining gradient information and invariance transformations to detect generalization errors
results: 在多种网络架构、问题设置和扰动类型上实现了比state-of-the-art更高的检测性能

Abstract
Deep neural networks tend to make overconfident predictions and often require additional detectors for misclassifications, particularly for safety-critical applications. Existing detection methods usually only focus on adversarial attacks or out-of-distribution samples as reasons for false predictions. However, generalization errors occur due to diverse reasons often related to poorly learning relevant invariances. We therefore propose GIT, a holistic approach for the detection of generalization errors that combines the usage of gradient information and invariance transformations. The invariance transformations are designed to shift misclassified samples back into the generalization area of the neural network, while the gradient information measures the contradiction between the initial prediction and the corresponding inherent computations of the neural network using the transformed sample. Our experiments demonstrate the superior performance of GIT compared to the state-of-the-art on a variety of network architectures, problem setups and perturbation types.

摘要
深度神经网络往往做出过度自信的预测，并且常需要额外检测器来检测错误预测，特别是在安全关键应用场景下。现有的检测方法通常只关注到抗击攻击或者异常样本作为预测错误的原因。然而，总体错误的原因可能是多种多样的， часто与神经网络学习不良的相关 invariants 有关。我们因此提出了 GIT，一种涵盖性的方法， combinates 使用Gradient信息和对称变换。对应样本的对称变换可以将预测错误的样本推回神经网络的总体区域，而Gradient信息则度量在初始预测和神经网络对 transformed sample 的相对计算之间的矛盾。我们的实验表明 GIT 在不同的网络架构、问题设置和扰动类型上都有superior表现，胜过当前状态的检测方法。

Spherical Feature Pyramid Networks For Semantic Segmentation

paper_url: http://arxiv.org/abs/2307.02658
repo_url: None
paper_authors: Thomas Walker, Varun Anand, Pavlos Andreadis
for: 这个论文主要目标是解决圆形数据semantic segmentation问题，因为传统的平面approaches需要将圆形图像投影到平面上，这会导致网络性能下降。
methods: 这篇论文使用了图形基于方法，表示信号在圆形网格上，以解决上述挑战。然而，当前的圆形分割方法仅使用了各种UNet架构的变体，而尚未explore更成功的平面架构。本文提出了基于圆形CNN的圆形Feature Pyramid Network（FPN）模型，以提高分割性能。
results: 在Standford 2D-3D-S数据集上，我们的模型实现了state-of-the-art表现，具有48.75的mIOU，比前一个最佳圆形CNN的表现提高3.75个IOU点。

Abstract
Semantic segmentation for spherical data is a challenging problem in machine learning since conventional planar approaches require projecting the spherical image to the Euclidean plane. Representing the signal on a fundamentally different topology introduces edges and distortions which impact network performance. Recently, graph-based approaches have bypassed these challenges to attain significant improvements by representing the signal on a spherical mesh. Current approaches to spherical segmentation exclusively use variants of the UNet architecture, meaning more successful planar architectures remain unexplored. Inspired by the success of feature pyramid networks (FPNs) in planar image segmentation, we leverage the pyramidal hierarchy of graph-based spherical CNNs to design spherical FPNs. Our spherical FPN models show consistent improvements over spherical UNets, whilst using fewer parameters. On the Stanford 2D-3D-S dataset, our models achieve state-of-the-art performance with an mIOU of 48.75, an improvement of 3.75 IoU points over the previous best spherical CNN.

摘要
Semantic segmentation for spherical data is a challenging problem in machine learning, as conventional planar approaches require projecting the spherical image to the Euclidean plane, which introduces edges and distortions that impact network performance. Recently, graph-based approaches have bypassed these challenges to achieve significant improvements by representing the signal on a spherical mesh. Current approaches to spherical segmentation exclusively use variants of the UNet architecture, meaning more successful planar architectures remain unexplored.Inspired by the success of feature pyramid networks (FPNs) in planar image segmentation, we leverage the pyramidal hierarchy of graph-based spherical CNNs to design spherical FPNs. Our spherical FPN models show consistent improvements over spherical UNets, while using fewer parameters. On the Stanford 2D-3D-S dataset, our models achieve state-of-the-art performance with an mIOU of 48.75, an improvement of 3.75 IoU points over the previous best spherical CNN.Here's the translation in Traditional Chinese:Semantic segmentation for spherical data is a challenging problem in machine learning, as conventional planar approaches require projecting the spherical image to the Euclidean plane, which introduces edges and distortions that impact network performance. Recently, graph-based approaches have bypassed these challenges to achieve significant improvements by representing the signal on a spherical mesh. Current approaches to spherical segmentation exclusively use variants of the UNet architecture, meaning more successful planar architectures remain unexplored.Inspired by the success of feature pyramid networks (FPNs) in planar image segmentation, we leverage the pyramidal hierarchy of graph-based spherical CNNs to design spherical FPNs. Our spherical FPN models show consistent improvements over spherical UNets, while using fewer parameters. On the Stanford 2D-3D-S dataset, our models achieve state-of-the-art performance with an mIOU of 48.75, an improvement of 3.75 IoU points over the previous best spherical CNN.

Active Class Selection for Few-Shot Class-Incremental Learning

paper_url: http://arxiv.org/abs/2307.02641
repo_url: https://github.com/chrismcclurg/fscil-acs
paper_authors: Christopher McClurg, Ali Ayub, Harsh Tyagi, Sarah M. Rajtmajer, Alan R. Wagner
for: 本研究旨在帮助实际应用中的机器人在有限的互动中不断学习环境中的新对象。
methods: 本研究结合了几何学计算和可活动选择技术，开发了一种名为FIASco模型，可以让机器人通过尝试少量的对象标注来不断学习新对象。
results: 实验结果表明，FIASco模型在实际应用中可以有效地帮助机器人不断学习新对象，并且可以在有限的互动中实现长期的学习和应用。

Abstract
For real-world applications, robots will need to continually learn in their environments through limited interactions with their users. Toward this, previous works in few-shot class incremental learning (FSCIL) and active class selection (ACS) have achieved promising results but were tested in constrained setups. Therefore, in this paper, we combine ideas from FSCIL and ACS to develop a novel framework that can allow an autonomous agent to continually learn new objects by asking its users to label only a few of the most informative objects in the environment. To this end, we build on a state-of-the-art (SOTA) FSCIL model and extend it with techniques from ACS literature. We term this model Few-shot Incremental Active class SeleCtiOn (FIASco). We further integrate a potential field-based navigation technique with our model to develop a complete framework that can allow an agent to process and reason on its sensory data through the FIASco model, navigate towards the most informative object in the environment, gather data about the object through its sensors and incrementally update the FIASco model. Experimental results on a simulated agent and a real robot show the significance of our approach for long-term real-world robotics applications.

摘要
为实际应用， роботы将需要在环境中不断学习，通过与用户有限的互动来更新知识。为此，先前的几何学增长学习（FSCIL）和活动类型选择（ACS）的研究已经取得了有希望的结果，但是这些研究都是在限定的设置下进行的。因此，在这篇论文中，我们将合并FSCIL和ACS的想法，开发一种能让自主代理人 continually 学习新的物体，只需要用户标注环境中最有用的几个对象。为此，我们基于当前最佳（SOTA）的FSCIL模型，并在ACS литературе中提出的技术上进行扩展。我们称这种模型为几何学增长活动类型选择（FIASco）。此外，我们还将潜在场景基本navIGATION技术与我们的模型结合，开发了一个完整的框架，让代理人可以通过FIASco模型处理和理解感知数据， navigate towards环境中最有用的对象，通过感知器收集数据，并不断更新FIASco模型。实验结果表明，我们的方法对长期实际 робо类应用具有重要意义。

Retinex-based Image Denoising / Contrast Enhancement using Gradient Graph Laplacian Regularizer

paper_url: http://arxiv.org/abs/2307.02625
repo_url: None
paper_authors: Yeganeh Gharedaghi, Gene Cheung, Xianming Liu
for: 提高低光照图像质量
methods: 使用图гра核方法进行锐化和对比度提高
results: 实现了竞争力强的视觉图像质量，同时降低计算复杂度。

Abstract
Images captured in poorly lit conditions are often corrupted by acquisition noise. Leveraging recent advances in graph-based regularization, we propose a fast Retinex-based restoration scheme that denoises and contrast-enhances an image. Specifically, by Retinex theory we first assume that each image pixel is a multiplication of its reflectance and illumination components. We next assume that the reflectance and illumination components are piecewise constant (PWC) and continuous piecewise planar (PWP) signals, which can be recovered via graph Laplacian regularizer (GLR) and gradient graph Laplacian regularizer (GGLR) respectively. We formulate quadratic objectives regularized by GLR and GGLR, which are minimized alternately until convergence by solving linear systems -- with improved condition numbers via proposed preconditioners -- via conjugate gradient (CG) efficiently. Experimental results show that our algorithm achieves competitive visual image quality while reducing computation complexity noticeably.

摘要
低光照条件下捕捉的图像经常受到获取噪声的损害。我们提出了一种基于图格 regularization的快速 Retinex 修复方案，可以减少图像噪声并提高对比度。Specifically，我们首先根据 Retinex 理论，假设每个图像像素是其反射率和照明组件的乘积。我们接着假设反射率和照明组件是连续piecewise planar（PWP）和piecewise constant（PWC）信号，可以通过图 Laplacian regularizer（GLR）和梯度图 Laplacian regularizer（GGLR）分别recover。我们形式了quadratic objective function regularized by GLR和GGLR，并通过 conjugate gradient（CG）高效地解决线性系统，以达到很好的条件数。实验结果表明，我们的算法可以在computation complexity下降 noticeably的情况下，达到竞争性的视觉图像质量。

MRecGen: Multimodal Appropriate Reaction Generator

paper_url: http://arxiv.org/abs/2307.02609
repo_url: None
paper_authors: Jiaqi Xu, Cheng Luo, Weicheng Xie, Linlin Shen, Xiaofeng Liu, Lu Liu, Hatice Gunes, Siyang Song
for: 本文提出了一种多Modal和多种应ropriate人类反应生成框架，用于生成适用于不同人类行为的真实和生动的人类式反应（通过同步的文本、音频和视频流），以应对用户行为。
methods: 本文提出了一种基于深度学习的多模态和多种应ropriate人类反应生成方法，通过对大量的人类反应数据进行学习和分类，生成适用于不同人类行为的真实和生动的人类式反应。
results: 本文的实验结果表明，该方法可以生成高质量的人类式反应，并且可以在不同的人类行为场景下应用。具体来说，通过对用户行为进行分类，生成适应的人类式反应，以提高人类与计算机之间的交互体验。

Abstract
Verbal and non-verbal human reaction generation is a challenging task, as different reactions could be appropriate for responding to the same behaviour. This paper proposes the first multiple and multimodal (verbal and nonverbal) appropriate human reaction generation framework that can generate appropriate and realistic human-style reactions (displayed in the form of synchronised text, audio and video streams) in response to an input user behaviour. This novel technique can be applied to various human-computer interaction scenarios by generating appropriate virtual agent/robot behaviours. Our demo is available at \url{https://github.com/SSYSteve/MRecGen}.

摘要
文本和非文本人类反应生成是一项复杂的任务，因为不同的反应可能适用于回应同一种行为。这篇论文提出了首个多模式和多媒体（文本和非文本）适当人类反应生成框架，可以在输入用户行为的基础上生成适当和现实的人类样式反应（表示为同步文本、音频和视频流），并可以应用于不同的人机交互场景中。我们的demo可以在 \url{https://github.com/SSYSteve/MRecGen} 中找到。

GNEP Based Dynamic Segmentation and Motion Estimation for Neuromorphic Imaging

paper_url: http://arxiv.org/abs/2307.02595
repo_url: None
paper_authors: Harbir Antil, David Sayre
for: 这篇论文探讨了基于事件驱动摄像头的图像分割和运动估计领域的应用。
methods: 该论文提出了基于Generalized Nash Equilibrium的框架，利用事件流中的时间和空间信息来进行分割和速度估计。
results: 该论文通过一系列实验证明了这种方法的有效性。

Abstract
This paper explores the application of event-based cameras in the domains of image segmentation and motion estimation. These cameras offer a groundbreaking technology by capturing visual information as a continuous stream of asynchronous events, departing from the conventional frame-based image acquisition. We introduce a Generalized Nash Equilibrium based framework that leverages the temporal and spatial information derived from the event stream to carry out segmentation and velocity estimation. To establish the theoretical foundations, we derive an existence criteria and propose a multi-level optimization method for calculating equilibrium. The efficacy of this approach is shown through a series of experiments.

摘要
Note:* "event-based cameras" 为 asynchronous event-based imaging 的 camera* "Generalized Nash Equilibrium" 为 Generalized Nash Equilibrium 的 framework* "temporal and spatial information" 为时间和空间信息* "existence criteria" 为 existence criteria 的 proof* "multi-level optimization method" 为 multi-level optimization method 的 proposal

Mainline Automatic Train Horn and Brake Performance Metric

paper_url: http://arxiv.org/abs/2307.02586
repo_url: None
paper_authors: Rustam Tagiew
for: 这篇论文提出了一种基于主要铁路的执行指标 для替换式驾驶系统。
methods: 该论文提出了一种预liminary的障碍探测子指标，并且不知道任何其他类似的提议。
results: 该论文预计将为执行系统比较和运营设计域中的障碍探测系统提供一个标准化的预测数量。

Abstract
This paper argues for the introduction of a mainline rail-oriented performance metric for driver-replacing on-board perception systems. Perception at the head of a train is divided into several subfunctions. This article presents a preliminary submetric for the obstacle detection subfunction. To the best of the author's knowledge, no other such proposal for obstacle detection exists. A set of submetrics for the subfunctions should facilitate the comparison of perception systems among each other and guide the measurement of human driver performance. It should also be useful for a standardized prediction of the number of accidents for a given perception system in a given operational design domain. In particular, for the proposal of the obstacle detection submetric, the professional readership is invited to provide their feedback and quantitative information to the author. The analysis results of the feedback will be published separately later.

摘要
这篇论文提出了引入主线铁路 Orientated 性能指标，用于评估驾驶员替换车载感知系统。感知系统中的各个子函数被分解为多个子指标。本文提出了一个初步的障碍探测子指标。作者知道的范围内没有其他类似的提议。一组子指标可以方便各个感知系统之间的比较，并且可以指导测量人类驾驶员的性能。此外，它还可以用于预测给定感知系统在给定操作设计域中的事故数量。特别是对于障碍探测子指标的建议，专业读者被邀请提供反馈和量化信息给作者。作者将分析反馈的结果，并将在后续发表。

Semi-supervised Learning from Street-View Images and OpenStreetMap for Automatic Building Height Estimation

paper_url: http://arxiv.org/abs/2307.02574
repo_url: https://github.com/bobleegogogo/building_height
paper_authors: Hao Li, Zhendong Yuan, Gabriel Dax, Gefei Kong, Hongchao Fan, Alexander Zipf, Martin Werner
for: 本研究旨在提供一种自动化建筑高度估算方法，以便从低成本的 voluntary geographical information (VGI) 数据中生成低成本的三维城市模型。
methods: 本研究使用 semi-supervised learning (SSL) 方法，并利用 Mapillary 街景图像和 OpenStreetMap (OSM) 数据进行自动化建筑高度估算。SSL 方法包括三个部分：首先，我们提出了一种 SSL 结构，其中可以在激活批处理中设置不同的 “pseudo label” 比率；其次，我们从 OSM 数据中提取多层次形态特征，以便从建筑高度中推断建筑高度；最后，我们设计了一个建筑层数估算工作流程，利用预训练的 Facade 对象检测网络来生成 “pseudo label” 从 SVI 数据中，并将其分配给 OSM 建筑脚印。
results: 在 tested 在海德堡市的 caso study 中，我们 validate 了提议的 SSL 方法，并评估模型的性能与参 Referenced 数据中的建筑高度。结果表明，使用 Random Forest (RF)、Support Vector Machine (SVM) 和 Convolutional Neural Network (CNN) 三种不同的回归模型，SSL 方法可以在建筑高度估算中带来明显的性能提升，MAE 约为 2.1 米，与现有方法竞争。这一初步结果是激动人心，鼓励我们在未来继续基于低成本 VGI 数据进行扩展，包括在不同的区域和数据质量上进行应用。

Abstract
Accurate building height estimation is key to the automatic derivation of 3D city models from emerging big geospatial data, including Volunteered Geographical Information (VGI). However, an automatic solution for large-scale building height estimation based on low-cost VGI data is currently missing. The fast development of VGI data platforms, especially OpenStreetMap (OSM) and crowdsourced street-view images (SVI), offers a stimulating opportunity to fill this research gap. In this work, we propose a semi-supervised learning (SSL) method of automatically estimating building height from Mapillary SVI and OSM data to generate low-cost and open-source 3D city modeling in LoD1. The proposed method consists of three parts: first, we propose an SSL schema with the option of setting a different ratio of "pseudo label" during the supervised regression; second, we extract multi-level morphometric features from OSM data (i.e., buildings and streets) for the purposed of inferring building height; last, we design a building floor estimation workflow with a pre-trained facade object detection network to generate "pseudo label" from SVI and assign it to the corresponding OSM building footprint. In a case study, we validate the proposed SSL method in the city of Heidelberg, Germany and evaluate the model performance against the reference data of building heights. Based on three different regression models, namely Random Forest (RF), Support Vector Machine (SVM), and Convolutional Neural Network (CNN), the SSL method leads to a clear performance boosting in estimating building heights with a Mean Absolute Error (MAE) around 2.1 meters, which is competitive to state-of-the-art approaches. The preliminary result is promising and motivates our future work in scaling up the proposed method based on low-cost VGI data, with possibilities in even regions and areas with diverse data quality and availability.

摘要
准确的建筑高度估算是三维城市模型自动生成的关键，特别是从低成本的地理空间数据（VGI）中获取数据。然而，基于低成本VGI数据的大规模建筑高度估算方法目前尚缺乏。随着VGI数据平台的快速发展，特别是OpenStreetMap（OSM）和来自公众参与的街景图像（SVI），我们有机会填补这个研究漏洞。在这项工作中，我们提出了一种半监督学习（SSL）方法，通过使用Mapillary SVI和OSM数据自动计算建筑高度，以生成低成本和开源的三维城市模型。该方法包括三部分：1. 我们提出了一种SSL结构，其中可以在批量回归中设置不同的“ Pseudo Label”比率。2. 我们从OSM数据中提取了多级形态特征，以便从建筑高度进行推断。3. 我们设计了一个建筑层数估算工作流程，使用预训练的外墙对象检测网络来生成“ Pseudo Label”从SVI中，并将其分配到相应的OSM建筑基eline。在海德堡市（Heidelberg）的 caso study中，我们验证了我们的SSL方法，并评估其性能与参照数据中的建筑高度相比。基于Random Forest（RF）、Support Vector Machine（SVM）和Convolutional Neural Network（CNN）三种不同的回归模型，SSL方法在估算建筑高度方面具有明显的性能提升，MAE约为2.1米，与现有方法相当竞争。这些初步结果启发我们未来在基于低成本VGI数据的大规模建筑高度估算方法上进行进一步研究，包括在不同的数据质量和可用性下进行扩展。

A Dataset of Inertial Measurement Units for Handwritten English Alphabets

paper_url: http://arxiv.org/abs/2307.02480
repo_url: None
paper_authors: Hari Prabhat Gupta, Rahul Mishra
for: 这个研究是为了提高英文手写识别精度，特别是在印度的多元文化和语言背景下。
methods: 这个研究使用了各种测量工具和方法，包括传感器测量单元(IMU)、离散感知和机器学习等，以捕捉手写动作的动态模式，提高英文手写识别精度。
results: 根据实验结果显示，这个dataset和收集系统可以实现高精度的英文手写识别，特别是在印度的多元文化和语言背景下。

Abstract
This paper presents an end-to-end methodology for collecting datasets to recognize handwritten English alphabets by utilizing Inertial Measurement Units (IMUs) and leveraging the diversity present in the Indian writing style. The IMUs are utilized to capture the dynamic movement patterns associated with handwriting, enabling more accurate recognition of alphabets. The Indian context introduces various challenges due to the heterogeneity in writing styles across different regions and languages. By leveraging this diversity, the collected dataset and the collection system aim to achieve higher recognition accuracy. Some preliminary experimental results demonstrate the effectiveness of the dataset in accurately recognizing handwritten English alphabet in the Indian context. This research can be extended and contributes to the field of pattern recognition and offers valuable insights for developing improved systems for handwriting recognition, particularly in diverse linguistic and cultural contexts.

摘要
Here's the translation in Simplified Chinese:这篇论文提出了一种综合方法，通过使用惯性测量单元(IMU)，收集手写英文字母的数据集，并利用印度写作风格的多样性，提高手写识别的准确率。印度的文化和语言多样性会导致写作风格的差异，但是收集的数据集和收集系统尝试达到更高的识别率。初步的实验结果表明，这些数据集可以准确地识别印度上手写英文字母。这些研究可以进一步推广，对手写识别技术的发展产生有价值的影响，特别是在多元语言和文化背景下。

Large-scale Detection of Marine Debris in Coastal Areas with Sentinel-2

paper_url: http://arxiv.org/abs/2307.02465
repo_url: https://github.com/marccoru/marinedebrisdetector
paper_authors: Marc Rußwurm, Sushen Jilla Venkatesa, Devis Tuia
for: This paper aims to detect and quantify marine pollution and macro-plastics using remote sensing technology.
methods: The paper uses a deep segmentation model to detect marine debris in coastal areas, leveraging medium-resolution satellite data. The model is trained on a combination of annotated datasets of marine debris and evaluated on specifically selected test sites.
results: The paper demonstrates that the deep learning model outperforms existing detection models by a large margin, due to the particular dataset design with extensive sampling of negative examples and label refinements. The results show the potential for large-scale automated detection of marine debris, which can help quantify and monitor marine litter with remote sensing at global scales.

Abstract
Detecting and quantifying marine pollution and macro-plastics is an increasingly pressing ecological issue that directly impacts ecology and human health. Efforts to quantify marine pollution are often conducted with sparse and expensive beach surveys, which are difficult to conduct on a large scale. Here, remote sensing can provide reliable estimates of plastic pollution by regularly monitoring and detecting marine debris in coastal areas. Medium-resolution satellite data of coastal areas is readily available and can be leveraged to detect aggregations of marine debris containing plastic litter. In this work, we present a detector for marine debris built on a deep segmentation model that outputs a probability for marine debris at the pixel level. We train this detector with a combination of annotated datasets of marine debris and evaluate it on specifically selected test sites where it is highly probable that plastic pollution is present in the detected marine debris. We demonstrate quantitatively and qualitatively that a deep learning model trained on this dataset issued from multiple sources outperforms existing detection models trained on previous datasets by a large margin. Our experiments show, consistent with the principles of data-centric AI, that this performance is due to our particular dataset design with extensive sampling of negative examples and label refinements rather than depending on the particular deep learning model. We hope to accelerate advances in the large-scale automated detection of marine debris, which is a step towards quantifying and monitoring marine litter with remote sensing at global scales, and release the model weights and training source code under https://github.com/marccoru/marinedebrisdetector

摘要
检测和评估海洋污染和巨型塑料是当前生态环境问题的一个不断增长的焦点，直接影响生物多样性和人类健康。尝试量化海洋污染的努力通常采用罕见和昂贵的海滩调查，这些调查难以在大规模上进行。在这里，Remote sensing可以提供可靠的塑料污染估计，通过定期监测和检测海洋垃圾在沿海区域。我们提出了一种基于深度分割模型的海洋垃圾检测器，该模型可以在像素级输出marine debris的投票 probabilities。我们使用了多种源的注解数据集来训练这个检测器，并在特定的测试地点进行评估，这些测试地点高度可能存在塑料污染。我们的实验表明，使用这种数据集和深度学习模型，我们可以在跨源数据集上超越现有的检测模型，提高检测精度。我们的实验还表明，这种性能是基于我们特有的数据集设计，包括广泛的负例抽象和标签精细化，而不是基于特定的深度学习模型。我们希望通过大规模自动检测海洋垃圾，降低海洋污染的评估难度，并在全球范围内追踪塑料污染，并将模型权重和训练源代码发布在https://github.com/marccoru/marinedebrisdetector。

AxonCallosumEM Dataset: Axon Semantic Segmentation of Whole Corpus Callosum cross section from EM Images

paper_url: http://arxiv.org/abs/2307.02464
repo_url: None
paper_authors: Ao Cheng, Guoqiang Zhao, Lirong Wang, Ruobing Zhang
for: The paper is written for the purpose of introducing a new dataset and a fine-tuning methodology for segmenting EM images of the corpus callosum.
methods: The paper uses a combination of EM imaging and manual annotation to create a large-scale dataset of the corpus callosum, and a fine-tuning methodology called EM-SAM that adapts the Segment Anything Model (SAM) to EM image segmentation tasks.
results: The paper presents the evaluation results of EM-SAM as a baseline, which outperforms other state-of-the-art methods.Here is the information in Simplified Chinese text:
for: 这篇论文是为了介绍一个新的数据集和一种精度调整方法，用于对电子显微镜图像的脑干细胞系统进行分割。
methods: 这篇论文使用了组合的电子显微镜成像和手动标注，创建了一个大规模的脑干细胞系统数据集，并使用了一种调整方法called EM-SAM，以适应电子显微镜图像分割任务。
results: 这篇论文提出了EM-SAM的评估结果作为基准，比其他状态所有方法表现出色。

Abstract
The electron microscope (EM) remains the predominant technique for elucidating intricate details of the animal nervous system at the nanometer scale. However, accurately reconstructing the complex morphology of axons and myelin sheaths poses a significant challenge. Furthermore, the absence of publicly available, large-scale EM datasets encompassing complete cross sections of the corpus callosum, with dense ground truth segmentation for axons and myelin sheaths, hinders the advancement and evaluation of holistic corpus callosum reconstructions. To surmount these obstacles, we introduce the AxonCallosumEM dataset, comprising a 1.83 times 5.76mm EM image captured from the corpus callosum of the Rett Syndrome (RTT) mouse model, which entail extensive axon bundles. We meticulously proofread over 600,000 patches at a resolution of 1024 times 1024, thus providing a comprehensive ground truth for myelinated axons and myelin sheaths. Additionally, we extensively annotated three distinct regions within the dataset for the purposes of training, testing, and validation. Utilizing this dataset, we develop a fine-tuning methodology that adapts Segment Anything Model (SAM) to EM images segmentation tasks, called EM-SAM, enabling outperforms other state-of-the-art methods. Furthermore, we present the evaluation results of EM-SAM as a baseline.

摘要
《电子镜相（EM）仍然是解释动物神经系统细胞水平的主要技术。然而，准确地重建复杂的轴细胞和质粒层结构却是一项 significativetask。此外，没有公开可用的大规模EM数据集，包括整个脑桥完整的跨section，并且有密集的真实性标注 для轴细胞和质粒层，限制了整个脑桥的重建和评估。为了突破这些障碍，我们介绍了AxonCallosumEM数据集，包括RTT小鼠型脑桥的1.83 times 5.76mm EM图像，它包含了广泛的轴细胞集。我们仔细对600,000个小块进行了1024 times 1024的真实性检查，以提供完整的myelinated轴细胞和质粒层的真实性标注。此外，我们也进行了三个不同区域的严格的标注，用于训练、测试和验证。使用这个数据集，我们开发了一种基于SAM模型的EM图像分割方法，称为EM-SAM，它可以超越其他当前的状况。此外，我们还提供了EM-SAM的评估结果作为基准。

Expert-Agnostic Ultrasound Image Quality Assessment using Deep Variational Clustering

paper_url: http://arxiv.org/abs/2307.02462
repo_url: None
paper_authors: Deepak Raina, Dimitrios Ntentia, SH Chandrashekhara, Richard Voyles, Subir Kumar Saha
for:The paper aims to develop an unsupervised ultrasound image quality assessment network to eliminate the burden and uncertainty of manual annotations.methods:The proposed framework, US2QNet, uses a variational autoencoder with three modules: pre-processing, clustering, and post-processing, to jointly enhance, extract, cluster, and visualize the quality feature representation of ultrasound images.results:The proposed framework achieved 78% accuracy and superior performance to state-of-the-art clustering methods in assessing the quality of urinary bladder ultrasound images.

Abstract
Ultrasound imaging is a commonly used modality for several diagnostic and therapeutic procedures. However, the diagnosis by ultrasound relies heavily on the quality of images assessed manually by sonographers, which diminishes the objectivity of the diagnosis and makes it operator-dependent. The supervised learning-based methods for automated quality assessment require manually annotated datasets, which are highly labour-intensive to acquire. These ultrasound images are low in quality and suffer from noisy annotations caused by inter-observer perceptual variations, which hampers learning efficiency. We propose an UnSupervised UltraSound image Quality assessment Network, US2QNet, that eliminates the burden and uncertainty of manual annotations. US2QNet uses the variational autoencoder embedded with the three modules, pre-processing, clustering and post-processing, to jointly enhance, extract, cluster and visualize the quality feature representation of ultrasound images. The pre-processing module uses filtering of images to point the network's attention towards salient quality features, rather than getting distracted by noise. Post-processing is proposed for visualizing the clusters of feature representations in 2D space. We validated the proposed framework for quality assessment of the urinary bladder ultrasound images. The proposed framework achieved 78% accuracy and superior performance to state-of-the-art clustering methods.

摘要
“超声成像是一种常用的诊断和治疗过程中的重要模式。然而，超声诊断的准确性受到图像评估员（sonographer）的主观因素的影响，这会使诊断变得操作员依赖。支持学习基本方法需要手动标注的数据集，这是非常劳动 INTENSIVE 的。这些超声图像质量低，并且受到了诊断人员之间的视觉差异导致的噪音标注，这会降低学习效率。我们提议一种无监督的超声图像质量评估网络，namely US2QNet，以消除手动标注的压力和不确定性。US2QNet使用包括预处理、聚类和后处理三个模块的变量自动编码器，以联合提高、提取、聚类和可视化超声图像质量特征表示。预处理模块使用图像滤波来引导网络注意力向关键质量特征方向，而不是被噪声所干扰。后处理 module 提出了可视化特征表示的分布在2D空间。我们验证了我们的框架，用于评估膀胱超声图像质量。我们的框架实现了78%的准确率，并超过了当前的聚类方法的性能。”

LLCaps: Learning to Illuminate Low-Light Capsule Endoscopy with Curved Wavelet Attention and Reverse Diffusion

paper_url: http://arxiv.org/abs/2307.02452
repo_url: https://github.com/longbai1006/llcaps
paper_authors: Long Bai, Tong Chen, Yanan Wu, An Wang, Mobarakol Islam, Hongliang Ren
for: 该论文旨在提出一种基于多尺度卷积神经网络和反扩散过程的 wireless capsule endoscopy 低光照图像增强方法，以提高肠道疾病诊断的精度。
methods: 该方法基于多尺度卷积神经网络，并提出了弯波let attention 块来学习高频和地方特征。此外，该方法还 combining了反扩散过程来进一步优化混合输出，生成最真实的图像。
results: 与前十个state-of-the-art 低光照图像增强方法进行比较，该方法显示出了 statistically 和质量上的优异性。此外，该方法还在肠道疾病分 segmentation 中表现出了优秀的临床潜力。

Abstract
Wireless capsule endoscopy (WCE) is a painless and non-invasive diagnostic tool for gastrointestinal (GI) diseases. However, due to GI anatomical constraints and hardware manufacturing limitations, WCE vision signals may suffer from insufficient illumination, leading to a complicated screening and examination procedure. Deep learning-based low-light image enhancement (LLIE) in the medical field gradually attracts researchers. Given the exuberant development of the denoising diffusion probabilistic model (DDPM) in computer vision, we introduce a WCE LLIE framework based on the multi-scale convolutional neural network (CNN) and reverse diffusion process. The multi-scale design allows models to preserve high-resolution representation and context information from low-resolution, while the curved wavelet attention (CWA) block is proposed for high-frequency and local feature learning. Furthermore, we combine the reverse diffusion procedure to further optimize the shallow output and generate the most realistic image. The proposed method is compared with ten state-of-the-art (SOTA) LLIE methods and significantly outperforms quantitatively and qualitatively. The superior performance on GI disease segmentation further demonstrates the clinical potential of our proposed model. Our code is publicly accessible.

摘要
无线 capsule 内镜（WCE）是一种无痛、非侵入性的诊断工具 для Digestive tract 疾病（GI）。然而，由于 GI анатомиче限制和硬件制造限制，WCE 视像信号可能受到不足照明的影响，导致复杂的检测和诊断过程。在医学领域，深度学习基于的低光照图像提高（LLIE）逐渐吸引研究人员。我们在 Computer Vision 领域的发展中，引入了基于多尺度 convolutional neural network （CNN）和反演扩散过程的 WCE LLIE 框架。这种多尺度设计允许模型保留高分辨率表示和 context 信息，而 Curved wavelet attention （CWA）块是用于高频和本地特征学习。此外，我们将反演扩散过程组合到输出优化，生成最真实的图像。我们的提议方法与当前 SOTA LLIE 方法进行比较，显著超越量化和质量上。此外，我们还对 GI 疾病 segmentation 进行了评估，并证明了我们的提议模型在临床上的潜在价值。我们的代码公共可访问。

Base Layer Efficiency in Scalable Human-Machine Coding

paper_url: http://arxiv.org/abs/2307.02430
repo_url: None
paper_authors: Yalda Foroutan, Alon Harell, Anderson de Andrade, Ivan V. Bajić
for: 这个论文是为了提高一种可扩展的人机图像编码器的基层编码效率而写的。
methods: 论文使用了一种现状最佳的人机图像编码器，并对其基层编码进行分析和优化。
results: 论文表明，通过对基层编码进行优化，可以获得20-40%的BD-Rate提升，用于对物体检测和实例分割进行优化。

Abstract
A basic premise in scalable human-machine coding is that the base layer is intended for automated machine analysis and is therefore more compressible than the same content would be for human viewing. Use cases for such coding include video surveillance and traffic monitoring, where the majority of the content will never be seen by humans. Therefore, base layer efficiency is of paramount importance because the system would most frequently operate at the base-layer rate. In this paper, we analyze the coding efficiency of the base layer in a state-of-the-art scalable human-machine image codec, and show that it can be improved. In particular, we demonstrate that gains of 20-40% in BD-Rate compared to the currently best results on object detection and instance segmentation are possible.

摘要
基本假设在可扩展人机编码中是，基层是为机器自动分析而设计的，因此比同样内容适用于人类查看时更加压缩。使用场景包括视频监控和交通监测，大多数内容将从来不会被人类查看。因此，基层效率非常重要，系统大多数时间都将在基层率下运行。在这篇论文中，我们分析了一种现代可扩展人机图像编码器的基层编码效率，并证明可以提高。特别是，我们示出了对 объек detection 和实例分割的结果进行优化，可以获得20-40%的BD-Rate提升。

DragonDiffusion: Enabling Drag-style Manipulation on Diffusion Models

paper_url: http://arxiv.org/abs/2307.02421
repo_url: https://github.com/mc-e/dragondiffusion
paper_authors: Chong Mou, Xintao Wang, Jiechong Song, Ying Shan, Jian Zhang
for: 这篇论文是为了提出一种新的图像编辑方法，即DragonDiffusion，允许用户通过拖动方式进行图像修改。
methods: 该方法基于Diffusion模型的强相关性，通过特征对应损失将编辑信号转换为梯度，以修改Diffusion模型的中间表示。此外，还采用多尺度引导和自注意力机制保持原图和修改结果之间的一致性。
results: 该方法可以实现多种编辑模式，包括对已有图像的对象移动、对象放大、对象外观替换和内容拖动等。而所有的编辑和内容保持信号都来自于原图，无需 Fine-tuning 或额外模块。

Abstract
Despite the ability of existing large-scale text-to-image (T2I) models to generate high-quality images from detailed textual descriptions, they often lack the ability to precisely edit the generated or real images. In this paper, we propose a novel image editing method, DragonDiffusion, enabling Drag-style manipulation on Diffusion models. Specifically, we construct classifier guidance based on the strong correspondence of intermediate features in the diffusion model. It can transform the editing signals into gradients via feature correspondence loss to modify the intermediate representation of the diffusion model. Based on this guidance strategy, we also build a multi-scale guidance to consider both semantic and geometric alignment. Moreover, a cross-branch self-attention is added to maintain the consistency between the original image and the editing result. Our method, through an efficient design, achieves various editing modes for the generated or real images, such as object moving, object resizing, object appearance replacement, and content dragging. It is worth noting that all editing and content preservation signals come from the image itself, and the model does not require fine-tuning or additional modules. Our source code will be available at https://github.com/MC-E/DragonDiffusion.

摘要
尽管现有的大规模文本到图像（T2I）模型可以生成高质量的图像从详细的文本描述，但它们通常缺乏 precisely 编辑生成的图像的能力。在这篇论文中，我们提出了一种新的图像编辑方法，即 DragonDiffusion，允许用户通过 Drag 式操作来修改Diffusion 模型中的图像。特别是，我们基于Diffusion 模型中强相关的间接特征的类ifikator导航，可以将编辑信号转化为梯度，以修改Diffusion 模型的中间表示。此外，我们还建立了多尺度导航，以考虑 Semantic 和 Geometric 对齐。此外，我们还添加了cross-branch self-attention，以保持原始图像和修改结果之间的一致性。我们的方法可以实现多种编辑模式，如物体移动、物体放大、物体外观替换和内容拖动。值得注意的是，所有的编辑和内容保持信号都来自于图像本身，而模型不需要 fine-tuning 或额外模块。我们的源代码将在 GitHub 上公开。

Unbalanced Optimal Transport: A Unified Framework for Object Detection

paper_url: http://arxiv.org/abs/2307.02402
repo_url: https://github.com/hdeplaen/uotod
paper_authors: Henri De Plaen, Pierre-François De Plaen, Johan A. K. Suykens, Marc Proesmans, Tinne Tuytelaars, Luc Van Gool
for: 这篇论文主要是为了研究supervised object detection中的匹配策略，以提高模型的准确率和初始化速度。
methods: 论文使用的方法包括Unbalanced Optimal Transport，它可以将不同的匹配策略集成到一起，并提供一个灵活的选择参数，以便根据不同的目标选择最佳的方法。
results: 实验结果表明，使用Unbalanced Optimal Transport进行匹配可以达到当前领域的最佳性能，包括均值准确率和均值回归率，并且可以提高模型的初始化速度。

Abstract
During training, supervised object detection tries to correctly match the predicted bounding boxes and associated classification scores to the ground truth. This is essential to determine which predictions are to be pushed towards which solutions, or to be discarded. Popular matching strategies include matching to the closest ground truth box (mostly used in combination with anchors), or matching via the Hungarian algorithm (mostly used in anchor-free methods). Each of these strategies comes with its own properties, underlying losses, and heuristics. We show how Unbalanced Optimal Transport unifies these different approaches and opens a whole continuum of methods in between. This allows for a finer selection of the desired properties. Experimentally, we show that training an object detection model with Unbalanced Optimal Transport is able to reach the state-of-the-art both in terms of Average Precision and Average Recall as well as to provide a faster initial convergence. The approach is well suited for GPU implementation, which proves to be an advantage for large-scale models.

摘要

RADiff: Controllable Diffusion Models for Radio Astronomical Maps Generation

paper_url: http://arxiv.org/abs/2307.02392
repo_url: None
paper_authors: Renato Sortino, Thomas Cecconello, Andrea DeMarco, Giuseppe Fiameni, Andrea Pilzer, Andrew M. Hopkins, Daniel Magro, Simone Riggi, Eva Sciacca, Adriano Ingallinera, Cristobal Bordiu, Filomena Bufano, Concetto Spampinato
for: 这篇论文的目的是提出一个基于条件扩散模型的生成方法，用于增强对于电波天文学数据的自动化分析和检测。
methods: 这篇论文使用的方法包括使用条件扩散模型生成 sintetic 图像，以增强现有数据的标注和检测。
results: 这篇论文的结果显示，使用生成的 sintetic 图像可以提高 semantic segmentation 模型的性能，并且可以自动增强数据集的规模和多样性。

Abstract
Along with the nearing completion of the Square Kilometre Array (SKA), comes an increasing demand for accurate and reliable automated solutions to extract valuable information from the vast amount of data it will allow acquiring. Automated source finding is a particularly important task in this context, as it enables the detection and classification of astronomical objects. Deep-learning-based object detection and semantic segmentation models have proven to be suitable for this purpose. However, training such deep networks requires a high volume of labeled data, which is not trivial to obtain in the context of radio astronomy. Since data needs to be manually labeled by experts, this process is not scalable to large dataset sizes, limiting the possibilities of leveraging deep networks to address several tasks. In this work, we propose RADiff, a generative approach based on conditional diffusion models trained over an annotated radio dataset to generate synthetic images, containing radio sources of different morphologies, to augment existing datasets and reduce the problems caused by class imbalances. We also show that it is possible to generate fully-synthetic image-annotation pairs to automatically augment any annotated dataset. We evaluate the effectiveness of this approach by training a semantic segmentation model on a real dataset augmented in two ways: 1) using synthetic images obtained from real masks, and 2) generating images from synthetic semantic masks. We show an improvement in performance when applying augmentation, gaining up to 18% in performance when using real masks and 4% when augmenting with synthetic masks. Finally, we employ this model to generate large-scale radio maps with the objective of simulating Data Challenges.

摘要
alongside the impending completion of the Square Kilometre Array (SKA), there is an increasing demand for accurate and reliable automated solutions to extract valuable information from the vast amount of data it will acquire. Automated source finding is a particularly important task in this context, as it enables the detection and classification of astronomical objects. Deep-learning-based object detection and semantic segmentation models have proven to be suitable for this purpose. However, training such deep networks requires a high volume of labeled data, which is not easy to obtain in the context of radio astronomy. Since data needs to be manually labeled by experts, this process is not scalable to large dataset sizes, limiting the possibilities of leveraging deep networks to address several tasks. In this work, we propose RADiff, a generative approach based on conditional diffusion models trained over an annotated radio dataset to generate synthetic images, containing radio sources of different morphologies, to augment existing datasets and reduce the problems caused by class imbalances. We also show that it is possible to generate fully-synthetic image-annotation pairs to automatically augment any annotated dataset. We evaluate the effectiveness of this approach by training a semantic segmentation model on a real dataset augmented in two ways: 1) using synthetic images obtained from real masks, and 2) generating images from synthetic semantic masks. We show an improvement in performance when applying augmentation, gaining up to 18% in performance when using real masks and 4% when augmenting with synthetic masks. Finally, we employ this model to generate large-scale radio maps with the objective of simulating Data Challenges.

2023-07-06

Proto-CLIP: Vision-Language Prototypical Network for Few-Shot Learning

PseudoCell: Hard Negative Mining as Pseudo Labeling for Deep Learning-Based Centroblast Cell Detection

EffLiFe: Efficient Light Field Generation via Hierarchical Sparse Gradient Descent

Self-supervised learning via inter-modal reconstruction and feature projection networks for label-efficient 3D-to-2D segmentation

Fourier-Net+: Leveraging Band-Limited Representation for Efficient 3D Medical Image Registration

Multi-modal multi-class Parkinson disease classification using CNN and decision level fusion

Cross-Spatial Pixel Integration and Cross-Stage Feature Fusion Based Transformer Network for Remote Sensing Image Super-Resolution

SegNetr: Rethinking the local-global interactions and skip connections in U-shaped networks

DisAsymNet: Disentanglement of Asymmetrical Abnormality on Bilateral Mammograms using Self-adversarial Learning

A Real-time Human Pose Estimation Approach for Optimal Sensor Placement in Sensor-based Human Activity Recognition

RefVSR++: Exploiting Reference Inputs for Reference-based Video Super-resolution

Probabilistic and Semantic Descriptions of Image Manifolds and Their Applications

Towards accurate instance segmentation in large-scale LiDAR point clouds

Reference-based Motion Blur Removal: Learning to Utilize Sharpness in the Reference Image

MomentDiff: Generative Video Moment Retrieval from Random to Real

A Critical Look at the Current Usage of Foundation Model for Dense Recognition Task

Deep Ensemble Learning with Frame Skipping for Face Anti-Spoofing

Revisiting Computer-Aided Tuberculosis Diagnosis

Noise-to-Norm Reconstruction for Industrial Anomaly Detection and Localization

Sampling-based Fast Gradient Rescaling Method for Highly Transferable Adversarial Attacks

Bundle-specific Tractogram Distribution Estimation Using Higher-order Streamline Differential Equation

Single Image LDR to HDR Conversion using Conditional Diffusion

Advancing Zero-Shot Digital Human Quality Assessment through Text-Prompted Evaluation

UIT-Saviors at MEDVQA-GI 2023: Improving Multimodal Learning with Image Enhancement for Gastrointestinal Visual Question Answering

SeLiNet: Sentiment enriched Lightweight Network for Emotion Recognition in Images

CityTrack: Improving City-Scale Multi-Camera Multi-Target Tracking by Location-Aware Tracking and Box-Grained Matching

Active Learning with Contrastive Pre-training for Facial Expression Recognition

An Uncertainty Aided Framework for Learning based Liver $T_1ρ$ Mapping and Analysis

MMNet: Multi-Collaboration and Multi-Supervision Network for Sequential Deepfake Detection

Applying a Color Palette with Local Control using Diffusion Models

A Study on the Impact of Face Image Quality on Face Recognition in the Wild

GIT: Detecting Uncertainty, Out-Of-Distribution and Adversarial Samples using Gradients and Invariance Transformations

Spherical Feature Pyramid Networks For Semantic Segmentation

Active Class Selection for Few-Shot Class-Incremental Learning

Retinex-based Image Denoising / Contrast Enhancement using Gradient Graph Laplacian Regularizer

MRecGen: Multimodal Appropriate Reaction Generator

GNEP Based Dynamic Segmentation and Motion Estimation for Neuromorphic Imaging

Mainline Automatic Train Horn and Brake Performance Metric

Semi-supervised Learning from Street-View Images and OpenStreetMap for Automatic Building Height Estimation

A Dataset of Inertial Measurement Units for Handwritten English Alphabets

Large-scale Detection of Marine Debris in Coastal Areas with Sentinel-2

AxonCallosumEM Dataset: Axon Semantic Segmentation of Whole Corpus Callosum cross section from EM Images

Expert-Agnostic Ultrasound Image Quality Assessment using Deep Variational Clustering

LLCaps: Learning to Illuminate Low-Light Capsule Endoscopy with Curved Wavelet Attention and Reverse Diffusion

Base Layer Efficiency in Scalable Human-Machine Coding

DragonDiffusion: Enabling Drag-style Manipulation on Diffusion Models

Unbalanced Optimal Transport: A Unified Framework for Object Detection

RADiff: Controllable Diffusion Models for Radio Astronomical Maps Generation