2023-07-27

cs.CV

cs.CV - 2023-07-27

Federated Model Aggregation via Self-Supervised Priors for Highly Imbalanced Medical Image Classification

paper_url: http://arxiv.org/abs/2307.14959
repo_url: https://github.com/xmed-lab/fed-mas
paper_authors: Marawan Elbatel, Hualiang Wang, Robert Martí, Huazhu Fu, Xiaomeng Li
for: 这 paper 主要针对高度不均衡的医疗数据集，包括皮肤损伤和肠道图像。现有的联邦学习方法在高度不均衡的数据集上主要是优化全局模型，而不考虑具有不同人口、发现和扫描仪的内部变化。
methods: 本 paper 使用公共可用的自我监督网络来研究客户端间的内部变化。Specifically, 我们发现使用每个客户端上的共享auxiliary pre-trained模型，如 MoCo-V2，可以获得相似的偏移量度量。基于这些发现，我们 derivate了一种动态均衡模型聚合via自我监督先验（MAS），以导引全局模型优化。
results: Fed-MAS 可以与不同的本地学习方法结合使用，以实现有效的模型聚合，并生成一个高度稳定和不偏的全局模型。我们的代码可以在 \url{https://github.com/xmed-lab/Fed-MAS} 上找到。

Abstract
In the medical field, federated learning commonly deals with highly imbalanced datasets, including skin lesions and gastrointestinal images. Existing federated methods under highly imbalanced datasets primarily focus on optimizing a global model without incorporating the intra-class variations that can arise in medical imaging due to different populations, findings, and scanners. In this paper, we study the inter-client intra-class variations with publicly available self-supervised auxiliary networks. Specifically, we find that employing a shared auxiliary pre-trained model, like MoCo-V2, locally on every client yields consistent divergence measurements. Based on these findings, we derive a dynamic balanced model aggregation via self-supervised priors (MAS) to guide the global model optimization. Fed-MAS can be utilized with different local learning methods for effective model aggregation toward a highly robust and unbiased global model. Our code is available at \url{https://github.com/xmed-lab/Fed-MAS}.

摘要
医疗领域中，联合学习通常处理高度不均衡的数据集，包括皮肤损害和肠道图像。现有的联合方法在高度不均衡的数据集上主要是优化全局模型，而不考虑医疗图像中可能出现的不同人口、发现和扫描器引起的内类差异。在这篇论文中，我们研究了客户端间内类差异，使用公共可用的自我超vised副网络（如MoCo-V2）进行本地预训练。我们发现，使用共享副网络模型在每个客户端进行本地预训练可以获得一致的异同度测量。基于这些发现，我们 derivate了一种动态均衡模型聚合方法（MAS），以自我超vised先验为引导，向全局模型优化。Fed-MAS可以与不同的本地学习方法结合使用，以实现高度 Robust和不偏的全球模型。我们的代码可以在 GitHub 上找到：https://github.com/xmed-lab/Fed-MAS。

GET3D–: Learning GET3D from Unconstrained Image Collections

paper_url: http://arxiv.org/abs/2307.14918
repo_url: None
paper_authors: Fanghua Yu, Xintao Wang, Zheyuan Li, Yan-Pei Cao, Ying Shan, Chao Dong
for: 生成高质量3D模型的需求在增长 exponential 速度，手动创建3D模型需要专业知识和大量时间。
methods: 我们提出了GET3D–方法，可以从2D图像直接生成高品质的3D模型，并且可以处理未知姿态和比例的情况。我们还提出了一种新的训练计划，可以稳定优化shape生成器和摄像头抽象器。
results: 我们的实验表明，GET3D–方法可以准确地适应6D摄像头姿态分布，并生成高品质的3D模型。我们在 synthetic 和实际不制约的数据集上进行了广泛的实验，并证明了我们的方法的可靠性和稳定性。

Abstract
The demand for efficient 3D model generation techniques has grown exponentially, as manual creation of 3D models is time-consuming and requires specialized expertise. While generative models have shown potential in creating 3D textured shapes from 2D images, their applicability in 3D industries is limited due to the lack of a well-defined camera distribution in real-world scenarios, resulting in low-quality shapes. To overcome this limitation, we propose GET3D--, the first method that directly generates textured 3D shapes from 2D images with unknown pose and scale. GET3D-- comprises a 3D shape generator and a learnable camera sampler that captures the 6D external changes on the camera. In addition, We propose a novel training schedule to stably optimize both the shape generator and camera sampler in a unified framework. By controlling external variations using the learnable camera sampler, our method can generate aligned shapes with clear textures. Extensive experiments demonstrate the efficacy of GET3D--, which precisely fits the 6D camera pose distribution and generates high-quality shapes on both synthetic and realistic unconstrained datasets.

摘要
需求高效3D模型生成技术的增长速度在线上呈指数增长趋势，因为手动创建3D模型需要特殊的专业知识并且占用了大量时间。而生成模型在3D行业中的应用是有限的，主要是因为实际场景中相机的分布没有明确的定义，导致生成的形状质量低下。为了解决这个问题，我们提出了 GET3D--，第一种直接从2D图像中生成有文字URE的3D形状的方法。GET3D--包括3D形状生成器和可学习的相机抽取器，可以捕捉外部的6D变化。此外，我们还提出了一种新的训练计划，可以稳定地优化形状生成器和相机抽取器在一个统一的框架中。通过控制外部变化使用可学习的相机抽取器，我们的方法可以生成aligned的形状，并且 texture是清晰的。广泛的实验证明了 GET3D--的可靠性，它能够准确地适应6D相机pose分布，并在 synthetic和实际无结构化数据集上生成高质量的形状。

NSA: Naturalistic Support Artifact to Boost Network Confidence

paper_url: http://arxiv.org/abs/2307.14917
repo_url: None
paper_authors: Abhijith Sharma, Phil Munz, Apurva Narayan
for: 这篇论文主要关注在visual AI系统面对自然和人工物理变化的抵抗性能。
methods: 这篇论文提出了一种叫做自然支持物件（NSA）的方法，通过DC-GAN进行 artifact 训练，以提高预测的可靠性。
results: 研究发现，这种NSA方法可以对抗自然变化，提高预测的信任分数，并且可以提高适应攻击性能。

Abstract
Visual AI systems are vulnerable to natural and synthetic physical corruption in the real-world. Such corruption often arises unexpectedly and alters the model's performance. In recent years, the primary focus has been on adversarial attacks. However, natural corruptions (e.g., snow, fog, dust) are an omnipresent threat to visual AI systems and should be considered equally important. Many existing works propose interesting solutions to train robust models against natural corruption. These works either leverage image augmentations, which come with the additional cost of model training, or place suspicious patches in the scene to design unadversarial examples. In this work, we propose the idea of naturalistic support artifacts (NSA) for robust prediction. The NSAs are shown to be beneficial in scenarios where model parameters are inaccessible and adding artifacts in the scene is feasible. The NSAs are natural looking objects generated through artifact training using DC-GAN to have high visual fidelity in the scene. We test against natural corruptions on the Imagenette dataset and observe the improvement in prediction confidence score by four times. We also demonstrate NSA's capability to increase adversarial accuracy by 8\% on average. Lastly, we qualitatively analyze NSAs using saliency maps to understand how they help improve prediction confidence.

摘要
视觉人工智能系统容易受到自然和人工生成的物理损害。这种损害通常是意外的，并会改变模型的性能。在过去几年中，主要关注点是对敌意攻击。然而，自然损害（如雪、雾、尘埃）对视觉人工智能系统是一种普遍存在的威胁，应该受到同样重视。现有的许多研究提出了有趣的解决方案，以帮助训练鲁棒的模型。这些研究可能会通过增加图像变换，来增加模型训练的成本，或者在场景中放置怀疑的贴图以设计不敌攻击示例。在这项工作中，我们提出了自然支持 artifacts（NSA）的想法，以便在无法访问模型参数时，提高预测的可靠性。NSA通过使用DC-GAN进行遗产训练，以生成高视觉准确性的自然看起来的对象。我们在Imagenette数据集上测试了对自然损害的情况，并观察到预测信心分数提高4倍。此外，我们还证明NSA可以提高敌意率平均8%。最后，我们使用滤波映射来解释NSA如何提高预测的可靠性。

Clustering of illustrations by atmosphere using a combination of supervised and unsupervised learning

paper_url: http://arxiv.org/abs/2307.15099
repo_url: None
paper_authors: Keisuke Kubota, Masahiro Okuda
for:这篇论文旨在解决分类Illustration的”氛围”（atmosphere）问题，帮助推荐和搜索。methods:这篇论文使用了 both supervised和Unsupervised learning with pseudo-labels，以获得特征向量。results:实验分析表明，这种方法可以比传统方法更好地实现人类化的集群。

Abstract
The distribution of illustrations on social media, such as Twitter and Pixiv has increased with the growing popularity of animation, games, and animated movies. The "atmosphere" of illustrations plays an important role in user preferences. Classifying illustrations by atmosphere can be helpful for recommendations and searches. However, assigning clear labels to the elusive "atmosphere" and conventional supervised classification is not always practical. Furthermore, even images with similar colors, edges, and low-level features may not have similar atmospheres, making classification based on low-level features challenging. In this paper, this problem is solved using both supervised and unsupervised learning with pseudo-labels. The feature vectors are obtained using the supervised method with pseudo-labels that contribute to an ambiguous atmosphere. Further, clustering is performed based on these feature vectors. Experimental analyses show that our method outperforms conventional methods in human-like clustering on datasets manually classified by humans.

摘要
社交媒体上的插图分布量增加，与动画、游戏和动画电影的浸泡式 Popularity 相关。插图的“氛围”在用户喜好中扮演重要的角色。按氛围分类插图可以有助于推荐和搜索。但是，将氛围分类为明确的标签是不实际的。即使图像具有相同的颜色、边缘和低级特征，也可能没有相同的氛围，从而使基于低级特征的分类困难。在这篇论文中，我们解决了这个问题，使用超级vised和无监督学习，并使用 pseudo-labels。我们通过supervised方法获取特征向量，并使用这些特征向量进行 clustering。实验分析表明，我们的方法在人类分类数据集上超过了传统方法的性能。

Weakly Supervised AI for Efficient Analysis of 3D Pathology Samples

paper_url: http://arxiv.org/abs/2307.14907
repo_url: https://github.com/mahmoodlab/mamba
paper_authors: Andrew H. Song, Mane Williams, Drew F. K. Williamson, Guillaume Jaume, Andrew Zhang, Bowen Chen, Robert Serafin, Jonathan T. C. Liu, Alex Baras, Anil V. Parwani, Faisal Mahmood
for: 这篇论文旨在提供一种用于诊断和评估癌症治疗效果的深度学习平台，以帮助临床医生更好地诊断和治疗癌症。
methods: 该论文使用了多模态多实例学习（MAMBA）平台，通过处理不同成像模式的3D组织图像，预测癌症患者的5年生物化学回报。
results: 研究发现，使用MAMBA平台可以更好地预测癌症患者的5年生物化学回报，并且可以减少采样偏见的风险，提高诊断精度。

Abstract
Human tissue and its constituent cells form a microenvironment that is fundamentally three-dimensional (3D). However, the standard-of-care in pathologic diagnosis involves selecting a few two-dimensional (2D) sections for microscopic evaluation, risking sampling bias and misdiagnosis. Diverse methods for capturing 3D tissue morphologies have been developed, but they have yet had little translation to clinical practice; manual and computational evaluations of such large 3D data have so far been impractical and/or unable to provide patient-level clinical insights. Here we present Modality-Agnostic Multiple instance learning for volumetric Block Analysis (MAMBA), a deep-learning-based platform for processing 3D tissue images from diverse imaging modalities and predicting patient outcomes. Archived prostate cancer specimens were imaged with open-top light-sheet microscopy or microcomputed tomography and the resulting 3D datasets were used to train risk-stratification networks based on 5-year biochemical recurrence outcomes via MAMBA. With the 3D block-based approach, MAMBA achieves an area under the receiver operating characteristic curve (AUC) of 0.86 and 0.74, superior to 2D traditional single-slice-based prognostication (AUC of 0.79 and 0.57), suggesting superior prognostication with 3D morphological features. Further analyses reveal that the incorporation of greater tissue volume improves prognostic performance and mitigates risk prediction variability from sampling bias, suggesting the value of capturing larger extents of heterogeneous 3D morphology. With the rapid growth and adoption of 3D spatial biology and pathology techniques by researchers and clinicians, MAMBA provides a general and efficient framework for 3D weakly supervised learning for clinical decision support and can help to reveal novel 3D morphological biomarkers for prognosis and therapeutic response.

摘要
人类组织和其组成细胞形成了基本三维（3D）的微环境。然而，现行的诊断标准仅选择一些二维（2D）的部分进行微镜下评估，可能会导致抽象采样偏见和误诊。多种用于捕捉3D组织结构的方法已经开发，但它们尚未在临床实践中得到广泛应用。我们现在提出了模态不可知多例学习 для卷积体分析（MAMBA），一种基于深度学习的平台，用于处理多种成像模式的3D组织图像并预测患者结果。我们使用了开普通镜微镜或微计算tomography扫描患有前列腺癌的股骨板，并使用MAMBA进行风险 stratification网络的训练，以实现5年内生化回报的预测。通过3D块基本方法，MAMBA在 receiver操作特征曲线值（AUC）中取得0.86和0.74，比2D传统单片基本预测（AUC为0.79和0.57）更高，表明3D形态特征更好地预测结果。进一步分析表明，包含更大的组织体积可以提高预测性能，并减少采样偏见导致的预测变化，表明捕捉更大的多样性3D形态是有价值的。随着3D空间生物学和病理学技术的快速发展和应用，MAMBA提供了一个通用和高效的3D弱监学习框架，可以帮助探索新的3D形态生物标志物和诊断响应。

Mixture of Self-Supervised Learning

paper_url: http://arxiv.org/abs/2307.14897
repo_url: https://github.com/aristorenaldo/g-ssl
paper_authors: Aristo Renaldo Ruslim, Novanto Yudistira, Budi Darma Setiawan
For: The paper aims to improve image classification by proposing a Gated Self-Supervised Learning (G-SSL) method that uses multiple pretext tasks and a Mixture of Expert architecture to combine them.* Methods: The proposed G-SSL method uses a combination of pretext tasks, including rotation prediction, solving jigsaw puzzles, and predicting relative positions on images. The method also employs a Mixture of Expert architecture as a gating network to combine the pretext tasks and automatically focus on the most useful augmentations for classification.* Results: The proposed method is tested on several scenarios, including CIFAR imbalance dataset classification, adversarial perturbations, Tiny-Imagenet dataset classification, and semi-supervised learning. The results show that the G-SSL method outperforms previous self-supervised learning methods in image classification tasks. Additionally, Grad-CAM and T-SNE analysis are used to visualize the important features that influence image classification and represent data for each class.Here is the simplified Chinese text format for the three information points:* For: 本研究旨在提高图像分类方法，通过提出多个预测任务和权重综合网络的组合来提高自动学习方法。* Methods: 提议的G-SSL方法使用多个预测任务，包括旋转预测、图像谜题解决和图像相对位置预测。此外，方法还使用权重综合网络来组合每个预测任务，以便自动地强调最有用的扩充方法。* Results: 提议的方法在多个场景中进行测试，包括CIFAR不均衡数据集分类、防御攻击、Tiny-Imagenet数据集分类和半监督学习。结果表明，G-SSL方法在图像分类任务中表现出色，并且通过Grad-CAM和T-SNE分析可以查看到重要的特征影响图像分类。

Abstract
Self-supervised learning is popular method because of its ability to learn features in images without using its labels and is able to overcome limited labeled datasets used in supervised learning. Self-supervised learning works by using a pretext task which will be trained on the model before being applied to a specific task. There are some examples of pretext tasks used in self-supervised learning in the field of image recognition, namely rotation prediction, solving jigsaw puzzles, and predicting relative positions on image. Previous studies have only used one type of transformation as a pretext task. This raises the question of how it affects if more than one pretext task is used and to use a gating network to combine all pretext tasks. Therefore, we propose the Gated Self-Supervised Learning method to improve image classification which use more than one transformation as pretext task and uses the Mixture of Expert architecture as a gating network in combining each pretext task so that the model automatically can study and focus more on the most useful augmentations for classification. We test performance of the proposed method in several scenarios, namely CIFAR imbalance dataset classification, adversarial perturbations, Tiny-Imagenet dataset classification, and semi-supervised learning. Moreover, there are Grad-CAM and T-SNE analysis that are used to see the proposed method for identifying important features that influence image classification and representing data for each class and separating different classes properly. Our code is in https://github.com/aristorenaldo/G-SSL

摘要
自我超级学习是一种受欢迎的方法，因为它可以在无需标签的情况下学习图像中的特征。自我超级学习通过使用预测任务来实现，这些任务将在模型之前进行训练，然后应用到特定任务上。在图像识别领域中，一些常见的预测任务包括旋转预测、解决缺失图像和Relative Positions预测。以前的研究只使用了一种变换作为预测任务。这引发了对多个预测任务的使用以及使用权重网络结合每个预测任务的问题。因此，我们提出了加锁自动学习方法，以提高图像分类。我们在多个变换作为预测任务，并使用权重网络结合每个预测任务来自动学习和强调最有用的扩展。我们在多个场景中测试了我们的提议方法，包括CIFAR不均衡数据集分类、抗攻击扰动、Tiny-Imagenet数据集分类和半监督学习。此外，我们还使用Grad-CAM和T-SNE分析来看到我们的提议方法如何对图像分类做出重要的特征标识和数据表示，并正确地分离不同的类别。我们的代码可以在https://github.com/aristorenaldo/G-SSL上找到。

Self-Supervised Learning for Improved Synthetic Aperture Sonar Target Recognition

paper_url: http://arxiv.org/abs/2307.15098
repo_url: None
paper_authors: BW Sheffield
for: 本研究探讨了基于自动学习（SSL）的目标识别在探测镜（SAS）图像中的应用。由于水下环境的特殊性，传统的计算机视觉技术，它们几乎完全依赖于光学相机图像，在水下环境中变得更加不效果。而SAS，它能够生成高分辨率图像，因此成为水下扫描领域的首选。但是，高分辨率SAS数据的巨量却对标注提出了巨大的挑战，标注是训练深度神经网络（DNNs）的关键步骤。
methods: 本研究使用了两种知名的SSL算法，MoCov2和BYOL，与已知的超级vised学习模型，ResNet18，进行对比。
results: 结果表明，在几个尝试情况下，SSL模型可以比超级vised学习模型表现更好，但是当所有标注都用于训练时，SSL模型不能超过超级vised学习模型。这些结果表明SSL可以作为一种可靠的替代方案，可以维持任务性能而减少数据标注的时间和成本。

Abstract
This study explores the application of self-supervised learning (SSL) for improved target recognition in synthetic aperture sonar (SAS) imagery. The unique challenges of underwater environments make traditional computer vision techniques, which rely heavily on optical camera imagery, less effective. SAS, with its ability to generate high-resolution imagery, emerges as a preferred choice for underwater imaging. However, the voluminous high-resolution SAS data presents a significant challenge for labeling; a crucial step for training deep neural networks (DNNs). SSL, which enables models to learn features in data without the need for labels, is proposed as a potential solution to the data labeling challenge in SAS. The study evaluates the performance of two prominent SSL algorithms, MoCov2 and BYOL, against the well-regarded supervised learning model, ResNet18, for binary image classification tasks. The findings suggest that while both SSL models can outperform a fully supervised model with access to a small number of labels in a few-shot scenario, they do not exceed it when all the labels are used. The results underscore the potential of SSL as a viable alternative to traditional supervised learning, capable of maintaining task performance while reducing the time and costs associated with data labeling. The study also contributes to the growing body of evidence supporting the use of SSL in remote sensing and could stimulate further research in this area.

摘要

Sample Less, Learn More: Efficient Action Recognition via Frame Feature Restoration

paper_url: http://arxiv.org/abs/2307.14866
repo_url: https://github.com/xacheng1996/sllm
paper_authors: Harry Cheng, Yangyang Guo, Liqiang Nie, Zhiyong Cheng, Mohan Kankanhalli
for: 提高视频动作识别模型的效率，特别是在有限的计算资源下。
methods: 提出了一种新的方法，用于修复两帧视频中的间接特征。这种方法具有较少的计算成本，但可以提高模型的效率。
results: 对四个公共数据集进行了广泛的实验，并证明了该方法可以提高三种常用的基线模型的效率，即使是在零基eline设定下。此外，该方法还意外地提高了模型的总体化能力。

Abstract
Training an effective video action recognition model poses significant computational challenges, particularly under limited resource budgets. Current methods primarily aim to either reduce model size or utilize pre-trained models, limiting their adaptability to various backbone architectures. This paper investigates the issue of over-sampled frames, a prevalent problem in many approaches yet it has received relatively little attention. Despite the use of fewer frames being a potential solution, this approach often results in a substantial decline in performance. To address this issue, we propose a novel method to restore the intermediate features for two sparsely sampled and adjacent video frames. This feature restoration technique brings a negligible increase in computational requirements compared to resource-intensive image encoders, such as ViT. To evaluate the effectiveness of our method, we conduct extensive experiments on four public datasets, including Kinetics-400, ActivityNet, UCF-101, and HMDB-51. With the integration of our method, the efficiency of three commonly used baselines has been improved by over 50%, with a mere 0.5% reduction in recognition accuracy. In addition, our method also surprisingly helps improve the generalization ability of the models under zero-shot settings.

摘要
训练有效的视频动作识别模型具有显著的计算挑战，特别是在有限的资源预算下。现有方法主要是减小模型大小或使用预训练模型，这限制了它们的适应性于多种基础架构。本文探讨了过度采样帧的问题，这是许多方法中的一个常见问题，但它受到了相对少的注意。尽管使用 fewer frames 是一个可能的解决方案，但这经常会导致性能明显下降。为解决这个问题，我们提出了一种将两帧采样的间隔恢复原始特征的方法。这种特征恢复技术与资源占用量更高的图像编码器，如 ViT，相比较，带来了微scopic的计算增加。为评估我们的方法效果，我们在四个公共数据集上进行了广泛的实验，包括 Kinetics-400、ActivityNet、UCF-101 和 HMDB-51。与基eline integrate我们的方法后，三个常用的基eline的效率得到了50%以上的提升，而减少识别精度的幅度仅为0.5%。此外，我们的方法还意外地帮助了模型在零 shot 设定下的泛化能力。

A full-resolution training framework for Sentinel-2 image fusion

paper_url: http://arxiv.org/abs/2307.14864
repo_url: https://github.com/matciotola/FR-FUSE
paper_authors: Matteo Ciotola, Mario Ragosta, Giovanni Poggi, Giuseppe Scarpa
for: 这种研究旨在提出一种新的无监督框架，用于训练基于Sentinel-2图像的深度学习模型，以实现超分辨率图像重构。
methods: 该方案利用Sentinel-2图像的10米和20米频道进行融合，而不需要降解分辨率来生成训练数据。同时，提出了一种适当的损失函数，以保证网络预测结果和输入组件之间的循环一致性。
results: 在我们的初步实验中，该方案已经显示了与经验法相比的有力Promising results，并且由于构建的损失函数，得到的训练网络可以被归类为多分辨率分析方法。

Abstract
This work presents a new unsupervised framework for training deep learning models for super-resolution of Sentinel-2 images by fusion of its 10-m and 20-m bands. The proposed scheme avoids the resolution downgrade process needed to generate training data in the supervised case. On the other hand, a proper loss that accounts for cycle-consistency between the network prediction and the input components to be fused is proposed. Despite its unsupervised nature, in our preliminary experiments the proposed scheme has shown promising results in comparison to the supervised approach. Besides, by construction of the proposed loss, the resulting trained network can be ascribed to the class of multi-resolution analysis methods.

摘要
这个工作提出了一种新的无监督框架，用于深度学习模型的超Resolution Sentinel-2图像合并。该方案避免了在生成训练数据时需要进行分辨率下降的问题。然而，提出了一种适合的损失函数，该函数考虑了网络预测和输入组件的循环一致性。 DESPITE its 无监督性，我们的初步实验表明，提出的方案在与监督方法相比 display promising results。此外，由于构造的损失函数，得到的训练网络可以被归类为多resolution分析方法。Note: Please keep in mind that Simplified Chinese is a compromise between the original Chinese characters and the Romanization of Chinese. The translation may not be exactly the same as the original text, but it should convey the same meaning.

IML-ViT: Benchmarking Image Manipulation Localization by Vision Transformer

paper_url: http://arxiv.org/abs/2307.14863
repo_url: https://github.com/sunnyhaze/iml-vit
paper_authors: Xiaochen Ma, Bo Du, Zhuohang Jiang, Ahmed Y. Al Hammadi, Jizhe Zhou
for:IML-ViT is designed to capture artifacts in image manipulation localization, which is crucial for ensuring the trustworthiness of multimedia.methods:IML-ViT utilizes a ViT architecture with high-resolution capacity, multi-scale feature extraction, and manipulation edge supervision to capture artifacts. The self-attention mechanism is employed to enhance the model’s ability to extract non-semantic discrepancies between manipulated and authentic regions.results:Extensive experiments on five benchmark datasets demonstrate that IML-ViT outperforms state-of-the-art manipulation localization methods, showcasing the effectiveness of the proposed approach.

Abstract
Advanced image tampering techniques are increasingly challenging the trustworthiness of multimedia, leading to the development of Image Manipulation Localization (IML). But what makes a good IML model? The answer lies in the way to capture artifacts. Exploiting artifacts requires the model to extract non-semantic discrepancies between manipulated and authentic regions, necessitating explicit comparisons between the two areas. With the self-attention mechanism, naturally, the Transformer should be a better candidate to capture artifacts. However, due to limited datasets, there is currently no pure ViT-based approach for IML to serve as a benchmark, and CNNs dominate the entire task. Nevertheless, CNNs suffer from weak long-range and non-semantic modeling. To bridge this gap, based on the fact that artifacts are sensitive to image resolution, amplified under multi-scale features, and massive at the manipulation border, we formulate the answer to the former question as building a ViT with high-resolution capacity, multi-scale feature extraction capability, and manipulation edge supervision that could converge with a small amount of data. We term this simple but effective ViT paradigm IML-ViT, which has significant potential to become a new benchmark for IML. Extensive experiments on five benchmark datasets verified our model outperforms the state-of-the-art manipulation localization methods.Code and models are available at \url{https://github.com/SunnyHaze/IML-ViT}.

摘要
现代图像修改技术正在不断挑战 multimedia 的可信度，导致图像修改地方化（IML）的发展。但是，一个好的 IML 模型又是什么样的？答案在于如何捕捉缺陷。修改图像时，需要模型从修改后和原始图像之间进行直接比较，从而捕捉非语义缺陷。自然地，Transformer 应该是更好的选择，因为它可以通过自动注意力机制来捕捉缺陷。然而，由于数据有限，目前没有纯 Vit-based 的 IML 方法作为标准，CNN 则dominate了整个任务。不过，CNN 受到语义表征的限制，无法模型长距离和非语义特征。为了填补这个差距，我们根据缺陷敏感于图像分辨率、在多个比例级中捕捉特征、和修改边缘监督来设计一个简单 yet 有效的 Vit 模型，我们称之为 IML-ViT。我们进行了五个 benchmark 数据集的广泛实验，证明我们的模型可以超越当前修改地方化方法的状态。代码和模型可以在 GitHub 上获取：https://github.com/SunnyHaze/IML-ViT。

Comparative Evaluation of Digital and Analog Chest Radiographs to Identify Tuberculosis using Deep Learning Model

paper_url: http://arxiv.org/abs/2307.14859
repo_url: None
paper_authors: Subhankar Chattoraj, Bhargava Reddy, Manoj Tadepalli, Preetham Putha
for: 这个研究用于评估基于深度学习（DL）技术的胸部X射线（CXR）识别TB病理标志的性能。
methods: 研究使用了10,000张胸部X射线DICOM（.dcm）数据集和打印出来的影像Films，从印度不同地点收集到的数据包括三种不同的手机：Samsung S8、iPhone 8和iPhone XS。
results: 研究发现，DL基于设备可以准确地识别胸部X射线中TB病理标志，AUC为0.928，sensitivity为0.841，specificity为0.806。三种手机中，最小差异为2.55%、5.10%和1.91%。这表明DL基于设备在 both digital and analog CXR 中具有良好的可靠性和灵活性。

Abstract
Purpose:Chest X-ray (CXR) is an essential tool and one of the most prescribed imaging to detect pulmonary abnormalities, with a yearly estimate of over 2 billion imaging performed worldwide. However, the accurate and timely diagnosis of TB remains an unmet goal. The prevalence of TB is highest in low-middle-income countries, and the requirement of a portable, automated, and reliable solution is required. In this study, we compared the performance of DL-based devices on digital and analog CXR. The evaluated DL-based device can be used in resource-constraint settings. Methods: A total of 10,000 CXR DICOMs(.dcm) and printed photos of the films acquired with three different cellular phones - Samsung S8, iPhone 8, and iPhone XS along with their radiological report were retrospectively collected from various sites across India from April 2020 to March 2021. Results: 10,000 chest X-rays were utilized to evaluate the DL-based device in identifying radiological signs of TB. The AUC of qXR for detecting signs of tuberculosis on the original DICOMs dataset was 0.928 with a sensitivity of 0.841 at a specificity of 0.806. At an optimal threshold, the difference in the AUC of three cellular smartphones with the original DICOMs is 0.024 (2.55%), 0.048 (5.10%), and 0.038 (1.91%). The minimum difference demonstrates the robustness of the DL-based device in identifying radiological signs of TB in both digital and analog CXR.

摘要
Methods: 从2020年4月到2021年3月，我们在印度多个地点采集了10,000余个CXR DICOMs(.dcm)和印刷的胸部X射线film，其中使用了三种不同的手机：Samsung S8、iPhone 8和iPhone XS。这些影像和其相关的医学报告共同组成了我们的数据集。Results: 我们使用了10,000个胸部X射线来评估DL基于设备在诊断肺炎的 радиологических标志物上的性能。DL基于设备在原始DICOMs数据集上的AUC为0.928，sensitivity为0.841，specificity为0.806。在优化的阈值下，三种手机中的DICOMs数据集之间的AUC差异为0.024（2.55%）、0.048（5.10%）和0.038（1.91%）。这些数据表明DL基于设备在数字和分析CXR上的诊断能力具有坚定的可靠性。

Simplified Concrete Dropout – Improving the Generation of Attribution Masks for Fine-grained Classification

paper_url: http://arxiv.org/abs/2307.14825
repo_url: None
paper_authors: Dimitri Korsch, Maha Shadaydeh, Joachim Denzler
for: 这 paper 是为了提高 fine-grained classification 模型的准确性和可靠性而写的。
methods: 这 paper 使用了 perturbation-based 方法，特别是 fill-in of the dropout (FIDO) algorithm，以及一种新的 mini-batch 更新方法来提高模型的解释能力。
results: 这 paper 的结果表明，使用这种新方法可以减少计算复杂度，同时提高模型的解释质量和分类性能。

Abstract
Fine-grained classification is a particular case of a classification problem, aiming to classify objects that share the visual appearance and can only be distinguished by subtle differences. Fine-grained classification models are often deployed to determine animal species or individuals in automated animal monitoring systems. Precise visual explanations of the model's decision are crucial to analyze systematic errors. Attention- or gradient-based methods are commonly used to identify regions in the image that contribute the most to the classification decision. These methods deliver either too coarse or too noisy explanations, unsuitable for identifying subtle visual differences reliably. However, perturbation-based methods can precisely identify pixels causally responsible for the classification result. Fill-in of the dropout (FIDO) algorithm is one of those methods. It utilizes the concrete dropout (CD) to sample a set of attribution masks and updates the sampling parameters based on the output of the classification model. A known problem of the algorithm is a high variance in the gradient estimates, which the authors have mitigated until now by mini-batch updates of the sampling parameters. This paper presents a solution to circumvent these computational instabilities by simplifying the CD sampling and reducing reliance on large mini-batch sizes. First, it allows estimating the parameters with smaller mini-batch sizes without losing the quality of the estimates but with a reduced computational effort. Furthermore, our solution produces finer and more coherent attribution masks. Finally, we use the resulting attribution masks to improve the classification performance of a trained model without additional fine-tuning of the model.

摘要
细化分类是特殊的分类问题，旨在分类具有相似视觉特征的对象，只能通过微不足的差异来区分。细化分类模型常用于自动动物监测系统中确定动物种或个体。精准的视觉解释是分析系统错误的关键。通常使用注意力或梯度基本方法来确定图像中占据分类决策的区域。然而，这些方法可能提供过于粗糙或噪音的解释，不适合确定微不足的视觉差异。在这些情况下，杂化基本方法可以准确地确定图像中占据分类决策的像素。填充掉dropout（FIDO）算法是其中的一种方法。它利用具体的dropout（CD）来采样一组属性面积，然后基于分类模型的输出更新采样参数。然而，这个算法存在高度变化的梯度估计问题，作者通过小批处理参数进行了缓解。这篇论文提出了一种解决方案，利用更小的批处理大小来避免计算不稳定性，同时保持估计质量不受影响。此外，我们的解决方案生成了更细致和更凝聚的属性面积，最终用这些面积提高已经训练的模型的分类性能。

Building RadiologyNET: Unsupervised annotation of a large-scale multimodal medical database

paper_url: http://arxiv.org/abs/2308.08517
repo_url: None
paper_authors: Mateja Napravnik, Franko Hržić, Sebastian Tschauner, Ivan Štajduhar
for: 这个论文的目的是构建一个大规模的医疗图像自动标注集合。
methods: 该论文使用了一种自动化、无监督的方法，使用多Modal sources，包括图像、DICOM元数据和描述性诊断，并测试了多种适当的特征提取器，以确定最佳的特征提取器。
results: 研究人员使用了优选的特征提取器进行多Modal表示，并使用k-means和k-medoids归一 clustering算法对一个代表样本 subset进行评估。结果表明，将所有数据源的嵌入都 fusion起来最好地进行无监督归一 clustering任务，并且得到了最紧凑的归一结果。

Abstract
Background and objective: The usage of machine learning in medical diagnosis and treatment has witnessed significant growth in recent years through the development of computer-aided diagnosis systems that are often relying on annotated medical radiology images. However, the availability of large annotated image datasets remains a major obstacle since the process of annotation is time-consuming and costly. This paper explores how to automatically annotate a database of medical radiology images with regard to their semantic similarity. Material and methods: An automated, unsupervised approach is used to construct a large annotated dataset of medical radiology images originating from Clinical Hospital Centre Rijeka, Croatia, utilising multimodal sources, including images, DICOM metadata, and narrative diagnoses. Several appropriate feature extractors are tested for each of the data sources, and their utility is evaluated using k-means and k-medoids clustering on a representative data subset. Results: The optimal feature extractors are then integrated into a multimodal representation, which is then clustered to create an automated pipeline for labelling a precursor dataset of 1,337,926 medical images into 50 clusters of visually similar images. The quality of the clusters is assessed by examining their homogeneity and mutual information, taking into account the anatomical region and modality representation. Conclusion: The results suggest that fusing the embeddings of all three data sources together works best for the task of unsupervised clustering of large-scale medical data, resulting in the most concise clusters. Hence, this work is the first step towards building a much larger and more fine-grained annotated dataset of medical radiology images.

摘要
背景和目标：随着医疗领域中机器学习的应用不断扩展，计算机辅助诊断系统在医疗领域中的应用也得到了广泛的应用。然而，大量标注的医疗影像数据集的可用性仍然是一个主要的障碍因素，因为标注过程是时间consuming和成本高昂的。这篇论文探讨了如何自动标注医疗影像数据库中的semantic similarity。材料和方法：我们使用自动化、无监督的方法构建了一个大量标注的医疗影像数据库，来自克罗地亚维也纳医院中心医院，包括多Modal sources，包括影像、DICOM元数据和描述诊断。我们测试了多个适当的特征提取器，并对其在表示subset中进行了评估，使用k-means和k-medoids归一 clustering。结果：我们选择最佳的特征提取器，并将其集成到多modal表示中，然后使用k-means和k-medoids归一 clustering来自动标注一个预处理数据集中的1,337,926个医疗影像，并创建50个视觉相似的集群。我们评估了这些集群的一致性和相互信息，考虑到了解剖区域和模式表示。结论：结果表明，将所有三种数据源的嵌入统一为一起最好用于不supervised clustering的任务，得到了最短的集群。因此，这种工作是建立一个更大和更细化的医疗影像数据库的第一步。

Fading memory as inductive bias in residual recurrent networks

paper_url: http://arxiv.org/abs/2307.14823
repo_url: None
paper_authors: Igor Dubinin, Felix Effenberger
for: 这个论文旨在研究异常连接在RNN中的影响，以及它们如何影响网络的性能、动态和记忆特性。
methods: 作者提出了弱相互连接循环网络（WCRNN），并研究了不同类型的异常连接对网络表现的影响。
results: 研究发现，不同类型的异常连接可以提供不同的 inductive bias，从而提高网络的表达力。具体来说，异常连接可以使网络的动态靠近边缘的混沌边缘，使网络可以充分利用数据的特征性质，并且可以提高网络的记忆特性。此外，作者还展示了如何将这些结果推广到非线性异常连接和Elman RNN中。

Abstract
Residual connections have been proposed as architecture-based inductive bias to mitigate the problem of exploding and vanishing gradients and increase task performance in both feed-forward and recurrent networks (RNNs) when trained with the backpropagation algorithm. Yet, little is known about how residual connections in RNNs influence their dynamics and fading memory properties. Here, we introduce weakly coupled residual recurrent networks (WCRNNs) in which residual connections result in well-defined Lyapunov exponents and allow for studying properties of fading memory. We investigate how the residual connections of WCRNNs influence their performance, network dynamics, and memory properties on a set of benchmark tasks. We show that several distinct forms of residual connections yield effective inductive biases that result in increased network expressivity. In particular, residual connections that (i) result in network dynamics at the proximity of the edge of chaos, (ii) allow networks to capitalize on characteristic spectral properties of the data, and (iii) result in heterogeneous memory properties are shown to increase practical expressivity. In addition, we demonstrate how our results can be extended to non-linear residuals and introduce a weakly coupled residual initialization scheme that can be used for Elman RNNs

摘要
剩余连接已被提议作为网络架构基于束缚的偏好，以降低迁移和消失梯度的问题，提高反propagation算法下的任务性能。然而，关于剩余连接在RNN中的动态和渐幻性质仍未得到了充分的了解。在这里，我们引入了弱相互连接的剩余循环网络（WCRNN），其中剩余连接导致明确的Lyapunov exponent，使得研究RNN的动态和渐幻性质变得可能。我们研究了WCRNN的剩余连接如何影响其性能、网络动态和记忆特性，并证明了不同形式的剩余连接可以提供有效的束缚偏好，从而增加网络的表达能力。具体来说，剩余连接可以：（i）导致网络动态在数据特征附近，（ii）让网络利用数据的特征波动性，（iii）导致异ogeneous记忆特性，这些特性可以提高实际的表达能力。此外，我们还展示了我们的结果可以扩展到非线性剩余，并引入了弱相互连接初始化方案，可以应用于Elman RNN。

Towards Deeply Unified Depth-aware Panoptic Segmentation with Bi-directional Guidance Learning

paper_url: http://arxiv.org/abs/2307.14786
repo_url: https://github.com/jwh97nn/DeepDPS
paper_authors: Junwen He, Yifan Wang, Lijun Wang, Huchuan Lu, Jun-Yan He, Jin-Peng Lan, Bin Luo, Yifeng Geng, Xuansong Xie
for: depth-aware panoptic segmentation, 提高Scene理解的robustness
methods: joint segmentation and depth estimation, 同时在每个 segment 中进行semantic segmentation和深度估计，并使用嵌入表示来整合场景几何信息
results: sets the new state of the art, 在Cityscapes-DVPS和SemKITTI-DVPS数据集上达到了新的state of the artHere’s the full translation of the abstract in Simplified Chinese: depth-aware panoptic segmentation是计算机视觉领域的一个新兴话题，它旨在结合semantic和几何理解来更加robust的Scene理解。 current works 通常将这两个任务看作为两个独立的学习任务，这限制了其探索跨领域信息的潜力。我们提出了一个深度嵌入的框架，用于depth-aware panoptic segmentation，它在每个 segment 中同时进行semantic segmentation和深度估计，并使用嵌入表示来整合场景几何信息。此外，我们还提出了一种bi-directional guidance learning方法，可以利用这两个任务之间的相互关系来促进cross-task feature learning。我们的方法在Cityscapes-DVPS和SemKITTI-DVPS数据集上达到了新的state of the art，并且我们的引导学习方法在不完全监督标签下也能够实现性能提升。

Abstract
Depth-aware panoptic segmentation is an emerging topic in computer vision which combines semantic and geometric understanding for more robust scene interpretation. Recent works pursue unified frameworks to tackle this challenge but mostly still treat it as two individual learning tasks, which limits their potential for exploring cross-domain information. We propose a deeply unified framework for depth-aware panoptic segmentation, which performs joint segmentation and depth estimation both in a per-segment manner with identical object queries. To narrow the gap between the two tasks, we further design a geometric query enhancement method, which is able to integrate scene geometry into object queries using latent representations. In addition, we propose a bi-directional guidance learning approach to facilitate cross-task feature learning by taking advantage of their mutual relations. Our method sets the new state of the art for depth-aware panoptic segmentation on both Cityscapes-DVPS and SemKITTI-DVPS datasets. Moreover, our guidance learning approach is shown to deliver performance improvement even under incomplete supervision labels.

摘要
depth-aware panoptic segmentation 是计算机视觉领域的一个emerging topic，它结合semantic和geometric的理解，以提高场景的理解度。 recent works 尝试开发一个unified framework来解决这个挑战，但大多数仍然视为两个不同的学习任务，这限制了其探索交叉领域信息的潜力。我们提出了一个深度统一的 framework для depth-aware panoptic segmentation，它在每个分割个体上同时进行分割和深度估计，并使用 identical object queries 来进行joint segmentation和depth estimation。为了降低这两个任务之间的差距，我们还设计了一种geometry query enhancement method，可以将场景几何信息 интеGRATE INTO object queries 中，使用latent representations。此外，我们还提出了一种bi-directional guidance learning Approach，可以在两个任务之间互相帮助学习特征，通过它们之间的相互关系来提高性能。我们的方法在Cityscapes-DVPS和SemKITTI-DVPS dataset上设置了新的state of the art，而且我们的导航学习方法在不完全supervision标签的情况下也能提供性能提升。

Contrastive Knowledge Amalgamation for Unsupervised Image Classification

paper_url: http://arxiv.org/abs/2307.14781
repo_url: None
paper_authors: Shangde Gao, Yichao Fu, Ke Liu, Yuqiang Han
for: 学习一个紧凑的学生模型，处理多个特有任务的教师模型的共同目标。
methods: 引入对比损失和对齐损失，实现类内凝聚和类间分离，同时通过软目标efficiently和flexibly进行多任务无监控学习。
results: 在多个benchmark上示出了总化能力，并通过了广泛的ablation study提供了更深入的理解。

Abstract
Knowledge amalgamation (KA) aims to learn a compact student model to handle the joint objective from multiple teacher models that are are specialized for their own tasks respectively. Current methods focus on coarsely aligning teachers and students in the common representation space, making it difficult for the student to learn the proper decision boundaries from a set of heterogeneous teachers. Besides, the KL divergence in previous works only minimizes the probability distribution difference between teachers and the student, ignoring the intrinsic characteristics of teachers. Therefore, we propose a novel Contrastive Knowledge Amalgamation (CKA) framework, which introduces contrastive losses and an alignment loss to achieve intra-class cohesion and inter-class separation.Contrastive losses intra- and inter- models are designed to widen the distance between representations of different classes. The alignment loss is introduced to minimize the sample-level distribution differences of teacher-student models in the common representation space.Furthermore, the student learns heterogeneous unsupervised classification tasks through soft targets efficiently and flexibly in the task-level amalgamation. Extensive experiments on benchmarks demonstrate the generalization capability of CKA in the amalgamation of specific task as well as multiple tasks. Comprehensive ablation studies provide a further insight into our CKA.

摘要
知识混合（KA）目标是学习一个紧凑的学生模型，以处理多个特циализирован任务的教师模型共同目标。现有方法主要是粗略对教师和学生在共同 Representation Space 进行对齐，使学生从多种不同任务的教师中学习到正确的决策边界很难。此外，前一些方法只是减小了教师和学生的概率分布差异，忽略了教师的内在特征。因此，我们提出了一种新的对比知识混合（CKA）框架，该框架引入对比损失和对齐损失，以实现类内凝聚和类间分离。对比损失可以把不同类型的表示之间的距离扩大，对齐损失可以在共同 Representation Space 中减少教师-学生模型的样本级分布差异。此外，学生通过软目标在任务级混合中学习多种不同任务，高效和灵活地完成不同任务的无监督分类任务。广泛的实验和综合性的减少研究表明CKA在特定任务和多任务混合中具有普遍性和可靠性。

pCTFusion: Point Convolution-Transformer Fusion with Semantic Aware Loss for Outdoor LiDAR Point Cloud Segmentation

paper_url: http://arxiv.org/abs/2307.14777
repo_url: https://github.com/GeoAI-Research-Lab/PCTFusion
paper_authors: Abhishek Kuriyal, Vaibhav Kumar, Bharat Lohani
for: 本研究旨在提高LiDAR点云semantic segmentation的性能，特别是对于少数类和边缘点的识别。
methods: 本研究提出了一种新的架构，即pCTFusion，它结合了kernel-based卷积和自注意机制，以提高特征学习和捕捉点云中的本地和全局依赖关系。
results: 对SemanticKITTI户外数据集进行评估，pCTFusion架构与当前状态艺术结构相比，提高了5-7%的性能。特别是对于少数类，性能提高尤为明显，这些发展的方法可以应用于复杂的点云数据集，并促进实际应用。

Abstract
LiDAR-generated point clouds are crucial for perceiving outdoor environments. The segmentation of point clouds is also essential for many applications. Previous research has focused on using self-attention and convolution (local attention) mechanisms individually in semantic segmentation architectures. However, there is limited work on combining the learned representations of these attention mechanisms to improve performance. Additionally, existing research that combines convolution with self-attention relies on global attention, which is not practical for processing large point clouds. To address these challenges, this study proposes a new architecture, pCTFusion, which combines kernel-based convolutions and self-attention mechanisms for better feature learning and capturing local and global dependencies in segmentation. The proposed architecture employs two types of self-attention mechanisms, local and global, based on the hierarchical positions of the encoder blocks. Furthermore, the existing loss functions do not consider the semantic and position-wise importance of the points, resulting in reduced accuracy, particularly at sharp class boundaries. To overcome this, the study models a novel attention-based loss function called Pointwise Geometric Anisotropy (PGA), which assigns weights based on the semantic distribution of points in a neighborhood. The proposed architecture is evaluated on SemanticKITTI outdoor dataset and showed a 5-7% improvement in performance compared to the state-of-the-art architectures. The results are particularly encouraging for minor classes, often misclassified due to class imbalance, lack of space, and neighbor-aware feature encoding. These developed methods can be leveraged for the segmentation of complex datasets and can drive real-world applications of LiDAR point cloud.

摘要
利用LiDAR生成的点云是外部环境的感知的关键。点云分割也是许多应用程序的必需之物。先前的研究主要集中在使用自注意力和卷积（本地注意力）机制来进行semantic segmentation。然而，有限的研究集中在将这些注意力机制学习的表示结合以提高性能。此外，现有的混合卷积和自注意力的研究都是基于全局注意力，不适合处理大量的点云。为解决这些挑战，本研究提出了一种新的架构，即pCTFusion，它结合均匀卷积和自注意力机制来提高特征学习和捕捉点云中的本地和全局依赖关系。该架构使用了两种类型的自注意力机制，即本地自注意力和全局自注意力，基于嵌入块的层次位置。此外，现有的损失函数不考虑点云中的 semantic和位置含义，导致准确性下降，特别是在精准的分类边界。为此，本研究提出了一种新的注意力基于损失函数，即点云几何各向异性（PGA），它根据点云的semantic分布在邻域分配权重。该提案的架构在SemanticKITTI外部 dataset上进行评估，与当前领先的架构相比，提高了5-7%的性能。结果特别有吸引力，尤其是对于少数类，通常因为分类不均衡、空间约束和邻居相关特征编码而导致的误分类。这些开发的方法可以应用于复杂的点云分割任务，并促进实际应用 LiDAR 点云。

3DPortraitGAN: Learning One-Quarter Headshot 3D GANs from a Single-View Portrait Dataset with Diverse Body Poses

paper_url: http://arxiv.org/abs/2307.14770
repo_url: https://github.com/oneThousand1000/3DPortraitGAN
paper_authors: Yiqian Wu, Hao Xu, Xiangjun Tang, Hongbo Fu, Xiaogang Jin
for: 实现一个可以从所有镜头角度产生完整的头部、脖子和肩膀 geometry 的一个四分之一头shot 3D 肖像。
methods: 使用 360°PHQ 数据集，并提出了一个名为 3DPortraitGAN 的新型一个四分之一头shot 3D 肖像生成器，可以从所有镜头角度产生可观的、具有完整的头部、脖子和肩膀 geometry 的肖像。
results: 经过实验显示，提出的框架可以很好地预测肖像的姿势和产生具有可观的、具有完整的头部、脖子和肩膀 geometry 的肖像。

Abstract
3D-aware face generators are typically trained on 2D real-life face image datasets that primarily consist of near-frontal face data, and as such, they are unable to construct one-quarter headshot 3D portraits with complete head, neck, and shoulder geometry. Two reasons account for this issue: First, existing facial recognition methods struggle with extracting facial data captured from large camera angles or back views. Second, it is challenging to learn a distribution of 3D portraits covering the one-quarter headshot region from single-view data due to significant geometric deformation caused by diverse body poses. To this end, we first create the dataset 360{\deg}-Portrait-HQ (360{\deg}PHQ for short) which consists of high-quality single-view real portraits annotated with a variety of camera parameters (the yaw angles span the entire 360{\deg} range) and body poses. We then propose 3DPortraitGAN, the first 3D-aware one-quarter headshot portrait generator that learns a canonical 3D avatar distribution from the 360{\deg}PHQ dataset with body pose self-learning. Our model can generate view-consistent portrait images from all camera angles with a canonical one-quarter headshot 3D representation. Our experiments show that the proposed framework can accurately predict portrait body poses and generate view-consistent, realistic portrait images with complete geometry from all camera angles.

摘要
三维意识的脸部生成器通常在二维实际脸部图像集上训练，这些集合主要由近 фронталь脸部数据组成，因此它们无法构建完整的一半头shot三维肖像。这个问题的两个原因是：首先，现有的人脸识别方法在大角度摄像头数据中提取脸部数据很困难；其次，学习一个包括一半头shot区域的三维肖像分布从单视数据上是很困难的，因为人体姿势的多样性导致了重大的几何变换。为解决这个问题，我们首先创建了360°PHQ数据集（简称为360°PHQ），该集合包含高质量的单视实际肖像，以及不同的摄像头parameters（摄像头方向的角度覆盖了整个360°范围）和人体姿势。然后，我们提出了3DPortraitGAN模型，这是第一个能够学习一个完整的三维肖像分布的一半头shot脸部生成器。我们的模型可以从所有摄像头角度生成视角一致的肖像图像，并且可以准确预测肖像图像中的人体姿势。我们的实验表明，我们的框架可以生成视角一致、真实、完整的肖像图像，并且可以准确预测肖像图像中的人体姿势。

Gloss-free Sign Language Translation: Improving from Visual-Language Pretraining

paper_url: http://arxiv.org/abs/2307.14768
repo_url: https://github.com/zhoubenjia/gfslt-vlp
paper_authors: Benjia Zhou, Zhigang Chen, Albert Clapés, Jun Wan, Yanyan Liang, Sergio Escalera, Zhen Lei, Du Zhang
For: This paper aims to improve the task of sign language translation (SLT) by addressing the challenge of using an intermediate representation, such as gloss sequences, which can hinder the development of SLT.* Methods: The proposed method, called Gloss-Free SLT based on Visual-Language Pretraining (GFSLT-VLP), uses a novel approach that inherits language-oriented prior knowledge from pre-trained models without any gloss annotation assistance. The approach involves two stages: (i) integrating Contrastive Language-Image Pre-training (CLIP) with masked self-supervised learning to create pre-tasks that bridge the semantic gap between visual and textual representations and restore masked sentences, and (ii) constructing an end-to-end architecture with an encoder-decoder-like structure that inherits the parameters of the pre-trained Visual Encoder and Text Decoder from the first stage.* Results: The proposed method achieves unprecedented improvements in terms of BLEU-4 score on the PHOENIX14T dataset (>+5) and the CSL-Daily dataset (>+3) compared to state-of-the-art gloss-free SLT methods. Additionally, the approach achieves competitive results on the PHOENIX14T dataset when compared with most of the gloss-based methods.

Abstract
Sign Language Translation (SLT) is a challenging task due to its cross-domain nature, involving the translation of visual-gestural language to text. Many previous methods employ an intermediate representation, i.e., gloss sequences, to facilitate SLT, thus transforming it into a two-stage task of sign language recognition (SLR) followed by sign language translation (SLT). However, the scarcity of gloss-annotated sign language data, combined with the information bottleneck in the mid-level gloss representation, has hindered the further development of the SLT task. To address this challenge, we propose a novel Gloss-Free SLT based on Visual-Language Pretraining (GFSLT-VLP), which improves SLT by inheriting language-oriented prior knowledge from pre-trained models, without any gloss annotation assistance. Our approach involves two stages: (i) integrating Contrastive Language-Image Pre-training (CLIP) with masked self-supervised learning to create pre-tasks that bridge the semantic gap between visual and textual representations and restore masked sentences, and (ii) constructing an end-to-end architecture with an encoder-decoder-like structure that inherits the parameters of the pre-trained Visual Encoder and Text Decoder from the first stage. The seamless combination of these novel designs forms a robust sign language representation and significantly improves gloss-free sign language translation. In particular, we have achieved unprecedented improvements in terms of BLEU-4 score on the PHOENIX14T dataset (>+5) and the CSL-Daily dataset (>+3) compared to state-of-the-art gloss-free SLT methods. Furthermore, our approach also achieves competitive results on the PHOENIX14T dataset when compared with most of the gloss-based methods. Our code is available at https://github.com/zhoubenjia/GFSLT-VLP.

摘要
Sign Language Translation (SLT) 是一个复杂的任务，因为它涉及视觉姿势语言的翻译，从视觉语言到文本。许多前一代方法使用中间表示，即荣誉序列，以便进行 SLT，因此将其转化为一个两阶段任务：首先使用 SLR 进行视觉语言识别，然后使用 SLT 进行翻译。但是，缺乏荣誉注解的手语数据，加之中间表示缺乏信息的瓶颈，使得 SLT 任务的进一步发展受阻。为解决这个挑战，我们提出了一种新的无荣誉 SLT，基于视觉语言预training（GFSLT-VLP），它通过继承预训练模型中的语言相关知识，而不需要荣誉注解，提高了 SLT 的性能。我们的方法包括两个阶段：1. 将 Contrastive Language-Image Pre-training（CLIP）与遮盖自我监督学习相结合，以创建预 зада务， bridge 视觉和文本表示之间的 semantic gap，并 restore 遮盖句子。2. 使用 encoder-decoder 结构， inherited 预训练 Visual Encoder 和 Text Decoder 的参数，以构建一个端到端的结构。这种新的设计结合，形成了一种robust的手语表示，并有效地提高了无荣誉手语翻译。具体来说，我们在 PHOENIX14T 数据集上 achieved 未曾有过的改进（BLEU-4 分数 > +5）和 CSL-Daily 数据集上 achieved 比较出色的改进（BLEU-4 分数 > +3），相比之下，state-of-the-art 无荣誉 SLT 方法。此外，我们的方法还在 PHOENIX14T 数据集上与大多数荣誉基于方法相比，达到了竞争的结果。我们的代码可以在上找到。

Semantic Image Completion and Enhancement using GANs

paper_url: http://arxiv.org/abs/2307.14748
repo_url: None
paper_authors: Priyansh Saxena, Raahat Gupta, Akshat Maheshwari, Saumil Maheshwari
for: 这篇论文主要是为了完善图像的完成和增强任务。
methods: 该论文使用了生成对抗网络（GAN）来实现图像完成任务。
results: GAN可以帮助完善图像的完成和增强任务，提高图像的质量。Here’s the full text in Simplified Chinese:
for: 这篇论文主要是为了完善图像的完成和增强任务。
methods: 该论文使用了生成对抗网络（GAN）来实现图像完成任务。
results: GAN可以帮助完善图像的完成和增强任务，提高图像的质量。

Abstract
Semantic inpainting or image completion alludes to the task of inferring arbitrary large missing regions in images based on image semantics. Since the prediction of image pixels requires an indication of high-level context, this makes it significantly tougher than image completion, which is often more concerned with correcting data corruption and removing entire objects from the input image. On the other hand, image enhancement attempts to eliminate unwanted noise and blur from the image, along with sustaining most of the image details. Efficient image completion and enhancement model should be able to recover the corrupted and masked regions in images and then refine the image further to increase the quality of the output image. Generative Adversarial Networks (GAN), have turned out to be helpful in picture completion tasks. In this chapter, we will discuss the underlying GAN architecture and how they can be used used for image completion tasks.

摘要
Semantic inpainting or image completion refers to the task of inferring large missing regions in images based on image semantics. Since the prediction of image pixels requires high-level context, this makes it significantly more challenging than image completion, which is often focused on correcting data corruption and removing objects from the input image. On the other hand, image enhancement aims to eliminate unwanted noise and blur from the image while preserving most of the image details. An efficient image completion and enhancement model should be able to recover the corrupted and masked regions in images and then refine the image to improve its quality. Generative Adversarial Networks (GAN) have proven to be helpful in picture completion tasks, and we will discuss their underlying architecture and how they can be used for image completion tasks in this chapter.

paper_url: http://arxiv.org/abs/2307.14735
repo_url: https://github.com/shankhanil006/tta-iqa
paper_authors: Subhadeep Roy, Shankhanil Mitra, Soma Biswas, Rajiv Soundararajan
for: 提高隐身图像质量评估（IQA）算法在推理时的性能，由于训练和测试场景之间的分布偏移导致现有算法在测试时表现不佳。
methods: 采用两种新的质量相关的 auxillary task，一个是批处理级别的群集比较损失，另一个是样本级别的相对排名损失，使模型具有质量意识并适应目标数据。
results: 通过更新源模型的批处理常数，可以在小批量测试数据上 Achieve 显著改善的性能。

Abstract
While the design of blind image quality assessment (IQA) algorithms has improved significantly, the distribution shift between the training and testing scenarios often leads to a poor performance of these methods at inference time. This motivates the study of test time adaptation (TTA) techniques to improve their performance at inference time. Existing auxiliary tasks and loss functions used for TTA may not be relevant for quality-aware adaptation of the pre-trained model. In this work, we introduce two novel quality-relevant auxiliary tasks at the batch and sample levels to enable TTA for blind IQA. In particular, we introduce a group contrastive loss at the batch level and a relative rank loss at the sample level to make the model quality aware and adapt to the target data. Our experiments reveal that even using a small batch of images from the test distribution helps achieve significant improvement in performance by updating the batch normalization statistics of the source model.

摘要
而网络设计的盲图像质量评估（IQA）算法的设计已经得到了显著改进，但在推理时间段的分布偏移通常会导致这些方法在推理时的性能不佳。这种情况 Motivates the study of test time adaptation（TTA）技术以提高其推理时间段的性能。现有的辅助任务和损失函数可能不适用于盲图像质量认知的适应。在这种工作中，我们介绍了两个新的质量相关的辅助任务，即批处理级别的群体对照损失和样本级别的相对排名损失，以使模型具备质量认知并适应目标数据。我们的实验表明，即使使用小批处理数据来源，也可以通过更新源模型的批处理常量来实现显著的性能改进。

Understanding Silent Failures in Medical Image Classification

paper_url: http://arxiv.org/abs/2307.14729
repo_url: https://github.com/iml-dkfz/sf-visuals
paper_authors: Till J. Bungert, Levin Kobelke, Paul F. Jaeger
for: 防止静止失败在医疗应用中ensure reliable use of classification systems
methods: 使用 confidence scoring functions (CSFs) 或设计更加可靠的分类器
results: None of the benchmarked CSFs can reliably prevent silent failures, deeper understanding of the root causes of failures in the data is required. Introduce SF-Visuals, an interactive analysis tool to visualize shifts and failures.

Abstract
To ensure the reliable use of classification systems in medical applications, it is crucial to prevent silent failures. This can be achieved by either designing classifiers that are robust enough to avoid failures in the first place, or by detecting remaining failures using confidence scoring functions (CSFs). A predominant source of failures in image classification is distribution shifts between training data and deployment data. To understand the current state of silent failure prevention in medical imaging, we conduct the first comprehensive analysis comparing various CSFs in four biomedical tasks and a diverse range of distribution shifts. Based on the result that none of the benchmarked CSFs can reliably prevent silent failures, we conclude that a deeper understanding of the root causes of failures in the data is required. To facilitate this, we introduce SF-Visuals, an interactive analysis tool that uses latent space clustering to visualize shifts and failures. On the basis of various examples, we demonstrate how this tool can help researchers gain insight into the requirements for safe application of classification systems in the medical domain. The open-source benchmark and tool are at: https://github.com/IML-DKFZ/sf-visuals.

摘要
(Simplified Chinese translation)为确保医疗应用中的分类系统可靠使用，防止潜在的失败是非常重要。这可以通过设计更加鲁棒的分类器，或者通过使用信任分数函数（CSF）来检测剩下的失败来实现。图像分类中最主要的失败来源是在训练数据和部署数据之间的分布变化。为了了解医疗领域中 silent failure 预防的当前状况，我们进行了第一次全面的比较多种 CSF 在四种生物医学任务和多样的分布变化下的性能分析。结果显示， none of the benchmarked CSFs 可靠地预防潜在的失败。我们 conclude 需要更深入地理解数据中的失败根据因素，以便在医疗领域中安全地应用分类系统。为此，我们介绍 SF-Visuals，一种可互动地分析工具，使用隐藏空间划分来视觉化偏移和失败。通过多个示例，我们示出了这个工具可以帮助研究人员更好地理解医疗领域中分类系统的应用需求。该开源 benchmark 和工具可以在 GitHub 上找到：https://github.com/IML-DKFZ/sf-visuals.

P2C: Self-Supervised Point Cloud Completion from Single Partial Clouds

paper_url: http://arxiv.org/abs/2307.14726
repo_url: https://github.com/cuiruikai/partial2complete
paper_authors: Ruikai Cui, Shi Qiu, Saeed Anwar, Jiawei Liu, Chaoyue Xing, Jing Zhang, Nick Barnes
for: 本研究旨在完成基于部分观察的点云ShapeNet数据集中的物体形状。
methods: 我们提出了一种自动学习的方法，称为Partial2Complete（P2C），它使用单个不完整的点云训练样本来完成物体的形状。
results: P2C可以与使用完整的点云训练样本相比，并且在使用多个部分观察样本训练的方法上表现更好。 Code可以在https://github.com/CuiRuikai/Partial2Complete中找到。

Abstract
Point cloud completion aims to recover the complete shape based on a partial observation. Existing methods require either complete point clouds or multiple partial observations of the same object for learning. In contrast to previous approaches, we present Partial2Complete (P2C), the first self-supervised framework that completes point cloud objects using training samples consisting of only a single incomplete point cloud per object. Specifically, our framework groups incomplete point clouds into local patches as input and predicts masked patches by learning prior information from different partial objects. We also propose Region-Aware Chamfer Distance to regularize shape mismatch without limiting completion capability, and devise the Normal Consistency Constraint to incorporate a local planarity assumption, encouraging the recovered shape surface to be continuous and complete. In this way, P2C no longer needs multiple observations or complete point clouds as ground truth. Instead, structural cues are learned from a category-specific dataset to complete partial point clouds of objects. We demonstrate the effectiveness of our approach on both synthetic ShapeNet data and real-world ScanNet data, showing that P2C produces comparable results to methods trained with complete shapes, and outperforms methods learned with multiple partial observations. Code is available at https://github.com/CuiRuikai/Partial2Complete.

摘要
《点云补充》目标是基于部分观察结果 recuperate 完整的形状。现有方法需要 either 完整的点云数据或多个相同对象的多个部分观察数据进行学习。相比之前的方法，我们提出了 Partial2Complete（P2C），这是首个没有需要多个部分观察或完整的点云数据作为学习参考的自动化完成点云对象的框架。具体来说，我们将 incomplete point cloud 分组成为本地 patches 作为输入，并预测masked patches 通过学习不同部分对象的准确信息。我们还提出了Region-Aware Chamfer Distance 来正则化形态匹配，而不是限制完成能力，并提出了 Normal Consistency Constraint 来包含地面的planarity假设，使recovered shape surface 是连续的和完整的。因此，P2C 不再需要多个观察或完整的点云数据作为参考。而是通过学习Category-specific dataset 中的结构准确信息来完成部分点云对象。我们在synthetic ShapeNet数据和实际世界 ScanNet 数据上展示了 P2C 的效果，显示 P2C 可以与基于完整形状的方法相比，并且超越基于多个部分观察的方法。代码可以在https://github.com/CuiRuikai/Partial2Complete 上找到。

vox2vec: A Framework for Self-supervised Contrastive Learning of Voxel-level Representations in Medical Images

paper_url: http://arxiv.org/abs/2307.14725
repo_url: https://github.com/mishgon/vox2vec
paper_authors: Mikhail Goncharov, Vera Soboleva, Anvar Kurmukov, Maxim Pisov, Mikhail Belyaev
for: 本文提出了一种自然语言学习（SSL）方法，用于学习声域级表示。
methods: 本文使用了一种矩阵学习方法，即特征峰网络（FPN），用于生成声域级表示。FPN是在不同层次的pyramid中预训练的，以便在不同的扩展上生成相同的声域级表示，并生成不同的声域级表示。
results: 本文使用vox2vec方法在多于6500个公共可用的计算机Tomography图像上进行预训练，然后attach到简单的头上并训练结果。结果显示，vox2vec方法在三个评估设置下都 OUTPERFORMS现有的医疗影像SSL技术：线性和非线性探测，以及终端精度调整。此外，使用冻结vox2vec表示后attach一个非线性头，可以与从scratch训练FPN获得相同的性能，而且具有50个可变参数的缺省参数量。代码可以在https://github.com/mishgon/vox2vec上获取。

Abstract
This paper introduces vox2vec - a contrastive method for self-supervised learning (SSL) of voxel-level representations. vox2vec representations are modeled by a Feature Pyramid Network (FPN): a voxel representation is a concatenation of the corresponding feature vectors from different pyramid levels. The FPN is pre-trained to produce similar representations for the same voxel in different augmented contexts and distinctive representations for different voxels. This results in unified multi-scale representations that capture both global semantics (e.g., body part) and local semantics (e.g., different small organs or healthy versus tumor tissue). We use vox2vec to pre-train a FPN on more than 6500 publicly available computed tomography images. We evaluate the pre-trained representations by attaching simple heads on top of them and training the resulting models for 22 segmentation tasks. We show that vox2vec outperforms existing medical imaging SSL techniques in three evaluation setups: linear and non-linear probing and end-to-end fine-tuning. Moreover, a non-linear head trained on top of the frozen vox2vec representations achieves competitive performance with the FPN trained from scratch while having 50 times fewer trainable parameters. The code is available at https://github.com/mishgon/vox2vec .

摘要
We use vox2vec to pre-train a FPN on more than 6500 publicly available computed tomography images. We evaluate the pre-trained representations by attaching simple heads on top of them and training the resulting models for 22 segmentation tasks. We show that vox2vec outperforms existing medical imaging SSL techniques in three evaluation setups: linear and non-linear probing and end-to-end fine-tuning. Moreover, a non-linear head trained on top of the frozen vox2vec representations achieves competitive performance with the FPN trained from scratch while having 50 times fewer trainable parameters.The code for vox2vec is available at .

EFLNet: Enhancing Feature Learning for Infrared Small Target Detection

paper_url: http://arxiv.org/abs/2307.14723
repo_url: None
paper_authors: Bo Yang, Xinyu Zhang, Jiahao Zhu, Jian Zhang, Dongjian Tian, Jun Luo, Mingliang Zhou, Yangjun Pi
for: 这篇论文的目的是解决单框架红外小目标检测中的挑战，即背景和目标之间的极大偏见，以及小目标信息在高层semantic层面上易产生丢失。
methods: 该论文提出了基于YOLOv7框架的强化特征学习网络（EFLNet）来解决这些问题。其中，我们提出了一种自适应权重损失函数，以自动调整损失权重，让模型更加偏爱目标特征。其次，我们引入了 норmalized Gaussian Wasserstein distance，以解决 bounding box regression 对红外小目标的极高敏感性。最后，我们在网络中添加了动态头机制，以实现自适应学习每个semantic层的相对重要性。
results: 实验结果表明，我们的方法可以在红外小目标检测中实现更好的性能，比如当前的深度学习基于方法。

Abstract
Single-frame infrared small target detection is considered to be a challenging task, due to the extreme imbalance between target and background, bounding box regression is extremely sensitive to infrared small targets, and small target information is easy to lose in the high-level semantic layer. In this paper, we propose an enhancing feature learning network (EFLNet) based on YOLOv7 framework to solve these problems. First, we notice that there is an extremely imbalance between the target and the background in the infrared image, which makes the model pay more attention to the background features, resulting in missed detection. To address this problem, we propose a new adaptive threshold focal loss function that adjusts the loss weight automatically, compelling the model to allocate greater attention to target features. Second, we introduce the normalized Gaussian Wasserstein distance to alleviate the difficulty of model convergence caused by the extreme sensitivity of the bounding box regression to infrared small targets. Finally, we incorporate a dynamic head mechanism into the network to enable adaptive learning of the relative importance of each semantic layer. Experimental results demonstrate our method can achieve better performance in the detection performance of infrared small targets compared to state-of-the-art deep-learning based methods.

摘要
单一帧红外小目标检测是一个具有挑战性的任务，因为目标和背景之间存在极大的不均衡， bounding box regression 非常敏感于红外小目标，小目标信息容易在高层 semantic layer 中失去。在这篇文章中，我们提出了一个强化特征学习网络（EFLNet）基于 YOLOv7 框架，以解决这些问题。首先，我们注意到了红外图像中目标和背景之间的极大不均衡，这使得模型更倾向于关注背景特征，从而导致搜寻失败。为了解决这个问题，我们提出了一个新的自动调整损失函数，可以自动调整损失重要性，让模型更加倾向于对目标特征进行关注。其次，我们引入了 норма化 Gaussian Wasserstein 距离，以减少 bounding box regression 的问题对于红外小目标的敏感性。最后，我们将动态头 Mechanism incorporated 到网络中，以启动网络进行自适应的学习，以便适当地调整各个 semantic layer 的相对重要性。实验结果显示，我们的方法可以在红外小目标检测中比 state-of-the-art deep-learning 基础方法表现更好。

Robust vertebra identification using simultaneous node and edge predicting Graph Neural Networks

paper_url: http://arxiv.org/abs/2308.02509
repo_url: https://github.com/imfusiongmbh/vid-vertebra-identification-dataset
paper_authors: Vincent Bürgin, Raphael Prevost, Marijn F. Stollenga
for: 这篇论文的目的是为了自动找到和识别CT扫描中的脊梗，以便于诊断和治疗。
methods: 这篇论文使用了一个简单的管线，包括使用标准预测和单一图像神经网络，将脊梗与全orientation相关和分类。
results: 这篇论文的结果显示，该方法可以正确地相互关联脊梗和体干骨，忽略False Positive，并在标准VerSe挑战 зада务中表现竞争性。

Abstract
Automatic vertebra localization and identification in CT scans is important for numerous clinical applications. Much progress has been made on this topic, but it mostly targets positional localization of vertebrae, ignoring their orientation. Additionally, most methods employ heuristics in their pipeline that can be sensitive in real clinical images which tend to contain abnormalities. We introduce a simple pipeline that employs a standard prediction with a U-Net, followed by a single graph neural network to associate and classify vertebrae with full orientation. To test our method, we introduce a new vertebra dataset that also contains pedicle detections that are associated with vertebra bodies, creating a more challenging landmark prediction, association and classification task. Our method is able to accurately associate the correct body and pedicle landmarks, ignore false positives and classify vertebrae in a simple, fully trainable pipeline avoiding application-specific heuristics. We show our method outperforms traditional approaches such as Hungarian Matching and Hidden Markov Models. We also show competitive performance on the standard VerSe challenge body identification task.

摘要
自动骨盆位置和识别在CT扫描图中是至关重要的临床应用。过去几年，在这一领域中有很大的进步，但大多数方法都是对骨盆体的位置进行定位，忽略了它们的方向。此外，大多数方法都使用了特定的规则和假设，这些规则可能不适用于实际的临床图像，这些图像可能包含异常。我们提出了一个简单的管道，该管道使用标准预测器和单个图 neural network 来关联和分类骨盆体，并忽略false positive。为测试我们的方法，我们创建了一个新的骨盆数据集，该数据集包含了关联骨盆体和肋骨的检测结果，这使得骨盆体的预测、关联和分类任务变得更加困难。我们的方法能够准确地关联正确的骨盆体和肋骨标记，忽略false positive，并在简单可trainable的管道中完成骨盆体的分类。我们比较了我们的方法与传统的匈牙利匹配和隐马尔可夫模型，发现我们的方法能够超越这些传统方法。此外，我们还发现我们的方法在标准VerSe挑战任务上的体部识别性能与其他方法类似。

GaitMorph: Transforming Gait by Optimally Transporting Discrete Codes

paper_url: http://arxiv.org/abs/2307.14713
repo_url: None
paper_authors: Adrian Cosma, Emilian Radoi
for: 本研究旨在提出一种无需人工标注的步态识别系统训练方法，通过自动学习方法来实现。
methods: 我们提出了一种基于高级压缩模型的方法，使用无标注数据来构建一个灵活可解释的抽象空间，并使用优化运输理论来学习抽象代码的传输地图。
results: 我们进行了广泛的实验，并证明了我们的方法可以合理地Synthesize加入视频序列中的其他视图。Here’s the English version of the three key points for reference:
for: The purpose of this research is to propose a method for training gait recognition systems without explicit human annotations, using self-supervised learning approaches.
methods: We propose a method based on a high-compression model for gait skeleton sequences, which leverages unlabeled data to construct a discrete and interpretable latent space that preserves identity-related features. We also propose a method based on optimal transport theory to learn latent transport maps on the discrete codebook that morph gait sequences between variations.
results: We conduct extensive experiments and show that our method is suitable to synthesize additional views for an input sequence.

Abstract
Gait, the manner of walking, has been proven to be a reliable biometric with uses in surveillance, marketing and security. A promising new direction for the field is training gait recognition systems without explicit human annotations, through self-supervised learning approaches. Such methods are heavily reliant on strong augmentations for the same walking sequence to induce more data variability and to simulate additional walking variations. Current data augmentation schemes are heuristic and cannot provide the necessary data variation as they are only able to provide simple temporal and spatial distortions. In this work, we propose GaitMorph, a novel method to modify the walking variation for an input gait sequence. Our method entails the training of a high-compression model for gait skeleton sequences that leverages unlabelled data to construct a discrete and interpretable latent space, which preserves identity-related features. Furthermore, we propose a method based on optimal transport theory to learn latent transport maps on the discrete codebook that morph gait sequences between variations. We perform extensive experiments and show that our method is suitable to synthesize additional views for an input sequence.

摘要
<>转换文本为简化中文。<>走姿（gait）已被证明为可靠的生物метри克，具有监测、营销和安全等多个应用。现在的新趋势是使用自动学习方法来训练走姿认识系统，而不需要显式的人类标注。这些方法对于数据多样性的需求非常高，因此需要强大的数据增强方法来模拟更多的走姿变化。现有的数据增强方法是谱系的，无法提供足够的数据多样性。在这种情况下，我们提出了一种新的方法——走姿变换（GaitMorph）。我们的方法是通过训练一个高压缩模型来Transforming Gait Sequences into Different VariationsGait, the manner of walking, has been proven to be a reliable biometric with uses in surveillance, marketing, and security. A promising new direction for the field is training gait recognition systems without explicit human annotations, through self-supervised learning approaches. However, current data augmentation schemes are heuristic and cannot provide the necessary data variation. In this work, we propose GaitMorph, a novel method to modify the walking variation for an input gait sequence. Our method entails training a high-compression model for gait skeleton sequences that leverages unlabelled data to construct a discrete and interpretable latent space, which preserves identity-related features. Furthermore, we propose a method based on optimal transport theory to learn latent transport maps on the discrete codebook that morph gait sequences between variations. We perform extensive experiments and show that our method is suitable to synthesize additional views for an input sequence.

Pre-training Vision Transformers with Very Limited Synthesized Images

paper_url: http://arxiv.org/abs/2307.14710
repo_url: https://github.com/ryoo-nakamura/ofdb
paper_authors: Ryo Nakamura, Hirokatsu Kataoka, Sora Takashima, Edgar Josafat Martinez Noriega, Rio Yokota, Nakamasa Inoue
for: 这个论文主要是用于探讨Formula-driven supervised learning（FDSL）是一种预训练方法，它利用生成自数学方程的Synthetic图像进行预训练。
methods: 这个论文使用的方法是将视transformer预训练在基于数学方程生成的Synthetic图像集上。
results: 研究发现，通过将不同类别的Synthetic图像视为数据增强，可以使模型在下游任务中表现更好。此外， authors还证明了这种方法可以使用远少于ImageNet-21k的数据集进行预训练，并且可以与ImageNet-21k相当或者超过其在ImageNet-1k fine-tuning中的性能。

Abstract
Formula-driven supervised learning (FDSL) is a pre-training method that relies on synthetic images generated from mathematical formulae such as fractals. Prior work on FDSL has shown that pre-training vision transformers on such synthetic datasets can yield competitive accuracy on a wide range of downstream tasks. These synthetic images are categorized according to the parameters in the mathematical formula that generate them. In the present work, we hypothesize that the process for generating different instances for the same category in FDSL, can be viewed as a form of data augmentation. We validate this hypothesis by replacing the instances with data augmentation, which means we only need a single image per category. Our experiments shows that this one-instance fractal database (OFDB) performs better than the original dataset where instances were explicitly generated. We further scale up OFDB to 21,000 categories and show that it matches, or even surpasses, the model pre-trained on ImageNet-21k in ImageNet-1k fine-tuning. The number of images in OFDB is 21k, whereas ImageNet-21k has 14M. This opens new possibilities for pre-training vision transformers with much smaller datasets.

摘要
“数学公式驱动的超级vised学习（FDSL）是一种预训练方法，利用生成自数学公式的 sintetic 图像。先前的研究表明，在这些 sintetic 图像上预训练视觉转换器可以达到多种下游任务的竞争性高度。这些 sintetic 图像按照生成图像的参数分类。在 presente 工作中，我们假设生成不同类别的过程可以视为数据增强。我们验证这一假设，通过将实例替换为数据增强，我们只需要单个图像per类。我们的实验表明，这一个单个图像数据库（OFDB）在 ImageNet-1k 精度调整中表现更好，而且可以与 ImageNet-21k 中预训练的模型相匹配或超越。OFDB 中的图像数量为 21k，而 ImageNet-21k 中的图像数量为 14M。这些新的可能性可能为预训练视觉转换器带来巨大的改进。”Note: Please note that the translation is in Simplified Chinese, which is one of the two standard Chinese dialects. If you need Traditional Chinese, please let me know.

Taxonomy Adaptive Cross-Domain Adaptation in Medical Imaging via Optimization Trajectory Distillation

paper_url: http://arxiv.org/abs/2307.14709
repo_url: https://github.com/camwew/tada-mi
paper_authors: Jianan Fan, Dongnan Liu, Hang Chang, Heng Huang, Mei Chen, Weidong Cai
for: 提高自动医疗图像分析的成功率，alleviate the burden of labeled data collection。
methods: 提出了一种名为“Optimization Trajectory Distillation”的新方法，它利用了梯度空间的低级特性和双流维护算法来规范网络训练的学习动态，并在不同的预测任务中进行了广泛的评估。
results: 比较previoius方法，该方法得到了更好的效果和改进。

Abstract
The success of automated medical image analysis depends on large-scale and expert-annotated training sets. Unsupervised domain adaptation (UDA) has been raised as a promising approach to alleviate the burden of labeled data collection. However, they generally operate under the closed-set adaptation setting assuming an identical label set between the source and target domains, which is over-restrictive in clinical practice where new classes commonly exist across datasets due to taxonomic inconsistency. While several methods have been presented to tackle both domain shifts and incoherent label sets, none of them take into account the common characteristics of the two issues and consider the learning dynamics along network training. In this work, we propose optimization trajectory distillation, a unified approach to address the two technical challenges from a new perspective. It exploits the low-rank nature of gradient space and devises a dual-stream distillation algorithm to regularize the learning dynamics of insufficiently annotated domain and classes with the external guidance obtained from reliable sources. Our approach resolves the issue of inadequate navigation along network optimization, which is the major obstacle in the taxonomy adaptive cross-domain adaptation scenario. We evaluate the proposed method extensively on several tasks towards various endpoints with clinical and open-world significance. The results demonstrate its effectiveness and improvements over previous methods.

摘要
成功的自动医疗图像分析取决于大规模的专家标注训练集。无监督领域适应（UDA）被提出为解决数据收集监督成本的有力的方法。然而，它们通常在关闭集成适应 Setting下运行，即源频率和目标频率之间的标签集 identical，这在临床实践中是过于严格的。虽然一些方法已经被提出来解决这两个技术挑战，但 none of them 考虑了这两个问题的共同特点和网络训练过程中的学习动力。在这项工作中，我们提出了优化轨迹涂抹，一种统一的方法，用于解决这两个技术挑战。它利用了梯度空间的低级别特性，并设计了一种双流涂抹算法，用于规范频率不充分标注的频率和类型的学习动力。我们的方法可以解决频率优化过程中的导航问题，这是涂抹适应交叉频率适应场景中的主要障碍。我们对多个任务进行了广泛的测试，结果表明我们的方法有效和前方法之上。

High Dynamic Range Imaging via Visual Attention Modules

paper_url: http://arxiv.org/abs/2307.14705
repo_url: https://github.com/alirezaomrani95/hdr-vam
paper_authors: Ali Reza Omrani, Davide Moroni
for: 这篇论文是为了提出一种新的高动态范围（HDR）成像方法，以增强图像的质量和细节。
methods: 该方法使用了深度学习架构，并通过视觉注意力模块（VAM）来提取图像中最重要的部分。
results: 实验结果表明，该方法比大多数现有的State-Of-The-Art算法更高效。

Abstract
Thanks to High Dynamic Range (HDR) imaging methods, the scope of photography has seen profound changes recently. To be more specific, such methods try to reconstruct the lost luminosity of the real world caused by the limitation of regular cameras from the Low Dynamic Range (LDR) images. Additionally, although the State-Of-The-Art methods in this topic perform well, they mainly concentrate on combining different exposures and have less attention to extracting the informative parts of the images. Thus, this paper aims to introduce a new model capable of incorporating information from the most visible areas of each image extracted by a visual attention module (VAM), which is a result of a segmentation strategy. In particular, the model, based on a deep learning architecture, utilizes the extracted areas to produce the final HDR image. The results demonstrate that our method outperformed most of the State-Of-The-Art algorithms.

摘要
高动态范围（HDR）摄影技术已经对摄影领域带来深刻的变革。更具体地说，这些方法试图重建实际世界中丢失的亮度，即常规相机的低动态范围（LDR）图像所带来的限制。尽管现状领先的方法在这个领域具有良好的性能，但它们主要集中在不同曝光的结合上，对图像中有用信息的提取则得到了更少的关注。因此，本文旨在介绍一种新的模型，可以通过视觉注意力模块（VAM） segmentation策略提取图像中最可见的部分，并使用这些部分生成最终的HDR图像。结果表明，我们的方法在大多数状态前进的算法中表现出色。

MIM-OOD: Generative Masked Image Modelling for Out-of-Distribution Detection in Medical Images

paper_url: http://arxiv.org/abs/2307.14701
repo_url: None
paper_authors: Sergio Naval Marimont, Vasilis Siomos, Giacomo Tarroni
for: 这篇论文目的是为了提出一种基于图像健康 анаatomy 的无监督 OUT-OF-DISTRIBUTION (OOD) 检测方法。
methods: 该方法使用了图像 tokenization 和 Auto-Regressive (AR) 模型来模型图像分布，并使用 AR 模型来标识异常 токен和替换异常表示。
results: 该方法使用了两个任务特定网络：一个是用于标识异常 токен的 transformer，另一个是用于使用 masked image modelling (MIM) 填充异常表示。实验表明，MIM-OOD 方法在脑 Magnetic Resonance Imaging (MRI) 异常中显著超越了 AR 模型（DICE 0.458 vs 0.301），同时实现了约25倍的速度提升（9.5s vs 244s）。

Abstract
Unsupervised Out-of-Distribution (OOD) detection consists in identifying anomalous regions in images leveraging only models trained on images of healthy anatomy. An established approach is to tokenize images and model the distribution of tokens with Auto-Regressive (AR) models. AR models are used to 1) identify anomalous tokens and 2) in-paint anomalous representations with in-distribution tokens. However, AR models are slow at inference time and prone to error accumulation issues which negatively affect OOD detection performance. Our novel method, MIM-OOD, overcomes both speed and error accumulation issues by replacing the AR model with two task-specific networks: 1) a transformer optimized to identify anomalous tokens and 2) a transformer optimized to in-paint anomalous tokens using masked image modelling (MIM). Our experiments with brain MRI anomalies show that MIM-OOD substantially outperforms AR models (DICE 0.458 vs 0.301) while achieving a nearly 25x speedup (9.5s vs 244s).

摘要
无监督异常检测（OOD）的方法是通过只使用健康解剖学图像进行模型训练来识别图像中的异常区域。一种常见的方法是将图像分割成 tokens，然后使用自动推导（AR）模型来分布计数。AR 模型可以1) 识别异常 tokens，2) 使用健康解剖学图像中的 tokens 填充异常表示。然而，AR 模型在推理时间和错误积累问题上存在缓慢和质量下降问题，这些问题会对 OOD 检测性能产生负面影响。我们的新方法 MIM-OOD 解决了这些问题，通过取代 AR 模型而使用两个任务特定的网络：1）一个基于 transformer 的异常 tokens 识别网络，2）一个基于 masked image modelling（MIM）的异常 tokens 填充网络。我们对大脑 MRI 异常进行了实验，结果显示 MIM-OOD 与 AR 模型相比有显著提高（DICE 0.458 vs 0.301），同时实现了约25倍的速度提升（9.5s vs 244s）。

paper_url: http://arxiv.org/abs/2307.14682
repo_url: https://github.com/aries-iai/cross-modal_patch_attack
paper_authors: Xingxing Wei, Yao Huang, Yitong Sun, Jie Yu
for: 防御深度学习对象检测器受到物理攻击的威胁，通过结合可见和无可见感知器的组合来增强安全性。methods: 我们提出了一种涉及多modal攻击的统一恶作剂质量，可以同时在可见和无可见感知器上实现欺骗。我们还提出了一种边界限定的形态优化方法，以及一种分数感知评估方法，以保证恶作剂质量的合理性。results: 我们的方法在多种场景下都达到了高于80%的攻击成功率，并在实际世界中进行了物理世界场景下的评估，包括不同的角度、距离、姿势和场景。

Abstract
Physical adversarial attacks have put a severe threat to DNN-based object detectors. To enhance security, a combination of visible and infrared sensors is deployed in various scenarios, which has proven effective in disabling existing single-modal physical attacks. To further demonstrate the potential risks in such cases, we design a unified adversarial patch that can perform cross-modal physical attacks, achieving evasion in both modalities simultaneously with a single patch. Given the different imaging mechanisms of visible and infrared sensors, our work manipulates patches' shape features, which can be captured in different modalities when they undergo changes. To deal with challenges, we propose a novel boundary-limited shape optimization approach that aims to achieve compact and smooth shapes for the adversarial patch, making it easy to implement in the physical world. And a score-aware iterative evaluation method is also introduced to balance the fooling degree between visible and infrared detectors during optimization, which guides the adversarial patch to iteratively reduce the predicted scores of the multi-modal sensors. Furthermore, we propose an Affine-Transformation-based enhancement strategy that makes the learnable shape robust to various angles, thus mitigating the issue of shape deformation caused by different shooting angles in the real world. Our method is evaluated against several state-of-the-art object detectors, achieving an Attack Success Rate (ASR) of over 80%. We also demonstrate the effectiveness of our approach in physical-world scenarios under various settings, including different angles, distances, postures, and scenes for both visible and infrared sensors.

摘要
人工智能图像识别器面临着物理攻击的严重威胁。为增强安全性，我们在不同场景中部署了可见和抗红外感知器，并证明了这种多感知模式的组合能够精准地终止现有的单模态物理攻击。为了进一步推翻这种情况的风险，我们设计了一种可跨模态物理攻击的统一恶作剂，可以同时在两种模式中 achievement 欺骗。由于可见和抗红外感知器的捕捉机制不同，我们在形状特征上进行了修改，以便在不同的捕捉机制下具有不同的特征。为了解决挑战，我们提出了一种边界限定形态优化方法，目的是实现紧凑而平滑的恶作剂形态，使其在物理世界中易于实现。此外，我们还提出了一种分数意识迭代评估方法，以保证恶作剂在多模式检测器中的欺骗度均衡。此外，我们还提出了基于Affine变换的增强策略，使learnable形态具有多个视角下的抗shape变形性，以mitigate实际世界中的shape变形问题。我们的方法在多个国家前景下进行了评估，其中包括不同的角度、距离、姿态和场景，并在可见和抗红外感知器中都达到了Attack Success Rate（ASR）高于80%。

LLDiffusion: Learning Degradation Representations in Diffusion Models for Low-Light Image Enhancement

paper_url: http://arxiv.org/abs/2307.14659
repo_url: https://github.com/taowangzj/lldiffusion
paper_authors: Tao Wang, Kaihao Zhang, Ziqian Shao, Wenhan Luo, Bjorn Stenger, Tae-Kyun Kim, Wei Liu, Hongdong Li
for: This paper proposes a degradation-aware learning scheme for low-light image enhancement (LLIE) using diffusion models, which effectively integrates degradation and image priors into the diffusion process to improve image enhancement.methods: The proposed method includes a joint learning framework for both image generation and image enhancement to learn degradation representations, as well as a Low-Light Diffusion model (LLDiffusion) with a well-designed dynamic diffusion module that takes into account both the color map and the latent degradation representations to guide the diffusion process.results: Extensive experiments on public benchmarks demonstrate that the proposed LLDiffusion outperforms state-of-the-art LLIE methods both quantitatively and qualitatively, with improved image enhancement and color fidelity.

Abstract
Current deep learning methods for low-light image enhancement (LLIE) typically rely on pixel-wise mapping learned from paired data. However, these methods often overlook the importance of considering degradation representations, which can lead to sub-optimal outcomes. In this paper, we address this limitation by proposing a degradation-aware learning scheme for LLIE using diffusion models, which effectively integrates degradation and image priors into the diffusion process, resulting in improved image enhancement. Our proposed degradation-aware learning scheme is based on the understanding that degradation representations play a crucial role in accurately modeling and capturing the specific degradation patterns present in low-light images. To this end, First, a joint learning framework for both image generation and image enhancement is presented to learn the degradation representations. Second, to leverage the learned degradation representations, we develop a Low-Light Diffusion model (LLDiffusion) with a well-designed dynamic diffusion module. This module takes into account both the color map and the latent degradation representations to guide the diffusion process. By incorporating these conditioning factors, the proposed LLDiffusion can effectively enhance low-light images, considering both the inherent degradation patterns and the desired color fidelity. Finally, we evaluate our proposed method on several well-known benchmark datasets, including synthetic and real-world unpaired datasets. Extensive experiments on public benchmarks demonstrate that our LLDiffusion outperforms state-of-the-art LLIE methods both quantitatively and qualitatively. The source code and pre-trained models are available at https://github.com/TaoWangzj/LLDiffusion.

摘要
当前的深度学习方法 для低光照图像提升（LLIE）通常基于像素级映射学习从对应的数据集学习。然而，这些方法经常忽视了考虑降低表示，这可能导致优化结果不佳。在这篇论文中，我们解决这些限制，并提出了基于降低表示的学习方案，用于LLIE。我们的提议基于于降低表示在准确地模型和捕捉低光照图像特有的降低特征。为此，我们首先提出了一个共同学习框架，用于学习图像生成和图像提升。然后，我们开发了一种低光照扩散模型（LLDiffusion），其中包括一个有效地考虑降低表示和颜色映射的动态扩散模块。通过这种模块，我们可以更好地考虑低光照图像的降低特征，同时保持图像的颜色准确性。最后，我们对多个公共 benchmark 进行了广泛的测试，并证明了我们的 LLDiffusion 在量化和质量上都超过了当前的 LL IE 方法。我们的源代码和预训练模型可以在 GitHub 上获取。

Spatial-Frequency U-Net for Denoising Diffusion Probabilistic Models

paper_url: http://arxiv.org/abs/2307.14648
repo_url: None
paper_authors: Xin Yuan, Linjie Li, Jianfeng Wang, Zhengyuan Yang, Kevin Lin, Zicheng Liu, Lijuan Wang
for: 这个论文探讨了在wavelet空间中使用滤波态 probabilistic模型（DDPM）来进行视觉合成。
methods: 这个论文提出了一个新的架构SFUNet，它在wavelet变换中积极地捕捉了频率和空间域之间的联系。在传统的混淆网络中，我们将2D滤波和空间专注层加以改进，并增加了我们自己的频率频率相关的滤波和专注模组，以同时模型频率和空间域之间的联系。
results: 这个论文发现，使用我们的SFUNet架构可以在CIFAR-10、FFHQ、LSUN-Bedroom和LSUN-Church datasets上生成高质量的图像，比起像素空间的网络。

Abstract
In this paper, we study the denoising diffusion probabilistic model (DDPM) in wavelet space, instead of pixel space, for visual synthesis. Considering the wavelet transform represents the image in spatial and frequency domains, we carefully design a novel architecture SFUNet to effectively capture the correlation for both domains. Specifically, in the standard denoising U-Net for pixel data, we supplement the 2D convolutions and spatial-only attention layers with our spatial frequency-aware convolution and attention modules to jointly model the complementary information from spatial and frequency domains in wavelet data. Our new architecture can be used as a drop-in replacement to the pixel-based network and is compatible with the vanilla DDPM training process. By explicitly modeling the wavelet signals, we find our model is able to generate images with higher quality on CIFAR-10, FFHQ, LSUN-Bedroom, and LSUN-Church datasets, than the pixel-based counterpart.

摘要
在这篇论文中，我们研究了在波лет空间中使用扩散概率模型（DDPM）进行视觉生成。而不是使用像素空间，我们在图像表示方面使用了波лет变换，这个变换可以在空间和频率两个领域中表示图像。为了有效地捕捉这两个领域之间的相关性，我们设计了一个新的架构SFUNet。在标准的清洁U-Net中，我们将2D卷积和空间只关注层替换为我们自己的空间频率相关卷积和关注模块，以同时模型空间和频率两个领域中的补偿信息。我们的新架构可以覆盖到像素基于网络，并且与普通的DDPM训练过程相容。通过显式地模型波лет信号，我们发现我们的模型在CIFAR-10、FFHQ、LSUN-Bedroom和LSUN-Church数据集上生成出比像素基于网络更高质量的图像。

EqGAN: Feature Equalization Fusion for Few-shot Image Generation

paper_url: http://arxiv.org/abs/2307.14638
repo_url: None
paper_authors: Yingbo Zhou, Zhihao Yue, Yutong Ye, Pengyu Zhang, Xian Wei, Mingsong Chen
for: 提高几何图像生成器的生成质量和多样性，解决现有的混合策略因为缺乏细结构和文本信息而导致的问题。
methods: 提出了一种新的特征平衡混合生成随机网络（EqGAN），通过分离编码特征来混合结构和文本，并通过等距离权重来归一化不同级别的结构和文本semantic。
results: 对三个公共数据集进行了广泛的实验，结果表明，EqGAN可以significantly提高生成性能（FID分数下降32.7%，LPIPS分数下降4.19%），并在下游分类任务中提高准确率（上升1.97%），与状态的艺术领域内的前景比较。

Abstract
Due to the absence of fine structure and texture information, existing fusion-based few-shot image generation methods suffer from unsatisfactory generation quality and diversity. To address this problem, we propose a novel feature Equalization fusion Generative Adversarial Network (EqGAN) for few-shot image generation. Unlike existing fusion strategies that rely on either deep features or local representations, we design two separate branches to fuse structures and textures by disentangling encoded features into shallow and deep contents. To refine image contents at all feature levels, we equalize the fused structure and texture semantics at different scales and supplement the decoder with richer information by skip connections. Since the fused structures and textures may be inconsistent with each other, we devise a consistent equalization loss between the equalized features and the intermediate output of the decoder to further align the semantics. Comprehensive experiments on three public datasets demonstrate that, EqGAN not only significantly improves generation performance with FID score (by up to 32.7%) and LPIPS score (by up to 4.19%), but also outperforms the state-of-the-arts in terms of accuracy (by up to 1.97%) for downstream classification tasks.

摘要
Translation notes:* "fine structure" and "texture information" are not easily translatable, so I left them as is.* "equalization" is 平衡 (píngyì) in Chinese, which means "balance" or "adjustment".* "shallow and deep contents" are superficical 和 deep 内容 (fāngtiě yǔ dīp cóng) in Chinese.* "skip connections" are 跳过连接 (tiào guò lián xiàng) in Chinese.* "consistent equalization loss" is 一致平衡损失 (yīchè píngyì shūshì) in Chinese.

HTNet for micro-expression recognition

paper_url: http://arxiv.org/abs/2307.14637
repo_url: https://github.com/wangzhifengharrison/htnet
paper_authors: Zhifeng Wang, Kaihao Zhang, Wenhan Luo, Ramesh Sankaranarayana
For: 本研究旨在提高微表情识别算法的性能，特别是识别微小的脸部表达。* Methods: 本文提出了一种 Hierarchical Transformer Network (HTNet)，包括两个主要组成部分：一个 transformer 层和一个汇集层。 transformer 层使用本地时间特征来表征本地小肌动作，而汇集层则用于学习脸部的全局semantic特征和本地相互作用。* Results: 实验结果显示，提出的方法在四个公共可用的微表情数据集上比前方法得分高出较大的幅度。codes和模型可以在以下链接中获取：\url{https://github.com/wangzhifengharrison/HTNet}

Abstract
Facial expression is related to facial muscle contractions and different muscle movements correspond to different emotional states. For micro-expression recognition, the muscle movements are usually subtle, which has a negative impact on the performance of current facial emotion recognition algorithms. Most existing methods use self-attention mechanisms to capture relationships between tokens in a sequence, but they do not take into account the inherent spatial relationships between facial landmarks. This can result in sub-optimal performance on micro-expression recognition tasks.Therefore, learning to recognize facial muscle movements is a key challenge in the area of micro-expression recognition. In this paper, we propose a Hierarchical Transformer Network (HTNet) to identify critical areas of facial muscle movement. HTNet includes two major components: a transformer layer that leverages the local temporal features and an aggregation layer that extracts local and global semantical facial features. Specifically, HTNet divides the face into four different facial areas: left lip area, left eye area, right eye area and right lip area. The transformer layer is used to focus on representing local minor muscle movement with local self-attention in each area. The aggregation layer is used to learn the interactions between eye areas and lip areas. The experiments on four publicly available micro-expression datasets show that the proposed approach outperforms previous methods by a large margin. The codes and models are available at: \url{https://github.com/wangzhifengharrison/HTNet}

摘要
Facial expression和facial muscle contractions有着密切的关系，不同的muscle movement对应于不同的情感状态。在微表达识别 tasks中，muscle movements通常是微妙的，这会影响现有的面部情感识别算法的性能。大多数现有方法使用自注意机制来捕捉序列中的关系，但是它们不考虑面部特征的自然空间关系。这会导致微表达识别任务的性能下降。因此，Recognize facial muscle movements是微表达识别领域中的关键挑战。在这篇论文中，我们提出了层次 transformer network（HTNet）来标识面部muscle movement的关键区域。HTNet包括两个主要组成部分：transformer层和聚合层。transformer层通过利用面部的局部时间特征来强调本地少股运动的表示。聚合层则通过学习眼部和唇部之间的交互来学习面部的全局semantic特征。特别是，HTNet将面部分为四个不同的区域：左眼区、左唇区、右眼区和右唇区。transformer层在每个区域中进行本地自注意，以强调本地少股运动的表示。聚合层则通过学习眼部和唇部之间的交互来学习面部的全局semantic特征。我们在四个公共可用的微表达数据集上进行了实验，结果显示，我们的方法与之前的方法相比，性能有大幅提高。代码和模型可以在以下链接中找到：\url{https://github.com/wangzhifengharrison/HTNet}

360VOT: A New Benchmark Dataset for Omnidirectional Visual Object Tracking

paper_url: http://arxiv.org/abs/2307.14630
repo_url: https://github.com/huajianup/360vot
paper_authors: Huajian Huang, Yinzhe Xu, Yingshu Chen, Sai-Kit Yeung
for: 该论文探讨了使用360度图像进行视觉对象跟踪，并描述了在360度图像中存在的新挑战，如大幅度扭曲和缝合 artifacts。
methods: 该论文提出了一种基于目标局部化的全面跟踪框架，并使用了新的表述方式，如 bounding field-of-view，以减少这些问题。
results: 该论文提供了一个大规模的全面跟踪 benchmark dataset， named 360VOT，包含120个序列和113K帧高分辨率图像，以及4种不偏的地面真实数据。此外，论文还提供了适用于360度图像的新的评价指标，并对20种现有的视觉跟踪器进行了广泛的评估。

Abstract
360{\deg} images can provide an omnidirectional field of view which is important for stable and long-term scene perception. In this paper, we explore 360{\deg} images for visual object tracking and perceive new challenges caused by large distortion, stitching artifacts, and other unique attributes of 360{\deg} images. To alleviate these problems, we take advantage of novel representations of target localization, i.e., bounding field-of-view, and then introduce a general 360 tracking framework that can adopt typical trackers for omnidirectional tracking. More importantly, we propose a new large-scale omnidirectional tracking benchmark dataset, 360VOT, in order to facilitate future research. 360VOT contains 120 sequences with up to 113K high-resolution frames in equirectangular projection. The tracking targets cover 32 categories in diverse scenarios. Moreover, we provide 4 types of unbiased ground truth, including (rotated) bounding boxes and (rotated) bounding field-of-views, as well as new metrics tailored for 360{\deg} images which allow for the accurate evaluation of omnidirectional tracking performance. Finally, we extensively evaluated 20 state-of-the-art visual trackers and provided a new baseline for future comparisons. Homepage: https://360vot.hkustvgd.com

摘要
三百六十度图像可提供全球视野，这对于稳定和长期场景识别非常重要。在这篇论文中，我们探讨了三百六十度图像在视觉对象跟踪中的挑战，包括大量扭曲、缝合 artifacts 和其他特有的三百六十度图像特征。为了解决这些问题，我们利用新的目标定位表示方法，即 bounding field-of-view，然后提出了一个通用的三百六十度跟踪框架，可以采用传统的全息跟踪器。更重要的是，我们提出了一个新的大规模全息跟踪数据集，360VOT，以便未来的研究。360VOT包含120个序列，最多113万高分辨率帧，使用 equirectangular projection。跟踪目标包括32个类别，在多样化的场景中。此外，我们提供了4种无偏见的地面真实，包括旋转的 bounding box 和旋转的 bounding field-of-view，以及适应三百六十度图像的新评价指标。最后，我们对20种当前最佳视觉跟踪器进行了广泛的评估，并提供了新的基线 для未来的比较。网站地址：https://360vot.hkustvgd.com

FS-Depth: Focal-and-Scale Depth Estimation from a Single Image in Unseen Indoor Scene

paper_url: http://arxiv.org/abs/2307.14624
repo_url: None
paper_authors: Chengrui Wei, Meng Yang, Lei He, Nanning Zheng
for: The paper is written for predicting absolute depth maps from single images in real (unseen) indoor scenes, which is an ill-posed problem due to the scale-ambiguous and focal-ambiguous issues.
methods: The paper proposes a focal-and-scale depth estimation model that combines a relative depth estimation network and an absolute depth estimation network, with multi-scale features generated by mapping a single focal length value to focal length features and concatenating them with intermediate features of different scales.
results: The paper reports that the proposed model significantly improves the generalization ability of depth estimation by 41%/13% (RMSE) with/without data augmentation compared with five recent SOTAs, and well alleviates the deformation problem in 3D reconstruction, while maintaining the accuracy of depth estimation on the original NYUDv2.Here are the three information points in Simplified Chinese text:
for: 本文是为了预测真实的indoor scene中单个图像的绝对深度图而写的。
methods: 本文提出了一种 focal-and-scale 深度估算模型，它将relative depth estimation网络和绝对深度估算网络结合在一起，并通过将单个 focal length value 映射到 focal length features 并与不同尺度的 intermediate features concatenate 而生成多个尺度的特征。
results: 本文发现，提出的模型可以对不同的indoor scene中单个图像进行绝对深度估算，并且可以提高generalization能力by 41%/13% (RMSE) compared with five recent SOTAs，同时也可以解决3D重建中的扭曲问题，而不会影响原始 NYUDv2 中的精度。

Abstract
It has long been an ill-posed problem to predict absolute depth maps from single images in real (unseen) indoor scenes. We observe that it is essentially due to not only the scale-ambiguous problem but also the focal-ambiguous problem that decreases the generalization ability of monocular depth estimation. That is, images may be captured by cameras of different focal lengths in scenes of different scales. In this paper, we develop a focal-and-scale depth estimation model to well learn absolute depth maps from single images in unseen indoor scenes. First, a relative depth estimation network is adopted to learn relative depths from single images with diverse scales/semantics. Second, multi-scale features are generated by mapping a single focal length value to focal length features and concatenating them with intermediate features of different scales in relative depth estimation. Finally, relative depths and multi-scale features are jointly fed into an absolute depth estimation network. In addition, a new pipeline is developed to augment the diversity of focal lengths of public datasets, which are often captured with cameras of the same or similar focal lengths. Our model is trained on augmented NYUDv2 and tested on three unseen datasets. Our model considerably improves the generalization ability of depth estimation by 41%/13% (RMSE) with/without data augmentation compared with five recent SOTAs and well alleviates the deformation problem in 3D reconstruction. Notably, our model well maintains the accuracy of depth estimation on original NYUDv2.

摘要
traditional Chinese:传统的问题是从单一图像中预测绝对深度地图，特别是在未见的室内场景中。我们观察到，这主要是因为scale-ambiguous problem和focal-ambiguous problem的共同作用，这导致单一图像中的深度估计缺乏普遍化能力。 Specifically, images may be captured by cameras of different focal lengths in scenes of different scales. In this paper, we develop a focal-and-scale depth estimation model to well learn absolute depth maps from single images in unseen indoor scenes. First, a relative depth estimation network is adopted to learn relative depths from single images with diverse scales/semantics. Second, multi-scale features are generated by mapping a single focal length value to focal length features and concatenating them with intermediate features of different scales in relative depth estimation. Finally, relative depths and multi-scale features are jointly fed into an absolute depth estimation network. In addition, a new pipeline is developed to augment the diversity of focal lengths of public datasets, which are often captured with cameras of the same or similar focal lengths. Our model is trained on augmented NYUDv2 and tested on three unseen datasets. Our model considerably improves the generalization ability of depth estimation by 41%/13% (RMSE) with/without data augmentation compared with five recent SOTAs and well alleviates the deformation problem in 3D reconstruction. Notably, our model well maintains the accuracy of depth estimation on original NYUDv2.Simplified Chinese:历史上，预测单一图像中的绝对深度地图是一个长期存在的问题。我们发现，这主要是因为scale-ambiguous problem和focal-ambiguous problem的共同作用，导致单个图像中的深度估计缺乏普遍化能力。Specifically, images may be captured by cameras of different focal lengths in scenes of different scales. In this paper, we develop a focal-and-scale depth estimation model to well learn absolute depth maps from single images in unseen indoor scenes. First, a relative depth estimation network is adopted to learn relative depths from single images with diverse scales/semantics. Second, multi-scale features are generated by mapping a single focal length value to focal length features and concatenating them with intermediate features of different scales in relative depth estimation. Finally, relative depths and multi-scale features are jointly fed into an absolute depth estimation network. In addition, a new pipeline is developed to augment the diversity of focal lengths of public datasets, which are often captured with cameras of the same or similar focal lengths. Our model is trained on augmented NYUDv2 and tested on three unseen datasets. Our model considerably improves the generalization ability of depth estimation by 41%/13% (RMSE) with/without data augmentation compared with five recent SOTAs and well alleviates the deformation problem in 3D reconstruction. Notably, our model well maintains the accuracy of depth estimation on original NYUDv2.

NeRF-Det: Learning Geometry-Aware Volumetric Representation for Multi-View 3D Object Detection

paper_url: http://arxiv.org/abs/2307.14620
repo_url: https://github.com/facebookresearch/nerf-det
paper_authors: Chenfeng Xu, Bichen Wu, Ji Hou, Sam Tsai, Ruilong Li, Jialiang Wang, Wei Zhan, Zijian He, Peter Vajda, Kurt Keutzer, Masayoshi Tomizuka
for: 本研究旨在提出一种基于NeRF的indoor三维检测方法，使用posedRGB图像作为输入，并且能够提高三维检测性能。
methods: 本方法使用NeRF进行三维场景的直接估计，并且引入了足够的geometry约束来提高NeRF-MLP的通用性。此外，我们在检测和NeRF分支之间引入了共享的MLP层，使得NeRF能够快速适应检测任务，并生成具有geometry感知的体积表示。
results: 根据ScanNet和ARKITScenes测试集，我们的方法在对比之下，比前一代方法提高了3.9 mAP和3.1 mAP。我们还提供了广泛的分析，以解释NeRF-Det如何工作。由于我们的共同训练设计，NeRF-Det能够在未看过场景时进行准确的检测、视图合成和深度估计任务，无需每个场景进行优化。代码可以在\url{https://github.com/facebookresearch/NeRF-Det}上获取。

Abstract
We present NeRF-Det, a novel method for indoor 3D detection with posed RGB images as input. Unlike existing indoor 3D detection methods that struggle to model scene geometry, our method makes novel use of NeRF in an end-to-end manner to explicitly estimate 3D geometry, thereby improving 3D detection performance. Specifically, to avoid the significant extra latency associated with per-scene optimization of NeRF, we introduce sufficient geometry priors to enhance the generalizability of NeRF-MLP. Furthermore, we subtly connect the detection and NeRF branches through a shared MLP, enabling an efficient adaptation of NeRF to detection and yielding geometry-aware volumetric representations for 3D detection. Our method outperforms state-of-the-arts by 3.9 mAP and 3.1 mAP on the ScanNet and ARKITScenes benchmarks, respectively. We provide extensive analysis to shed light on how NeRF-Det works. As a result of our joint-training design, NeRF-Det is able to generalize well to unseen scenes for object detection, view synthesis, and depth estimation tasks without requiring per-scene optimization. Code is available at \url{https://github.com/facebookresearch/NeRF-Det}.

摘要
我们提出了NeRF-Det方法，用于indoor三维检测，输入poseRGB图像。与现有的indoor三维检测方法不同，我们的方法通过endre-to-end的方式使用NeRF来显式地估计Scene geometry，从而改善三维检测性能。具体来说，为了避免每个场景优化NeRF的显著额外延迟，我们引入了充分的geometry priors，以提高NeRF-MLP的通用性。此外，我们在检测和NeRF分支之间设置了共享的MLP，使得NeRF能够高效地适应检测，并生成具有三维特征的 объем表示。我们的方法在ScanNet和ARKITScenes测试集上比前STATE-OF-THE-ART高出3.9 mAP和3.1 mAP。我们提供了广泛的分析，以解释NeRF-Det如何工作。由于我们的联合训练设计，NeRF-Det能够通过不需要每个场景优化来泛化到未看到的场景，进行对象检测、视synthesis和深度估计任务。代码可以在 \url{https://github.com/facebookresearch/NeRF-Det} 上获取。

Multiscale Dynamic Graph Representation for Biometric Recognition with Occlusions

paper_url: http://arxiv.org/abs/2307.14617
repo_url: https://github.com/renmin1991/dyamic-graph-representation
paper_authors: Min Ren, Yunlong Wang, Yuhao Zhu, Kunbo Zhang, Zhenan Sun
for: 提高生物认证中的遮挡问题解决方法
methods: 提议一种 integrate CNN 和图模型的统一框架，通过动态图匹配和多尺度策略来抑制遮挡部分
results: 对比基eline方法，提出的方法在自然和遮挡 simulate 两种情况下都显示出了明显的提高，具体来说是boosting 识别率。

Abstract
Occlusion is a common problem with biometric recognition in the wild. The generalization ability of CNNs greatly decreases due to the adverse effects of various occlusions. To this end, we propose a novel unified framework integrating the merits of both CNNs and graph models to overcome occlusion problems in biometric recognition, called multiscale dynamic graph representation (MS-DGR). More specifically, a group of deep features reflected on certain subregions is recrafted into a feature graph (FG). Each node inside the FG is deemed to characterize a specific local region of the input sample, and the edges imply the co-occurrence of non-occluded regions. By analyzing the similarities of the node representations and measuring the topological structures stored in the adjacent matrix, the proposed framework leverages dynamic graph matching to judiciously discard the nodes corresponding to the occluded parts. The multiscale strategy is further incorporated to attain more diverse nodes representing regions of various sizes. Furthermore, the proposed framework exhibits a more illustrative and reasonable inference by showing the paired nodes. Extensive experiments demonstrate the superiority of the proposed framework, which boosts the accuracy in both natural and occlusion-simulated cases by a large margin compared with that of baseline methods.

摘要
干扰是生物认证中常见的问题。通用的Convolutional Neural Networks (CNNs) 在不同的干扰情况下 exhibit 很差的泛化能力。为解决这个问题，我们提出了一种新的统一框架，即多尺度动态图表示 (MS-DGR)。更具体地说，一组深度特征在某些子区域上反射后，被重新拼接成一个特征图 (FG)。每个节点在 FG 中都代表了特定的本地区域，而边则表示了不受 occlusion 影响的区域之间的协同关系。通过分析节点表示的相似性和计算邻域矩阵中的 topological 结构，我们的框架通过动态图匹配来舍弃 occlusion 部分的节点。我们还在框架中采用多尺度策略，以获得更多的不同尺度的节点，代表不同大小的区域。此外，我们的框架可以更直观地显示对应的节点对，从而提供更直观的推理。广泛的实验表明，我们的方法可以在自然和干扰 simulate 的情况下，提高了识别率，相比基eline 方法，提高了大幅度。

GenCo: An Auxiliary Generator from Contrastive Learning for Enhanced Few-Shot Learning in Remote Sensing

paper_url: http://arxiv.org/abs/2307.14612
repo_url: None
paper_authors: Jing Wu, Naira Hovakimyan, Jennifer Hobbs
for: 提高远程感知和地球观测中的分类和 semantic segmentation 任务中的少量示例学习性能。
methods: 利用对冲学习框架（GenCo）进行预训练，同时允许特征样本的变体探索。
results: 在 Agriculture-Vision 和 EuroSAT 两个关键的远程感知数据集上，我们的方法比纯粹的超参数训练得到更好的性能，并在 semantic segmentation 任务上达到了最佳效果。

Abstract
Classifying and segmenting patterns from a limited number of examples is a significant challenge in remote sensing and earth observation due to the difficulty in acquiring accurately labeled data in large quantities. Previous studies have shown that meta-learning, which involves episodic training on query and support sets, is a promising approach. However, there has been little attention paid to direct fine-tuning techniques. This paper repurposes contrastive learning as a pre-training method for few-shot learning for classification and semantic segmentation tasks. Specifically, we introduce a generator-based contrastive learning framework (GenCo) that pre-trains backbones and simultaneously explores variants of feature samples. In fine-tuning, the auxiliary generator can be used to enrich limited labeled data samples in feature space. We demonstrate the effectiveness of our method in improving few-shot learning performance on two key remote sensing datasets: Agriculture-Vision and EuroSAT. Empirically, our approach outperforms purely supervised training on the nearly 95,000 images in Agriculture-Vision for both classification and semantic segmentation tasks. Similarly, the proposed few-shot method achieves better results on the land-cover classification task on EuroSAT compared to the results obtained from fully supervised model training on the dataset.

摘要
<>translate the following text into Simplified ChineseClassifying and segmenting patterns from a limited number of examples is a significant challenge in remote sensing and earth observation due to the difficulty in acquiring accurately labeled data in large quantities. Previous studies have shown that meta-learning, which involves episodic training on query and support sets, is a promising approach. However, there has been little attention paid to direct fine-tuning techniques. This paper repurposes contrastive learning as a pre-training method for few-shot learning for classification and semantic segmentation tasks. Specifically, we introduce a generator-based contrastive learning framework (GenCo) that pre-trains backbones and simultaneously explores variants of feature samples. In fine-tuning, the auxiliary generator can be used to enrich limited labeled data samples in feature space. We demonstrate the effectiveness of our method in improving few-shot learning performance on two key remote sensing datasets: Agriculture-Vision and EuroSAT. Empirically, our approach outperforms purely supervised training on the nearly 95,000 images in Agriculture-Vision for both classification and semantic segmentation tasks. Similarly, the proposed few-shot method achieves better results on the land-cover classification task on EuroSAT compared to the results obtained from fully supervised model training on the dataset.Translation:难以从有限数量的示例中提取模式的分类和 semantic segmentation 是Remote sensing 和 Earth observation 领域中的一大挑战，因为获取大量高精度标注数据具有很大的困难。先前的研究表明，meta-learning，即在查询和支持集上进行 episodic 训练，是一种有前途的方法。然而，直接精度调整技术几乎未获得关注。本文借鉴了对准学习，将它作为预训练方法，以提高少量示例学习的分类和 semantic segmentation 性能。特别是，我们引入了生成器基于对准学习框架（GenCo），在预训练期间同时探索特征样本的变体。在细化中，可以使用auxiliary生成器来增加有限量标注数据样本在特征空间中。我们通过实验证明，我们的方法可以在 Agriculture-Vision 和 EuroSAT 两个重要的 Remote sensing 数据集上提高少量示例学习性能。empirically，我们的方法在 Agriculture-Vision 中的分类和 semantic segmentation 任务上超过了完全监督训练 nearly 95,000 张图像的结果。类似地，我们提出的几shot 方法在 EuroSAT 上的 land-cover 分类任务上超过了完全监督模型训练 dataset 上的结果。

TextManiA: Enriching Visual Feature by Text-driven Manifold Augmentation

paper_url: http://arxiv.org/abs/2307.14611
repo_url: None
paper_authors: Moon Ye-Bin, Jisoo Kim, Hongyeob Kim, Kilho Son, Tae-Hyun Oh
for: 提高欠发达数据集中的泛化能力，即使面临着类别分布偏斜问题。
methods: 提出了一种基于文本的替换增强方法，通过使用可读性强的视觉感知词（attributes）来增强视觉特征空间的Semantic层次。
results: 实验表明，TextManiA可以在缺乏样本的情况下具有很高的泛化能力，并且可以与标准混合方法相结合使用，以提高泛化能力。

Abstract
Recent label mix-based augmentation methods have shown their effectiveness in generalization despite their simplicity, and their favorable effects are often attributed to semantic-level augmentation. However, we found that they are vulnerable to highly skewed class distribution, because scarce data classes are rarely sampled for inter-class perturbation. We propose TextManiA, a text-driven manifold augmentation method that semantically enriches visual feature spaces, regardless of data distribution. TextManiA augments visual data with intra-class semantic perturbation by exploiting easy-to-understand visually mimetic words, i.e., attributes. To this end, we bridge between the text representation and a target visual feature space, and propose an efficient vector augmentation. To empirically support the validity of our design, we devise two visualization-based analyses and show the plausibility of the bridge between two different modality spaces. Our experiments demonstrate that TextManiA is powerful in scarce samples with class imbalance as well as even distribution. We also show compatibility with the label mix-based approaches in evenly distributed scarce data.

摘要
(Simplified Chinese translation)近期的标签混合基于扩展方法已经表现出了一致性，即使它们简单，它们的良好效果通常归结于semantic-level的扩展。然而，我们发现它们对高度不均衡的类分布非常敏感，因为罕见的数据类很少被采样为间类扰动。我们提出了TextManiA，一种基于文本的扩展方法，它可以在不同的数据分布下增强视觉特征空间的 semantic。TextManiA通过利用易于理解的视觉效果词，例如特征，进行内类semantic扰动。为了实现这一点，我们将文本表示与目标视觉特征空间之间建立了一个桥梁，并提出了一种效率的向量扩展。为了证明我们的设计的有效性，我们设计了两种视觉基于分析和展示了两个不同的模态空间之间的桥梁的可能性。我们的实验表明，TextManiA在罕见样本中的类均衡和均衡分布中都具有强大的泛化能力。我们还证明了TextManiA与标签混合基于方法在均衡分布中的罕见样本中兼容。

A Weakly Supervised Segmentation Network Embedding Cross-scale Attention Guidance and Noise-sensitive Constraint for Detecting Tertiary Lymphoid Structures of Pancreatic Tumors

paper_url: http://arxiv.org/abs/2307.14603
repo_url: None
paper_authors: Bingxue Wang, Liwen Zou, Jun Chen, Yingying Cao, Zhenghua Cai, Yudong Qiu, Liang Mao, Zhongqiu Wang, Jingya Chen, Luying Gui, Xiaoping Yang
for: 检测胰腺癌的特Point是否存在，以便更好地诊断和治疗胰腺癌患者。
methods: 我们提议一种几何学学习方法， combinig着一个预训练的核型分割模型和域随机网络来识别лимфоци特 Point。然后，我们设立了一个跨度级别注意力引导机制，通过同时学习原始历史病理图像的粗细度特征和我们设计的лимфоци特 Point的密度注意力来帮助提高分割精度。
results: 我们的方法在两个收集的数据集上进行了实验，并证明了与现有分割基于深度学习算法相比，我们的方法可以更高的检测精度。此外，我们还应用了我们的方法来研究胰腺癌的特Point密度与周围血管渗透的关系，并获得了一些临床统计结果。

Abstract
The presence of tertiary lymphoid structures (TLSs) on pancreatic pathological images is an important prognostic indicator of pancreatic tumors. Therefore, TLSs detection on pancreatic pathological images plays a crucial role in diagnosis and treatment for patients with pancreatic tumors. However, fully supervised detection algorithms based on deep learning usually require a large number of manual annotations, which is time-consuming and labor-intensive. In this paper, we aim to detect the TLSs in a manner of few-shot learning by proposing a weakly supervised segmentation network. We firstly obtain the lymphocyte density maps by combining a pretrained model for nuclei segmentation and a domain adversarial network for lymphocyte nuclei recognition. Then, we establish a cross-scale attention guidance mechanism by jointly learning the coarse-scale features from the original histopathology images and fine-scale features from our designed lymphocyte density attention. A noise-sensitive constraint is introduced by an embedding signed distance function loss in the training procedure to reduce tiny prediction errors. Experimental results on two collected datasets demonstrate that our proposed method significantly outperforms the state-of-the-art segmentation-based algorithms in terms of TLSs detection accuracy. Additionally, we apply our method to study the congruent relationship between the density of TLSs and peripancreatic vascular invasion and obtain some clinically statistical results.

摘要
《文献》中提到的“次级血液结构（TLS）”在胰腺病理图像中的存在是诊断和治疗胰腺肿瘤患者的重要预测指标。因此，TLS的检测在胰腺病理图像上扮演着关键的角色。然而，通常需要大量的手动标注，这是时间consuming和劳动密集的。在这篇论文中，我们尝试通过几拍学习方法来检测TLS。我们首先使用预训练的核仁分 segmentation 模型和域 adversarial network来确定血液细胞核的位置。然后，我们实现了跨比例注意力导航机制，通过同时学习原始 histopathology 图像的粗细度特征和我们设计的血液细胞注意力来确定TLS的位置。在训练过程中，我们引入了一种噪声敏感的约束，通过嵌入签名距离函数损失来减少微scopic prediction error。实验结果表明，我们提出的方法在两个收集的数据集上显著超过了现有的分类基于深度学习的 segmentation 算法。此外，我们应用我们的方法研究了TLS的density与周围胰腺血管浸润之间的相互关系，并获得了一些临床统计结果。

FakeTracer: Proactively Defending Against Face-swap DeepFakes via Implanting Traces in Training

paper_url: http://arxiv.org/abs/2307.14593
repo_url: None
paper_authors: Pu Sun, Honggang Qi, Yuezun Li, Siwei Lyu
for: 防止 DeepFake 技术的违用，保护个人隐私
methods: 植入训练过程中的特征 traces，使 DeepFake 模型学习含有可持续和可除去的特征
results: 对 Celeb-DF 数据集进行了广泛的实验，证明了我们的方法可以有效地防止 face-swap DeepFake

Abstract
Face-swap DeepFake is an emerging AI-based face forgery technique that can replace the original face in a video with a generated face of the target identity while retaining consistent facial attributes such as expression and orientation. Due to the high privacy of faces, the misuse of this technique can raise severe social concerns, drawing tremendous attention to defend against DeepFakes recently. In this paper, we describe a new proactive defense method called FakeTracer to expose face-swap DeepFakes via implanting traces in training. Compared to general face-synthesis DeepFake, the face-swap DeepFake is more complex as it involves identity change, is subjected to the encoding-decoding process, and is trained unsupervised, increasing the difficulty of implanting traces into the training phase. To effectively defend against face-swap DeepFake, we design two types of traces, sustainable trace (STrace) and erasable trace (ETrace), to be added to training faces. During the training, these manipulated faces affect the learning of the face-swap DeepFake model, enabling it to generate faces that only contain sustainable traces. In light of these two traces, our method can effectively expose DeepFakes by identifying them. Extensive experiments are conducted on the Celeb-DF dataset, compared with recent passive and proactive defense methods, and are studied thoroughly regarding various factors, corroborating the efficacy of our method on defending against face-swap DeepFake.

摘要
“深圳技术：Face-swap DeepFake是一种新兴的人工智能技术，可以将原始影片中的面部替换为目标身份的生成面部，保留面部特征如表情和方向。由于面部隐私问题，这种技术可能会导致严重的社会影响，因此近期引起了广泛关注防范 DeepFakes。在这篇论文中，我们描述了一种新的积极防范方法，称为FakeTracer，可以透过将 traces 添加到训练过程中，将 face-swap DeepFake 曝光。相比于一般的面部合成 DeepFake，face-swap DeepFake 更加复杂，因为它涉及到身份变更、编码-解码过程和无监督训练，这使得在训练过程中将 traces 添加到面部上更加困难。为了有效防范 face-swap DeepFake，我们设计了两种 traces，可持续 trace (STrace) 和可清除 trace (ETrace)，并将它们添加到训练面部。在训练过程中，这些改进的面部对 face-swap DeepFake 模型的学习有很大影响，使其能够生成包含可持续 traces 的面部。这些两种 traces 使我们的方法能够有效地曝光 DeepFakes，通过识别它们。我们在 Celeb-DF 数据集上进行了广泛的实验，与最近的预防和反击方法进行比较，并且对不同的因素进行了深入的研究，证明了我们的方法在防范 face-swap DeepFake 方面的效果。”

MCPA: Multi-scale Cross Perceptron Attention Network for 2D Medical Image Segmentation

paper_url: http://arxiv.org/abs/2307.14588
repo_url: https://github.com/simonustc/mcpa-for-2d-medical-image-segmentation
paper_authors: Liang Xu, Mingxiao Chen, Yi Cheng, Pengfei Shao, Shuwei Shen, Peng Yao, Ronald X. Xu
for: 这个研究旨在提出一个基于Convolutional Neural Networks (CNN)的双重网络模型，以提高医疗影像分类 task的性能。
methods: 本研究使用了Transformer模组来强化UNet架构，以更好地捕捉医疗影像中的全球相依性。并且引入了多对多普尔投影模组，以实现各维度特征之间的协调。
results: 实验结果显示，这个MCPA模型在多个公开的医疗影像数据集上达到了州际性能。code可以在https://github.com/simonustc/MCPA-for-2D-Medical-Image-Segmentation上获取。

Abstract
The UNet architecture, based on Convolutional Neural Networks (CNN), has demonstrated its remarkable performance in medical image analysis. However, it faces challenges in capturing long-range dependencies due to the limited receptive fields and inherent bias of convolutional operations. Recently, numerous transformer-based techniques have been incorporated into the UNet architecture to overcome this limitation by effectively capturing global feature correlations. However, the integration of the Transformer modules may result in the loss of local contextual information during the global feature fusion process. To overcome these challenges, we propose a 2D medical image segmentation model called Multi-scale Cross Perceptron Attention Network (MCPA). The MCPA consists of three main components: an encoder, a decoder, and a Cross Perceptron. The Cross Perceptron first captures the local correlations using multiple Multi-scale Cross Perceptron modules, facilitating the fusion of features across scales. The resulting multi-scale feature vectors are then spatially unfolded, concatenated, and fed through a Global Perceptron module to model global dependencies. Furthermore, we introduce a Progressive Dual-branch Structure to address the semantic segmentation of the image involving finer tissue structures. This structure gradually shifts the segmentation focus of MCPA network training from large-scale structural features to more sophisticated pixel-level features. We evaluate our proposed MCPA model on several publicly available medical image datasets from different tasks and devices, including the open large-scale dataset of CT (Synapse), MRI (ACDC), fundus camera (DRIVE, CHASE_DB1, HRF), and OCTA (ROSE). The experimental results show that our MCPA model achieves state-of-the-art performance. The code is available at https://github.com/simonustc/MCPA-for-2D-Medical-Image-Segmentation.

摘要
UNet 架构，基于卷积神经网络（CNN），在医疗图像分析中表现出色。然而，它在捕捉长距离依赖关系方面面临挑战，因为卷积操作具有局部捕捉区域和内置偏见。最近，许多基于变换器技术的方法被 incorporated into UNet 架构，以解决这些限制，并有效地捕捉全局特征相关性。然而，将变换器模块 интеGRATED 到 UNet 架构可能会导致全局特征卷积过程中失去本地Contextual information。为了解决这些挑战，我们提出了一种名为 Multi-scale Cross Perceptron Attention Network (MCPA)的2D医疗图像分类模型。MCPA 模型由三个主要组成部分组成：编码器、解码器和 Cross Perceptron。Cross Perceptron 首先使用多个 Multi-scale Cross Perceptron 模块捕捉本地相关性，以便在不同尺度上进行特征整合。得到的多个尺度特征向量然后在空间上展开、 concatenate 并经过全球 Perceptron 模块来建模全局依赖关系。此外，我们还引入了一种进步的双分支结构，以Addressing the semantic segmentation of the image involving finer tissue structures。这种结构逐渐调整 MCPA 网络训练的 segmentation 焦点，从大规模结构特征向 pixel-level 特征。我们在 Synapse、ACDC、DRIVE、CHASE_DB1 和 HRF 等公共可用的医疗图像数据集上进行了实验，结果显示，我们的 MCPA 模型实现了状态的最佳性能。代码可以在中获取。

Neural Representation-Based Method for Metal-induced Artifact Reduction in Dental CBCT Imaging

paper_url: http://arxiv.org/abs/2307.14579
repo_url: None
paper_authors: Hyoung Suk Park, Kiwan Jeon, Jin Keun Seo
for: 这个研究旨在提出一种新的 dental cone-beam computed tomography (CBCT) 重建方法，以减少常见的金属引起的artefacts，特别是在存在多种金属设备的情况下。
methods: 该研究提议使用 implicit neural network 生成两个不同的 tomographic 图像，其中一个表示特定能量层的灰度分布，另一个表示各色 X-ray 束在不同能量层的非线性强化因素。与传统的 CT 重建技术不同，该方法仅基于 Beer–Lambert 定律，从而有效避免在传统方法中通常实现的背投过程中产生金属引起的artefacts。
results: 广泛的实验评估表明，提议的方法可以有效减少金属引起的artefacts，并提供高质量的图像重建，从而强调第二个图像在捕捉非线性强化因素方面的重要性。

Abstract
This study introduces a novel reconstruction method for dental cone-beam computed tomography (CBCT), focusing on effectively reducing metal-induced artifacts commonly encountered in the presence of prevalent metallic implants. Despite significant progress in metal artifact reduction techniques, challenges persist owing to the intricate physical interactions between polychromatic X-ray beams and metal objects, which are further compounded by the additional effects associated with metal-tooth interactions and factors specific to the dental CBCT data environment. To overcome these limitations, we propose an implicit neural network that generates two distinct and informative tomographic images. One image represents the monochromatic attenuation distribution at a specific energy level, whereas the other captures the nonlinear beam-hardening factor resulting from the polychromatic nature of X-ray beams. In contrast to existing CT reconstruction techniques, the proposed method relies exclusively on the Beer--Lambert law, effectively preventing the generation of metal-induced artifacts during the backprojection process commonly implemented in conventional methods. Extensive experimental evaluations demonstrate that the proposed method effectively reduces metal artifacts while providing high-quality image reconstructions, thus emphasizing the significance of the second image in capturing the nonlinear beam-hardening factor.

摘要
To overcome these limitations, the proposed method uses an implicit neural network to generate two distinct and informative tomographic images. One image represents the monochromatic attenuation distribution at a specific energy level, while the other captures the nonlinear beam-hardening factor resulting from the polychromatic nature of X-ray beams. Unlike existing CT reconstruction techniques, the proposed method relies exclusively on the Beer-Lambert law, effectively preventing the generation of metal-induced artifacts during the backprojection process commonly implemented in conventional methods.Experimental evaluations demonstrate that the proposed method effectively reduces metal artifacts while providing high-quality image reconstructions, highlighting the significance of the second image in capturing the nonlinear beam-hardening factor.

GADER: GAit DEtection and Recognition in the Wild

paper_url: http://arxiv.org/abs/2307.14578
repo_url: None
paper_authors: Yuxiang Guo, Cheng Peng, Ram Prabhakar, Chun Pong Lau, Rama Chellappa
for: 人体认证，特别是在开放场景下进行人体识别和验证
methods: 使用Double Helical Signature检测人体运动段落，并借鉴RGB认识模型进行学习表征，以实现更高的人体识别精度
results: 在室内和室外数据集上进行了广泛的实验，并达到了与状态艺前的最佳性能，特别是在无结构、长距离场景下的识别精度提高20.6%

Abstract
Gait recognition holds the promise of robustly identifying subjects based on their walking patterns instead of color information. While previous approaches have performed well for curated indoor scenes, they have significantly impeded applicability in unconstrained situations, e.g. outdoor, long distance scenes. We propose an end-to-end GAit DEtection and Recognition (GADER) algorithm for human authentication in challenging outdoor scenarios. Specifically, GADER leverages a Double Helical Signature to detect the fragment of human movement and incorporates a novel gait recognition method, which learns representations by distilling from an auxiliary RGB recognition model. At inference time, GADER only uses the silhouette modality but benefits from a more robust representation. Extensive experiments on indoor and outdoor datasets demonstrate that the proposed method outperforms the State-of-The-Arts for gait recognition and verification, with a significant 20.6% improvement on unconstrained, long distance scenes.

摘要
“门槛识别”可以强大地识别基于行走模式而不是颜色信息。在前一些方法中，它们在受控环境中表现得非常好，但是在无法控制的场景中，它们受到了严重的限制，例如：开放场景、长距离场景。我们提出了一个终端到端的“人体识别和验证”（GADER）算法，用于人类身份验证在具有挑战的开放场景中。具体来说，GADER 使用了一个Double Helical Signature来探测人类运动的残留部分，并将一个新的步行识别方法给运用到auxiliary RGB识别模型中。在推理时间，GADER 仅使用了 silhouette 模式，但是它们可以从更加稳定的表现中获益。实验结果显示，提出的方法在受控和无法控制的场景中都有着优秀的表现，与现有的State-of-The-Arts 有20.6%的提升。

Robust Detection, Association, and Localization of Vehicle Lights: A Context-Based Cascaded CNN Approach and Evaluations

paper_url: http://arxiv.org/abs/2307.14571
repo_url: None
paper_authors: Akshay Gopalkrishnan, Ross Greer, Maitrayee Keskar, Mohan Trivedi
for: 这篇论文是用于检测和识别汽车灯光的，以便实现安全自动驾驶任务。
methods: 本论文使用了一种基于CNN的方法，将汽车灯光拟合为四个粗略的角度，以提高检测灯光的精度。
results: experiments show that the proposed method can achieve an average distance error of 4.77 pixels from the ground truth corner, which is about 16.33% of the size of the vehicle light on average.Here’s the full text in Simplified Chinese:
for: 这篇论文是用于检测和识别汽车灯光的，以便实现安全自动驾驶任务。
methods: 本论文使用了一种基于CNN的方法，将汽车灯光拟合为四个粗略的角度，以提高检测灯光的精度。
results: experiments show that the proposed method can achieve an average distance error of 4.77 pixels from the ground truth corner, which is about 16.33% of the size of the vehicle light on average。

Abstract
Vehicle light detection, association, and localization are required for important downstream safe autonomous driving tasks, such as predicting a vehicle's light state to determine if the vehicle is making a lane change or turning. Currently, many vehicle light detectors use single-stage detectors which predict bounding boxes to identify a vehicle light, in a manner decoupled from vehicle instances. In this paper, we present a method for detecting a vehicle light given an upstream vehicle detection and approximation of a visible light's center. Our method predicts four approximate corners associated with each vehicle light. We experiment with CNN architectures, data augmentation, and contextual preprocessing methods designed to reduce surrounding-vehicle confusion. We achieve an average distance error from the ground truth corner of 4.77 pixels, about 16.33% of the size of the vehicle light on average. We train and evaluate our model on the LISA Lights Dataset, allowing us to thoroughly evaluate our vehicle light corner detection model on a large variety of vehicle light shapes and lighting conditions. We propose that this model can be integrated into a pipeline with vehicle detection and vehicle light center detection to make a fully-formed vehicle light detection network, valuable to identifying trajectory-informative signals in driving scenes.

摘要
自动驾驶需要车辆灯光检测、相关性和地理位置，以便完成重要的下游安全自动驾驶任务，如预测车辆灯光的状态，以确定车辆是否改变车道或转弯。现在，许多车辆灯光检测器使用单Stage检测器，以预测车辆灯光的 bounding box，并且与车辆实例分离。在这篇论文中，我们提出了一种用于检测车辆灯光的方法，给出了车辆灯光的四个约束角。我们对 CNN 架构、数据增强和上下文处理方法进行了实验，以降低周围车辆的混淆。我们在 LISA Lights 数据集上训练和评估我们的模型，可以充分评估我们的车辆灯光角度检测模型在不同的车辆灯光形状和照明条件下的性能。我们建议将这种模型纳入车辆检测和车辆灯光中心检测的管道，以创造一个完整的车辆灯光检测网络，有助于识别驾驶场景中的路径信号。

Physically Plausible 3D Human-Scene Reconstruction from Monocular RGB Image using an Adversarial Learning Approach

paper_url: http://arxiv.org/abs/2307.14570
repo_url: None
paper_authors: Sandika Biswas, Kejie Li, Biplab Banerjee, Subhasis Chaudhuri, Hamid Rezatofighi
for: 这个论文的目的是提出一种基于学习的全息人景三维重建方法，以解决基于单色RGB图像的全息人景三维重建问题。
methods: 该方法使用图形学表示法，并通过对Scene中元素之间的相互作用进行学习，从训练数据自身学习场景元素之间的物理约束。
results: 该方法可以达到与优化基于方法相当的三维重建质量，而不需要执行批处理时间的优化。这使得该方法更适合机器人应用，如机器人导航等。

Abstract
Holistic 3D human-scene reconstruction is a crucial and emerging research area in robot perception. A key challenge in holistic 3D human-scene reconstruction is to generate a physically plausible 3D scene from a single monocular RGB image. The existing research mainly proposes optimization-based approaches for reconstructing the scene from a sequence of RGB frames with explicitly defined physical laws and constraints between different scene elements (humans and objects). However, it is hard to explicitly define and model every physical law in every scenario. This paper proposes using an implicit feature representation of the scene elements to distinguish a physically plausible alignment of humans and objects from an implausible one. We propose using a graph-based holistic representation with an encoded physical representation of the scene to analyze the human-object and object-object interactions within the scene. Using this graphical representation, we adversarially train our model to learn the feasible alignments of the scene elements from the training data itself without explicitly defining the laws and constraints between them. Unlike the existing inference-time optimization-based approaches, we use this adversarially trained model to produce a per-frame 3D reconstruction of the scene that abides by the physical laws and constraints. Our learning-based method achieves comparable 3D reconstruction quality to existing optimization-based holistic human-scene reconstruction methods and does not need inference time optimization. This makes it better suited when compared to existing methods, for potential use in robotic applications, such as robot navigation, etc.

摘要
“整体3D人景重建是Robot感知领域中的关键和emerging研究领域。一个关键挑战是将单一RGB影像中的3D场景转换为物理可能的3D场景。现有的研究主要提出了基于优化的方法来从RGB影像序列中重建场景，并且明确定义了场景元素之间的物理法则和约束。但是，实际上很难明确地定义和模型每个场景中的物理法则。本文提出使用各元素的偏项特征来区别物理可能的人物和物品的平行配置和不可能的配置。我们提出使用图形基于的整体表示，将场景元素之间的物理表示编码到图形中，并通过对训练数据自身进行对抗学习，以学习场景元素之间的可能的平行配置。与现有的推理时间优化方法不同，我们使用这个对抗学习的模型来生成每帧3D重建结果，并且跟随物理法则和约束。我们的学习型方法与现有的优化型方法相比，具有更好的3D重建质量，并且不需要推理时间优化。这使得它更适合在Robot应用中使用，如Robot导航等。”

paper_url: http://arxiv.org/abs/2307.14523
repo_url: None
paper_authors: Soorena Salari, Amirhossein Rasoulian, Hassan Rivaz, Yiming Xiao
for: 这个研究旨在提供一个新的整合学习框架，用于在 neurosurgery 中医疗影像注册中找到相匹配的附属标记。
methods: 这个方法使用了两个卷积神经网络，用于将 MRI 和 US 影像特征编码，以帮助匹配 US 影像 patch 中的相匹配标记在 MRI 中。
results: 研究结果显示，该方法可以实现高精度的附属标记检测，具体来说是 5.88+-4.79 mm 的平均标记检测精度，较之 SIFT 特征的 18.78+-4.77 mm 为低。

Abstract
Homologous anatomical landmarks between medical scans are instrumental in quantitative assessment of image registration quality in various clinical applications, such as MRI-ultrasound registration for tissue shift correction in ultrasound-guided brain tumor resection. While manually identified landmark pairs between MRI and ultrasound (US) have greatly facilitated the validation of different registration algorithms for the task, the procedure requires significant expertise, labor, and time, and can be prone to inter- and intra-rater inconsistency. So far, many traditional and machine learning approaches have been presented for anatomical landmark detection, but they primarily focus on mono-modal applications. Unfortunately, despite the clinical needs, inter-modal/contrast landmark detection has very rarely been attempted. Therefore, we propose a novel contrastive learning framework to detect corresponding landmarks between MRI and intra-operative US scans in neurosurgery. Specifically, two convolutional neural networks were trained jointly to encode image features in MRI and US scans to help match the US image patch that contain the corresponding landmarks in the MRI. We developed and validated the technique using the public RESECT database. With a mean landmark detection accuracy of 5.88+-4.79 mm against 18.78+-4.77 mm with SIFT features, the proposed method offers promising results for MRI-US landmark detection in neurosurgical applications for the first time.

摘要
医学成像渠道之间的相似结构特征可以用于质量评估不同临床应用中的图像 региSTR的良好性，如MRI-ultrasound（US） региSTR用于脑肿瘤静脉 Correcting for tissue shift during ultrasound-guided brain tumor resection. Although manually identified landmark pairs between MRI and US have greatly facilitated the validation of different registration algorithms for this task, the procedure requires significant expertise, labor, and time, and can be prone to inter- and intra-rater inconsistency. To date, many traditional and machine learning approaches have been proposed for anatomical landmark detection, but they primarily focus on mono-modal applications. Unfortunately, despite the clinical needs, inter-modal/contrast landmark detection has very rarely been attempted. Therefore, we propose a novel contrastive learning framework to detect corresponding landmarks between MRI and intra-operative US scans in neurosurgery. Specifically, two convolutional neural networks were trained jointly to encode image features in MRI and US scans to help match the US image patch that contains the corresponding landmarks in the MRI. We developed and validated the technique using the public RESECT database. With a mean landmark detection accuracy of 5.88 ± 4.79 mm against 18.78 ± 4.77 mm with SIFT features, the proposed method offers promising results for MRI-US landmark detection in neurosurgical applications for the first time.

paper_url: http://arxiv.org/abs/2307.14520
repo_url: None
paper_authors: Soorena Salari, Amirhossein Rasoulian, Hassan Rivaz, Yiming Xiao
for: 本研究旨在提供一种 Deep Learning 技术，用于在 brain tumor 手术中评估 MRI-iUS регистра结果的准确性。methods: 该技术基于 3D 焦点 modify 的深度学习方法，并使用 uncertainty 估计来评估 MRI-iUS REGISTRATION 错误的准确性。results: 在使用公共的 RESECT 临床数据库进行验证后，该算法可以实现 MRI-iUS REGISTRATION 错误的估计错误为 0.59+-0.57 mm。

Abstract
In brain tumor resection, accurate removal of cancerous tissues while preserving eloquent regions is crucial to the safety and outcomes of the treatment. However, intra-operative tissue deformation (called brain shift) can move the surgical target and render the pre-surgical plan invalid. Intra-operative ultrasound (iUS) has been adopted to provide real-time images to track brain shift, and inter-modal (i.e., MRI-iUS) registration is often required to update the pre-surgical plan. Quality control for the registration results during surgery is important to avoid adverse outcomes, but manual verification faces great challenges due to difficult 3D visualization and the low contrast of iUS. Automatic algorithms are urgently needed to address this issue, but the problem was rarely attempted. Therefore, we propose a novel deep learning technique based on 3D focal modulation in conjunction with uncertainty estimation to accurately assess MRI-iUS registration errors for brain tumor surgery. Developed and validated with the public RESECT clinical database, the resulting algorithm can achieve an estimation error of 0.59+-0.57 mm.

摘要
在脑肿擦除手术中，准确地移除患有肿瘤的组织，保留功能区的安全和治疗结果是关键。然而，在手术中的组织弹性（脑移动）可能使手术目标移动，使原先的预治疗计划无效。使用实时ultrasound（iUS）图像提供实时跟踪脑移动，并使用多Modal（MRI-iUS）匹配进行更新预治疗计划。手术中质量控制匹配结果的重要性，以避免不良结果，但手动验证受到困难的3D视图和iUS的低对比度带来很大挑战。因此，我们提出了一种新的深度学习技术，基于3D关注模ulation，并与不确定性估计以准确评估MRI-iUS匹配错误。与RESECT临床数据库公共数据集进行开发和验证，我们的算法可以实现匹配错误的估计错误0.59±0.57毫米。

SuperInpaint: Learning Detail-Enhanced Attentional Implicit Representation for Super-resolutional Image Inpainting

paper_url: http://arxiv.org/abs/2307.14489
repo_url: None
paper_authors: Canyu Zhang, Qing Guo, Xiaoguang Li, Renjie Wan, Hongkai Yu, Ivor Tsang, Song Wang
for: This paper aims to address the challenging task of SuperInpaint, which involves reconstructing missing regions in low-resolution images and generating completed images with higher resolutions.
methods: The proposed method, called DEAR, uses a deep convolutional network to extract the latent embedding of an input image, and then enhances the high-frequency components of the latent embedding via an adaptive high-pass filter. The method also uses an unmask-attentional module to suppress embeddings from ineffective masked pixels, and an implicit representation to generate the color of the specified pixel.
results: The proposed method outperforms all existing methods by a significant margin on four widely used metrics, demonstrating its effectiveness in addressing the SuperInpaint task.

Abstract
In this work, we introduce a challenging image restoration task, referred to as SuperInpaint, which aims to reconstruct missing regions in low-resolution images and generate completed images with arbitrarily higher resolutions. We have found that this task cannot be effectively addressed by stacking state-of-the-art super-resolution and image inpainting methods as they amplify each other's flaws, leading to noticeable artifacts. To overcome these limitations, we propose the detail-enhanced attentional implicit representation (DEAR) that can achieve SuperInpaint with a single model, resulting in high-quality completed images with arbitrary resolutions. Specifically, we use a deep convolutional network to extract the latent embedding of an input image and then enhance the high-frequency components of the latent embedding via an adaptive high-pass filter. This leads to detail-enhanced semantic embedding. We further feed the semantic embedding into an unmask-attentional module that suppresses embeddings from ineffective masked pixels. Additionally, we extract a pixel-wise importance map that indicates which pixels should be used for image reconstruction. Given the coordinates of a pixel we want to reconstruct, we first collect its neighboring pixels in the input image and extract their detail-enhanced semantic embeddings, unmask-attentional semantic embeddings, importance values, and spatial distances to the desired pixel. Then, we feed all the above terms into an implicit representation and generate the color of the specified pixel. To evaluate our method, we extend three existing datasets for this new task and build 18 meaningful baselines using SOTA inpainting and super-resolution methods. Extensive experimental results demonstrate that our method outperforms all existing methods by a significant margin on four widely used metrics.

摘要
在这个工作中，我们引入了一个挑战性的图像修复任务，称为SuperInpaint，该任务的目标是在低分辨率图像中恢复缺失区域并生成完整图像的高分辨率版本。我们发现，这个任务不可能通过核心状态艺术的超解像和图像填充方法核心来解决，因为这些方法会强调对方的缺陷，导致明显的瑕疵。为了突破这些限制，我们提出了细节增强的匿名隐藏表示（DEAR），该方法可以在单个模型中实现SuperInpaint，并且可以生成高质量的完整图像。具体来说，我们使用深度卷积神经网络提取输入图像的含义嵌入，然后使用适应高频滤波器增强含义嵌入的高频分量。这导致细节增强的semantic嵌入。我们还将semantic嵌入 feed到无掩蔽听力模块，该模块可以抑制听力模块中不可靠的掩蔽像素的嵌入。此外，我们还提取了每个像素的重要性图，该图示出了需要用于图像重建的像素的坐标。给定需要重建的像素的坐标，我们首先收集了输入图像中的相关像素，然后提取它们的细节增强的semantic嵌入、无掩蔽听力嵌入、重要性值和空间距离目标像素。然后，我们将所有这些因素 feed到隐藏表示中，并生成目标像素的颜色。为了评估我们的方法，我们扩展了三个现有的数据集，并建立了18个基eline，使用现状填充和超解像方法的state-of-the-art方法。广泛的实验结果表明，我们的方法在四个常用的评价指标上明显超过了所有现有的方法。

Role of Image Acquisition and Patient Phenotype Variations in Automatic Segmentation Model Generalization

paper_url: http://arxiv.org/abs/2307.14482
repo_url: None
paper_authors: Timothy L. Kline, Sumana Ramanathan, Harrison C. Gottlich, Panagiotis Korfiatis, Adriana V. Gregory
for: This study aimed to evaluate the out-of-domain performance and generalization capabilities of automated medical image segmentation models.
methods: The study used datasets from non-contrast and contrast-enhanced abdominal CT scans of healthy patients and those with polycystic kidney disease (PKD) to train and validate the models.
results: The models trained on a diverse range of data showed no worse performance than models trained exclusively on in-domain data when tested on in-domain data, and the study found that broader training examples significantly enhance model generalization and out-of-domain performance.Here is the same information in Simplified Chinese text:
for: 这项研究目的是评估医疗影像自动分割模型的域外性能和泛化能力。
methods: 这项研究使用了非对照和对照腹部CT扫描图像健康人群和肾脏癌病（PKD）患者的数据集来训练和验证模型。
results: 模型在多样化的数据上训练后，与仅仅在域内数据上训练的模型在域内数据上的性能没有差异，并且发现更广泛的训练示例可以显著提高模型的泛化和域外性能。

Abstract
Purpose: This study evaluated the out-of-domain performance and generalization capabilities of automated medical image segmentation models, with a particular focus on adaptation to new image acquisitions and disease type. Materials: Datasets from both non-contrast and contrast-enhanced abdominal CT scans of healthy patients and those with polycystic kidney disease (PKD) were used. A total of 400 images (100 non-contrast controls, 100 contrast controls, 100 non-contrast PKD, 100 contrast PKD) were utilized for training/validation of models to segment kidneys, livers, and spleens, and the final models were then tested on 100 non-contrast CT images of patients affected by PKD. Performance was evaluated using Dice, Jaccard, TPR, and Precision. Results: Models trained on a diverse range of data showed no worse performance than models trained exclusively on in-domain data when tested on in-domain data. For instance, the Dice similarity of the model trained on 25% from each dataset was found to be non-inferior to the model trained purely on in-domain data. Conclusions: The results indicate that broader training examples significantly enhances model generalization and out-of-domain performance, thereby improving automated segmentation tools' applicability in clinical settings. The study's findings provide a roadmap for future research to adopt a data-centric approach in medical image AI model development.

摘要
目的：本研究评估了自动医疗影像分割模型的离域性和总体化能力，尤其关注新领域和疾病类型的适应性。材料：来自非对照和对照肝脏CT扫描的健康人群和肝病多囊细胞病（PKD）患者的 dataset 被使用。总共使用了400张图像（100张非对照控制、100张对照控制、100张非对照 PKD、100张对照 PKD）进行模型训练和验证，并将最终模型测试在100张非对照 CT 图像上，影像分割模型能够准确地 segmentation 肾脏、肝脏和脾脏。性能被评估使用 dice、jaccard、TPR 和精度。结论：结果表明，使用更广泛的数据集可以显著提高模型的总体化和离域性能，从而提高自动分割工具在临床中的应用性。这些结论为未来医疗影像 AI 模型开发提供了一个路线图。

MiDaS v3.1 – A Model Zoo for Robust Monocular Relative Depth Estimation

paper_url: http://arxiv.org/abs/2307.14460
repo_url: https://github.com/isl-org/MiDaS
paper_authors: Reiner Birkl, Diana Wofk, Matthias Müller
for: 这 paper 是为了提高 monocular depth estimation 的质量和速度而实现的。
methods: 这 paper 使用了不同的 encoder backbones，包括 BEiT、Swin、SwinV2、Next-ViT 和 LeViT，以实现不同的 performance-runtime 贸易。
results: 这 paper 的结果表明，使用最有前途的视transformer 作为图像编码器可以提高 depth estimation 的质量，同时可以提高下游任务的高帧率。 Code 可以在 https://github.com/isl-org/MiDaS 找到。

Abstract
We release MiDaS v3.1 for monocular depth estimation, offering a variety of new models based on different encoder backbones. This release is motivated by the success of transformers in computer vision, with a large variety of pretrained vision transformers now available. We explore how using the most promising vision transformers as image encoders impacts depth estimation quality and runtime of the MiDaS architecture. Our investigation also includes recent convolutional approaches that achieve comparable quality to vision transformers in image classification tasks. While the previous release MiDaS v3.0 solely leverages the vanilla vision transformer ViT, MiDaS v3.1 offers additional models based on BEiT, Swin, SwinV2, Next-ViT and LeViT. These models offer different performance-runtime tradeoffs. The best model improves the depth estimation quality by 28% while efficient models enable downstream tasks requiring high frame rates. We also describe the general process for integrating new backbones. A video summarizing the work can be found at https://youtu.be/UjaeNNFf9sE and the code is available at https://github.com/isl-org/MiDaS.

摘要
我们发布了MiDaS v3.1，用于单目深度估计，提供了多种基于不同Encoder脊梁的新模型。这个发布是由计算机视觉中transformer的成功所 inspirited，现有大量预训练视觉transformer可用。我们研究了使用最有前途的视觉transformer作为图像编码器对深度估计质量和 runtime MiDaS架构的影响。我们的调查还包括最近的 convolutional方法，可以与视觉transformer相比肩。在上一个发布MiDaS v3.0中，我们只使用了vanilla vision transformer ViT，而MiDaS v3.1提供了基于BEiT、Swin、SwinV2、Next-ViT和LeViT的additional模型。这些模型提供了不同的性能-时间质量交易。最佳模型可以提高深度估计质量28%，而高效的模型可以支持需要高帧率的下游任务。我们还描述了将新的脊梁集成到MiDaS架构的一般过程。关于这个工作的视频可以在https://youtu.be/UjaeNNFf9sE找到，代码可以在https://github.com/isl-org/MiDaS中找到。

Self-supervised Few-shot Learning for Semantic Segmentation: An Annotation-free Approach

paper_url: http://arxiv.org/abs/2307.14446
repo_url: https://github.com/mindflow-institue/annotation_free_fewshot
paper_authors: Sanaz Karimijafarbigloo, Reza Azad, Dorit Merhof
For: + The paper aims to address the challenge of few-shot semantic segmentation (FSS) in medical image analysis, where limited annotated data is available. + The proposed method is designed to work without annotated semantic classes, making it suitable for medical images with limited annotations.* Methods: + The proposed method reframes the problem of image decomposition as a graph partitioning task, using eigenvectors from self-supervised networks to estimate the distribution of objects in the support images. + A novel self-supervised FSS framework is proposed, which adaptively estimates the query mask based on the eigenvectors obtained from the support images, eliminating the need for manual annotation. + A multi-scale large kernel attention module is introduced to selectively emphasize relevant features and details, improving the segmentation process and contributing to better object delineation.* Results: + The proposed method is evaluated on both natural and medical image datasets, demonstrating its efficiency and effectiveness. + The approach is shown to be general and model-agnostic, allowing for seamless integration with various deep architectures.Here is the information in Simplified Chinese text:* For: + 该研究旨在解决医学图像分割领域中的少量标注数据问题，使用自动标注方法实现准确的物体分割。 + 提议的方法不需要任何标注数据，因此适用于医学图像中的限量标注数据。* Methods: + 该方法将图像分解问题转化为图像分解任务，使用自动标注网络中的特征相似矩阵来估算支持图像中对象的分布。 + 提议的方法包括一种新的自动标注分割框架，不需要任何手动标注，而是基于支持图像中的特征 eigenvector 来适应性地分割查询图像。 + 为了进一步提高查询图像的解码，提议的方法还引入了一种多级大kernel注意模块，可以选择性地强调相关的特征和细节，从而提高分割过程和物体定义。* Results: + 提议的方法在自然图像和医学图像上进行了评估，显示其高效和有效。 + 该方法具有通用性和模型无关性，可以轻松地与多种深度架构集成。I hope that helps!

Abstract
Few-shot semantic segmentation (FSS) offers immense potential in the field of medical image analysis, enabling accurate object segmentation with limited training data. However, existing FSS techniques heavily rely on annotated semantic classes, rendering them unsuitable for medical images due to the scarcity of annotations. To address this challenge, multiple contributions are proposed: First, inspired by spectral decomposition methods, the problem of image decomposition is reframed as a graph partitioning task. The eigenvectors of the Laplacian matrix, derived from the feature affinity matrix of self-supervised networks, are analyzed to estimate the distribution of the objects of interest from the support images. Secondly, we propose a novel self-supervised FSS framework that does not rely on any annotation. Instead, it adaptively estimates the query mask by leveraging the eigenvectors obtained from the support images. This approach eliminates the need for manual annotation, making it particularly suitable for medical images with limited annotated data. Thirdly, to further enhance the decoding of the query image based on the information provided by the support image, we introduce a multi-scale large kernel attention module. By selectively emphasizing relevant features and details, this module improves the segmentation process and contributes to better object delineation. Evaluations on both natural and medical image datasets demonstrate the efficiency and effectiveness of our method. Moreover, the proposed approach is characterized by its generality and model-agnostic nature, allowing for seamless integration with various deep architectures. The code is publicly available at \href{https://github.com/mindflow-institue/annotation_free_fewshot}{\textcolor{magenta}{GitHub}.

摘要
这个文章提出了几个解决方案，以实现无需标注的几个Semantic Segmentation（FSS）。首先，我们参考了spectral decomposition方法，将图像分解问题转换为一个图像分类 задачу。然后，我们提出了一个新的自助学FSS框架，不需要任何标注。它可以透过获取支持图像中的eigenvector来 Adaptively estimate the query mask。此外，我们还引入了一个多尺度大kernel注意模组，以增强对支持图像的讯息整合，进一步改善分类过程。实验结果显示，我们的方法具有高效性和可靠性，并且可以与多种深度架构整合。代码可以在\href{https://github.com/mindflow-institue/annotation_free_fewshot}{\textcolor{magenta}{GitHub}上找到。

Phenotype-preserving metric design for high-content image reconstruction by generative inpainting

paper_url: http://arxiv.org/abs/2307.14436
repo_url: None
paper_authors: Vaibhav Sharma, Artur Yakimovich
for: 这个论文旨在描述一种基于高级微scopy的自动化高内容图像分析技术，以及该技术如何处理和修复图像中的artefacts。
methods: 本论文使用了state-of-the-art的填充方法，如DeepFill V2和Edge Connect，来修复高级微scopy图像中的artefacts。这些方法可以通过精心调整来实现 faithful restoration of microscopy images。
results: 研究发现，Restoration的质量与修复区域的大小有关，而不是形状。此外，提出了一种新的phenotype-preserving metric设计策略，该策略可以控制修复的质量，并且可以扩展到其他应用。

Abstract
In the past decades, automated high-content microscopy demonstrated its ability to deliver large quantities of image-based data powering the versatility of phenotypic drug screening and systems biology applications. However, as the sizes of image-based datasets grew, it became infeasible for humans to control, avoid and overcome the presence of imaging and sample preparation artefacts in the images. While novel techniques like machine learning and deep learning may address these shortcomings through generative image inpainting, when applied to sensitive research data this may come at the cost of undesired image manipulation. Undesired manipulation may be caused by phenomena such as neural hallucinations, to which some artificial neural networks are prone. To address this, here we evaluate the state-of-the-art inpainting methods for image restoration in a high-content fluorescence microscopy dataset of cultured cells with labelled nuclei. We show that architectures like DeepFill V2 and Edge Connect can faithfully restore microscopy images upon fine-tuning with relatively little data. Our results demonstrate that the area of the region to be restored is of higher importance than shape. Furthermore, to control for the quality of restoration, we propose a novel phenotype-preserving metric design strategy. In this strategy, the size and count of the restored biological phenotypes like cell nuclei are quantified to penalise undesirable manipulation. We argue that the design principles of our approach may also generalise to other applications.

摘要
过去几十年，自动化高内容微镜已经证明其能够提供大量的图像数据，推动了生物学和系统生物学应用的多样化。然而，随着图像数据的大小的增加，人类无法控制、避免和超越图像和样本准备 artifacts。而新技术如机器学习和深度学习可能解决这些缺陷，通过生成图像填充来修复图像。但当应用于敏感研究数据时，这可能会导致不жела的图像修改。这些修改可能是由神经投影引起的，一些人工神经网络容易受到这种影响。为此，我们在高内容染色微镜像dataset中评估了当前的填充方法。我们发现，深度填充V2和边缘连接可以在 fine-tuning 后对图像进行 faithful 的修复，只需要 relativity little data。我们的结果表明，图像修复的区域大小比较重要于形状。此外，为了控制修复质量，我们提出了一种新的现象保持式度量设计策略。在这种策略中，修复后的生物现象的大小和数量被量化，以便对不良修改进行惩罚。我们认为这些设计原则可能也适用于其他应用。

ProtoASNet: Dynamic Prototypes for Inherently Interpretable and Uncertainty-Aware Aortic Stenosis Classification in Echocardiography

paper_url: http://arxiv.org/abs/2307.14433
repo_url: https://github.com/hooman007/protoasnet
paper_authors: Hooman Vaseli, Ang Nan Gu, S. Neda Ahmadi Amiri, Michael Y. Tsang, Andrea Fung, Nima Kondori, Armin Saadat, Purang Abolmaesumi, Teresa S. M. Tsang
for: 本研究旨在提出一种可靠的自动检测涂猛病（AS）严重程度的方法，以便于适时采取治疗。
methods: 本方法基于protoNet，直接从B模式超音波视频中检测AS，并通过与学习的空间时间谱例子进行比较，以获得可解释的预测结果。
results: 对于一个私人数据集和公共可用的TMED-2数据集，ProtoASNet比现有状态之artefact的方法表现出色，具有80.0%和79.7%的准确率。此外，ProtoASNet还提供了解释性和不确定性度量，以便于在深度网络的助助 clinical decision-making。

Abstract
Aortic stenosis (AS) is a common heart valve disease that requires accurate and timely diagnosis for appropriate treatment. Most current automatic AS severity detection methods rely on black-box models with a low level of trustworthiness, which hinders clinical adoption. To address this issue, we propose ProtoASNet, a prototypical network that directly detects AS from B-mode echocardiography videos, while making interpretable predictions based on the similarity between the input and learned spatio-temporal prototypes. This approach provides supporting evidence that is clinically relevant, as the prototypes typically highlight markers such as calcification and restricted movement of aortic valve leaflets. Moreover, ProtoASNet utilizes abstention loss to estimate aleatoric uncertainty by defining a set of prototypes that capture ambiguity and insufficient information in the observed data. This provides a reliable system that can detect and explain when it may fail. We evaluate ProtoASNet on a private dataset and the publicly available TMED-2 dataset, where it outperforms existing state-of-the-art methods with an accuracy of 80.0% and 79.7%, respectively. Furthermore, ProtoASNet provides interpretability and an uncertainty measure for each prediction, which can improve transparency and facilitate the interactive usage of deep networks to aid clinical decision-making. Our source code is available at: https://github.com/hooman007/ProtoASNet.

摘要
《血液动脉狭窄症（AS）是心血管疾病的常见疾病，需要准确和时间上的诊断，以便适当的治疗。现有的大多数自动AS严重程度检测方法都采用黑盒模型，导致临床应用受限。为解决这个问题，我们提出了ProtoASNet，一种prototype网络，可以直接从B模式超音波视频中检测AS，并通过与学习的空间时间原型进行可读性的预测。这种方法提供了临床相关的支持证据，通常是卷积和叶 valve的受限运动的标志。此外，ProtoASNet使用缺失损失来估算随机不确定性，定义了一组捕捉歧义和不充分信息的prototype，从而提供一个可靠的系统。我们评估了ProtoASNet于私人数据集和公共可用的TMED-2数据集，其在两个数据集上均取得了80.0%和79.7%的准确率，分别超过了现有的状态对比方法。此外，ProtoASNet还提供了每个预测的可解释性和不确定性度量，可以提高透明度和促进深度网络的交互式使用，以帮助临床决策。我们的源代码可以在：https://github.com/hooman007/ProtoASNet。

Virtual Mirrors: Non-Line-of-Sight Imaging Beyond the Third Bounce

paper_url: http://arxiv.org/abs/2307.14341
repo_url: None
paper_authors: Diego Royo, Talha Sultan, Adolfo Muñoz, Khadijeh Masumnia-Bisheh, Eric Brandt, Diego Gutierrez, Andreas Velten, Julio Marco
for: 该研究旨在推广非直视（NLOS）成像方法的能力，以描绘不可见的场景。
methods: 该研究使用计算波动频谱NLOS成像领域的 indirect illumination 方法，并利用物体表面的折射特性来扩展NLOS成像的能力。
results: 该研究通过分析场景中的物体位置和方向，以及使用计算建立 auxiliary apertures 来扩展NLOS成像的能力，可以成功地成像单个角度的物体和隐藏在两个角度的物体。

Abstract
Non-line-of-sight (NLOS) imaging methods are capable of reconstructing complex scenes that are not visible to an observer using indirect illumination. However, they assume only third-bounce illumination, so they are currently limited to single-corner configurations, and present limited visibility when imaging surfaces at certain orientations. To reason about and tackle these limitations, we make the key observation that planar diffuse surfaces behave specularly at wavelengths used in the computational wave-based NLOS imaging domain. We call such surfaces virtual mirrors. We leverage this observation to expand the capabilities of NLOS imaging using illumination beyond the third bounce, addressing two problems: imaging single-corner objects at limited visibility angles, and imaging objects hidden behind two corners. To image objects at limited visibility angles, we first analyze the reflections of the known illuminated point on surfaces of the scene as an estimator of the position and orientation of objects with limited visibility. We then image those limited visibility objects by computationally building secondary apertures at other surfaces that observe the target object from a direct visibility perspective. Beyond single-corner NLOS imaging, we exploit the specular behavior of virtual mirrors to image objects hidden behind a second corner by imaging the space behind such virtual mirrors, where the mirror image of objects hidden around two corners is formed. No specular surfaces were involved in the making of this paper.

摘要

MAMo: Leveraging Memory and Attention for Monocular Video Depth Estimation

paper_url: http://arxiv.org/abs/2307.14336
repo_url: None
paper_authors: Rajeev Yasarla, Hong Cai, Jisoo Jeong, Yunxiao Shi, Risheek Garrepalli, Fatih Porikli
for: 提高单目视频深度估计精度
methods: 使用内存和注意力框架，将单目深度估计网络转化为视频深度估计模型，利用视频 temporal 信息提高深度估计精度
results: 在多个benchmark上实现了新的state-of-the-art（SOTA）精度，并且在响应时间方面与cost-volume-based视频深度模型相比，提供了更高的精度和更低的响应时间。

Abstract
We propose MAMo, a novel memory and attention frame-work for monocular video depth estimation. MAMo can augment and improve any single-image depth estimation networks into video depth estimation models, enabling them to take advantage of the temporal information to predict more accurate depth. In MAMo, we augment model with memory which aids the depth prediction as the model streams through the video. Specifically, the memory stores learned visual and displacement tokens of the previous time instances. This allows the depth network to cross-reference relevant features from the past when predicting depth on the current frame. We introduce a novel scheme to continuously update the memory, optimizing it to keep tokens that correspond with both the past and the present visual information. We adopt attention-based approach to process memory features where we first learn the spatio-temporal relation among the resultant visual and displacement memory tokens using self-attention module. Further, the output features of self-attention are aggregated with the current visual features through cross-attention. The cross-attended features are finally given to a decoder to predict depth on the current frame. Through extensive experiments on several benchmarks, including KITTI, NYU-Depth V2, and DDAD, we show that MAMo consistently improves monocular depth estimation networks and sets new state-of-the-art (SOTA) accuracy. Notably, our MAMo video depth estimation provides higher accuracy with lower latency, when omparing to SOTA cost-volume-based video depth models.

摘要
我们提出了MAMo，一种新的记忆和注意框架，用于单光学视频深度估计。MAMo可以将单个图像深度估计网络转化为视频深度估计模型，让它们利用视频中的时间信息来预测更加准确的深度。在MAMo中，我们增加了记忆，以 помочь深度预测。Specifically, the memory stores learned visual and displacement tokens of the previous time instances. This allows the depth network to cross-reference relevant features from the past when predicting depth on the current frame. We introduce a novel scheme to continuously update the memory, optimizing it to keep tokens that correspond with both the past and the present visual information. We adopt an attention-based approach to process memory features, where we first learn the spatio-temporal relation among the resultant visual and displacement memory tokens using a self-attention module. Further, the output features of self-attention are aggregated with the current visual features through cross-attention. The cross-attended features are finally given to a decoder to predict depth on the current frame. Through extensive experiments on several benchmarks, including KITTI, NYU-Depth V2, and DDAD, we show that MAMo consistently improves monocular depth estimation networks and sets new state-of-the-art (SOTA) accuracy. Notably, our MAMo video depth estimation provides higher accuracy with lower latency, when comparing to SOTA cost-volume-based video depth models.

Visual Instruction Inversion: Image Editing via Visual Prompting

paper_url: http://arxiv.org/abs/2307.14331
repo_url: https://github.com/thaoshibe/visii
paper_authors: Thao Nguyen, Yuheng Li, Utkarsh Ojha, Yong Jae Lee
for: 图像编辑
methods: 使用视觉提示学习文本编辑指令
results: 与状态艺技图像编辑框架竞争的 результа集Here’s a more detailed explanation of each point:
for: The paper is written for the task of image editing, specifically using visual prompts to convey editing ideas and learn text-based editing directions.
methods: The paper uses a method of inverting visual prompts into editing instructions, leveraging pre-trained text-to-image diffusion models to perform the editing.
results: The paper achieves competitive results compared to state-of-the-art text-conditioned image editing frameworks, with just one example pair.

Abstract
Text-conditioned image editing has emerged as a powerful tool for editing images. However, in many situations, language can be ambiguous and ineffective in describing specific image edits. When faced with such challenges, visual prompts can be a more informative and intuitive way to convey ideas. We present a method for image editing via visual prompting. Given pairs of example that represent the "before" and "after" images of an edit, our goal is to learn a text-based editing direction that can be used to perform the same edit on new images. We leverage the rich, pretrained editing capabilities of text-to-image diffusion models by inverting visual prompts into editing instructions. Our results show that with just one example pair, we can achieve competitive results compared to state-of-the-art text-conditioned image editing frameworks.

摘要
文本受控图像编辑已经成为图像编辑的强大工具。然而，在许多情况下，语言可能是模糊和不准确地描述特定的图像编辑。在面临这些挑战时，视觉提示可以是更加信息rich和INTUITIVE的方式来传达想法。我们提出了基于视觉提示的图像编辑方法。给定一对"before"和"after"图像，我们的目标是学习一个可以用于新图像进行同样编辑的文本编辑方向。我们利用了已经预训练的文本到图像扩散模型，通过视觉提示的倒转来获得编辑指令。我们的结果显示，只需要一对示例，我们可以与状态静态图像编辑框架相比的获得竞争性的结果。

US & MR Image-Fusion Based on Skin Co-Registration

paper_url: http://arxiv.org/abs/2307.14288
repo_url: None
paper_authors: Martina Paccini, Giacomo Paschina, Stefano De Beni, Giuseppe Patanè
for: 这个研究的目的是发展一个可携性高的医学影像融合系统，用于融合CT和MRI影像 WITH real-time US取得，以便在医疗器械运行中进行实时追踪和操作。
methods: 这个研究使用了一个3D摄像头感应器，用于融合CT和MRI影像 WITH real-time US取得。
results: 这个研究的主要结果是一个可携性高的医学影像融合系统，可以在不同的生物学结构中进行实时追踪和操作。

Abstract
The study and development of innovative solutions for the advanced visualisation, representation and analysis of medical images offer different research directions. Current practice in medical imaging consists in combining real-time US with imaging modalities that allow internal anatomy acquisitions, such as CT, MRI, PET or similar. Application of image-fusion approaches can be found in tracking surgical tools and/or needles, in real-time during interventions. Thus, this work proposes a fusion imaging system for the registration of CT and MRI images with real-time US acquisition leveraging a 3D camera sensor. The main focus of the work is the portability of the system and its applicability to different anatomical districts.

摘要
研究和开发创新解决方案 для高级医学图像可视化、表示和分析提供了不同的研究方向。现有医学成像做法是将实时US与可以获取内部解剖结构的成像方式结合，如CT、MRI、PET等。在实时手术过程中，应用图像融合方法可以跟踪手术工具和/或针刺的位置。这项工作提议了CT和MRI图像融合系统，通过3D摄像头感知器实现实时US获取。主要关注点是系统的可搬性和适用于不同的解剖区域。Note: Please note that the translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you need the translation in Traditional Chinese, please let me know.

Large-scale Fully-Unsupervised Re-Identification

paper_url: http://arxiv.org/abs/2307.14278
repo_url: None
paper_authors: Gabriel Bertocco, Fernanda Andaló, Terrance E. Boult, Anderson Rocha
for: This paper focuses on fully-unsupervised person and vehicle re-identification, with the goal of improving the robustness and efficiency of the re-identification process in large-scale scenarios.
methods: The proposed methodology consists of two strategies: local neighborhood sampling and a novel Re-Ranking technique. The first strategy reduces the dataset size in each iteration without violating neighborhood relationships, while the second strategy reduces the time and memory complexity of the Re-Ranking process. Additionally, the paper introduces a novel scheduling algorithm that adjusts the density parameter during training to leverage the diversity of samples and keep the learning robust to noisy labeling.
results: The proposed method outperforms state-of-the-art methods in well-known benchmarks and in the challenging large-scale Veri-Wild dataset. The method achieves a faster and memory-efficient Re-Ranking strategy, and a large-scale, noisy-robust, and ensemble-based learning approach.

Abstract
Fully-unsupervised Person and Vehicle Re-Identification have received increasing attention due to their broad applicability in surveillance, forensics, event understanding, and smart cities, without requiring any manual annotation. However, most of the prior art has been evaluated in datasets that have just a couple thousand samples. Such small-data setups often allow the use of costly techniques in time and memory footprints, such as Re-Ranking, to improve clustering results. Moreover, some previous work even pre-selects the best clustering hyper-parameters for each dataset, which is unrealistic in a large-scale fully-unsupervised scenario. In this context, this work tackles a more realistic scenario and proposes two strategies to learn from large-scale unlabeled data. The first strategy performs a local neighborhood sampling to reduce the dataset size in each iteration without violating neighborhood relationships. A second strategy leverages a novel Re-Ranking technique, which has a lower time upper bound complexity and reduces the memory complexity from O(n^2) to O(kn) with k << n. To avoid the pre-selection of specific hyper-parameter values for the clustering algorithm, we also present a novel scheduling algorithm that adjusts the density parameter during training, to leverage the diversity of samples and keep the learning robust to noisy labeling. Finally, due to the complementary knowledge learned by different models, we also introduce a co-training strategy that relies upon the permutation of predicted pseudo-labels, among the backbones, with no need for any hyper-parameters or weighting optimization. The proposed methodology outperforms the state-of-the-art methods in well-known benchmarks and in the challenging large-scale Veri-Wild dataset, with a faster and memory-efficient Re-Ranking strategy, and a large-scale, noisy-robust, and ensemble-based learning approach.

摘要
具有广泛应用前景的无监督人员和车辆重识别技术已经吸引了越来越多的关注，因为它不需要任何手动标注。然而，大多数先前的研究都是在几千个样本的小数据集上进行评估的，这些小数据集经常允许使用费时的技术，如重新排序，以提高归类结果。此外，一些先前的工作甚至会预先选择每个数据集的最佳归类参数，这是在大规模无监督场景下不现实的。在这种情况下，本研究面临着更加现实的场景，并提出了两种策略来学习大规模无标签数据。第一种策略是本地邻域采样，可以降低数据集的大小，而不违反邻域关系。第二种策略是一种新的重新排序技术，它的时间上限复杂度为O(n)，相比之下，传统的重新排序技术的时间上限复杂度为O(n^2)。为了避免预先选择特定的归类参数，我们还提出了一种新的调整策略，可以在训练中调整凝固参数，以利用样本的多样性，并使学习具有鲁棒性。最后，由于不同的模型学习的知识是 complementary，我们还引入了一种合作训练策略，可以在不同的后处理器之间进行预测 pseudo-标签的 permutation，无需任何超参或权值优化。在许多知名的benchmark上和Veri-Wild数据集中，我们的方法性能高于当前状态艺术。我们的方法具有快速、内存高效的重新排序策略，以及大规模、噪声鲁棒、 ensemble-based 学习方法。

G2L: Semantically Aligned and Uniform Video Grounding via Geodesic and Game Theory

paper_url: http://arxiv.org/abs/2307.14277
repo_url: None
paper_authors: Hongxiang Li, Meng Cao, Xuxin Cheng, Yaowei Li, Zhihong Zhu, Yuexian Zou
for: 这个论文旨在解决视频定位问题，即使用自适应对视频定位。
methods: 该论文提出了一种基于地odesic和游戏理论的均衡视频定位框架（G2L），通过地odesic距离来衡量同时段的含义相似性，并通过游戏理论来学习细致的含义对应。
results: 实验结果表明，G2L方法可以有效地解决视频定位问题，并且在三个基准数据集上达到了最高的性能。

Abstract
The recent video grounding works attempt to introduce vanilla contrastive learning into video grounding. However, we claim that this naive solution is suboptimal. Contrastive learning requires two key properties: (1) \emph{alignment} of features of similar samples, and (2) \emph{uniformity} of the induced distribution of the normalized features on the hypersphere. Due to two annoying issues in video grounding: (1) the co-existence of some visual entities in both ground truth and other moments, \ie semantic overlapping; (2) only a few moments in the video are annotated, \ie sparse annotation dilemma, vanilla contrastive learning is unable to model the correlations between temporally distant moments and learned inconsistent video representations. Both characteristics lead to vanilla contrastive learning being unsuitable for video grounding. In this paper, we introduce Geodesic and Game Localization (G2L), a semantically aligned and uniform video grounding framework via geodesic and game theory. We quantify the correlations among moments leveraging the geodesic distance that guides the model to learn the correct cross-modal representations. Furthermore, from the novel perspective of game theory, we propose semantic Shapley interaction based on geodesic distance sampling to learn fine-grained semantic alignment in similar moments. Experiments on three benchmarks demonstrate the effectiveness of our method.

摘要
最近的视频定位工作尝试将纯然对偶学习引入到视频定位中。然而，我们认为这种简单的解决方案是不合适的。对偶学习需要两个关键性质：（1）对类似样本的特征进行Alignment，以及（2）在归一化后的特征分布在圆柱体上具有 uniformity。由于视频定位中存在两个烦人的问题：（1）视频中存在一些视觉实体同时出现在真实标注和其他时刻中，即semantic overlapping;（2）只有一些时刻在视频中被标注，即稀疏标注困难，纯然对偶学习无法模型视频中的时间距离相关性和学习不一致的视频表示。这两个特征导致纯然对偶学习不适用于视频定位。在这篇论文中，我们提出了Geodesic and Game Localization（G2L）方法，该方法通过几何学和游戏理论来实现semantically aligned和uniform的视频定位框架。我们利用几何学距离来衡量时间距离相关性，使模型学习正确的交叉模式表示。此外，从游戏理论的新角度出发，我们提出了semantic Shapley交互，基于几何学距离采样来学习细致的semantic alignment。实验结果表明，我们的方法有效地解决了视频定位中的问题。

Deepfake Image Generation for Improved Brain Tumor Segmentation

paper_url: http://arxiv.org/abs/2307.14273
repo_url: None
paper_authors: Roa’a Al-Emaryeen, Sara Al-Nahhas, Fatima Himour, Waleed Mahafza, Omar Al-Kadi
for: 针对脑肿 segmentation task，提高识别效果
methods: 使用生成对抗网络进行图像到图像翻译，然后使用U-Net适应性 convolutional neural network进行图像分割
results: 比较四个公共数据集的基准值，显示提高图像分割质量指标的性能

Abstract
As the world progresses in technology and health, awareness of disease by revealing asymptomatic signs improves. It is important to detect and treat tumors in early stage as it can be life-threatening. Computer-aided technologies are used to overcome lingering limitations facing disease diagnosis, while brain tumor segmentation remains a difficult process, especially when multi-modality data is involved. This is mainly attributed to ineffective training due to lack of data and corresponding labelling. This work investigates the feasibility of employing deep-fake image generation for effective brain tumor segmentation. To this end, a Generative Adversarial Network was used for image-to-image translation for increasing dataset size, followed by image segmentation using a U-Net-based convolutional neural network trained with deepfake images. Performance of the proposed approach is compared with ground truth of four publicly available datasets. Results show improved performance in terms of image segmentation quality metrics, and could potentially assist when training with limited data.

摘要
世界的技术和医疗发展，疾病早期发现的意识提高。早期发现和治疗肿瘤非常重要，因为可能是生命威胁。计算机支持技术用于超越疾病诊断中的留存问题，而脑肿瘤分 segmentation仍然是一个困难的过程，特别是当涉及多Modal数据时。这主要归因于训练效果不够，因为缺乏数据和相应的标注。这项工作研究了使用深归真图生成技术来实现有效的脑肿瘤分 segmentation。为此，我们使用了一个生成对抗网络进行图像到图像翻译，然后使用一个基于 U-Net 的卷积神经网络，并在深归真图上进行图像分割。我们对提出的方法的性能进行了比较，并与四个公共可用的数据集的标准值进行比较。结果表明，我们的方法可以提高图像分割质量指标，并且可能在培训数据有限时提供帮助。

2023-07-27

Federated Model Aggregation via Self-Supervised Priors for Highly Imbalanced Medical Image Classification

GET3D–: Learning GET3D from Unconstrained Image Collections

NSA: Naturalistic Support Artifact to Boost Network Confidence

Clustering of illustrations by atmosphere using a combination of supervised and unsupervised learning

Weakly Supervised AI for Efficient Analysis of 3D Pathology Samples

Mixture of Self-Supervised Learning

Self-Supervised Learning for Improved Synthetic Aperture Sonar Target Recognition

Sample Less, Learn More: Efficient Action Recognition via Frame Feature Restoration

A full-resolution training framework for Sentinel-2 image fusion

IML-ViT: Benchmarking Image Manipulation Localization by Vision Transformer

Comparative Evaluation of Digital and Analog Chest Radiographs to Identify Tuberculosis using Deep Learning Model

Simplified Concrete Dropout – Improving the Generation of Attribution Masks for Fine-grained Classification

Building RadiologyNET: Unsupervised annotation of a large-scale multimodal medical database

Fading memory as inductive bias in residual recurrent networks

Towards Deeply Unified Depth-aware Panoptic Segmentation with Bi-directional Guidance Learning

Contrastive Knowledge Amalgamation for Unsupervised Image Classification

pCTFusion: Point Convolution-Transformer Fusion with Semantic Aware Loss for Outdoor LiDAR Point Cloud Segmentation

3DPortraitGAN: Learning One-Quarter Headshot 3D GANs from a Single-View Portrait Dataset with Diverse Body Poses

Gloss-free Sign Language Translation: Improving from Visual-Language Pretraining

Semantic Image Completion and Enhancement using GANs

Test Time Adaptation for Blind Image Quality Assessment

Understanding Silent Failures in Medical Image Classification

P2C: Self-Supervised Point Cloud Completion from Single Partial Clouds

vox2vec: A Framework for Self-supervised Contrastive Learning of Voxel-level Representations in Medical Images

EFLNet: Enhancing Feature Learning for Infrared Small Target Detection

Robust vertebra identification using simultaneous node and edge predicting Graph Neural Networks

GaitMorph: Transforming Gait by Optimally Transporting Discrete Codes

Pre-training Vision Transformers with Very Limited Synthesized Images

Taxonomy Adaptive Cross-Domain Adaptation in Medical Imaging via Optimization Trajectory Distillation

High Dynamic Range Imaging via Visual Attention Modules

MIM-OOD: Generative Masked Image Modelling for Out-of-Distribution Detection in Medical Images

Unified Adversarial Patch for Visible-Infrared Cross-modal Attacks in the Physical World

LLDiffusion: Learning Degradation Representations in Diffusion Models for Low-Light Image Enhancement

Spatial-Frequency U-Net for Denoising Diffusion Probabilistic Models

EqGAN: Feature Equalization Fusion for Few-shot Image Generation

HTNet for micro-expression recognition

360VOT: A New Benchmark Dataset for Omnidirectional Visual Object Tracking

FS-Depth: Focal-and-Scale Depth Estimation from a Single Image in Unseen Indoor Scene

NeRF-Det: Learning Geometry-Aware Volumetric Representation for Multi-View 3D Object Detection

Multiscale Dynamic Graph Representation for Biometric Recognition with Occlusions

GenCo: An Auxiliary Generator from Contrastive Learning for Enhanced Few-Shot Learning in Remote Sensing

TextManiA: Enriching Visual Feature by Text-driven Manifold Augmentation

A Weakly Supervised Segmentation Network Embedding Cross-scale Attention Guidance and Noise-sensitive Constraint for Detecting Tertiary Lymphoid Structures of Pancreatic Tumors

FakeTracer: Proactively Defending Against Face-swap DeepFakes via Implanting Traces in Training

MCPA: Multi-scale Cross Perceptron Attention Network for 2D Medical Image Segmentation

Neural Representation-Based Method for Metal-induced Artifact Reduction in Dental CBCT Imaging

GADER: GAit DEtection and Recognition in the Wild

Robust Detection, Association, and Localization of Vehicle Lights: A Context-Based Cascaded CNN Approach and Evaluations

Physically Plausible 3D Human-Scene Reconstruction from Monocular RGB Image using an Adversarial Learning Approach

Towards multi-modal anatomical landmark detection for ultrasound-guided brain tumor resection with contrastive learning

FocalErrorNet: Uncertainty-aware focal modulation network for inter-modal registration error estimation in ultrasound-guided neurosurgery

SuperInpaint: Learning Detail-Enhanced Attentional Implicit Representation for Super-resolutional Image Inpainting

Role of Image Acquisition and Patient Phenotype Variations in Automatic Segmentation Model Generalization

MiDaS v3.1 – A Model Zoo for Robust Monocular Relative Depth Estimation

Self-supervised Few-shot Learning for Semantic Segmentation: An Annotation-free Approach

Phenotype-preserving metric design for high-content image reconstruction by generative inpainting

ProtoASNet: Dynamic Prototypes for Inherently Interpretable and Uncertainty-Aware Aortic Stenosis Classification in Echocardiography

Virtual Mirrors: Non-Line-of-Sight Imaging Beyond the Third Bounce

MAMo: Leveraging Memory and Attention for Monocular Video Depth Estimation

Visual Instruction Inversion: Image Editing via Visual Prompting

US & MR Image-Fusion Based on Skin Co-Registration

Large-scale Fully-Unsupervised Re-Identification

G2L: Semantically Aligned and Uniform Video Grounding via Geodesic and Game Theory

Deepfake Image Generation for Improved Brain Tumor Segmentation