2023-08-12

cs.CV

cs.CV - 2023-08-12

Cyclic Test-Time Adaptation on Monocular Video for 3D Human Mesh Reconstruction

paper_url: http://arxiv.org/abs/2308.06554
repo_url: https://github.com/hygenie1228/cycleadapt_release
paper_authors: Hyeongjin Nam, Daniel Sungho Jung, Yeonguk Oh, Kyoung Mu Lee
for: addresses the domain gap problem in 3D human mesh reconstruction by proposing a cyclic adaptation method that leverages both 2D and 3D evidence.
methods: the proposed method consists of two networks: a human mesh reconstruction network (HMRNet) and a human motion denoising network (MDNet), which are cyclically adapted given a test video. The 3D supervision targets generated by MDNet are used to fully supervise HMRNet, reducing the reliance on 2D evidence.
results: the proposed method achieves state-of-the-art performance compared to previous test-time adaptation methods, demonstrating the effectiveness of the cyclic adaptation scheme in addressing the domain gap problem.

Abstract
Despite recent advances in 3D human mesh reconstruction, domain gap between training and test data is still a major challenge. Several prior works tackle the domain gap problem via test-time adaptation that fine-tunes a network relying on 2D evidence (e.g., 2D human keypoints) from test images. However, the high reliance on 2D evidence during adaptation causes two major issues. First, 2D evidence induces depth ambiguity, preventing the learning of accurate 3D human geometry. Second, 2D evidence is noisy or partially non-existent during test time, and such imperfect 2D evidence leads to erroneous adaptation. To overcome the above issues, we introduce CycleAdapt, which cyclically adapts two networks: a human mesh reconstruction network (HMRNet) and a human motion denoising network (MDNet), given a test video. In our framework, to alleviate high reliance on 2D evidence, we fully supervise HMRNet with generated 3D supervision targets by MDNet. Our cyclic adaptation scheme progressively elaborates the 3D supervision targets, which compensate for imperfect 2D evidence. As a result, our CycleAdapt achieves state-of-the-art performance compared to previous test-time adaptation methods. The codes are available at https://github.com/hygenie1228/CycleAdapt_RELEASE.

摘要
尽管最近的3D人体渲染技术得到了进步，但域外差问题仍然是主要挑战。一些先前的工作通过测试时适应来解决域外差问题，但高度依赖于2D证据（例如2D人体关键点）的适应会导致两个主要问题。首先，2D证据引入深度不确定性，阻碍学习准确的3D人体几何学。其次，2D证据在测试时可能受到噪声或部分损失，这会导致错误的适应。为解决以上问题，我们介绍了CyclesAdapt，它将两个网络——人体渲染网络（HMRNet）和人体动作净化网络（MDNet）——在测试视频基础上进行循环适应。在我们的框架中，为了减少依赖于2D证据，我们完全supervise HMRNet 的生成3D目标，使其能够学习准确的3D人体几何学。我们的循环适应方案逐渐填充3D目标，以补做受到噪声或部分损失的2D证据。因此，我们的CyclesAdapt可以与之前的测试时适应方法相比，实现最新的表现。代码可以在https://github.com/hygenie1228/CycleAdapt_RELEASE 中找到。

Revisiting Vision Transformer from the View of Path Ensemble

paper_url: http://arxiv.org/abs/2308.06548
repo_url: None
paper_authors: Shuning Chang, Pichao Wang, Hao Luo, Fan Wang, Mike Zheng Shou
for: 本文提出了一种新的视点，认为 transformer 层可以被看作是多个并行的路径 ensemble network。
methods: 将传统的多头自注意力（MSA）和Feed Forward Network（FFN）替换为三个并行的路径，并使用 identify connection 将这些路径转换为明确的多路ensemble network。
results: 通过调查每个路径对最终预测的影响，发现一些路径甚至会降低性能。因此，提出了路径裁剪和 EnsembleScale 技术来优化路径组合，以便允许短路专注提供高质量表示。此外，通过自馈沟通来增强 paths 服务后续路径的表示。

Abstract
Vision Transformers (ViTs) are normally regarded as a stack of transformer layers. In this work, we propose a novel view of ViTs showing that they can be seen as ensemble networks containing multiple parallel paths with different lengths. Specifically, we equivalently transform the traditional cascade of multi-head self-attention (MSA) and feed-forward network (FFN) into three parallel paths in each transformer layer. Then, we utilize the identity connection in our new transformer form and further transform the ViT into an explicit multi-path ensemble network. From the new perspective, these paths perform two functions: the first is to provide the feature for the classifier directly, and the second is to provide the lower-level feature representation for subsequent longer paths. We investigate the influence of each path for the final prediction and discover that some paths even pull down the performance. Therefore, we propose the path pruning and EnsembleScale skills for improvement, which cut out the underperforming paths and re-weight the ensemble components, respectively, to optimize the path combination and make the short paths focus on providing high-quality representation for subsequent paths. We also demonstrate that our path combination strategies can help ViTs go deeper and act as high-pass filters to filter out partial low-frequency signals. To further enhance the representation of paths served for subsequent paths, self-distillation is applied to transfer knowledge from the long paths to the short paths. This work calls for more future research to explain and design ViTs from new perspectives.

摘要
视transformer（ViT）通常被看作是一 stack of transformer层。在这项工作中，我们提出了一种新的视图，显示了ViT可以被看作是一个多路网络，每个层包含多个平行的路径。 Specifically, we可以将传统的多头自注意（MSA）和Feed-Forward Network（FFN）转化为每个transformer层中的三个平行路径。然后，我们利用我们新的transformer形式中的标识连接，并将ViT转化为一个显式多路ensemble网络。从这种新的视角来看，这些路径在两个功能：第一是提供分类器所需的特征，第二是提供后续更长的路径所需的下一个特征表示。我们调查每个路径对最终预测的影响，发现一些路径甚至会降低性能。因此，我们提出了路径剔除和EnsembleScale技巧来优化路径组合，即将不良表现的路径剔除，并重新权重ensemble组件。我们还证明了我们的路径组合策略可以帮助ViT深入探索，并作为高通过滤器来过滤部分低频信号。为了进一步增强路径服务后续路径的表示，我们应用了自适应知识传递，将长路径中的知识传递给短路径。这项工作呼吁了更多的未来研究，以解释和设计ViT从新的视角。

SegPrompt: Boosting Open-world Segmentation via Category-level Prompt Learning

paper_url: http://arxiv.org/abs/2308.06531
repo_url: https://github.com/aim-uofa/segprompt
paper_authors: Muzhi Zhu, Hengtao Li, Hao Chen, Chengxiang Fan, Weian Mao, Chenchen Jing, Yifan Liu, Chunhua Shen
for: 提高closed-set实例分割模型对未知类别的检测能力
methods: 使用类别信息进行训练 Mechanism，提高模型对已知和未知类别的检测能力
results: 在新的开放世界数据集上，SegPrompt可以提高总和未知检测性能by 5.6%和6.1%，而无需影响推理效率。在 existed cross-dataset transfer和强烈监督设置下，我们的方法也得到了5.5%和12.3%的相对改进。

Abstract
Current closed-set instance segmentation models rely on pre-defined class labels for each mask during training and evaluation, largely limiting their ability to detect novel objects. Open-world instance segmentation (OWIS) models address this challenge by detecting unknown objects in a class-agnostic manner. However, previous OWIS approaches completely erase category information during training to keep the model's ability to generalize to unknown objects. In this work, we propose a novel training mechanism termed SegPrompt that uses category information to improve the model's class-agnostic segmentation ability for both known and unknown categories. In addition, the previous OWIS training setting exposes the unknown classes to the training set and brings information leakage, which is unreasonable in the real world. Therefore, we provide a new open-world benchmark closer to a real-world scenario by dividing the dataset classes into known-seen-unseen parts. For the first time, we focus on the model's ability to discover objects that never appear in the training set images. Experiments show that SegPrompt can improve the overall and unseen detection performance by 5.6% and 6.1% in AR on our new benchmark without affecting the inference efficiency. We further demonstrate the effectiveness of our method on existing cross-dataset transfer and strongly supervised settings, leading to 5.5% and 12.3% relative improvement.

摘要
当前的闭erset实例分割模型依赖于在训练和评估中预先定义的类标签，这限制了它们的能力检测新的对象。开放世界实例分割（OWIS）模型解决了这个挑战，它在无类别情况下检测未知对象。然而，前一些OWIS方法完全抹除了类型信息在训练中，以保持模型对未知类型的泛化能力。在这种情况下，我们提出了一种新的训练机制，称为SegPrompt，它使用类型信息来提高模型在已知和未知类型之间的无类别分割能力。此外，前一些OWIS训练设置会泄露信息，这不符合实际世界的情况。因此，我们提供了一个更加真实的开放世界 benchmark，将数据集分为已知、未seen和未知三部分。我们首次关注模型能够在训练集图像中不出现的对象检测能力。实验结果显示，SegPrompt可以在AR上提高总和未seen检测性能5.6%和6.1%，而不影响推理效率。我们还证明我们的方法在现有的跨数据集转移和强烈监督设置下有5.5%和12.3%的相对改进。

paper_url: http://arxiv.org/abs/2308.06530
repo_url: None
paper_authors: Miaoyu Li, Yachao Zhang, Xu MA, Yanyun Qu, Yun Fu
for: 这篇论文旨在提高频率域执行3D semantic segmentation的预测性和灵活性，并且在新的频率域中进行预测，而不需要训练数据集。
methods: 这篇论文提出了一种基于鸟瞰看的cross-modal learning架构，具有更高的错误耐受性和稳定性，并且可以实现频率域内的预测。
results: 这篇论文透过三个不同的3D数据集进行评估，结果显示BEV-DG在所有设定中具有显著的性能优势，与现有的竞争者相比，BEV-DG的性能优势为10%左右。

Abstract
Cross-modal Unsupervised Domain Adaptation (UDA) aims to exploit the complementarity of 2D-3D data to overcome the lack of annotation in a new domain. However, UDA methods rely on access to the target domain during training, meaning the trained model only works in a specific target domain. In light of this, we propose cross-modal learning under bird's-eye view for Domain Generalization (DG) of 3D semantic segmentation, called BEV-DG. DG is more challenging because the model cannot access the target domain during training, meaning it needs to rely on cross-modal learning to alleviate the domain gap. Since 3D semantic segmentation requires the classification of each point, existing cross-modal learning is directly conducted point-to-point, which is sensitive to the misalignment in projections between pixels and points. To this end, our approach aims to optimize domain-irrelevant representation modeling with the aid of cross-modal learning under bird's-eye view. We propose BEV-based Area-to-area Fusion (BAF) to conduct cross-modal learning under bird's-eye view, which has a higher fault tolerance for point-level misalignment. Furthermore, to model domain-irrelevant representations, we propose BEV-driven Domain Contrastive Learning (BDCL) with the help of cross-modal learning under bird's-eye view. We design three domain generalization settings based on three 3D datasets, and BEV-DG significantly outperforms state-of-the-art competitors with tremendous margins in all settings.

摘要
cross-modal无监督领域适应（UDA）目标是利用2D-3D数据的补充性来缺乏目标领域的标注。然而，UDA方法需要训练时有Target领域的存在，因此训练的模型只能在特定的Target领域中工作。为了解决这个问题，我们提出了基于鸟瞰视的cross-modal学习 для领域总结（DG）的3D语义分割，称为BEV-DG。DG比UDA更加困难，因为模型在训练时无法访问目标领域，因此它需要通过cross-modal学习来减少领域差距。由于3D语义分割需要每个点的分类，现有的cross-modal学习是直接进行点对点的，这是 projection between pixels and points 的不一致敏感。为此，我们的方法是通过cross-modal学习下鸟瞰视模型化领域无关表示，使用BEV-based Area-to-area Fusion（BAF）来进行cross-modal学习，这种方法具有更高的错误忍容度。此外，我们还提出了基于鸟瞰视的BEV-driven Domain Contrastive Learning（BDCL），通过cross-modal学习来模型领域无关表示。我们设计了基于三个3D数据集的三个领域总结设置，BEV-DG在所有设置中都以很大的优势超越了当前的竞争对手。

Seed Feature Maps-based CNN Models for LEO Satellite Remote Sensing Services

paper_url: http://arxiv.org/abs/2308.06515
repo_url: None
paper_authors: Zhichao Lu, Chuntao Ding, Shangguang Wang, Ran Cheng, Felix Juefei-Xu, Vishnu Naresh Boddeti
for: 这篇研究是为了提出一个基于ground-station server的框架，以实现高性能的卷积神经网络模型在低地球轨道（LEO）卫星上的快速遥测图像处理。
methods: 本研究使用了一个基于seed feature map的框架，具体是每个层的卷积神经网络模型仅包含一个可学习的特征图（seed feature map），并通过特定规律生成其他特征图。此外，这个框架还使用了Random Hyperparameter Generation（RHG）技术，实现在LEO卫星上更新卷积神经网络模型。
results: 实验结果显示，提出的框架可以与现有的State-of-the-art方法相比，在ISPRS Vaihingen、ISPRS Potsdam、UAVid和LoveDA等数据集上实现更高的mIoU，特别是在UAVid数据集上，SineFM-based模型的mIoU高于UNetFormer，仅使用3.3倍少的参数和2.2倍少的FLOPs。

Abstract
Deploying high-performance convolutional neural network (CNN) models on low-earth orbit (LEO) satellites for rapid remote sensing image processing has attracted significant interest from industry and academia. However, the limited resources available on LEO satellites contrast with the demands of resource-intensive CNN models, necessitating the adoption of ground-station server assistance for training and updating these models. Existing approaches often require large floating-point operations (FLOPs) and substantial model parameter transmissions, presenting considerable challenges. To address these issues, this paper introduces a ground-station server-assisted framework. With the proposed framework, each layer of the CNN model contains only one learnable feature map (called the seed feature map) from which other feature maps are generated based on specific rules. The hyperparameters of these rules are randomly generated instead of being trained, thus enabling the generation of multiple feature maps from the seed feature map and significantly reducing FLOPs. Furthermore, since the random hyperparameters can be saved using a few random seeds, the ground station server assistance can be facilitated in updating the CNN model deployed on the LEO satellite. Experimental results on the ISPRS Vaihingen, ISPRS Potsdam, UAVid, and LoveDA datasets for semantic segmentation services demonstrate that the proposed framework outperforms existing state-of-the-art approaches. In particular, the SineFM-based model achieves a higher mIoU than the UNetFormer on the UAVid dataset, with 3.3x fewer parameters and 2.2x fewer FLOPs.

摘要
deploying high-performance convolutional neural network (CNN) models on low-earth orbit (LEO) satellites for rapid remote sensing image processing has attracted significant interest from industry and academia. However, the limited resources available on LEO satellites contrast with the demands of resource-intensive CNN models, necessitating the adoption of ground-station server assistance for training and updating these models. existing approaches often require large floating-point operations (FLOPs) and substantial model parameter transmissions, presenting considerable challenges. to address these issues, this paper introduces a ground-station server-assisted framework. with the proposed framework, each layer of the CNN model contains only one learnable feature map (called the seed feature map) from which other feature maps are generated based on specific rules. the hyperparameters of these rules are randomly generated instead of being trained, thus enabling the generation of multiple feature maps from the seed feature map and significantly reducing FLOPs. furthermore, since the random hyperparameters can be saved using a few random seeds, the ground station server assistance can be facilitated in updating the CNN model deployed on the LEO satellite. experimental results on the ISPRS Vaihingen, ISPRS Potsdam, UAVid, and LoveDA datasets for semantic segmentation services demonstrate that the proposed framework outperforms existing state-of-the-art approaches. in particular, the SineFM-based model achieves a higher mIoU than the UNetFormer on the UAVid dataset, with 3.3x fewer parameters and 2.2x fewer FLOPs.

Out-of-distribution multi-view auto-encoders for prostate cancer lesion detection

paper_url: http://arxiv.org/abs/2308.06481
repo_url: None
paper_authors: Alvaro Fernandez-Quilez, Linas Vidziunas, Ørjan Kløvfjell Thoresen, Ketil Oppedal, Svein Reidar Kjosavik, Trygve Eftestøl
for: 这篇论文目的是为了提出一种基于对外域检测的潜在医疗影像识别方法，并且运用不同T2w方向的多条流进行检测，以提高肝癌潜在病变检测的精确度。
methods: 本论文使用的方法包括对外域检测和多条流方法，以探索肝癌潜在病变检测的可能性。
results: 本论文的结果显示，使用多条流方法可以提高肝癌潜在病变检测的精确度，并且在一个公共可用数据集上获得了更高的检测精确度（AUC），具体为73.1%和82.3%之间。

Abstract
Traditional deep learning (DL) approaches based on supervised learning paradigms require large amounts of annotated data that are rarely available in the medical domain. Unsupervised Out-of-distribution (OOD) detection is an alternative that requires less annotated data. Further, OOD applications exploit the class skewness commonly present in medical data. Magnetic resonance imaging (MRI) has proven to be useful for prostate cancer (PCa) diagnosis and management, but current DL approaches rely on T2w axial MRI, which suffers from low out-of-plane resolution. We propose a multi-stream approach to accommodate different T2w directions to improve the performance of PCa lesion detection in an OOD approach. We evaluate our approach on a publicly available data-set, obtaining better detection results in terms of AUC when compared to a single direction approach (73.1 vs 82.3). Our results show the potential of OOD approaches for PCa lesion detection based on MRI.

摘要
传统的深度学习（DL）方法基于指导学习 paradigma需要大量的标注数据，而这些数据在医疗领域很难获得。不supervised Out-of-distribution（OOD）检测是一种alternative，它需要更少的标注数据。另外，OOD应用可以利用医疗数据的类偏好。核磁共振成像（MRI）已经证明是肠癌（PCa）诊断和管理的有用工具，但当前的DL方法仅仅采用T2w极向MRI，这会受到低外平面分辨率的限制。我们提议一种多流程approach来满足不同的T2w方向，以提高PCa患部检测的性能。我们对公共可用数据集进行评估，并获得了与单向approach相比的更好的检测结果（AUC=73.1 vs AUC=82.3）。我们的结果表明OOD方法在MRI上进行PCa患部检测具有潜在的应用前景。

Leveraging multi-view data without annotations for prostate MRI segmentation: A contrastive approach

paper_url: http://arxiv.org/abs/2308.06477
repo_url: None
paper_authors: Tim Nikolass Lindeijer, Tord Martin Ytredal, Trygve Eftestøl, Tobias Nordström, Fredrik Jäderling, Martin Eklund, Alvaro Fernandez-Quilez
for: 提高 automatic prostate segmentation 的精度和可靠性，使用 multi-view MRI 数据和 contrastive learning 技术。
methods: 提posed 一种 triplet encoder and single decoder network 基于 U-Net，称为 tU-Net (triplet U-Net)，可以利用不需要注意力的 sagittal 和 coronal 视图来提高 segmentation 的精度。
results: tU-Net 显示在 dice score 指标上 statistically 提高了精度 (91.25+-0.52% 比 86.40+-1.50%,P<.001)，并且在不同视图的数据上进行了可靠的总体骨骼变换。

Abstract
An accurate prostate delineation and volume characterization can support the clinical assessment of prostate cancer. A large amount of automatic prostate segmentation tools consider exclusively the axial MRI direction in spite of the availability as per acquisition protocols of multi-view data. Further, when multi-view data is exploited, manual annotations and availability at test time for all the views is commonly assumed. In this work, we explore a contrastive approach at training time to leverage multi-view data without annotations and provide flexibility at deployment time in the event of missing views. We propose a triplet encoder and single decoder network based on U-Net, tU-Net (triplet U-Net). Our proposed architecture is able to exploit non-annotated sagittal and coronal views via contrastive learning to improve the segmentation from a volumetric perspective. For that purpose, we introduce the concept of inter-view similarity in the latent space. To guide the training, we combine a dice score loss calculated with respect to the axial view and its manual annotations together with a multi-view contrastive loss. tU-Net shows statistical improvement in dice score coefficient (DSC) with respect to only axial view (91.25+-0.52% compared to 86.40+-1.50%,P<.001). Sensitivity analysis reveals the volumetric positive impact of the contrastive loss when paired with tU-Net (2.85+-1.34% compared to 3.81+-1.88%,P<.001). Further, our approach shows good external volumetric generalization in an in-house dataset when tested with multi-view data (2.76+-1.89% compared to 3.92+-3.31%,P=.002), showing the feasibility of exploiting non-annotated multi-view data through contrastive learning whilst providing flexibility at deployment in the event of missing views.

摘要
通过增强多视图数据的利用，我们提出了一种基于对照学习的三元Encoder-单元网络（tU-Net），用于提高肾脏细分。我们在训练时使用了非标注的架子视图和仰视图，通过对照学习来利用这些视图，从而提高 segmentation 的精度。为了引导训练，我们组合了axial视图和其手动注释的 dice score 损失函数，以及多视图对照损失函数。 results 表明，tU-Net 比只使用axial视图的情况提高了 dice score 系数（DSC）（91.25±0.52% vs 86.40±1.50%,P<0.001）。另外，我们的方法还在不同的混合率下进行了敏感性分析，发现对照学习损失函数对于与 tU-Net 结合使用时产生的卷积效应具有 Statistical significance（2.85±1.34% vs 3.81±1.88%,P<0.001）。此外，我们的方法还在一个自有的数据集上进行了 external volumetric 一致性测试，并发现在使用多视图数据时，tU-Net 的性能较好（2.76±1.89% vs 3.92±3.31%,P=.002），这表明了我们的方法可以在实际应用中利用非标注的多视图数据进行对照学习，并且在部署时可以避免 missing views 的问题。

Tiny and Efficient Model for the Edge Detection Generalization

paper_url: http://arxiv.org/abs/2308.06468
repo_url: https://github.com/xavysp/teed
paper_authors: Xavier Soria, Yachuan Li, Mohammad Rouhani, Angel D. Sappa
for: 本文 targets at addressing the issue of edge detection in computer vision, with the objectives of simplicity, efficiency, and generalization.
methods: 本文提出了一种轻量级卷积神经网络（TEED），具有只有58K参数，比现状态 искусственный智能模型少。通过在BIPED dataset上训练，可以在less than 30分钟内完成训练，每个epoch仅需less than 5分钟。
results: 本文的提出的模型可以快速 converges within the first few epochs，并且预测的边映射具有高质量。此外，本文还提出了一个新的测试数据集，用于测试边检测模型的通用性。I hope this helps!

Abstract
Most high-level computer vision tasks rely on low-level image operations as their initial processes. Operations such as edge detection, image enhancement, and super-resolution, provide the foundations for higher level image analysis. In this work we address the edge detection considering three main objectives: simplicity, efficiency, and generalization since current state-of-the-art (SOTA) edge detection models are increased in complexity for better accuracy. To achieve this, we present Tiny and Efficient Edge Detector (TEED), a light convolutional neural network with only $58K$ parameters, less than $0.2$% of the state-of-the-art models. Training on the BIPED dataset takes $less than 30 minutes$, with each epoch requiring $less than 5 minutes$. Our proposed model is easy to train and it quickly converges within very first few epochs, while the predicted edge-maps are crisp and of high quality. Additionally, we propose a new dataset to test the generalization of edge detection, which comprises samples from popular images used in edge detection and image segmentation. The source code is available in https://github.com/xavysp/TEED.

摘要
大多数高级计算机视觉任务都基于低级图像操作作为初始过程。操作如图像提高、图像增强和超分辨率，为更高级图像分析提供基础。在这项工作中，我们考虑了三个主要目标：简单、高效和泛化，因为当前状态体系（SOTA）的边检测模型在精度方面增加了复杂度。为达到这个目标，我们提出了简单高效的边检测器（TEED），这是一个具有58000个参数的轻量级卷积神经网络，比状态体系模型少了99.8%的参数。在BIPE dataset上训练时间只需要少于30分钟，每个epoch只需要少于5分钟。我们的提出的模型轻松训练，快速 converges，并且预测的边映射具有高质量。此外，我们还提出了一个新的测试泛化边检测的数据集，该数据集包括流行的图像used in edge detection和图像分类中的样本。源代码可以在https://github.com/xavysp/TEED上获取。

Improved YOLOv8 Detection Algorithm in Security Inspection Image

paper_url: http://arxiv.org/abs/2308.06452
repo_url: None
paper_authors: Liyao Lu
for: 本研究旨在解决X射线图像检测中的重叠检测对象、假阳性货物检测和检测失败问题。
methods: 本研究提出了基于YOLOv8s的改进X射线财物检测算法CSS-YOLO。
results: 实验结果表明，CSS-YOLO算法能够提高检测精度，降低假阳性率和 missed detection 率，提高安全检查效果。

Abstract
Security inspection is the first line of defense to ensure the safety of people's lives and property, and intelligent security inspection is an inevitable trend in the future development of the security inspection industry. Aiming at the problems of overlapping detection objects, false detection of contraband, and missed detection in the process of X-ray image detection, an improved X-ray contraband detection algorithm CSS-YOLO based on YOLOv8s is proposed.

摘要
安全检查是人们生命和财产安全的首列防御，未来安全检查行业的发展将具有智能化特点。面临检测对象重叠、质控违禁品和检测失败等问题，我们提出了基于YOLOv8s的改进X射线质控检测算法CSS-YOLO。

TongueSAM: An Universal Tongue Segmentation Model Based on SAM with Zero-Shot

paper_url: http://arxiv.org/abs/2308.06444
repo_url: https://github.com/cshan-github/tonguesam
paper_authors: Shan Cao, Qunsheng Ruan, Qingfeng Wu
for: 本研究旨在提出一种通用的舌部分 segmentation模型，以解决现有的舌部分 segmentation方法在不同舌型图像上表现 mediocre 的问题。
methods: 本研究使用的是一种名为 SAM（Segment Anything Model）的大规模预训练交互分割模型，该模型具有强大的零shot泛化能力。通过应用 SAM 到舌部分分割，可以实现零shot 分割不同类型的舌型图像。此外，本研究还使用了一种基于对象检测的 Prompt Generator，以实现一个端到端自动化的舌部分分割方法。
results: 实验表明，TongueSAM 在不同舌部分分割数据集上表现出色，特别是在零shot 下表现。此外，TongueSAM 可以 direct 应用于其他数据集无需 fine-tuning。据我们知道，这是首次应用大规模预训练模型于舌部分分割。研究成果和预训练模型将在：https://github.com/cshan-github/TongueSAM 上公布。

Abstract
Tongue segmentation serves as the primary step in automated TCM tongue diagnosis, which plays a significant role in the diagnostic results. Currently, numerous deep learning based methods have achieved promising results. However, most of these methods exhibit mediocre performance on tongues different from the training set. To address this issue, this paper proposes a universal tongue segmentation model named TongueSAM based on SAM (Segment Anything Model). SAM is a large-scale pretrained interactive segmentation model known for its powerful zero-shot generalization capability. Applying SAM to tongue segmentation enables the segmentation of various types of tongue images with zero-shot. In this study, a Prompt Generator based on object detection is integrated into SAM to enable an end-to-end automated tongue segmentation method. Experiments demonstrate that TongueSAM achieves exceptional performance across various of tongue segmentation datasets, particularly under zero-shot. TongueSAM can be directly applied to other datasets without fine-tuning. As far as we know, this is the first application of large-scale pretrained model for tongue segmentation. The project and pretrained model of TongueSAM be publiced in :https://github.com/cshan-github/TongueSAM.

摘要
叙述分割 serves as the primary step in automated TCM tongue diagnosis, which plays a significant role in the diagnostic results. Currently, numerous deep learning based methods have achieved promising results. However, most of these methods exhibit mediocre performance on tongues different from the training set. To address this issue, this paper proposes a universal tongue segmentation model named TongueSAM based on SAM (Segment Anything Model). SAM is a large-scale pretrained interactive segmentation model known for its powerful zero-shot generalization capability. Applying SAM to tongue segmentation enables the segmentation of various types of tongue images with zero-shot. In this study, a Prompt Generator based on object detection is integrated into SAM to enable an end-to-end automated tongue segmentation method. Experiments demonstrate that TongueSAM achieves exceptional performance across various of tongue segmentation datasets, particularly under zero-shot. TongueSAM can be directly applied to other datasets without fine-tuning. As far as we know, this is the first application of large-scale pretrained model for tongue segmentation. The project and pretrained model of TongueSAM be publiced in :https://github.com/cshan-github/TongueSAM.Here's the translation in Traditional Chinese: tonguesegmentation serves as the primary step in automated TCM tongue diagnosis, which plays a significant role in the diagnostic results. Currently, numerous deep learning based methods have achieved promising results. However, most of these methods exhibit mediocre performance on tongues different from the training set. To address this issue, this paper proposes a universal tongue segmentation model named TongueSAM based on SAM (Segment Anything Model). SAM is a large-scale pretrained interactive segmentation model known for its powerful zero-shot generalization capability. Applying SAM to tongue segmentation enables the segmentation of various types of tongue images with zero-shot. In this study, a Prompt Generator based on object detection is integrated into SAM to enable an end-to-end automated tongue segmentation method. Experiments demonstrate that TongueSAM achieves exceptional performance across various of tongue segmentation datasets, particularly under zero-shot. TongueSAM can be directly applied to other datasets without fine-tuning. As far as we know, this is the first application of large-scale pretrained model for tongue segmentation. The project and pretrained model of TongueSAM be publiced in :https://github.com/cshan-github/TongueSAM.

Distributionally Robust Optimization and Invariant Representation Learning for Addressing Subgroup Underrepresentation: Mechanisms and Limitations

paper_url: http://arxiv.org/abs/2308.06434
repo_url: None
paper_authors: Nilesh Kumar, Ruby Shrestha, Zhiyuan Li, Linwei Wang
For: This paper aims to address the issue of spurious correlation due to subgroup underrepresentation in medical image classification, specifically by exploring the use of robust optimization to learn invariant representations.* Methods: The paper proposes a novel approach that leverages robust optimization to facilitate the learning of invariant representations, and evaluates the effectiveness of this approach through a comprehensive study.* Results: The proposed approach is shown to improve the performance of classifiers on underrepresented subgroups, while maintaining high average and worst-group performance, compared to existing methods such as generalized reweighting and naive invariant representation learning.

Abstract
Spurious correlation caused by subgroup underrepresentation has received increasing attention as a source of bias that can be perpetuated by deep neural networks (DNNs). Distributionally robust optimization has shown success in addressing this bias, although the underlying working mechanism mostly relies on upweighting under-performing samples as surrogates for those underrepresented in data. At the same time, while invariant representation learning has been a powerful choice for removing nuisance-sensitive features, it has been little considered in settings where spurious correlations are caused by significant underrepresentation of subgroups. In this paper, we take the first step to better understand and improve the mechanisms for debiasing spurious correlation due to subgroup underrepresentation in medical image classification. Through a comprehensive evaluation study, we first show that 1) generalized reweighting of under-performing samples can be problematic when bias is not the only cause for poor performance, while 2) naive invariant representation learning suffers from spurious correlations itself. We then present a novel approach that leverages robust optimization to facilitate the learning of invariant representations at the presence of spurious correlations. Finetuned classifiers utilizing such representation demonstrated improved abilities to reduce subgroup performance disparity, while maintaining high average and worst-group performance.

摘要
假设对于小分支的参数不足，导致深度神经网络（DNNs）中的伪正相关。 Distributionally robust optimization 已经在解决这种偏见方面取得成功，但是其主要运作机制是通过增重下performing samples 作为没有在数据中受到代表的样本。在这篇研究中，我们对对于小分支参数不足导致的伪正相关的推导和改进方法进行了首次的研究。我们首先显示了以下两个结果：1）通过增重下performing samples 并不一定能够解决伪正相关，而2）简单的对称表现学习方法本身受到了伪正相关的影响。我们然后提出了一种新的方法，利用Robust optimization 来促进对伪正相关的推导。我们继续调整这些表现，以便在存在伪正相关的情况下维持高的平均和最差分支性能。

paper_url: http://arxiv.org/abs/2308.06432
repo_url: None
paper_authors: Yuhan Zhang, Kun Huang, Mingchao Li, Songtao Yuan, Qiang Chen
for: 预测 age-related macular degeneration (nAMD) 疾病进程和效果。
methods: 提posed a single-horizon disease evolution network (SHENet)，使用 feature encoder、graph evolution module 和 feature decoder，并通过 adversarial training 确保疾病进程学习的有效性。
results: 与其他生成方法相比，SHENet 生成的 SD-OCT 图像质量最高，同时保持结构保护和内容预测。 Qualitative evaluations 也表明 SHENet 的视觉效果较好。

Abstract
Most of the existing disease prediction methods in the field of medical image processing fall into two classes, namely image-to-category predictions and image-to-parameter predictions. Few works have focused on image-to-image predictions. Different from multi-horizon predictions in other fields, ophthalmologists prefer to show more confidence in single-horizon predictions due to the low tolerance of predictive risk. We propose a single-horizon disease evolution network (SHENet) to predictively generate post-therapeutic SD-OCT images by inputting pre-therapeutic SD-OCT images with neovascular age-related macular degeneration (nAMD). In SHENet, a feature encoder converts the input SD-OCT images to deep features, then a graph evolution module predicts the process of disease evolution in high-dimensional latent space and outputs the predicted deep features, and lastly, feature decoder recovers the predicted deep features to SD-OCT images. We further propose an evolution reinforcement module to ensure the effectiveness of disease evolution learning and obtain realistic SD-OCT images by adversarial training. SHENet is validated on 383 SD-OCT cubes of 22 nAMD patients based on three well-designed schemes based on the quantitative and qualitative evaluations. Compared with other generative methods, the generative SD-OCT images of SHENet have the highest image quality. Besides, SHENet achieves the best structure protection and content prediction. Qualitative evaluations also demonstrate that SHENet has a better visual effect than other methods. SHENet can generate post-therapeutic SD-OCT images with both high prediction performance and good image quality, which has great potential to help ophthalmologists forecast the therapeutic effect of nAMD.

摘要
现有的疾病预测方法在医学图像处理领域主要分为两类：图像到类别预测和图像到参数预测，其中少数工作关注到图像到图像预测。与其他多个预测horizon不同，眼科医生更偏好单个预测horizon，因为预测风险的低忍性。我们提出了单个预测疾病演化网络（SHENet），用于预测基于前治疗SD-OCT图像的后治疗SD-OCT图像。在SHENet中，一个特征编码器将输入SD-OCT图像转化为深度特征，然后一个图像演化模块预测疾病演化过程在高维潜在空间中，并输出预测的深度特征。最后，特征解码器重建预测的深度特征为SD-OCT图像。我们还提出了演化增强模块，以确保疾病演化学习的有效性并获得真实的SD-OCT图像，通过对抗训练。SHENet在383个SD-OCT立方体上的22例nAMD患者基于三种有效的方案进行验证，并通过量化和质量评估。与其他生成方法相比，SHENet生成的SD-OCT图像的生成质量最高。此外，SHENet保持了最佳的结构保护和内容预测。质量评估还表明，SHENet在视觉效果方面表现更好。SHENet可以预测nAMD后治疗SD-OCT图像，具有高预测性和良好的图像质量，这对眼科医生预测nAMD治疗效果具有很大潜力。

M&M: Tackling False Positives in Mammography with a Multi-view and Multi-instance Learning Sparse Detector

paper_url: http://arxiv.org/abs/2308.06420
repo_url: None
paper_authors: Yen Nhi Truong Vu, Dan Guo, Ahmed Taha, Jason Su, Thomas Paul Matthews
for: 提高诊断率和避免假阳性结果
methods: 使用 sparse R-CNN，包括多视图交叉注意模块和多实例学习
results: 提高了检测和预测性能，并通过精细的ablation study证明每个组件的效果

Abstract
Deep-learning-based object detection methods show promise for improving screening mammography, but high rates of false positives can hinder their effectiveness in clinical practice. To reduce false positives, we identify three challenges: (1) unlike natural images, a malignant mammogram typically contains only one malignant finding; (2) mammography exams contain two views of each breast, and both views ought to be considered to make a correct assessment; (3) most mammograms are negative and do not contain any findings. In this work, we tackle the three aforementioned challenges by: (1) leveraging Sparse R-CNN and showing that sparse detectors are more appropriate than dense detectors for mammography; (2) including a multi-view cross-attention module to synthesize information from different views; (3) incorporating multi-instance learning (MIL) to train with unannotated images and perform breast-level classification. The resulting model, M&M, is a Multi-view and Multi-instance learning system that can both localize malignant findings and provide breast-level predictions. We validate M&M's detection and classification performance using five mammography datasets. In addition, we demonstrate the effectiveness of each proposed component through comprehensive ablation studies.

摘要
深度学习基于对象检测方法在萤幕检查中显示出优秀表现，但高 false positive 率可能会阻碍其在临床实践中的效iveness。为了减少 false positive，我们标识了三个挑战：（1）癌症肺像素通常只包含一个癌症发现;（2）萤幕检查包括两个视图每一个乳腺，需要考虑两个视图来确定正确的评估;（3）大多数萤幕检查为正常图像，没有任何发现。在这种情况下，我们解决了这三个挑战，通过：（1）利用稀疏 R-CNN，并证明稀疏检测器更适合萤幕检查;（2）添加多视图交叉注意模块，以将不同视图的信息相互协同;（3）采用多例学习（MIL），以使用无注释图像进行训练，并在乳腺级别进行预测。得到的模型称为 M&M，它可以同时localize 癌症发现和进行乳腺级别预测。我们验证 M&M 的检测和预测性能使用五个萤幕检查 dataset。此外，我们还通过完整的减少研究，证明每一个提案的效果。

Improving Pseudo Labels for Open-Vocabulary Object Detection

paper_url: http://arxiv.org/abs/2308.06412
repo_url: None
paper_authors: Shiyu Zhao, Samuel Schulter, Long Zhao, Zhixing Zhang, Vijay Kumar B. G, Yumin Suh, Manmohan Chandraker, Dimitris N. Metaxas
for: 提高开放词汇物体检测（OVD）中使用预先训练的视觉语言模型（VLM）生成的假标签（PL）的性能。
methods: 提出在线自我训练和拆分并融合头（SAS-Det）方法，包括自我训练VLMs生成高质量PL，并利用拆分并融合头除去PL的地方噪声，同时 fusion complementary knowledge learned from precise ground truth和噪声PL。
results: 在COCO和LVISbenchmark上 achieved 37.4 AP$_{50}$和27.3 AP$_r$，胜过先前的状态艺术模型，并且 Pseudo labeling 速度比过去的方法快三倍。

Abstract
Recent studies show promising performance in open-vocabulary object detection (OVD) using pseudo labels (PLs) from pretrained vision and language models (VLMs). However, PLs generated by VLMs are extremely noisy due to the gap between the pretraining objective of VLMs and OVD, which blocks further advances on PLs. In this paper, we aim to reduce the noise in PLs and propose a method called online Self-training And a Split-and-fusion head for OVD (SAS-Det). First, the self-training finetunes VLMs to generate high quality PLs while prevents forgetting the knowledge learned in the pretraining. Second, a split-and-fusion (SAF) head is designed to remove the noise in localization of PLs, which is usually ignored in existing methods. It also fuses complementary knowledge learned from both precise ground truth and noisy pseudo labels to boost the performance. Extensive experiments demonstrate SAS-Det is both efficient and effective. Our pseudo labeling is 3 times faster than prior methods. SAS-Det outperforms prior state-of-the-art models of the same scale by a clear margin and achieves 37.4 AP$_{50}$ and 27.3 AP$_r$ on novel categories of the COCO and LVIS benchmarks, respectively.

摘要

Detecting and Preventing Hallucinations in Large Vision Language Models

paper_url: http://arxiv.org/abs/2308.06394
repo_url: None
paper_authors: Anisha Gunjal, Jihan Yin, Erhan Bas
for:The paper aims to address the issue of hallucinations in instruction-tuned large vision language models (LVLMs) for visual question answering (VQA).methods:The authors introduce a new dataset called M-HalDetect, which consists of 16,000 fine-grained annotations on VQA examples to train and benchmark models for hallucination detection and prevention. They also propose a novel optimization method called Fine-grained Direct Preference Optimization (FDPO) to reduce hallucinations in LVLMs.results:The authors evaluate the effectiveness of M-HalDetect and FDPO using human evaluation and find that they reduce hallucination rates in InstructBLIP by 41% and 55%, respectively. They also find that their reward model generalizes to other multi-modal models, reducing hallucinations in LLaVA and mPLUG-OWL by 15% and 57%, respectively, and has strong correlation with human evaluated accuracy scores.

Abstract
Instruction tuned Large Vision Language Models (LVLMs) have significantly advanced in generalizing across a diverse set of multi-modal tasks, especially for Visual Question Answering (VQA). However, generating detailed responses that are visually grounded is still a challenging task for these models. We find that even the current state-of-the-art LVLMs (InstructBLIP) still contain a staggering 30 percent of the hallucinatory text in the form of non-existent objects, unfaithful descriptions, and inaccurate relationships. To address this, we introduce M-HalDetect, a (M)ultimodal (Hal)lucination (Detect)ion Dataset that can be used to train and benchmark models for hallucination detection and prevention. M-HalDetect consists of 16k fine-grained annotations on VQA examples, making it the first comprehensive multi-modal hallucination detection dataset for detailed image descriptions. Unlike previous work that only consider object hallucination, we additionally annotate both entity descriptions and relationships that are unfaithful. To demonstrate the potential of this dataset for hallucination prevention, we optimize InstructBLIP through our novel Fine-grained Direct Preference Optimization (FDPO). We also train fine-grained multi-modal reward models from InstructBLIP and evaluate their effectiveness with best-of-n rejection sampling. We perform human evaluation on both FDPO and rejection sampling, and find that they reduce hallucination rates in InstructBLIP by 41% and 55% respectively. We also find that our reward model generalizes to other multi-modal models, reducing hallucinations in LLaVA and mPLUG-OWL by 15% and 57% respectively, and has strong correlation with human evaluated accuracy scores.

摘要
干脆大视语言模型（LVLM）在多modal任务上 generalized 了，特别是对于视觉问答（VQA）。然而，生成视觉固有的回答仍然是current state-of-the-art LVLMs（InstructBLIP）中的挑战。我们发现，甚至当前最佳LVLMs中还有30%的hallucination text，包括不存在的物体、不准确的描述和关系。为了解决这个问题，我们提出了M-HalDetect数据集，可以用于训练和对比模型，以检测和避免hallucination。M-HalDetect包括16k精细的VQA示例注释，使其成为首个多modal hallucination detection数据集。不同于之前的工作，我们不仅考虑物体hallucination，还注释了不准确的实体描述和关系。为了证明M-HalDetect的可用性，我们对InstructBLIP进行了novel Fine-grained Direct Preference Optimization（FDPO）优化。我们还使用InstructBLIP中的精细多modal奖励模型进行训练，并使用best-of-n拒绝采样来评估其效果。我们对FDPO和拒绝采样进行了人工评估，发现它们可以降低InstructBLIP中的hallucination率41%和55%。此外，我们发现我们的奖励模型可以普适化到其他多modal模型，降低LLaVA和mPLUG-OWL中的hallucination率15%和57%，并与人类评估精度成对。

R2S100K: Road-Region Segmentation Dataset For Semi-Supervised Autonomous Driving in the Wild

paper_url: http://arxiv.org/abs/2308.06393
repo_url: None
paper_authors: Muhammad Atif Butt, Hassan Ali, Adnan Qayyum, Waqas Sultani, Ala Al-Fuqaha, Junaid Qadir
for: 这项研究的目的是提供一个大规模的、多样化的道路区域分割数据集，以便为自动驾驶技术的发展提供更好的支持。
methods: 这项研究使用了一种名为Efficient Data Sampling（EDS）的自我教学框架，通过利用无标注数据来提高学习效果，同时还使用了 semi-supervised learning 方法。
results: 实验结果表明，提出的方法可以显著改善 semantic segmentation 任务的泛化能力，同时也可以降低标注成本。

Abstract
Semantic understanding of roadways is a key enabling factor for safe autonomous driving. However, existing autonomous driving datasets provide well-structured urban roads while ignoring unstructured roadways containing distress, potholes, water puddles, and various kinds of road patches i.e., earthen, gravel etc. To this end, we introduce Road Region Segmentation dataset (R2S100K) -- a large-scale dataset and benchmark for training and evaluation of road segmentation in aforementioned challenging unstructured roadways. R2S100K comprises 100K images extracted from a large and diverse set of video sequences covering more than 1000 KM of roadways. Out of these 100K privacy respecting images, 14,000 images have fine pixel-labeling of road regions, with 86,000 unlabeled images that can be leveraged through semi-supervised learning methods. Alongside, we present an Efficient Data Sampling (EDS) based self-training framework to improve learning by leveraging unlabeled data. Our experimental results demonstrate that the proposed method significantly improves learning methods in generalizability and reduces the labeling cost for semantic segmentation tasks. Our benchmark will be publicly available to facilitate future research at https://r2s100k.github.io/.

摘要
<>转换文本到简化中文。<>路径理解是自驾投控车中关键的能力因素。然而，现有的自驾投控车数据集只提供了有效的城市路径，而忽略了不结构化的路径中的压力、沟壑、水泥等等。为此，我们介绍了路径区域分割数据集（R2S100K）——一个大规模的数据集和标准 для训练和评估路径分割在上述挑战性的路径上。R2S100K包含100K张图像，其中14,000张图像有细腻的像素标注路径区域，剩下86,000张图像可以通过半有结构学习方法进行利用。此外，我们提出了一种效率的数据采样（EDS）基于的自动训练框架，以提高学习的通用性和减少标注成本。我们的实验结果表明，提posed方法可以显著提高学习方法的通用性和减少标注成本。我们的标准将在https://r2s100k.github.io/上公开，以便未来的研究。

U-RED: Unsupervised 3D Shape Retrieval and Deformation for Partial Point Clouds

paper_url: http://arxiv.org/abs/2308.06383
repo_url: https://github.com/zhangcyg/u-red
paper_authors: Yan Di, Chenyangguang Zhang, Ruida Zhang, Fabian Manhardt, Yongzhi Su, Jason Rambach, Didier Stricker, Xiangyang Ji, Federico Tombari
for: 本文提出了一种不supervised shape retrieval和扭形管道，用于从已有的CAD模型库中检索和扭形匹配目标对象。
methods: 该管道使用了一种新的点级差异指导度量来抗随机变量，并通过将所有可能的全形对象投影到单位球面上来处理一个部分观察的一对多关系。
results: 在PartNet、ComplementMe和Scan2CAD等 sintetic和实际数据集上，U-RED比前状态艺术方法提高47.3%、16.7%和31.6%。

Abstract
In this paper, we propose U-RED, an Unsupervised shape REtrieval and Deformation pipeline that takes an arbitrary object observation as input, typically captured by RGB images or scans, and jointly retrieves and deforms the geometrically similar CAD models from a pre-established database to tightly match the target. Considering existing methods typically fail to handle noisy partial observations, U-RED is designed to address this issue from two aspects. First, since one partial shape may correspond to multiple potential full shapes, the retrieval method must allow such an ambiguous one-to-many relationship. Thereby U-RED learns to project all possible full shapes of a partial target onto the surface of a unit sphere. Then during inference, each sampling on the sphere will yield a feasible retrieval. Second, since real-world partial observations usually contain noticeable noise, a reliable learned metric that measures the similarity between shapes is necessary for stable retrieval. In U-RED, we design a novel point-wise residual-guided metric that allows noise-robust comparison. Extensive experiments on the synthetic datasets PartNet, ComplementMe and the real-world dataset Scan2CAD demonstrate that U-RED surpasses existing state-of-the-art approaches by 47.3%, 16.7% and 31.6% respectively under Chamfer Distance.

摘要
在这篇论文中，我们提出了无监督的形状检索和扭曲管道（U-RED），它可以将从RGB图像或扫描得到的任意物体观察作为输入，并将相似的CAD模型从预设的数据库中检索出来，以便与目标物体紧密匹配。现有的方法通常无法处理干扰性的部分观察，因此U-RED在两个方面进行了改进。首先，由于一个部分形状可能对应多个可能的全形状，因此检索方法必须允许这种杂乱的一对多关系。U-RED通过将所有可能的全形状 проек到单位球上来解决这个问题。然后，在推理时，每个样本在球上的抽象都将产生一个可能的检索。其次，由于实际的部分观察通常含有显著的干扰，因此需要一个可靠的学习的形状相似度量表，以确保稳定的检索。在U-RED中，我们设计了一种新的点级差异导向的形状相似度量表，允许比较干扰的形状。我们在PartNet、ComplementMe和Scan2CAD等 sintetic和实际数据集上进行了广泛的实验，结果显示，U-RED在Chamfer Distance下比现有状态的方法提高47.3%、16.7%和31.6%。

CATS v2: Hybrid encoders for robust medical segmentation

paper_url: http://arxiv.org/abs/2308.06377
repo_url: https://github.com/haoli12345/cats
paper_authors: Hao Li, Han Liu, Dewei Hu, Xing Yao, Jiacheng Wang, Ipek Oguz
for: 这个研究的目的是提出一个以CATS为基础的对� части�内部构成，以提高医疗影像分类�的精度和意义性。
methods: 这个研究使用了CATS v2模型，其中包括一个具有传播�的 Hybrid 构成，该构成包括一个 CNN 基础的 Encoder 路径和一个传播�的 Transformer 路径。
results: 在两个公共挑战赛 datasets 上进行评估，CATS v2 模型在分类 VS 和肾脏癌等项目上表现出较高的 Dice scores，较以前的方法为高。

Abstract
Convolutional Neural Networks (CNNs) have exhibited strong performance in medical image segmentation tasks by capturing high-level (local) information, such as edges and textures. However, due to the limited field of view of convolution kernel, it is hard for CNNs to fully represent global information. Recently, transformers have shown good performance for medical image segmentation due to their ability to better model long-range dependencies. Nevertheless, transformers struggle to capture high-level spatial features as effectively as CNNs. A good segmentation model should learn a better representation from local and global features to be both precise and semantically accurate. In our previous work, we proposed CATS, which is a U-shaped segmentation network augmented with transformer encoder. In this work, we further extend this model and propose CATS v2 with hybrid encoders. Specifically, hybrid encoders consist of a CNN-based encoder path paralleled to a transformer path with a shifted window, which better leverage both local and global information to produce robust 3D medical image segmentation. We fuse the information from the convolutional encoder and the transformer at the skip connections of different resolutions to form the final segmentation. The proposed method is evaluated on two public challenge datasets: Cross-Modality Domain Adaptation (CrossMoDA) and task 5 of Medical Segmentation Decathlon (MSD-5), to segment vestibular schwannoma (VS) and prostate, respectively. Compared with the state-of-the-art methods, our approach demonstrates superior performance in terms of higher Dice scores.

摘要
卷积神经网络（CNN）在医疗图像分割任务中表现出色，捕捉到高级（本地）信息，如边缘和 тексту层。然而，由于卷积核的视野有限，使得CNN难以完全表征全局信息。近些年来， transformer 在医疗图像分割中表现良好，主要是因为它们能够更好地模型长距离依赖关系。然而， transformer 在捕捉高级空间特征方面表现不如 CNN 好。为了建立一个好的分割模型，需要学习更好的 Representation 来兼顾本地和全局特征，以确保准确和Semantic 准确。在我们之前的工作中，我们提出了CATS，它是一个 U-shaped 分割网络，通过添加 transformer 编码器来提高性能。在这个工作中，我们进一步扩展了CATS 模型，并提出了CATS v2 模型，它使用了混合编码器。具体来说，混合编码器包括一个 CNN 基于编码器路径和一个 shifted window 的 transformer 路径，这两者可以更好地利用本地和全局信息，以生成Robust 3D 医疗图像分割。我们在不同的分辨率之间进行 skip 连接，将 convolutional 编码器和 transformer 的信息融合起来，以生成最终的分割。我们在 Cross-Modality Domain Adaptation（CrossMoDA）和 Medical Segmentation Decathlon 任务（MSD-5）上进行了评估，对 vestibular schwannoma 和 prostate 进行了分割。与当前的状态艺术方法相比，我们的方法在 Dice 分数方面表现出色。

Surrogate Model for Geological CO2 Storage and Its Use in MCMC-based History Matching

paper_url: http://arxiv.org/abs/2308.06341
repo_url: None
paper_authors: Yifu Han, Francois P. Hamon, Su Jiang, Louis J. Durlofsky
for: 这个研究targets an important application in geological carbon storage operations, specifically history matching of storage systems with high prior geological uncertainty.
methods: The authors extend a recently introduced recurrent R-U-Net surrogate model to treat geomodel realizations drawn from a wide range of geological scenarios, using flow simulation results and a Markov chain Monte Carlo history matching workflow.
results: The surrogate model provides accurate predictions for new realizations over the full range of geological scenarios, with median relative error of 1.3% in pressure and 4.5% in saturation. The incorporation of the surrogate model into the history matching workflow reduces geological uncertainty and leads to posterior 3D pressure and saturation fields that display much closer agreement with the true-model responses than prior predictions.

Abstract
Deep-learning-based surrogate models show great promise for use in geological carbon storage operations. In this work we target an important application - the history matching of storage systems characterized by a high degree of (prior) geological uncertainty. Toward this goal, we extend the recently introduced recurrent R-U-Net surrogate model to treat geomodel realizations drawn from a wide range of geological scenarios. These scenarios are defined by a set of metaparameters, which include the mean and standard deviation of log-permeability, permeability anisotropy ratio, horizontal correlation length, etc. An infinite number of realizations can be generated for each set of metaparameters, so the range of prior uncertainty is large. The surrogate model is trained with flow simulation results, generated using the open-source simulator GEOS, for 2000 random realizations. The flow problems involve four wells, each injecting 1 Mt CO2/year, for 30 years. The trained surrogate model is shown to provide accurate predictions for new realizations over the full range of geological scenarios, with median relative error of 1.3% in pressure and 4.5% in saturation. The surrogate model is incorporated into a Markov chain Monte Carlo history matching workflow, where the goal is to generate history matched realizations and posterior estimates of the metaparameters. We show that, using observed data from monitoring wells in synthetic `true' models, geological uncertainty is reduced substantially. This leads to posterior 3D pressure and saturation fields that display much closer agreement with the true-model responses than do prior predictions.

摘要
深度学习基本的代理模型在地质碳存储操作中表现出了极大的承诺。在这项工作中，我们target了一个重要应用 - 地质风险很高的存储系统历史匹配。为达到这个目标，我们将Recurrent R-U-Net代理模型扩展到处理各种不同的地质场景。这些场景是通过一组元参数来定义的，其中包括含量风险的平均值和标准差、滤 filtering ratio、水平相关长度等。可以生成无数量的实例 для每个元参数，因此地质风险的范围是非常广泛。我们使用GEOS开源模拟器对2000个随机实例进行流体模拟，并将模型训练于这些实例。训练后，代理模型能够准确预测新的实例，并且在全面的地质场景下显示了 median相对误差为1.3%的压力和4.5%的浓度。这个代理模型被 incorporated into Markov chain Monte Carlo历史匹配工作流程中，以生成历史匹配实例和 posterior 的元参数估计。我们显示，使用 synthetic 'true' 模型中的观测数据，地质风险可以减少得非常多。这导致了 posterior 3D 压力和浓度场 displaying much closer agreement with the true-model responses than prior predictions。

Deep Learning-Based Open Source Toolkit for Eosinophil Detection in Pediatric Eosinophilic Esophagitis

paper_url: http://arxiv.org/abs/2308.06333
repo_url: https://github.com/hrlblab/open-eoe
paper_authors: Juming Xiong, Yilin Liu, Ruining Deng, Regina N Tyree, Hernan Correa, Girish Hiremath, Yaohong Wang, Yuankai Huo
for:This paper aims to develop an open-source toolkit for automated detection of eosinophils in whole slide images for the diagnosis of eosinophilic esophagitis.methods:The toolkit uses deep learning-based object detection models and ensemble learning to improve the accuracy and reliability of eosinophil detection.results:The toolkit was tested on a set of 289 whole slide images and achieved an accuracy of 91% in detecting eosinophils at the widely accepted threshold of >= 15 per high power field for diagnosing eosinophilic esophagitis.

Abstract
Eosinophilic Esophagitis (EoE) is a chronic, immune/antigen-mediated esophageal disease, characterized by symptoms related to esophageal dysfunction and histological evidence of eosinophil-dominant inflammation. Owing to the intricate microscopic representation of EoE in imaging, current methodologies which depend on manual identification are not only labor-intensive but also prone to inaccuracies. In this study, we develop an open-source toolkit, named Open-EoE, to perform end-to-end whole slide image (WSI) level eosinophil (Eos) detection using one line of command via Docker. Specifically, the toolkit supports three state-of-the-art deep learning-based object detection models. Furthermore, Open-EoE further optimizes the performance by implementing an ensemble learning strategy, and enhancing the precision and reliability of our results. The experimental results demonstrated that the Open-EoE toolkit can efficiently detect Eos on a testing set with 289 WSIs. At the widely accepted threshold of >= 15 Eos per high power field (HPF) for diagnosing EoE, the Open-EoE achieved an accuracy of 91%, showing decent consistency with pathologist evaluations. This suggests a promising avenue for integrating machine learning methodologies into the diagnostic process for EoE. The docker and source code has been made publicly available at https://github.com/hrlblab/Open-EoE.

摘要
《Eosinophilic Esophagitis (EoE)是一种慢性、免疫/抗原诱导的食道疾病，表现为食道功能障碍和 Histological 证据显示的吸收性黑色素细胞滥多性Inflammation。由于EoE的微scopic表现在成像中复杂，目前的方法ologies依靠 manual identification 不仅劳累也容易出错。本研究中，我们开发了一个开源工具kit，名为 Open-EoE，通过一行命令 via Docker 来实现整个扫描图像 (WSI) 层的吸收性黑色素细胞 (Eos) 检测。Specifically，工具kit 支持三种 state-of-the-art deep learning-based object detection 模型。此外，Open-EoE 还进一步优化了性能，通过实现 ensemble learning 策略，并提高了结果的精度和可靠性。实验结果表明，Open-EoE 工具kit 可以有效地检测 Eos 在289个 WSIs 上。在 widely accepted 的 >= 15 Eos per high power field (HPF) 的标准下，Open-EoE 达到了 91% 的准确率，与Pathologist 评估相当一致。这表明可以将机器学习方法 integrate 到 EoE 诊断过程中，并且 Open-EoE 的 Docker 和源代码已经在 https://github.com/hrlblab/Open-EoE 上公开 released。

Revolutionizing Space Health (Swin-FSR): Advancing Super-Resolution of Fundus Images for SANS Visual Assessment Technology

paper_url: http://arxiv.org/abs/2308.06332
repo_url: https://github.com/FarihaHossain/SwinFSR
paper_authors: Khondker Fariha Hossain, Sharif Amit Kamran, Joshua Ong, Andrew G. Lee, Alireza Tavakkoli
for: 这paper是为了提出一种基于SwinTransformer的眼内画像超分辨模型，用于解决在各种各样的眼内图像识别任务中的数据传输压缩问题。methods: 这paper使用了SwinTransformer搭配空间和深度精度注意力来实现眼内图像超分辨。results: 这paper在三个公共数据集上达到了Peak signal-to-noise-ratio（PSNR）47.89、49.00和45.32，并在NASA提供的一个专用数据集上达到了相当的比较结果。

Abstract
The rapid accessibility of portable and affordable retinal imaging devices has made early differential diagnosis easier. For example, color funduscopy imaging is readily available in remote villages, which can help to identify diseases like age-related macular degeneration (AMD), glaucoma, or pathological myopia (PM). On the other hand, astronauts at the International Space Station utilize this camera for identifying spaceflight-associated neuro-ocular syndrome (SANS). However, due to the unavailability of experts in these locations, the data has to be transferred to an urban healthcare facility (AMD and glaucoma) or a terrestrial station (e.g, SANS) for more precise disease identification. Moreover, due to low bandwidth limits, the imaging data has to be compressed for transfer between these two places. Different super-resolution algorithms have been proposed throughout the years to address this. Furthermore, with the advent of deep learning, the field has advanced so much that x2 and x4 compressed images can be decompressed to their original form without losing spatial information. In this paper, we introduce a novel model called Swin-FSR that utilizes Swin Transformer with spatial and depth-wise attention for fundus image super-resolution. Our architecture achieves Peak signal-to-noise-ratio (PSNR) of 47.89, 49.00 and 45.32 on three public datasets, namely iChallenge-AMD, iChallenge-PM, and G1020. Additionally, we tested the model's effectiveness on a privately held dataset for SANS provided by NASA and achieved comparable results against previous architectures.

摘要
“快速访问可携带便宜的肉眼成像设备，使得早期差异诊断变得更加容易。例如，颜色基准成像技术可以在偏远的村庄中提供，以帮助诊断年龄相关macular degeneration（AMD）、高压病（glaucoma）或 PATHOLOGICAL MYOPIA（PM）等疾病。然而，由于这些地点缺乏专业人士，因此数据必须被传输到城市医疗机构（AMD和 glaucoma）或地面站（例如，SANS）进行更加精确的疾病诊断。此外，由于带宽限制，成像数据必须进行压缩传输。过去数年，一些超分辨算法已经提出来解决这个问题。此外，随着深度学习的发展，这一领域已经进步到了非常高的水平，可以使得压缩后的成像数据被 decompress 到原始形式，而不会产生空间信息损失。本文提出了一种名为 Swin-FSR 的新模型，该模型使用 Swin Transformer 与空间和深度宽度注意来进行肉眼成像超分辨。我们的架构实现了 Peak signal-to-noise-ratio（PSNR）的 47.89、49.00 和 45.32 在三个公共数据集上，namely iChallenge-AMD、iChallenge-PM 和 G1020。此外，我们对 NASA 提供的一个私人保留数据集进行测试，并实现了与前一代架构相当的效果。”

A Hierarchical Descriptor Framework for On-the-Fly Anatomical Location Matching between Longitudinal Studies

paper_url: http://arxiv.org/abs/2308.07337
repo_url: None
paper_authors: Halid Ziya Yerebakan, Yoshihisa Shinagawa, Mahesh Ranganath, Simon Allen-Raffl, Gerardo Hermosillo Valadez
for: 医疗图像 longitudinal 比较中匹配 анатомиче位置
methods: 使用 hierarchical sparse sampling 计算查询点描述符，然后使用 hierarchical search 找到最相似的点在目标图像中
results: 实现了减少计算时间至毫秒级单个CPU上，可以帮助医生在实时比较相似的 анатомиче位置而无需额外建筑或存储变换场景Is there anything else I can help you with?

Abstract
We propose a method to match anatomical locations between pairs of medical images in longitudinal comparisons. The matching is made possible by computing a descriptor of the query point in a source image based on a hierarchical sparse sampling of image intensities that encode the location information. Then, a hierarchical search operation finds the corresponding point with the most similar descriptor in the target image. This simple yet powerful strategy reduces the computational time of mapping points to a millisecond scale on a single CPU. Thus, radiologists can compare similar anatomical locations in near real-time without requiring extra architectural costs for precomputing or storing deformation fields from registrations. Our algorithm does not require prior training, resampling, segmentation, or affine transformation steps. We have tested our algorithm on the recently published Deep Lesion Tracking dataset annotations. We observed more accurate matching compared to Deep Lesion Tracker while being 24 times faster than the most precise algorithm reported therein. We also investigated the matching accuracy on CT and MR modalities and compared the proposed algorithm's accuracy against ground truth consolidated from multiple radiologists.

摘要
我们提出了一种方法，用于在医疗影像对比中匹配解剖位置。该方法基于源图像中计算查询点的特征器，该特征器是基于层次稀疏抽象的图像强度值，这些值编码了位置信息。然后，使用层次搜索操作找到目标图像中最相似的点。这种简单 yet 强大的策略可以在单个 CPU 上减少比较时间到毫秒级，因此让 radiologist 可以在实时比较相似的解剖位置，无需额外的建筑成本或存储扭变场的预计算或存储。我们的算法不需要先行训练、扩充、分割或非对映变换步骤。我们在最近发布的 Deep Lesion Tracking 数据集注释中进行了测试，并观察到比 Deep Lesion Tracker 更准确的匹配，同时比最精确的算法 report 在其中的 24 倍 faster。我们还 investigate 了该算法的匹配精度在 CT 和 MR Modalities 上，并与多名医生共同协调的ground truth进行比较。

FunnyBirds: A Synthetic Vision Dataset for a Part-Based Analysis of Explainable AI Methods

paper_url: http://arxiv.org/abs/2308.06248
repo_url: https://github.com/visinf/funnybirds
paper_authors: Robin Hesse, Simone Schaub-Meyer, Stefan Roth
for: 这个论文的目的是解释人工智能（XAI）领域中复杂的深度神经网络模型的内部工作方式。
methods: 这篇论文使用了一种新的Synthetic vision dataset，叫做FunnyBirds，以及一系列自动评估协议来解决XAI中缺乏ground truth解释的挑战。
results: 通过使用FunnyBirds dataset和自动评估协议，这篇论文报告了24种不同的神经网络模型和XAI方法的结果，并证明了这些方法在一种完全自动和系统的方式下的优缺点。

Abstract
The field of explainable artificial intelligence (XAI) aims to uncover the inner workings of complex deep neural models. While being crucial for safety-critical domains, XAI inherently lacks ground-truth explanations, making its automatic evaluation an unsolved problem. We address this challenge by proposing a novel synthetic vision dataset, named FunnyBirds, and accompanying automatic evaluation protocols. Our dataset allows performing semantically meaningful image interventions, e.g., removing individual object parts, which has three important implications. First, it enables analyzing explanations on a part level, which is closer to human comprehension than existing methods that evaluate on a pixel level. Second, by comparing the model output for inputs with removed parts, we can estimate ground-truth part importances that should be reflected in the explanations. Third, by mapping individual explanations into a common space of part importances, we can analyze a variety of different explanation types in a single common framework. Using our tools, we report results for 24 different combinations of neural models and XAI methods, demonstrating the strengths and weaknesses of the assessed methods in a fully automatic and systematic manner.

摘要
field of explainable artificial intelligence (XAI) 目的是暴露复杂深度神经网络模型的内部工作原理。而这种技术在安全关键领域非常重要，但XAI本身缺乏真实的解释，这使得自动评估成为一个未解决的问题。我们解决这个挑战 by proposing a novel synthetic vision dataset， named FunnyBirds，以及相应的自动评估协议。我们的数据集允许执行Semantically meaningful image interventions，例如 removing individual object parts，这有三个重要的后果。首先，它允许分析解释的部级划分，这更加接近人类的理解，而不是现有的方法，它们会评估像素级划分。其次，通过比较模型输出的各个部分输入，我们可以估算出各个部分的真实重要性，这些重要性应该反映在解释中。最后，我们可以将各种不同类型的解释映射到一个共同的部分重要性空间中，以便分析多种不同的解释类型在单一的框架中。使用我们的工具，我们报告了24种不同的神经网络模型和XAI方法的结果，这些结果 demonstrate了评估方法的优劣点在一个完全自动和系统的方式上。

Continual Face Forgery Detection via Historical Distribution Preserving

paper_url: http://arxiv.org/abs/2308.06217
repo_url: None
paper_authors: Ke Sun, Shen Chen, Taiping Yao, Xiaoshuai Sun, Shouhong Ding, Rongrong Ji
for: 防止面部伪造攻击的安全威胁
methods: 使用普遍攻击伪造模型、知识传递和历史分布保持等方法
results: 比前一代方法高效地检测新的伪造攻击，并维持了面部伪造 distribuition的稳定性

Abstract
Face forgery techniques have advanced rapidly and pose serious security threats. Existing face forgery detection methods try to learn generalizable features, but they still fall short of practical application. Additionally, finetuning these methods on historical training data is resource-intensive in terms of time and storage. In this paper, we focus on a novel and challenging problem: Continual Face Forgery Detection (CFFD), which aims to efficiently learn from new forgery attacks without forgetting previous ones. Specifically, we propose a Historical Distribution Preserving (HDP) framework that reserves and preserves the distributions of historical faces. To achieve this, we use universal adversarial perturbation (UAP) to simulate historical forgery distribution, and knowledge distillation to maintain the distribution variation of real faces across different models. We also construct a new benchmark for CFFD with three evaluation protocols. Our extensive experiments on the benchmarks show that our method outperforms the state-of-the-art competitors.

摘要
<> translate "Face forgery techniques have advanced rapidly and pose serious security threats. Existing face forgery detection methods try to learn generalizable features, but they still fall short of practical application. Additionally, finetuning these methods on historical training data is resource-intensive in terms of time and storage. In this paper, we focus on a novel and challenging problem: Continual Face Forgery Detection (CFFD), which aims to efficiently learn from new forgery attacks without forgetting previous ones. Specifically, we propose a Historical Distribution Preserving (HDP) framework that reserves and preserves the distributions of historical faces. To achieve this, we use universal adversarial perturbation (UAP) to simulate historical forgery distribution, and knowledge distillation to maintain the distribution variation of real faces across different models. We also construct a new benchmark for CFFD with three evaluation protocols. Our extensive experiments on the benchmarks show that our method outperforms the state-of-the-art competitors."中文简体版：现代面孔伪造技术得到了快速发展，对安全提供了严重的威胁。现有的面孔伪造检测方法尝试学习通用特征，但 ainda fall short of practical application。此外，在历史训练数据上进行finetuning这些方法是费时和占用存储空间的。在这篇论文中，我们关注一个新和挑战的问题： continual face forgery detection（CFFD），该问题的目标是高效地从新的伪造攻击中学习，而不是忘记之前的。我们提出了一个历史分布保持（HDP）框架，该框架保留和保持历史面孔的分布。为了实现这一目标，我们使用通用对抗扰动（UAP）来模拟历史伪造分布，并使用知识蒸馏来保持实际面孔的分布变化。我们还建立了一个新的CFFD数据集和三个评估协议。我们的广泛实验表明，我们的方法在CFFD中表现出了优于当前竞争者。

2023-08-12

Cyclic Test-Time Adaptation on Monocular Video for 3D Human Mesh Reconstruction

Revisiting Vision Transformer from the View of Path Ensemble

SegPrompt: Boosting Open-world Segmentation via Category-level Prompt Learning

BEV-DG: Cross-Modal Learning under Bird’s-Eye View for Domain Generalization of 3D Semantic Segmentation

Seed Feature Maps-based CNN Models for LEO Satellite Remote Sensing Services

Out-of-distribution multi-view auto-encoders for prostate cancer lesion detection

Leveraging multi-view data without annotations for prostate MRI segmentation: A contrastive approach

Tiny and Efficient Model for the Edge Detection Generalization

Improved YOLOv8 Detection Algorithm in Security Inspection Image

TongueSAM: An Universal Tongue Segmentation Model Based on SAM with Zero-Shot

Distributionally Robust Optimization and Invariant Representation Learning for Addressing Subgroup Underrepresentation: Mechanisms and Limitations

Learn Single-horizon Disease Evolution for Predictive Generation of Post-therapeutic Neovascular Age-related Macular Degeneration

M&M: Tackling False Positives in Mammography with a Multi-view and Multi-instance Learning Sparse Detector

Improving Pseudo Labels for Open-Vocabulary Object Detection

Detecting and Preventing Hallucinations in Large Vision Language Models

R2S100K: Road-Region Segmentation Dataset For Semi-Supervised Autonomous Driving in the Wild

U-RED: Unsupervised 3D Shape Retrieval and Deformation for Partial Point Clouds

CATS v2: Hybrid encoders for robust medical segmentation

Surrogate Model for Geological CO2 Storage and Its Use in MCMC-based History Matching

Deep Learning-Based Open Source Toolkit for Eosinophil Detection in Pediatric Eosinophilic Esophagitis

Revolutionizing Space Health (Swin-FSR): Advancing Super-Resolution of Fundus Images for SANS Visual Assessment Technology

A Hierarchical Descriptor Framework for On-the-Fly Anatomical Location Matching between Longitudinal Studies

FunnyBirds: A Synthetic Vision Dataset for a Part-Based Analysis of Explainable AI Methods

Continual Face Forgery Detection via Historical Distribution Preserving