cs.CV - 2023-10-21

Zero-shot Learning of Individualized Task Contrast Prediction from Resting-state Functional Connectomes

  • paper_url: http://arxiv.org/abs/2310.14105
  • repo_url: None
  • paper_authors: Minh Nguyen, Gia H. Ngo, Mert R. Sabuncu
  • for: 可以用resting-state functional MRI(rsfMRI)scan来预测任务触发活动
  • methods: 使用machine learning(ML)模型,通过对suffix pair的resting-state和任务触发fMRI scan进行训练
  • results: 可以预测 novel task的活动,并且与state-of-the-art模型的in-domain预测相当In English, this would be:
  • for: Using resting-state functional MRI (rsfMRI) scans to predict task-evoked activity
  • methods: Using machine learning (ML) models, trained on paired resting-state and task-evoked fMRI scans
  • results: Can predict activity for novel tasks, and is competitive with state-of-the-art models’ in-domain predictions
    Abstract Given sufficient pairs of resting-state and task-evoked fMRI scans from subjects, it is possible to train ML models to predict subject-specific task-evoked activity using resting-state functional MRI (rsfMRI) scans. However, while rsfMRI scans are relatively easy to collect, obtaining sufficient task fMRI scans is much harder as it involves more complex experimental designs and procedures. Thus, the reliance on scarce paired data limits the application of current techniques to only tasks seen during training. We show that this reliance can be reduced by leveraging group-average contrasts, enabling zero-shot predictions for novel tasks. Our approach, named OPIC (short for Omni-Task Prediction of Individual Contrasts), takes as input a subject's rsfMRI-derived connectome and a group-average contrast, to produce a prediction of the subject-specific contrast. Similar to zero-shot learning in large language models using special inputs to obtain answers for novel natural language processing tasks, inputting group-average contrasts guides the OPIC model to generalize to novel tasks unseen in training. Experimental results show that OPIC's predictions for novel tasks are not only better than simple group-averages, but are also competitive with a state-of-the-art model's in-domain predictions that was trained using in-domain tasks' data.
    摘要 Translated into Simplified Chinese:可以通过 sufficient pairs of resting-state and task-evoked fMRI scans from subjects 来训练机器学习(ML)模型,预测个体特定的任务触发活动使用 resting-state functional MRI(rsfMRI)scans。然而,而 rsfMRI scans 相对容易收集,而获取足够的任务 fMRI scans 则更加复杂,需要更复杂的实验设计和过程。因此,现有技术的应用受到了scarce paired data的限制,只能在训练中使用已经看到的任务。我们显示,可以通过利用 group-average contrasts 降低这种限制,实现 zero-shot 预测 Novel task。我们的方法,名为 OPIC(short for Omni-Task Prediction of Individual Contrasts),输入一个subject的 rsfMRI-derived connectome 和 group-average contrast,以生成一个subject-specific contrast 预测。与大型语言模型使用特定输入来获取 novel natural language processing task 的答案类似,输入 group-average contrasts 使得 OPIC 模型能够泛化到 novel task 中。实验结果表明,OPIC 的预测对 Novel task 不仅 луч于单纯的 group-average,而且与一种状态之前的模型在预测中的性能也很竞争。

Unleashing Modified Deep Learning Models in Efficient COVID19 Detection

  • paper_url: http://arxiv.org/abs/2310.14081
  • repo_url: None
  • paper_authors: Md Aminul Islam, Shabbir Ahmed Shuvo, Mohammad Abu Tareq Rony, M Raihan, Md Abu Sufian
  • for: 这个研究旨在提高COVID-19预测和检测精度,以帮助医疗机构、政策制定者和研究人员做出更加有知识基础的决策,从而减少COVID-19和其他传染病的影响。
  • methods: 这个研究使用了深度学习技术,特别是MobileNet V3、DenseNet201和GoogleNet Inception V1等模型,以及损失优化和可扩展批处理normalization等策略,以提高预测模型的性能和鲁棒性。
  • results: 研究发现,使用MobileNet V3、DenseNet201和GoogleNet Inception V1等模型可以实现高度的预测精度,并且可以结合损失优化和可扩展批处理normalization来进一步提高预测模型的性能和鲁棒性。
    Abstract The COVID19 pandemic, a unique and devastating respiratory disease outbreak, has affected global populations as the disease spreads rapidly. Recent Deep Learning breakthroughs may improve COVID19 prediction and forecasting as a tool of precise and fast detection, however, current methods are still being examined to achieve higher accuracy and precision. This study analyzed the collection contained 8055 CT image samples, 5427 of which were COVID cases and 2628 non COVID. The 9544 Xray samples included 4044 COVID patients and 5500 non COVID cases. The most accurate models are MobileNet V3 (97.872 percent), DenseNet201 (97.567 percent), and GoogleNet Inception V1 (97.643 percent). High accuracy indicates that these models can make many accurate predictions, as well as others, are also high for MobileNetV3 and DenseNet201. An extensive evaluation using accuracy, precision, and recall allows a comprehensive comparison to improve predictive models by combining loss optimization with scalable batch normalization in this study. Our analysis shows that these tactics improve model performance and resilience for advancing COVID19 prediction and detection and shows how Deep Learning can improve disease handling. The methods we suggest would strengthen healthcare systems, policymakers, and researchers to make educated decisions to reduce COVID19 and other contagious diseases. CCS CONCEPTS Covid,Deep Learning, Image Processing KEYWORDS Covid, Deep Learning, DenseNet201, MobileNet, ResNet, DenseNet, GoogleNet, Image Processing, Disease Detection.
    摘要 COVID-19 流行病,一种独特且肇事的呼吸疾病爆发,对全球人口产生了深触的影响。 current Deep Learning 突破可能提高 COVID-19 预测和预测,作为精确和快速检测工具。但是,当前方法仍在进行评估,以提高准确率和精度。本研究分析了包含 8055 个 CT 图像样本,其中 5427 个是 COVID случа例,2628 个是非 COVID случа例。9544 个 X-ray 样本中包括 4044 个 COVID 病例和 5500 个非 COVID 病例。最准确的模型是 MobileNet V3(97.872%),DenseNet201(97.567%)和 GoogleNet Inception V1(97.643%)。高准确率表明这些模型可以做出许多准确预测,同时其他模型也具有高准确率。通过精度、精度和回归的全面评估,我们可以对改进预测模型进行比较。我们的分析表明,通过捆绑损失优化与可扩展批处理normalization可以提高模型性能和抗预测能力。这些方法可以强化医疗系统、政策制定者和研究人员,以便根据 COVID-19 和其他传染病的情况进行教育的决策。关键词: COVID-19,Deep Learning, DenseNet201, MobileNet, ResNet, DenseNet, GoogleNet, Image Processing,疾病检测。

Concept-based Anomaly Detection in Retail Stores for Automatic Correction using Mobile Robots

  • paper_url: http://arxiv.org/abs/2310.14063
  • repo_url: None
  • paper_authors: Aditya Kapoor, Vartika Sengar, Nijil George, Vighnesh Vatsal, Jayavardhana Gubbi, Balamuralidhar P, Arpan Pal
  • for: 这 paper 的目的是提出一种基于视transformer(ViT)的概念异常检测方法,用于检测商店中的货物错位和缺失。
  • methods: 该方法使用 auto-encoder 架构,然后使用异常检测在准确空间进行异常检测。
  • results: 该方法在 RP2K 数据集上的峰成功率为 89.90%,比基本模型的标准 ViT auto-encoder 高出 8.10%。
    Abstract Tracking of inventory and rearrangement of misplaced items are some of the most labor-intensive tasks in a retail environment. While there have been attempts at using vision-based techniques for these tasks, they mostly use planogram compliance for detection of any anomalies, a technique that has been found lacking in robustness and scalability. Moreover, existing systems rely on human intervention to perform corrective actions after detection. In this paper, we present Co-AD, a Concept-based Anomaly Detection approach using a Vision Transformer (ViT) that is able to flag misplaced objects without using a prior knowledge base such as a planogram. It uses an auto-encoder architecture followed by outlier detection in the latent space. Co-AD has a peak success rate of 89.90% on anomaly detection image sets of retail objects drawn from the RP2K dataset, compared to 80.81% on the best-performing baseline of a standard ViT auto-encoder. To demonstrate its utility, we describe a robotic mobile manipulation pipeline to autonomously correct the anomalies flagged by Co-AD. This work is ultimately aimed towards developing autonomous mobile robot solutions that reduce the need for human intervention in retail store management.
    摘要 <>销售环境中最具压力的任务之一是存储和重新排序丢失的商品。而使用视觉技术进行这些任务的尝试已经存在,但大多数使用计划图合规性进行异常检测,这种技术缺乏可靠性和扩展性。此外,现有系统往往需要人工干预进行 corrections 。在这篇论文中,我们提出了 Co-AD,一种基于概念的异常检测方法,使用视觉转换器(ViT)来识别丢失的物品,不需要使用先知库如计划图。它使用自适应网络架构,然后在幽默空间进行异常检测。Co-AD 在 RP2K 数据集上的峰值成功率为 89.90%,比最佳基eline的标准 ViT 自适应网络架构高出 8.19%。为了证明其实用性,我们描述了一种基于移动抓取机的自动化corrctions 管理管道。这项工作最终目标是开发自动化移动机器人解决方案,减少零售店管理中的人工干预。

Training Image Derivatives: Increased Accuracy and Universal Robustness

  • paper_url: http://arxiv.org/abs/2310.14045
  • repo_url: None
  • paper_authors: Vsevolod I. Avrutskiy
    for:* The paper is written for the problem of image analysis, specifically reconstructing the vertices of a cube based on its image.methods:* The paper uses derivative training, a method that computes the derivatives of the output values in the forward pass and includes them in the cost function to improve the accuracy of the neural network.* The paper also uses a gradient-based algorithm to minimize the cost function with respect to the weights.results:* The paper obtains 25 times more accurate results for noiseless inputs by training the derivatives with respect to the 6 degrees of freedom of the cube.* The paper also provides insights into the robustness problem, including two types of network vulnerabilities and a trade-off between accuracy and robustness.Here is the text in Simplified Chinese:for:* 这篇论文是为图像分析问题写的,具体来说是根据图像重建三角形的顶点。methods:* 这篇论文使用 derive 训练,在前向传递中计算输出值的导数,并将其包含在成本函数中以提高神经网络的准确性。* 这篇论文还使用一种基于梯度的算法来最小化成本函数中的权重。results:* 这篇论文在噪声为零的输入上达到了25倍的更高精度。* 这篇论文还提供了图像分析问题中的 robustness 问题的重要情况,包括神经网络中两种类型的抵触性和两种类型的图像变化。
    Abstract Derivative training is a well-known method to improve the accuracy of neural networks. In the forward pass, not only the output values are computed, but also their derivatives, and their deviations from the target derivatives are included in the cost function, which is minimized with respect to the weights by a gradient-based algorithm. So far, this method has been implemented for relatively low-dimensional tasks. In this study, we apply the approach to the problem of image analysis. We consider the task of reconstructing the vertices of a cube based on its image. By training the derivatives with respect to the 6 degrees of freedom of the cube, we obtain 25 times more accurate results for noiseless inputs. The derivatives also provide important insights into the robustness problem, which is currently understood in terms of two types of network vulnerabilities. The first type is small perturbations that dramatically change the output, and the second type is substantial image changes that the network erroneously ignores. They are currently considered as conflicting goals, since conventional training methods produce a trade-off. The first type can be analyzed via the gradient of the network, but the second type requires human evaluation of the inputs, which is an oracle substitute. For the task at hand, the nearest neighbor oracle can be defined, and the knowledge of derivatives allows it to be expanded into Taylor series. This allows to perform the first-order robustness analysis that unifies both types of vulnerabilities, and to implement robust training that eliminates any trade-offs, so that accuracy and robustness are limited only by network capacity.
    摘要 偏导训练是一种通用的方法,可以提高神经网络的准确率。在前向传播中,不仅计算输出值,还计算其导数和与目标导数的差异,并将其包含在权重下降的成本函数中,通过梯度基本算法进行最小化。到目前为止,这种方法已经实现在相对低维度的任务上。在这项研究中,我们将这种方法应用到图像分析任务中,即根据图像推算出立方体的顶点坐标。通过对立方体的6个自由度进行导数训练,我们可以获得25倍的精度,对于噪声无效输入。导数还提供了对稳定性问题的重要视角,这个问题目前被理解为神经网络的两种类型敏感性。第一种是小幅度的输入修改,导致输出异常变化,第二种是大规模的图像修改,神经网络错误地忽略这些修改。这两种类型的敏感性目前被视为矛盾目标, convent ional 训练方法会产生交易。第一种可以通过神经网络的梯度进行分析,但第二种需要人工评估输入,这是一个oracle substitute。对于这个任务, nearest neighbor oracle 可以定义,并且导数的知识允许其扩展为泰勒级数。这样可以实现首领稳定性分析,并将稳定性和准确率限制在神经网络容量之间。

You Only Condense Once: Two Rules for Pruning Condensed Datasets

  • paper_url: http://arxiv.org/abs/2310.14019
  • repo_url: None
  • paper_authors: Yang He, Lingao Xiao, Joey Tianyi Zhou
  • For: 提高训练效率,适应设备上的限制条件。* Methods: 使用两个简单的数据采样规则:低LBPE分数和平衡构建。* Results: 在ConvNet、ResNet和DenseNet网络上,在CIFAR-10、CIFAR-100和ImageNet数据集上达到了6.98-8.89%和6.31-23.92%的准确率提升。
    Abstract Dataset condensation is a crucial tool for enhancing training efficiency by reducing the size of the training dataset, particularly in on-device scenarios. However, these scenarios have two significant challenges: 1) the varying computational resources available on the devices require a dataset size different from the pre-defined condensed dataset, and 2) the limited computational resources often preclude the possibility of conducting additional condensation processes. We introduce You Only Condense Once (YOCO) to overcome these limitations. On top of one condensed dataset, YOCO produces smaller condensed datasets with two embarrassingly simple dataset pruning rules: Low LBPE Score and Balanced Construction. YOCO offers two key advantages: 1) it can flexibly resize the dataset to fit varying computational constraints, and 2) it eliminates the need for extra condensation processes, which can be computationally prohibitive. Experiments validate our findings on networks including ConvNet, ResNet and DenseNet, and datasets including CIFAR-10, CIFAR-100 and ImageNet. For example, our YOCO surpassed various dataset condensation and dataset pruning methods on CIFAR-10 with ten Images Per Class (IPC), achieving 6.98-8.89% and 6.31-23.92% accuracy gains, respectively. The code is available at: https://github.com/he-y/you-only-condense-once.
    摘要 dataset 缩减是训练效率的重要工具,尤其在设备上进行训练时。但是这些情况有两个主要挑战:1)设备上的计算资源不断变化,需要不同于预先定义的缩减 dataset size,2)设备上的计算资源frequently precludes the possibility of conducting additional condensation processes。我们介绍了“仅需缩减一次”(YOCO)来解决这些问题。YOCO 可以生成更小的缩减 dataset,并且可以灵活地调整 dataset size 以适应不同的计算限制。此外,YOCO 可以消除额外的缩减 процес,这可以是计算昂费的。我们的实验结果显示,YOCO 在 ConvNet、ResNet 和 DenseNet 等网络上,以及 CIFAR-10、CIFAR-100 和 ImageNet 等 dataset 上,均有着superior的表现。例如,我们在 CIFAR-10 上,使用 YOCO 可以从原始的 10 个图像每个类别(IPC)开始,获得 6.98-8.89% 和 6.31-23.92% 的精度提升。相关代码可以在 GitHub 上找到:https://github.com/he-y/you-only-condense-once。

Ophthalmic Biomarker Detection Using Ensembled Vision Transformers – Winning Solution to IEEE SPS VIP Cup 2023

  • paper_url: http://arxiv.org/abs/2310.14005
  • repo_url: None
  • paper_authors: H. A. Z. Sameen Shahgir, Khondker Salman Sayeed, Tanjeem Azwad Zaman, Md. Asif Haider, Sheikh Saifur Rahman Jony, M. Sohel Rahman
  • For: The paper was written for the IEEE SPS VIP Cup 2023: Ophthalmic Biomarker Detection competition, with the primary objective of identifying biomarkers from Optical Coherence Tomography (OCT) images obtained from a diverse range of patients.* Methods: The authors trained two vision transformer-based models, MaxViT and EVA-02, using robust augmentations and 5-fold cross-validation. They ensembled the two models at inference time and found that MaxViT’s use of convolution layers followed by strided attention was better suited for detecting local features, while EVA-02’s use of normal attention mechanism and knowledge distillation was better for detecting global features.* Results: The authors achieved a patient-wise F1 score of 0.814 in the first phase and 0.8527 in the second and final phase of VIP Cup 2023, scoring 3.8% higher than the next-best solution.
    Abstract This report outlines our approach in the IEEE SPS VIP Cup 2023: Ophthalmic Biomarker Detection competition. Our primary objective in this competition was to identify biomarkers from Optical Coherence Tomography (OCT) images obtained from a diverse range of patients. Using robust augmentations and 5-fold cross-validation, we trained two vision transformer-based models: MaxViT and EVA-02, and ensembled them at inference time. We find MaxViT's use of convolution layers followed by strided attention to be better suited for the detection of local features while EVA-02's use of normal attention mechanism and knowledge distillation is better for detecting global features. Ours was the best-performing solution in the competition, achieving a patient-wise F1 score of 0.814 in the first phase and 0.8527 in the second and final phase of VIP Cup 2023, scoring 3.8% higher than the next-best solution.
    摘要 本报告介绍了我们在IEEE SPS VIP杯2023:眼部生物标志检测比赛中采用的方法。我们的主要目标是从多样化的病人群中提取眼部生物标志。我们使用了可靠的扩展和5fold跨 VALIDATION,并在推理时 ensemble两种视transformer模型:MaxViT和EVA-02。我们发现MaxViT使用 convolution层后的步骤权重检测本地特征,而EVA-02使用 normal attention机制和知识传递是更适合检测全局特征。在比赛中,我们的解决方案得到了第一阶段和第二阶段的VI P杯2023的病人级F1分数0.814和0.8527,比下一个最佳解决方案高出3.8%。

Bi-discriminator Domain Adversarial Neural Networks with Class-Level Gradient Alignment

  • paper_url: http://arxiv.org/abs/2310.13959
  • repo_url: None
  • paper_authors: Chuang Zhao, Hongke Zhao, Hengshu Zhu, Zhenya Huang, Nan Feng, Enhong Chen, Hui Xiong
  • for: 这个研究旨在提高隐私领域转移中的不监控性评估,并且将Source domain的丰富知识转移到Target domain中,同时保持Target domain的标签空间。
  • methods: 本研究使用的方法包括bi-discriminator领域敌方网络,以及基于梯度信号和第二阶probability估计的分组梯度对齐。
  • results: 实验结果显示, compared to existing方法,本研究的方法可以更好地将Target domain中的标签转移到Source domain中,并且可以更好地避免错分析和错分类。
    Abstract Unsupervised domain adaptation aims to transfer rich knowledge from the annotated source domain to the unlabeled target domain with the same label space. One prevalent solution is the bi-discriminator domain adversarial network, which strives to identify target domain samples outside the support of the source domain distribution and enforces their classification to be consistent on both discriminators. Despite being effective, agnostic accuracy and overconfident estimation for out-of-distribution samples hinder its further performance improvement. To address the above challenges, we propose a novel bi-discriminator domain adversarial neural network with class-level gradient alignment, i.e. BACG. BACG resorts to gradient signals and second-order probability estimation for better alignment of domain distributions. Specifically, for accuracy-awareness, we first design an optimizable nearest neighbor algorithm to obtain pseudo-labels of samples in the target domain, and then enforce the backward gradient approximation of the two discriminators at the class level. Furthermore, following evidential learning theory, we transform the traditional softmax-based optimization method into a Multinomial Dirichlet hierarchical model to infer the class probability distribution as well as samples uncertainty, thereby alleviating misestimation of out-of-distribution samples and guaranteeing high-quality classes alignment. In addition, inspired by contrastive learning, we develop a memory bank-based variant, i.e. Fast-BACG, which can greatly shorten the training process at the cost of a minor decrease in accuracy. Extensive experiments and detailed theoretical analysis on four benchmark data sets validate the effectiveness and robustness of our algorithm.
    摘要 Unsupervised domain adaptation aims to transfer rich knowledge from the annotated source domain to the unlabeled target domain with the same label space. One prevalent solution is the bi-discriminator domain adversarial network, which strives to identify target domain samples outside the support of the source domain distribution and enforces their classification to be consistent on both discriminators. Despite being effective, agnostic accuracy and overconfident estimation for out-of-distribution samples hinder its further performance improvement. To address the above challenges, we propose a novel bi-discriminator domain adversarial neural network with class-level gradient alignment, i.e. BACG. BACG resorts to gradient signals and second-order probability estimation for better alignment of domain distributions. Specifically, for accuracy-awareness, we first design an optimizable nearest neighbor algorithm to obtain pseudo-labels of samples in the target domain, and then enforce the backward gradient approximation of the two discriminators at the class level. Furthermore, following evidential learning theory, we transform the traditional softmax-based optimization method into a Multinomial Dirichlet hierarchical model to infer the class probability distribution as well as samples uncertainty, thereby alleviating misestimation of out-of-distribution samples and guaranteeing high-quality classes alignment. In addition, inspired by contrastive learning, we develop a memory bank-based variant, i.e. Fast-BACG, which can greatly shorten the training process at the cost of a minor decrease in accuracy. Extensive experiments and detailed theoretical analysis on four benchmark data sets validate the effectiveness and robustness of our algorithm.Here's the translation in Traditional Chinese:Unsupervised domain adaptation aims to transfer rich knowledge from the annotated source domain to the unlabeled target domain with the same label space. One prevalent solution is the bi-discriminator domain adversarial network, which strives to identify target domain samples outside the support of the source domain distribution and enforces their classification to be consistent on both discriminators. Despite being effective, agnostic accuracy and overconfident estimation for out-of-distribution samples hinder its further performance improvement. To address the above challenges, we propose a novel bi-discriminator domain adversarial neural network with class-level gradient alignment, i.e. BACG. BACG resorts to gradient signals and second-order probability estimation for better alignment of domain distributions. Specifically, for accuracy-awareness, we first design an optimizable nearest neighbor algorithm to obtain pseudo-labels of samples in the target domain, and then enforce the backward gradient approximation of the two discriminators at the class level. Furthermore, following evidential learning theory, we transform the traditional softmax-based optimization method into a Multinomial Dirichlet hierarchical model to infer the class probability distribution as well as samples uncertainty, thereby alleviating misestimation of out-of-distribution samples and guaranteeing high-quality classes alignment. In addition, inspired by contrastive learning, we develop a memory bank-based variant, i.e. Fast-BACG, which can greatly shorten the training process at the cost of a minor decrease in accuracy. Extensive experiments and detailed theoretical analysis on four benchmark data sets validate the effectiveness and robustness of our algorithm.

Competitive Ensembling Teacher-Student Framework for Semi-Supervised Left Atrium MRI Segmentation

  • paper_url: http://arxiv.org/abs/2310.13955
  • repo_url: None
  • paper_authors: Yuyan Shi, Yichi Zhang, Shasha Wang
    for: 这篇论文主要关注于 semi-supervised learning 技术的应用在医疗影像分类中,尤其是 Left Atrium (LA) 区域的分类。methods: 本文提出了一个简单 yet efficient 的 teacher-student 架构,其中两个学生模型受到不同的任务水平干扰,并在教师模型的导引下进行互相学习。此外,文章还提出了一种竞争性的整合策略,以将更可靠的信息融合到教师模型中。results: 本文在公共的 LA 数据集上进行了评估,并获得了优秀的性能成绩,具体来说,该方法可以充分利用无标注数据,并较上先进的几种 semi-supervised 方法表现出色。
    Abstract Semi-supervised learning has greatly advanced medical image segmentation since it effectively alleviates the need of acquiring abundant annotations from experts and utilizes unlabeled data which is much easier to acquire. Among existing perturbed consistency learning methods, mean-teacher model serves as a standard baseline for semi-supervised medical image segmentation. In this paper, we present a simple yet efficient competitive ensembling teacher student framework for semi-supervised for left atrium segmentation from 3D MR images, in which two student models with different task-level disturbances are introduced to learn mutually, while a competitive ensembling strategy is performed to ensemble more reliable information to teacher model. Different from the one-way transfer between teacher and student models, our framework facilitates the collaborative learning procedure of different student models with the guidance of teacher model and motivates different training networks for a competitive learning and ensembling procedure to achieve better performance. We evaluate our proposed method on the public Left Atrium (LA) dataset and it obtains impressive performance gains by exploiting the unlabeled data effectively and outperforms several existing semi-supervised methods.
    摘要 semi-supervised learning 已经大幅提高医疗图像分割的技术 Waterloo ,因为它有效地减轻了专家们 annotate 大量数据的需求,并利用了 easier to obtain 的无标注数据。在现有的妥协一致学习方法中,mean-teacher 模型 serves as a standard baseline for semi-supervised medical image segmentation。在这篇论文中,我们提出了一种简单 yet efficient 的 competitive ensembling teacher student 框架,用于 semi-supervised 左心室 segmentation from 3D MR 图像,其中有两个学生模型 with different task-level disturbances 是用来学习 mutually,而一种 competitive ensembling 策略是用于 ensemble 更可靠的信息到 teacher model。与一个 teacher 和学生模型之间的一个way transfer 不同,我们的框架实现了不同的学生模型之间的协同学习过程,帮助 by teacher model 的指导和动力,以实现更好的性能。我们对 public Left Atrium (LA) 数据集进行了评估,并获得了很好的性能提升,通过有效地利用无标注数据和多个现有的半指导学习方法。

Fuzzy-NMS: Improving 3D Object Detection with Fuzzy Classification in NMS

  • paper_url: http://arxiv.org/abs/2310.13951
  • repo_url: None
  • paper_authors: Li Wang, Xinyu Zhang, Fachuan Zhao, Chuze Wu, Yichen Wang, Ziying Song, Lei Yang, Jun Li, Huaping Liu
  • for: 提高3D物体检测精度,解决NMS过程中的不确定性
  • methods: 引入混合学习方法,提出一种通用化精度补做模块
  • results: 对多种最新的NMS基于检测器进行改进,特别是对小物体(如人车)的准确率有显著提高,无需重新训练和显著增加推理时间
    Abstract Non-maximum suppression (NMS) is an essential post-processing module used in many 3D object detection frameworks to remove overlapping candidate bounding boxes. However, an overreliance on classification scores and difficulties in determining appropriate thresholds can affect the resulting accuracy directly. To address these issues, we introduce fuzzy learning into NMS and propose a novel generalized Fuzzy-NMS module to achieve finer candidate bounding box filtering. The proposed Fuzzy-NMS module combines the volume and clustering density of candidate bounding boxes, refining them with a fuzzy classification method and optimizing the appropriate suppression thresholds to reduce uncertainty in the NMS process. Adequate validation experiments are conducted using the mainstream KITTI and large-scale Waymo 3D object detection benchmarks. The results of these tests demonstrate the proposed Fuzzy-NMS module can improve the accuracy of numerous recently NMS-based detectors significantly, including PointPillars, PV-RCNN, and IA-SSD, etc. This effect is particularly evident for small objects such as pedestrians and bicycles. As a plug-and-play module, Fuzzy-NMS does not need to be retrained and produces no obvious increases in inference time.
    摘要 我们的提案的总体瑞逸-NMS 模块通过将卷积体和凝聚密度的候选 bounding box 组合起来,并使用瑞逸分类方法来细化它们,以便优化适当的阈值,从而减少 NMS 过程中的uncertainty。我们对主流的 KITTI 和大规模的 Waymo 3D object detection 标准准进行了适当的验证实验。实验结果表明,我们的提案的总体瑞逸-NMS 模块可以在许多最近的 NMS 基于检测器中提高准确性,包括 PointPillars、PV-RCNN 和 IA-SSD 等。这种效果尤其明显于小物体,如人行道和自行车。总之,我们的总体瑞逸-NMS 模块是一个可插件的模块,不需要重新训练,并且不会导致明显的执行时间增加。

Adversarial Image Generation by Spatial Transformation in Perceptual Colorspaces

  • paper_url: http://arxiv.org/abs/2310.13950
  • repo_url: https://github.com/ayberkydn/stadv-torch
  • paper_authors: Ayberk Aydin, Alptekin Temizel
  • for: 这个论文旨在提出一种基于色彩空间的攻击方法,以便在深度神经网络上进行targeted white-box攻击。
  • methods: 该方法使用了空间变换来生成攻击样本,其中Pixel值在Chrominance channels上独立变换,而不是直接对像值进行添加性负杂化或者直接操作。
  • results: 实验结果表明,该方法在targeted white-box攻击 Setting下可以获得竞争力的欺骗率,并且在benign和攻击样本之间的approxiamte perceptual distance上表现出优异的result。I hope that helps! Let me know if you have any further questions or if there’s anything else I can assist you with.
    Abstract Deep neural networks are known to be vulnerable to adversarial perturbations. The amount of these perturbations are generally quantified using $L_p$ metrics, such as $L_0$, $L_2$ and $L_\infty$. However, even when the measured perturbations are small, they tend to be noticeable by human observers since $L_p$ distance metrics are not representative of human perception. On the other hand, humans are less sensitive to changes in colorspace. In addition, pixel shifts in a constrained neighborhood are hard to notice. Motivated by these observations, we propose a method that creates adversarial examples by applying spatial transformations, which creates adversarial examples by changing the pixel locations independently to chrominance channels of perceptual colorspaces such as $YC_{b}C_{r}$ and $CIELAB$, instead of making an additive perturbation or manipulating pixel values directly. In a targeted white-box attack setting, the proposed method is able to obtain competitive fooling rates with very high confidence. The experimental evaluations show that the proposed method has favorable results in terms of approximate perceptual distance between benign and adversarially generated images. The source code is publicly available at https://github.com/ayberkydn/stadv-torch
    摘要 Motivated by these observations, we propose a method that creates adversarial examples by applying spatial transformations, which creates adversarial examples by changing the pixel locations independently in chrominance channels of perceptual colorspaces such as $YC_{b}C_{r}$ and $CIELAB$, instead of making an additive perturbation or manipulating pixel values directly. In a targeted white-box attack setting, the proposed method is able to obtain competitive fooling rates with very high confidence.The experimental evaluations show that the proposed method has favorable results in terms of approximate perceptual distance between benign and adversarially generated images. The source code is publicly available at .

Learning Motion Refinement for Unsupervised Face Animation

  • paper_url: http://arxiv.org/abs/2310.13912
  • repo_url: https://github.com/jialetao/mrfa
  • paper_authors: Jiale Tao, Shuhang Gu, Wen Li, Lixin Duan
  • for: 生成一个基于出处图像的人脸视频,模拟驱动视频中的人脸动作。
  • methods: 采用了一种新的无监督人脸动画方法,同时学习粗细动作。在本方法中,我们利用了本地Affine运动模型来学习全局粗细动作,并在本地区域使用一种新的动作细化模块来补做粗细动作。这个模块是通过 dense correlation between source and driving images 来学习的。
  • results: 对 widely used benchmarks 进行了广泛的实验,并取得了state-of-the-art的结果。
    Abstract Unsupervised face animation aims to generate a human face video based on the appearance of a source image, mimicking the motion from a driving video. Existing methods typically adopted a prior-based motion model (e.g., the local affine motion model or the local thin-plate-spline motion model). While it is able to capture the coarse facial motion, artifacts can often be observed around the tiny motion in local areas (e.g., lips and eyes), due to the limited ability of these methods to model the finer facial motions. In this work, we design a new unsupervised face animation approach to learn simultaneously the coarse and finer motions. In particular, while exploiting the local affine motion model to learn the global coarse facial motion, we design a novel motion refinement module to compensate for the local affine motion model for modeling finer face motions in local areas. The motion refinement is learned from the dense correlation between the source and driving images. Specifically, we first construct a structure correlation volume based on the keypoint features of the source and driving images. Then, we train a model to generate the tiny facial motions iteratively from low to high resolution. The learned motion refinements are combined with the coarse motion to generate the new image. Extensive experiments on widely used benchmarks demonstrate that our method achieves the best results among state-of-the-art baselines.
    摘要 <>这是一个对文本进行简化中文译文的示例:无监督面部动画目标是将来源图片中的人脸动画化,基于驱动影片中的动作,并将动作调节为人脸的细微动作。现有方法通常使用本地欧几何动作模型(例如本地欧几何动作模型或本地薄板拓扑动作模型)。这些方法可以捕捉到人脸的粗略动作,但是它们对本地区域(例如嘴巴和眼睛)的动作有限,往往会出现遗憾。在这个工作中,我们设计了一个新的无监督面部动画方法,同时学习粗略和细微的动作。具体来说,我们利用本地欧几何动作模型学习全局粗略的人脸动作,并设计了一个新的动作精度调整模块,以补偿本地欧几何动作模型对本地区域的动作精度模型。这个动作精度调整是根据驱动影片和源影片之间的密集相联性学习的。具体来说,我们首先建立了基于关键特征的源影片和驱动影片之间的结构相联性量。然后,我们将这个结构相联性量训练成一个可以从低分辨率到高分辨率的模型,以产生细微的动作调整。学习的动作调整与粗略动作结合,创建出新的图片。实际实验显示,我们的方法在广泛使用的标准benchmark上得到了最好的结果。

Exploring Driving Behavior for Autonomous Vehicles Based on Gramian Angular Field Vision Transformer

  • paper_url: http://arxiv.org/abs/2310.13906
  • repo_url: None
  • paper_authors: Junwei You, Ying Chen, Zhuoyu Jiang, Zhangchi Liu, Zilin Huang, Yifeng Ding, Bin Ran
  • for: 本研究旨在提出一种用于分类自动驾驶车辆(AV)驾驶行为的有效分类方法,以便诊断AV运行问题、改进自动驾驶算法和减少事故率。
  • methods: 该研究提出了一种名为Gramian Angular Field Vision Transformer(GAF-ViT)模型,用于分析AV驾驶行为。GAF-ViT模型包括三个关键组件:GAFTransformer模块、通道注意力模块和多通道ViT模块。这些模块将表示序列中的多个变量行为转换为多个图像,然后使用图像识别技术进行行为分类。
  • results: 对于Waymo开放数据集的轨迹数据进行实验表示,提出的GAF-ViT模型实现了当前领先的性能。此外,对于各个模块的效果进行了减少性研究,以证明模型的可行性。
    Abstract Effective classification of autonomous vehicle (AV) driving behavior emerges as a critical area for diagnosing AV operation faults, enhancing autonomous driving algorithms, and reducing accident rates. This paper presents the Gramian Angular Field Vision Transformer (GAF-ViT) model, designed to analyze AV driving behavior. The proposed GAF-ViT model consists of three key components: GAF Transformer Module, Channel Attention Module, and Multi-Channel ViT Module. These modules collectively convert representative sequences of multivariate behavior into multi-channel images and employ image recognition techniques for behavior classification. A channel attention mechanism is applied to multi-channel images to discern the impact of various driving behavior features. Experimental evaluation on the Waymo Open Dataset of trajectories demonstrates that the proposed model achieves state-of-the-art performance. Furthermore, an ablation study effectively substantiates the efficacy of individual modules within the model.
    摘要 <>传递给定文本到简化中文。<>自动驾驶车辆(AV)驾驶行为分类成为诊断AV操作错误、改进自动驾驶算法和减少事故率的关键领域。本文介绍了Gramian Angular Field Vision Transformer(GAF-ViT)模型,用于分析AV驾驶行为。提议的GAF-ViT模型包括三个关键组件:GAF TransformerModule、Channel AttentionModule和Multi-Channel ViTModule。这些模块结合收集的多个变量行为的表示序列,并使用图像识别技术进行行为分类。通过频道注意机制对多个频道图像进行区分。实验表明,提议的模型在 Waymo 开放数据集上的轨迹 traverse 得到了状态的最佳性。此外,一个ablation研究有效地证明了模型中各个模块的效果。

Multimodal Transformer Using Cross-Channel attention for Object Detection in Remote Sensing Images

  • paper_url: http://arxiv.org/abs/2310.13876
  • repo_url: None
  • paper_authors: Bissmella Bahaduri, Zuheng Ming, Fangchen Feng, Anissa Mokraou
  • for: 这篇研究旨在提高遥测图像中的物体探测精度,并且解决遥测图像中物体探测的特定挑战,例如资料标注的缺乏和高分辨率图像中的小物体。
  • methods: 本研究提出了一个多模式转换器,通过跨通道注意力模组来探索多源遥测数据的联合。相比于传统的通道合并方法,该模组能够学习不同通道之间的关系,实现多模式输入的协调。此外,研究人员还提出了基于Swin transformer的新架构,具有固定维度的卷积层,以获得轻量级的精确性和Computational efficiency。
  • results: 实验结果显示了该多模式转换器和架构的效果,证明了它们在多模式遥测图像中的应用性。
    Abstract Object detection in Remote Sensing Images (RSI) is a critical task for numerous applications in Earth Observation (EO). Unlike general object detection, object detection in RSI has specific challenges: 1) the scarcity of labeled data in RSI compared to general object detection datasets, and 2) the small objects presented in a high-resolution image with a vast background. To address these challenges, we propose a multimodal transformer exploring multi-source remote sensing data for object detection. Instead of directly combining the multimodal input through a channel-wise concatenation, which ignores the heterogeneity of different modalities, we propose a cross-channel attention module. This module learns the relationship between different channels, enabling the construction of a coherent multimodal input by aligning the different modalities at the early stage. We also introduce a new architecture based on the Swin transformer that incorporates convolution layers in non-shifting blocks while maintaining fixed dimensions, allowing for the generation of fine-to-coarse representations with a favorable accuracy-computation trade-off. The extensive experiments prove the effectiveness of the proposed multimodal fusion module and architecture, demonstrating their applicability to multimodal aerial imagery.
    摘要 remote sensing 图像中的对象检测是许多应用程序地球观测(EO)中的关键任务。与通用对象检测不同,对象检测在 remote sensing 图像中具有特定挑战:1) remote sensing 图像中标注数据的稀缺性,2) 图像中的小对象在高分辨率背景中呈现。 To address these challenges, we propose a multimodal transformer exploring multi-source remote sensing data for object detection. Instead of directly combining the multimodal input through a channel-wise concatenation, which ignores the heterogeneity of different modalities, we propose a cross-channel attention module. This module learns the relationship between different channels, enabling the construction of a coherent multimodal input by aligning the different modalities at the early stage. We also introduce a new architecture based on the Swin transformer that incorporates convolution layers in non-shifting blocks while maintaining fixed dimensions, allowing for the generation of fine-to-coarse representations with a favorable accuracy-computation trade-off. The extensive experiments prove the effectiveness of the proposed multimodal fusion module and architecture, demonstrating their applicability to multimodal aerial imagery.Here's a word-for-word translation of the text into Simplified Chinese:远程感知图像中的对象检测是EO中许多应用程序的关键任务。与通用对象检测不同,对象检测在远程感知图像中具有特定挑战:1)远程感知图像中标注数据的稀缺性,2)图像中的小对象在高分辨率背景中呈现。 To address these challenges, we propose a multimodal transformer exploring multi-source remote sensing data for object detection. Instead of directly combining the multimodal input through a channel-wise concatenation, which ignores the heterogeneity of different modalities, we propose a cross-channel attention module. This module learns the relationship between different channels, enabling the construction of a coherent multimodal input by aligning the different modalities at the early stage. We also introduce a new architecture based on the Swin transformer that incorporates convolution layers in non-shifting blocks while maintaining fixed dimensions, allowing for the generation of fine-to-coarse representations with a favorable accuracy-computation trade-off. The extensive experiments prove the effectiveness of the proposed multimodal fusion module and architecture, demonstrating their applicability to multimodal aerial imagery.