cs.CV - 2023-07-19

Unsupervised Accuracy Estimation of Deep Visual Models using Domain-Adaptive Adversarial Perturbation without Source Samples

  • paper_url: http://arxiv.org/abs/2307.10062
  • repo_url: None
  • paper_authors: JoonHo Lee, Jae Oh Woo, Hankyu Moon, Kwonho Lee
  • for: 本研究的目的是提出一种无需源数据和标签的深度视觉模型部署方法,以解决由于目标分布与源分布的不同而导致的性能下降问题。
  • methods: 本研究使用pseudo-标签来估计目标域的准确性,并采用最近的无源适应预测算法进行适应。我们在目标模型的输入上应用自适应敌意干扰,以减少由高 идеal共同假设风险导致的假标签的影响。
  • results: 我们的源无法框架可以有效地Addresses the challenging distribution shift scenarios,并超越需要源数据和标签的训练方法。
    Abstract Deploying deep visual models can lead to performance drops due to the discrepancies between source and target distributions. Several approaches leverage labeled source data to estimate target domain accuracy, but accessing labeled source data is often prohibitively difficult due to data confidentiality or resource limitations on serving devices. Our work proposes a new framework to estimate model accuracy on unlabeled target data without access to source data. We investigate the feasibility of using pseudo-labels for accuracy estimation and evolve this idea into adopting recent advances in source-free domain adaptation algorithms. Our approach measures the disagreement rate between the source hypothesis and the target pseudo-labeling function, adapted from the source hypothesis. We mitigate the impact of erroneous pseudo-labels that may arise due to a high ideal joint hypothesis risk by employing adaptive adversarial perturbation on the input of the target model. Our proposed source-free framework effectively addresses the challenging distribution shift scenarios and outperforms existing methods requiring source data and labels for training.
    摘要 deploying deep visual models can lead to performance drops due to the discrepancies between source and target distributions. several approaches leverage labeled source data to estimate target domain accuracy, but accessing labeled source data is often prohibitively difficult due to data confidentiality or resource limitations on serving devices. our work proposes a new framework to estimate model accuracy on unlabeled target data without access to source data. we investigate the feasibility of using pseudo-labels for accuracy estimation and evolve this idea into adopting recent advances in source-free domain adaptation algorithms. our approach measures the disagreement rate between the source hypothesis and the target pseudo-labeling function, adapted from the source hypothesis. we mitigate the impact of erroneous pseudo-labels that may arise due to a high ideal joint hypothesis risk by employing adaptive adversarial perturbation on the input of the target model. our proposed source-free framework effectively addresses the challenging distribution shift scenarios and outperforms existing methods requiring source data and labels for training.

Divert More Attention to Vision-Language Object Tracking

  • paper_url: http://arxiv.org/abs/2307.10046
  • repo_url: https://github.com/JudasDie/SOTS
  • paper_authors: Mingzhe Guo, Zhipeng Zhang, Liping Jing, Haibin Ling, Heng Fan
  • for: 本文旨在提出一种基于视觉语言学习的tracking方法,以解决现有的视觉语言学习模型在跟踪领域的不足。
  • methods: 本文提出了一种新的视觉语言表示方法,包括提出了一种通用特征标注策略和一种模态混合器(ModaMixer)。同时,本文还引入了一种对比损失来对不同模式进行对应。
  • results: 本文的方法在六个跟踪benchmark上进行了实验,并取得了显著的提升。同时,本文还提供了一些理论分析,以证明方法的合理性。
    Abstract Multimodal vision-language (VL) learning has noticeably pushed the tendency toward generic intelligence owing to emerging large foundation models. However, tracking, as a fundamental vision problem, surprisingly enjoys less bonus from recent flourishing VL learning. We argue that the reasons are two-fold: the lack of large-scale vision-language annotated videos and ineffective vision-language interaction learning of current works. These nuisances motivate us to design more effective vision-language representation for tracking, meanwhile constructing a large database with language annotation for model learning. Particularly, in this paper, we first propose a general attribute annotation strategy to decorate videos in six popular tracking benchmarks, which contributes a large-scale vision-language tracking database with more than 23,000 videos. We then introduce a novel framework to improve tracking by learning a unified-adaptive VL representation, where the cores are the proposed asymmetric architecture search and modality mixer (ModaMixer). To further improve VL representation, we introduce a contrastive loss to align different modalities. To thoroughly evidence the effectiveness of our method, we integrate the proposed framework on three tracking methods with different designs, i.e., the CNN-based SiamCAR, the Transformer-based OSTrack, and the hybrid structure TransT. The experiments demonstrate that our framework can significantly improve all baselines on six benchmarks. Besides empirical results, we theoretically analyze our approach to show its rationality. By revealing the potential of VL representation, we expect the community to divert more attention to VL tracking and hope to open more possibilities for future tracking with diversified multimodal messages.
    摘要 多模态视语(VL)学习已经明显推动了基本智能的趋势,尤其是由于大规模基础模型的出现。然而,跟踪问题,作为视觉基本问题,却意外地没有从现在蓬勃发展的 VL 学习中受到多大的优惠。我们认为这有两个原因:一是视觉语言注解视频的缺乏大规模数据集,二是当前 VL 学习中视觉语言交互学习的不具有效果。这些弊端使我们强烈需要设计更有效的视觉语言表示,同时建立一个大规模的语言注解视频数据集。在本文中,我们首先提出一种通用属性注解策略,以增加 six 个流行的跟踪benchmark中的视频数量,从而构建了大规模的视觉语言跟踪数据集,包含超过 23,000 个视频。然后,我们引入一种新的框架,以提高跟踪的性能,包括我们所提出的异symmetric architecture search和ModaMixer。为了进一步提高 VL 表示,我们还引入了一种对比损失,以协调不同模式。为了证明我们的方法的效iveness,我们将其应用于三种不同的跟踪方法中,即CNN基于 SiamCAR,Transformer基于 OSTrack,以及混合结构 TransT。实验结果表明,我们的框架可以显著提高所有基准在六个benchmark中。此外,我们还进行了理论分析,以证明我们的方法的合理性。通过推动 VL 表示的潜力,我们希望社区更多地关注 VL 跟踪,并期待在未来跟踪中涉及到多样化的Multimodal消息时开启更多的可能性。

Class Attention to Regions of Lesion for Imbalanced Medical Image Recognition

  • paper_url: http://arxiv.org/abs/2307.10036
  • repo_url: None
  • paper_authors: Jia-Xin Zhuang, Jiabin Cai, Jianguo Zhang, Wei-shi Zheng, Ruixuan Wang
    for: This paper aims to effectively handle data imbalance issues in automated medical image classification by proposing a simple yet effective framework, named CARE, which embeds attention into the training process of CNNs to help the network attend to lesion regions of rare diseases.methods: The proposed CARE framework uses an attention module that helps CNNs attend to lesion regions of rare diseases during the training phase. The attention module is embedded into the training process of CNNs and does not change the architecture of the original network. Additionally, the authors developed variants of CARE by leveraging traditional saliency methods or a pretrained segmentation model for bounding box generation.results: The results show that the CARE variants with automated bounding box generation are comparable to the original CARE framework with manual bounding box annotations. The method effectively helps the network focus on the lesion regions of rare diseases and remarkably improves the classification performance of rare diseases on two imbalanced datasets, one for skin images and one for pneumonia.
    Abstract Automated medical image classification is the key component in intelligent diagnosis systems. However, most medical image datasets contain plenty of samples of common diseases and just a handful of rare ones, leading to major class imbalances. Currently, it is an open problem in intelligent diagnosis to effectively learn from imbalanced training data. In this paper, we propose a simple yet effective framework, named \textbf{C}lass \textbf{A}ttention to \textbf{RE}gions of the lesion (CARE), to handle data imbalance issues by embedding attention into the training process of \textbf{C}onvolutional \textbf{N}eural \textbf{N}etworks (CNNs). The proposed attention module helps CNNs attend to lesion regions of rare diseases, therefore helping CNNs to learn their characteristics more effectively. In addition, this attention module works only during the training phase and does not change the architecture of the original network, so it can be directly combined with any existing CNN architecture. The CARE framework needs bounding boxes to represent the lesion regions of rare diseases. To alleviate the need for manual annotation, we further developed variants of CARE by leveraging the traditional saliency methods or a pretrained segmentation model for bounding box generation. Results show that the CARE variants with automated bounding box generation are comparable to the original CARE framework with \textit{manual} bounding box annotations. A series of experiments on an imbalanced skin image dataset and a pneumonia dataset indicates that our method can effectively help the network focus on the lesion regions of rare diseases and remarkably improves the classification performance of rare diseases.
    摘要 自动医疗图像分类是智能诊断系统的关键组件。然而,大多数医疗图像集合中充满常见疾病的样本,而罕见疾病的样本却只有几个,这导致了培训数据的主要类别不均。目前,智能诊断中有一个打开的问题是如何有效地学习培训数据中的类别不均问题。在这篇论文中,我们提出了一个简单 yet有效的框架,名为\textbf{C}lass \textbf{A}ttention to \textbf{RE}gions of the lesion (CARE),用于处理培训数据中的类别不均问题。我们在\textbf{C}onvolutional \textbf{N}eural \textbf{N}etworks (CNNs) 的培训过程中嵌入注意力模块,使 CNNs 能够更有效地关注罕见疾病的 lesion 区域。此外,我们的注意力模块只在培训阶段工作,不会改变原始网络的架构,因此可以直接与任何现有的 CNN 架构结合使用。CARE 框架需要病变区域的 bounding box 来表示罕见疾病的 lesion 区域。为了减少手动标注的需求,我们还开发了 CARE 的多种变体,通过利用传统的注意力方法或一个预训导的分割模型来生成 bounding box。实验结果表明,CARE 变体使用自动生成的 bounding box 与手动标注的 CARE 框架相当。系列的实验在一个充满病变的皮肤图像集合和一个肺炎图像集合中表明,我们的方法可以帮助网络更有效地关注罕见疾病的 lesion 区域,并显著提高罕见疾病的分类性能。

Towards Fair Face Verification: An In-depth Analysis of Demographic Biases

  • paper_url: http://arxiv.org/abs/2307.10011
  • repo_url: None
  • paper_authors: Ioannis Sarridis, Christos Koutlis, Symeon Papadopoulos, Christos Diou
  • for: 这篇论文旨在探讨人脸识别和验证系统中存在的性别、年龄和种族偏见问题,以及这些偏见在不同保护组合中的交叠影响。
  • methods: 本论文使用深度学习方法进行人脸识别和验证,并对现有云端解决方案进行了分析。同时,该论文还包括五种附加的评价指标,以衡量系统的公平性。
  • results: 研究发现,存在许多种族、年龄和性别的偏见问题,其中非裔人群的True Positive Rate(TPR)相比于白人群体下降11.25%,而且在不同保护组合之间存在显著的偏见。此外,跨多个保护组合的交叠效应也存在显著偏见。这些结果希望能够促进更公平、更公正的人脸识别和验证系统的发展。
    Abstract Deep learning-based person identification and verification systems have remarkably improved in terms of accuracy in recent years; however, such systems, including widely popular cloud-based solutions, have been found to exhibit significant biases related to race, age, and gender, a problem that requires in-depth exploration and solutions. This paper presents an in-depth analysis, with a particular emphasis on the intersectionality of these demographic factors. Intersectional bias refers to the performance discrepancies w.r.t. the different combinations of race, age, and gender groups, an area relatively unexplored in current literature. Furthermore, the reliance of most state-of-the-art approaches on accuracy as the principal evaluation metric often masks significant demographic disparities in performance. To counter this crucial limitation, we incorporate five additional metrics in our quantitative analysis, including disparate impact and mistreatment metrics, which are typically ignored by the relevant fairness-aware approaches. Results on the Racial Faces in-the-Wild (RFW) benchmark indicate pervasive biases in face recognition systems, extending beyond race, with different demographic factors yielding significantly disparate outcomes. In particular, Africans demonstrate an 11.25% lower True Positive Rate (TPR) compared to Caucasians, while only a 3.51% accuracy drop is observed. Even more concerning, the intersections of multiple protected groups, such as African females over 60 years old, demonstrate a +39.89% disparate mistreatment rate compared to the highest Caucasians rate. By shedding light on these biases and their implications, this paper aims to stimulate further research towards developing fairer, more equitable face recognition and verification systems.
    摘要 深度学习基于人识别和验证系统在过去几年中有了非常大的进步,但是这些系统,包括广泛使用的云端解决方案,已经被发现有显著的种族、年龄和性别相关的偏见问题,这是需要深入探讨和解决的问题。这篇文章提供了深入的分析,尤其是关于人识别系统中 intersect 的偏见问题。人识别系统中的偏见指的是不同的种族、年龄和性别组合的性能差异,这是现有文献中尚未得到充分探讨的领域。此外,现有大多数状态的艺术方法仅仅依靠精度作为评价指标,这会隐藏人识别系统中的重要民生差异。为了解决这一重要 limitation,我们在量化分析中添加了五个额外的指标,包括不平等影响指标和不公正对待指标,这些指标通常被现有的公平意识方法忽略。results on the Racial Faces in-the-Wild (RFW) benchmark show that face recognition systems exhibit pervasive biases beyond race, with different demographic factors leading to significantly disparate outcomes. Specifically, Africans have a 11.25% lower True Positive Rate (TPR) compared to Caucasians, while only a 3.51% accuracy drop is observed. Furthermore, the intersections of multiple protected groups, such as African females over 60 years old, demonstrate a +39.89% disparate mistreatment rate compared to the highest Caucasians rate. By shedding light on these biases and their implications, this paper aims to stimulate further research towards developing fairer, more equitable face recognition and verification systems.

MODA: Mapping-Once Audio-driven Portrait Animation with Dual Attentions

  • paper_url: http://arxiv.org/abs/2307.10008
  • repo_url: None
  • paper_authors: Yunfei Liu, Lijian Lin, Fei Yu, Changyin Zhou, Yu Li
  • for: 这篇论文旨在实现基于音频的人脸动画,以生成高精度、多模态的人脸视频。
  • methods: 该方法包括三个阶段:1)Mapping-Once网络with dual attention(MODA)将音频转换为说话表示;2)facial composer网络生成精度受到控制的面部特征点;3)时间引导渲染器Synthesize稳定的视频。
  • results: 对比前方法,该方法生成的人脸视频更加自然和真实。
    Abstract Audio-driven portrait animation aims to synthesize portrait videos that are conditioned by given audio. Animating high-fidelity and multimodal video portraits has a variety of applications. Previous methods have attempted to capture different motion modes and generate high-fidelity portrait videos by training different models or sampling signals from given videos. However, lacking correlation learning between lip-sync and other movements (e.g., head pose/eye blinking) usually leads to unnatural results. In this paper, we propose a unified system for multi-person, diverse, and high-fidelity talking portrait generation. Our method contains three stages, i.e., 1) Mapping-Once network with Dual Attentions (MODA) generates talking representation from given audio. In MODA, we design a dual-attention module to encode accurate mouth movements and diverse modalities. 2) Facial composer network generates dense and detailed face landmarks, and 3) temporal-guided renderer syntheses stable videos. Extensive evaluations demonstrate that the proposed system produces more natural and realistic video portraits compared to previous methods.
    摘要 Audio-driven portrait animation aims to synthesize portrait videos that are conditioned by given audio. 动画高质量多modal视频肖像有很多应用。先前的方法都是通过不同的模型或样本信号来捕捉不同的动作模式并生成高质量肖像视频。然而,通常lacking correlation learning between lip-sync and other movements (例如,头姿/眼睛跳动),导致结果不自然。在这篇论文中,我们提出了一个统一的系统,用于多个人、多样化和高质量的说话肖像生成。我们的方法包括三个阶段:1. Mapping-Once网络与双注意力(MODA)生成来自给定音频的说话表示。在MODA中,我们设计了双注意力模块,以编码准确的口型动作和多样化的特征。2. 脸部组合网络生成密集和详细的面部标记点,3. 时间引导渲染器将生成稳定的视频。广泛的评估表明,我们提出的系统可以生成更自然和真实的视频肖像,与先前的方法相比。

As large as it gets: Learning infinitely large Filters via Neural Implicit Functions in the Fourier Domain

  • paper_url: http://arxiv.org/abs/2307.10001
  • repo_url: None
  • paper_authors: Julia Grabinski, Janis Keuper, Margret Keuper
  • for: investigate how large receptive fields are needed in vision applications, and develop a method to learn large filters without increasing memory consumption during training or inference.
  • methods: learn frequency representations of filter weights as neural implicit functions, allowing for efficient implementation in current frameworks and enabling the use of infinitely large filters with only a few learnable weights.
  • results: achieve results on par with the state-of-the-art on large image classification benchmarks while executing convolutions solely in the frequency domain, and provide an extensive analysis of the learned receptive fields, showing that the learned filters are well-localized and relatively small in the spatial domain.
    Abstract Motivated by the recent trend towards the usage of larger receptive fields for more context-aware neural networks in vision applications, we aim to investigate how large these receptive fields really need to be. To facilitate such study, several challenges need to be addressed, most importantly: (i) We need to provide an effective way for models to learn large filters (potentially as large as the input data) without increasing their memory consumption during training or inference, (ii) the study of filter sizes has to be decoupled from other effects such as the network width or number of learnable parameters, and (iii) the employed convolution operation should be a plug-and-play module that can replace any conventional convolution in a Convolutional Neural Network (CNN) and allow for an efficient implementation in current frameworks. To facilitate such models, we propose to learn not spatial but frequency representations of filter weights as neural implicit functions, such that even infinitely large filters can be parameterized by only a few learnable weights. The resulting neural implicit frequency CNNs are the first models to achieve results on par with the state-of-the-art on large image classification benchmarks while executing convolutions solely in the frequency domain and can be employed within any CNN architecture. They allow us to provide an extensive analysis of the learned receptive fields. Interestingly, our analysis shows that, although the proposed networks could learn very large convolution kernels, the learned filters practically translate into well-localized and relatively small convolution kernels in the spatial domain.
    摘要
  1. We need to find an effective way for models to learn large filters (potentially as large as the input data) without increasing their memory consumption during training or inference.2. The study of filter sizes must be decoupled from other effects such as the network width or number of learnable parameters.3. The employed convolution operation should be a plug-and-play module that can replace any conventional convolution in a Convolutional Neural Network (CNN) and allow for an efficient implementation in current frameworks.To address these challenges, we propose learning not spatial but frequency representations of filter weights as neural implicit functions. This allows even infinitely large filters to be parameterized by only a few learnable weights. The resulting neural implicit frequency CNNs are the first models to achieve results on par with the state-of-the-art on large image classification benchmarks while executing convolutions solely in the frequency domain. They can be employed within any CNN architecture and allow us to provide an extensive analysis of the learned receptive fields.Our analysis shows that, although the proposed networks could learn very large convolution kernels, the learned filters practically translate into well-localized and relatively small convolution kernels in the spatial domain.

Mitigating Viewer Impact from Disturbing Imagery using AI Filters: A User-Study

  • paper_url: http://arxiv.org/abs/2307.10334
  • repo_url: None
  • paper_authors: Ioannis Sarridis, Jochen Spangenberg, Olga Papadopoulou, Symeon Papadopoulos
    for: 这个论文的目的是研究用人工智能技术来减少对激素内容的情感影响。methods: 这个研究使用了107名参与者,主要是记者和人权调查员,测试了五种过滤方式,包括传统的滤除(抑制和部分抑制)和人工智能基于的绘制(绘制、颜色绘制和画画)。研究测量了这些过滤方式对图像信息传递和情感压力减少的效果。results: 研究发现,基于人工智能的绘制过滤方式表现最佳,可以减少负面感受(-30.38%),同时保持图像的可读性(97.19%)。参与者还建议将AI过滤器 integrate到他们的工作流程中,例如用作先决步骤,以便最终 inspect the original image。这篇论文为专业人士Routinely处理可能激素内容的视觉环境的开发做出了贡献。
    Abstract Exposure to disturbing imagery can significantly impact individuals, especially professionals who encounter such content as part of their work. This paper presents a user study, involving 107 participants, predominantly journalists and human rights investigators, that explores the capability of Artificial Intelligence (AI)-based image filters to potentially mitigate the emotional impact of viewing such disturbing content. We tested five different filter styles, both traditional (Blurring and Partial Blurring) and AI-based (Drawing, Colored Drawing, and Painting), and measured their effectiveness in terms of conveying image information while reducing emotional distress. Our findings suggest that the AI-based Drawing style filter demonstrates the best performance, offering a promising solution for reducing negative feelings (-30.38%) while preserving the interpretability of the image (97.19%). Despite the requirement for many professionals to eventually inspect the original images, participants suggested potential strategies for integrating AI filters into their workflow, such as using AI filters as an initial, preparatory step before viewing the original image. Overall, this paper contributes to the development of a more ethically considerate and effective visual environment for professionals routinely engaging with potentially disturbing imagery.
    摘要 曝露恐怖图像可能对个人产生深见影响,特别是专业人士在工作中遇到这类内容。这篇论文报告了一项用户研究,参与者为107名记者和人权调查员,探讨了人工智能(AI)基于图像筛选器的可能性来减轻曝露恐怖图像的情感影响。我们测试了5种不同的筛选器风格,包括传统的掩蔽和部分掩蔽,以及基于AI的绘制、颜色绘制和画画风格。我们发现,AI基于绘制风格的筛选器表现最佳,可以减少负面情感 (-30.38%) while保持图像的可读性(97.19%)。尽管 eventually需要查看原图像,但参与者建议可以将AI筛选器 integrate into their workflow,例如作为初步预备步骤 перед查看原图像。总的来说,这篇论文对专业人士常遇到可能恐怖图像的可负面影响的开发提供了一种更优的解决方案。

TUNeS: A Temporal U-Net with Self-Attention for Video-based Surgical Phase Recognition

  • paper_url: http://arxiv.org/abs/2307.09997
  • repo_url: None
  • paper_authors: Isabel Funke, Dominik Rivoir, Stefanie Krell, Stefanie Speidel
  • for: 这个论文的目的是提供一种基于视频流的自适应计算机助手,以便在未来的手术室中提高操作效率和准确性。
  • methods: 本论文使用的方法包括视频流中提取有用特征和模型时间信息的技术,以及使用注意力机制来捕捉长距离依赖关系。特别是,该论文提出了一种新的注意力模型,即TUNeS,该模型不需要局部注意力或注意力权重补做。
  • results: 在实验中,所有的时间模型都表现出了在使用长时间文本段训练的特征提取器时的改善,而TUNeS得到了state-of-the-art的结果在Cholec80 dataset上。
    Abstract To enable context-aware computer assistance in the operating room of the future, cognitive systems need to understand automatically which surgical phase is being performed by the medical team. The primary source of information for surgical phase recognition is typically video, which presents two challenges: extracting meaningful features from the video stream and effectively modeling temporal information in the sequence of visual features. For temporal modeling, attention mechanisms have gained popularity due to their ability to capture long-range dependencies. In this paper, we explore design choices for attention in existing temporal models for surgical phase recognition and propose a novel approach that does not resort to local attention or regularization of attention weights: TUNeS is an efficient and simple temporal model that incorporates self-attention at the coarsest stage of a U-Net-like structure. In addition, we propose to train the feature extractor, a standard CNN, together with an LSTM on preferably long video segments, i.e., with long temporal context. In our experiments, all temporal models performed better on top of feature extractors that were trained with longer temporal context. On top of these contextualized features, TUNeS achieves state-of-the-art results on Cholec80.
    摘要 为实现智能医疗室的计算机助手功能,�gnitive系统需要自动地理解医疗团队在执行哪一个手术阶段。视频是主要的信息来源,它们存在两个挑战:从视频流中提取有用的特征和有效地模型视频流中的时间信息。为了处理时间信息,注意机制在模型中得到了广泛的应用,因为它们可以捕捉长距离依赖关系。在这篇论文中,我们探讨了现有的 temporal 模型中的注意机制设计选择,并提出了一种新的方法,它不需要本地注意力或注意力权重规则化:TUNeS 是一种高效和简单的时间模型,它在 U-Net like 结构中使用自注意。此外,我们提议在可能长的视频段上训练特征提取器,即标准的 CNN,并与 LSTM 一起训练。在我们的实验中,所有的时间模型都在使用更长的时间上下文下表现更好。在这些上下文化特征上,TUNeS 实现了 Cholect80 的状态当前结果。

Impact of Disentanglement on Pruning Neural Networks

  • paper_url: http://arxiv.org/abs/2307.09994
  • repo_url: None
  • paper_authors: Carl Shneider, Peyman Rostami, Anis Kacem, Nilotpal Sinha, Abd El Rahman Shabayek, Djamila Aouada
  • for: 这篇论文的目的是为了实现深度学习神经网络在边缘设备上的部署,并且降低其内存占用量、电力消耗和延迟时间。
  • methods: 本论文使用了β-VAE框架,与标准的条件给出来进行删除,以 Investigate 将网络训练为了学习分离表示的影响。
  • results: 在 MNIST 和 CIFAR10 数据集上实验,发现 forcing 网络学习分离表示可以增强删除的效果,并且提出了未来工作的道路。
    Abstract Deploying deep learning neural networks on edge devices, to accomplish task specific objectives in the real-world, requires a reduction in their memory footprint, power consumption, and latency. This can be realized via efficient model compression. Disentangled latent representations produced by variational autoencoder (VAE) networks are a promising approach for achieving model compression because they mainly retain task-specific information, discarding useless information for the task at hand. We make use of the Beta-VAE framework combined with a standard criterion for pruning to investigate the impact of forcing the network to learn disentangled representations on the pruning process for the task of classification. In particular, we perform experiments on MNIST and CIFAR10 datasets, examine disentanglement challenges, and propose a path forward for future works.
    摘要 部署深度学习神经网络在边缘设备上,以完成实际场景中的任务特定目标,需要减少其内存占用量、能耗和延迟。这可以通过有效的模型压缩来实现。 variational autoencoder(VAE)网络生成的分离的干扰表示可能是实现模型压缩的有效方法,因为它们主要保留任务特定的信息,抛弃无关于任务的信息。我们使用β-VAE框架与标准的束缚方法进行调研,研究在分类任务上强制网络学习分离表示的影响。在特定的MNIST和CIFAR10数据集上进行实验,检查分离挑战,并提出未来工作的路径。

TinyTrain: Deep Neural Network Training at the Extreme Edge

  • paper_url: http://arxiv.org/abs/2307.09988
  • repo_url: None
  • paper_authors: Young D. Kwon, Rui Li, Stylianos I. Venieris, Jagmohan Chauhan, Nicholas D. Lane, Cecilia Mascolo
  • For: 这个研究旨在提高 edge 设备上的内存和 compute 资源限制下的人工智能模型训练效率,以满足用户个性化和隐私需求。* Methods: 本研究提出了 TinyTrain,一个在 edge 设备上进行内存和 compute 资源有限的模型训练方法,通过选择ively 更新模型部分,以及对于数据稀缺的特殊处理,以获得高精度的结果。* Results: TinyTrain 比 vanilla fine-tuning 提高 3.6-5.0% 的精度,同时将backward-pass 内存和计算成本降低到 2,286 倍和 7.68 倍,实现了快速、能效地进行模型训练。
    Abstract On-device training is essential for user personalisation and privacy. With the pervasiveness of IoT devices and microcontroller units (MCU), this task becomes more challenging due to the constrained memory and compute resources, and the limited availability of labelled user data. Nonetheless, prior works neglect the data scarcity issue, require excessively long training time (e.g. a few hours), or induce substantial accuracy loss ($\geq$10\%). We propose TinyTrain, an on-device training approach that drastically reduces training time by selectively updating parts of the model and explicitly coping with data scarcity. TinyTrain introduces a task-adaptive sparse-update method that dynamically selects the layer/channel based on a multi-objective criterion that jointly captures user data, the memory, and the compute capabilities of the target device, leading to high accuracy on unseen tasks with reduced computation and memory footprint. TinyTrain outperforms vanilla fine-tuning of the entire network by 3.6-5.0\% in accuracy, while reducing the backward-pass memory and computation cost by up to 2,286$\times$ and 7.68$\times$, respectively. Targeting broadly used real-world edge devices, TinyTrain achieves 9.5$\times$ faster and 3.5$\times$ more energy-efficient training over status-quo approaches, and 2.8$\times$ smaller memory footprint than SOTA approaches, while remaining within the 1 MB memory envelope of MCU-grade platforms.
    摘要 “设备内训练是用户个性化和隐私的关键。随着互联网物联网设备和微控制器单元(MCU)的普及,这个任务变得更加挑战,因为这些设备具有有限的内存和计算资源,同时有限的标注用户数据。然而,先前的研究忽视了数据稀缺问题,需要过长的训练时间(例如几个小时),或者导致显著的准确性损失(大于10%)。我们提出了TinyTrain,一种在设备内训练方法,可以减少训练时间,并且特别处理数据稀缺问题。TinyTrain引入了任务适应的稀缺更新方法,可以动态选择层/通道,根据多目标 criterion,同时考虑用户数据、内存和计算能力。这使得TinyTrain在未seen任务上保持高准确性,同时减少了后向传输的内存和计算成本。相比普通精度调整,TinyTrain提高了3.6-5.0%的准确性,同时减少了反向传输的内存和计算成本。TinyTrain针对广泛使用的现实世界边缘设备,实现了9.5倍快速训练和3.5倍更高能效训练,同时减少了2.8倍的内存占用。”

Lazy Visual Localization via Motion Averaging

  • paper_url: http://arxiv.org/abs/2307.09981
  • repo_url: None
  • paper_authors: Siyan Dong, Shaohui Liu, Hengkai Guo, Baoquan Chen, Marc Pollefeys
  • for: 本研究的目的是提出一种基于图像的视觉地理位置系统,以便实现高精度的地理位置估计。
  • methods: 该方法不需要构建3D metric maps,而是通过对数据库图像和查询图像的运动平均来实现高精度的地理位置估计。
  • results: 实验结果显示,该方法可以与现有的结构基本方法相比,达到同等或更高的地理位置准确率。此外,该方法还可以轻松扩展到处理复杂的配置,如多个查询图像的协同地理位置估计和摄像机架。
    Abstract Visual (re)localization is critical for various applications in computer vision and robotics. Its goal is to estimate the 6 degrees of freedom (DoF) camera pose for each query image, based on a set of posed database images. Currently, all leading solutions are structure-based that either explicitly construct 3D metric maps from the database with structure-from-motion, or implicitly encode the 3D information with scene coordinate regression models. On the contrary, visual localization without reconstructing the scene in 3D offers clear benefits. It makes deployment more convenient by reducing database pre-processing time, releasing storage requirements, and remaining unaffected by imperfect reconstruction, etc. In this technical report, we demonstrate that it is possible to achieve high localization accuracy without reconstructing the scene from the database. The key to achieving this owes to a tailored motion averaging over database-query pairs. Experiments show that our visual localization proposal, LazyLoc, achieves comparable performance against state-of-the-art structure-based methods. Furthermore, we showcase the versatility of LazyLoc, which can be easily extended to handle complex configurations such as multi-query co-localization and camera rigs.
    摘要 将文本翻译为简化中文。计算机视觉和 робо技术中的视觉(重)本地化是非常重要的。其目标是根据每个查询图像,估算相对于数据库图像的6个自由度(DoF)摄像头姿。现在,所有领先的解决方案都是结构基的,即从数据库中构建3D métric 地图,或者隐式地将3D信息编码到场景坐标回归模型中。然而,不需要重建场景的视觉本地化具有明显的优点。它使得部署更加便捷,降低数据库预处理时间,释放存储要求,并不受不完全重建等等的影响。在这份技术报告中,我们示出了可以达到高地位精度无需重建场景的可能性。关键在于对数据库-查询对的特制运动平均。实验表明,我们的视觉本地化提案,懒散寻Localization(LazyLoc)可以与当前领先的结构基方法相比,达到相似的性能。此外,我们还展示了 LazyLoc 的灵活性,可以轻松扩展到处理复杂配置,如多个查询均在和相机套件。Note: "简化中文" refers to Simplified Chinese, which is one of the two standardized Chinese writing systems, used in mainland China and Singapore.

ProtoCaps: A Fast and Non-Iterative Capsule Network Routing Method

  • paper_url: http://arxiv.org/abs/2307.09944
  • repo_url: None
  • paper_authors: Miles Everett, Mingjun Zhong, Georgios Leontidis
  • for: 提高卷积网络的运算效率和表现力
  • methods: 提出了一种新的非迭代路由机制,以及使用可调prototype集成的方法
  • results: 比对 current best non-iterative Capsule Network 的结果,在Imagewoof dataset上达到了更高的性能Translation:
  • for: To enhance the operational efficiency and performance of Capsule Networks
  • methods: Proposed a novel, non-iterative routing mechanism and use of trainable prototype clustering
  • results: Achieved superior results compared to the current best non-iterative Capsule Network on the Imagewoof dataset, which is too computationally demanding to handle efficiently by iterative approaches.
    Abstract Capsule Networks have emerged as a powerful class of deep learning architectures, known for robust performance with relatively few parameters compared to Convolutional Neural Networks (CNNs). However, their inherent efficiency is often overshadowed by their slow, iterative routing mechanisms which establish connections between Capsule layers, posing computational challenges resulting in an inability to scale. In this paper, we introduce a novel, non-iterative routing mechanism, inspired by trainable prototype clustering. This innovative approach aims to mitigate computational complexity, while retaining, if not enhancing, performance efficacy. Furthermore, we harness a shared Capsule subspace, negating the need to project each lower-level Capsule to each higher-level Capsule, thereby significantly reducing memory requisites during training. Our approach demonstrates superior results compared to the current best non-iterative Capsule Network and tests on the Imagewoof dataset, which is too computationally demanding to handle efficiently by iterative approaches. Our findings underscore the potential of our proposed methodology in enhancing the operational efficiency and performance of Capsule Networks, paving the way for their application in increasingly complex computational scenarios.
    摘要 干燥网络(Capsule Networks)已经成为深度学习领域的一种强大的架构,知名于相对于卷积神经网络(CNNs)的Parameters少,但它们的内置效率经常被迅速的迭代路由机制所掩蔽,导致不能扩展。在这篇论文中,我们提出了一种新的非迭代路由机制,取得了trainable prototype clustering的灵感。这种创新的方法旨在降低计算复杂性,保持或者提高性能效率。另外,我们利用共享干燥子空间,从而废弃每层下级干燥向每层上级干燥的投影,从而在训练时大幅减少内存需求。我们的方法在Imagewoof数据集上测试,这个数据集是迅速的迭代方法无法处理的计算强度。我们的发现表明了我们提出的方法在提高干燥网络的运行效率和性能的潜在可能性,为其应用在越来越复杂的计算场景铺开了道路。

AGAR: Attention Graph-RNN for Adaptative Motion Prediction of Point Clouds of Deformable Objects

  • paper_url: http://arxiv.org/abs/2307.09936
  • repo_url: None
  • paper_authors: Pedro Gomes, Silvia Rossi, Laura Toni
  • for: 本研究针对三维点 cloud sequence中的弹性物体动作(如人体动作)进行动作预测,探讨这种表示方式中的困难和复杂动作所带来的技术限制。
  • methods: 我们提出一个改进的建筑方案,专门针对弹性形状和复杂动作的点 cloud prediction。 Specifically, 我们使用一个基于图的方法,学习和利用点 cloud 的空间结构,以提取更有表示力的特征。然后,我们提出一个适应Module,能够合理地调整本地和全局运动,以更好地模型弹性三维点 cloud 中的动作。
  • results: 我们对以下数据集进行试验:MNIST moving digits、Mixamo human bodies motions、JPEG和CWIPC-SXR real-world dynamic bodies。结果显示,我们的方法在模型复杂动作方面表现更好,并且能够保持点 cloud 的形状。此外,我们还证明了我们的框架在动态特征学习中的一般化能力,通过将框架应用于MSRAction3D数据集,并 achieved results on-par with state-of-the-art methods。
    Abstract This paper focuses on motion prediction for point cloud sequences in the challenging case of deformable 3D objects, such as human body motion. First, we investigate the challenges caused by deformable shapes and complex motions present in this type of representation, with the ultimate goal of understanding the technical limitations of state-of-the-art models. From this understanding, we propose an improved architecture for point cloud prediction of deformable 3D objects. Specifically, to handle deformable shapes, we propose a graph-based approach that learns and exploits the spatial structure of point clouds to extract more representative features. Then we propose a module able to combine the learned features in an adaptative manner according to the point cloud movements. The proposed adaptative module controls the composition of local and global motions for each point, enabling the network to model complex motions in deformable 3D objects more effectively. We tested the proposed method on the following datasets: MNIST moving digits, the Mixamo human bodies motions, JPEG and CWIPC-SXR real-world dynamic bodies. Simulation results demonstrate that our method outperforms the current baseline methods given its improved ability to model complex movements as well as preserve point cloud shape. Furthermore, we demonstrate the generalizability of the proposed framework for dynamic feature learning, by testing the framework for action recognition on the MSRAction3D dataset and achieving results on-par with state-of-the-art methods
    摘要 To handle deformable shapes, the proposed method uses a graph-based approach that learns and exploits the spatial structure of point clouds to extract more representative features. Then, a module is proposed that combines the learned features in an adaptive manner according to the point cloud movements. This module controls the composition of local and global motions for each point, allowing the network to model complex motions in deformable 3D objects more effectively.The proposed method was tested on several datasets, including MNIST moving digits, Mixamo human bodies motions, JPEG, and CWIPC-SXR real-world dynamic bodies. The simulation results show that the proposed method outperforms current baseline methods in terms of its ability to model complex movements while preserving point cloud shape. Additionally, the proposed framework was tested for action recognition on the MSRAction3D dataset and achieved results on par with state-of-the-art methods, demonstrating its generalizability for dynamic feature learning.

DISA: DIfferentiable Similarity Approximation for Universal Multimodal Registration

  • paper_url: http://arxiv.org/abs/2307.09931
  • repo_url: https://github.com/imfusiongmbh/disa-universal-multimodal-registration
  • paper_authors: Matteo Ronchetti, Wolfgang Wein, Nassir Navab, Oliver Zettinig, Raphael Prevost
  • for: 这篇论文的目的是为了提出一种能够快速、灵活地进行多Modal imaging的图像注册方法。
  • methods: 该方法使用了小型卷积神经网络(CNN)来approximate现有的相似度度量,从而使得注册过程变得更加快速和灵活。
  • results: 实验结果表明,该方法可以在不同的生物学Dataset上进行广泛的应用,并且不需要特殊的重新训练。此外,该方法比传统的局部patch基本metric更快速,并且可以在临床设置中直接应用。
    Abstract Multimodal image registration is a challenging but essential step for numerous image-guided procedures. Most registration algorithms rely on the computation of complex, frequently non-differentiable similarity metrics to deal with the appearance discrepancy of anatomical structures between imaging modalities. Recent Machine Learning based approaches are limited to specific anatomy-modality combinations and do not generalize to new settings. We propose a generic framework for creating expressive cross-modal descriptors that enable fast deformable global registration. We achieve this by approximating existing metrics with a dot-product in the feature space of a small convolutional neural network (CNN) which is inherently differentiable can be trained without registered data. Our method is several orders of magnitude faster than local patch-based metrics and can be directly applied in clinical settings by replacing the similarity measure with the proposed one. Experiments on three different datasets demonstrate that our approach generalizes well beyond the training data, yielding a broad capture range even on unseen anatomies and modality pairs, without the need for specialized retraining. We make our training code and data publicly available.
    摘要 多modal图像匹配是一项复杂但必要的步骤,用于许多图像引导过程。大多数匹配算法依靠计算复杂,通常不导 differentiable的相似度度量来处理不同模态图像之间的形态差异。现代机器学习基于的方法受到特定的解剖学-模态组合的限制,并不能扩展到新的设置。我们提议一种通用框架,用于生成表达力强的跨模态描述符,以实现快速的扭形全局匹配。我们通过将现有的度量简化为投影在小的卷积神经网络(CNN)的特征空间中的点积,实现了快速、可训练的dot product。我们的方法比局部小区域度量更快速,可以直接应用于临床设置,只需要将相似度度量替换为我们提议的度量即可。我们在三个不同的数据集上进行了实验,结果表明我们的方法可以跨模态数据集上泛化良好,无需特殊的再训练。我们将训练代码和数据公开发布。

Measuring and Modeling Uncertainty Degree for Monocular Depth Estimation

  • paper_url: http://arxiv.org/abs/2307.09929
  • repo_url: None
  • paper_authors: Mochu Xiang, Jing Zhang, Nick Barnes, Yuchao Dai
  • for: This paper aims to improve the reliability of monocular depth estimation (MDE) models by modeling uncertainty from the perspective of inherent probability distributions.
  • methods: The proposed method introduces additional training regularization terms to estimate uncertainty more fairly and with more comprehensive metrics, without requiring extra modules or multiple inferences.
  • results: The proposed method provides state-of-the-art reliability in uncertainty estimations, and can be further improved when combined with ensemble or sampling methods, as demonstrated in a series of experiments.Here is the text in Simplified Chinese:
  • for: 这篇论文目标是提高单目深度估计(MDE)模型的可靠性,通过来自深度概率体 Volume 和其扩展的内在概率分布的视角来模型不确定性。
  • methods: 提议的方法通过添加额外训练规则来更加公正地估计不确定性,不需要额外模块或多个推理。
  • results: 提议的方法可以提供state-of-the-art的可靠性在不确定性估计中,并可以通过ensemble或采样方法进一步改进。
    Abstract Effectively measuring and modeling the reliability of a trained model is essential to the real-world deployment of monocular depth estimation (MDE) models. However, the intrinsic ill-posedness and ordinal-sensitive nature of MDE pose major challenges to the estimation of uncertainty degree of the trained models. On the one hand, utilizing current uncertainty modeling methods may increase memory consumption and are usually time-consuming. On the other hand, measuring the uncertainty based on model accuracy can also be problematic, where uncertainty reliability and prediction accuracy are not well decoupled. In this paper, we propose to model the uncertainty of MDE models from the perspective of the inherent probability distributions originating from the depth probability volume and its extensions, and to assess it more fairly with more comprehensive metrics. By simply introducing additional training regularization terms, our model, with surprisingly simple formations and without requiring extra modules or multiple inferences, can provide uncertainty estimations with state-of-the-art reliability, and can be further improved when combined with ensemble or sampling methods. A series of experiments demonstrate the effectiveness of our methods.
    摘要 必须有效地测量和模型训练过的模型的可靠性,以便在实际应用中部署单目深度估计(MDE)模型。然而,MDE的内在缺失和排序敏感性带来了训练模型的不确定度的估计的主要挑战。一方面,使用现有的不确定度模型方法可能会增加内存占用和时间消耗。另一方面,基于模型准确率来估计uncertainty的方法也有问题,因为不确定度可靠性和预测精度不是分离的。在这篇论文中,我们提议从深度概率分布和其扩展的视角来模型MDE模型的不确定度,并使用更全面的指标来评估其可靠性。通过简单地添加额外训练正则化项,我们的模型可以提供高度可靠的不确定度估计,并可以通过 ensemble或抽样方法进行进一步改进。一系列实验证明了我们的方法的效果。

Embedded Heterogeneous Attention Transformer for Cross-lingual Image Captioning

  • paper_url: http://arxiv.org/abs/2307.09915
  • repo_url: None
  • paper_authors: Zijie Song, Zhenzhen Hu, Richang Hong
  • for: 这个论文旨在解决跨语言图像描述任务中的跨语言和跨modal挑战,以建立跨语言和不同modal之间的关联。
  • methods: 我们使用多元网络来建立跨领域关系和图像和不同语言之间的本地匹配。我们的方法包括穿梭隐藏式多元注意力(MHCA)、多元注意力推理网络(HARN)和多元共注意(HCA)。
  • results: 我们的方法在MSCOCO数据集上进行英文和中文图像描述任务,和先进的单语言方法相比,获得更好的结果。
    Abstract Cross-lingual image captioning is confronted with both cross-lingual and cross-modal challenges for multimedia analysis. The crucial issue in this task is to model the global and local matching between the image and different languages. Existing cross-modal embedding methods based on Transformer architecture oversight the local matching between the image region and monolingual words, not to mention in the face of a variety of differentiated languages. Due to the heterogeneous property of the cross-modal and cross-lingual task, we utilize the heterogeneous network to establish cross-domain relationships and the local correspondences between the image and different languages. In this paper, we propose an Embedded Heterogeneous Attention Transformer (EHAT) to build reasoning paths bridging cross-domain for cross-lingual image captioning and integrate into transformer. The proposed EHAT consists of a Masked Heterogeneous Cross-attention (MHCA), Heterogeneous Attention Reasoning Network (HARN) and Heterogeneous Co-attention (HCA). HARN as the core network, models and infers cross-domain relationship anchored by vision bounding box representation features to connect two languages word features and learn the heterogeneous maps. MHCA and HCA implement cross-domain integration in the encoder through the special heterogeneous attention and enable single model to generate two language captioning. We test on MSCOCO dataset to generate English and Chinese, which are most widely used and have obvious difference between their language families. Our experiments show that our method even achieve better than advanced monolingual methods.
    摘要 cross-lingual image captioning 面临着跨语言和跨模态挑战,以实现多媒体分析。这个任务的关键问题在于模型图像和不同语言之间的全球和本地匹配。现有的跨Modal embedding方法基于Transformer架构,仅考虑图像区域和单一语言词汇之间的本地匹配,而不考虑不同语言之间的差异。由于跨Modal和跨语言任务的异ogeneous性,我们利用异ogeneous网络建立跨Domains关系和图像与不同语言之间的本地匹配。在这篇论文中,我们提出一种Embedded Heterogeneous Attention Transformer(EHAT),用于建立跨Domains的理解路径,并将其集成到Transformer中。EHAT包括Masked Heterogeneous Cross-attention(MHCA)、Heterogeneous Attention Reasoning Network(HARN)和Heterogeneous Co-attention(HCA)。HARN作为核心网络,通过视觉矩形框特征来连接两种语言词汇,并学习异ogeneous地图。MHCA和HCA在编码器中实现跨Domain集成,通过特殊的异ogeneous注意力,使单个模型可以生成两种语言描述。我们在MSCOCO数据集上进行测试,生成英语和中文,这两种语言是最常用的,而且它们的语言家族具有明显的差异。我们的实验结果表明,我们的方法可以超越先进的单语言方法。

Learning from Abstract Images: on the Importance of Occlusion in a Minimalist Encoding of Human Poses

  • paper_url: http://arxiv.org/abs/2307.09893
  • repo_url: None
  • paper_authors: Saad Manzur, Wayne Hayes
  • for: 提高2D-to-3Dpose lifting网络的跨数据集性能。
  • methods: 使用透明3D臂提供 occlusion信息,并隐式地编码关节位置。
  • results: 在不使用部件地图的情况下,通过在抽象的synthetic图像上进行训练,在多个视角下获得了优化的同dataset benchmark表现,以及在跨数据集 benchmark中的“量子跳跃”性提升。
    Abstract Existing 2D-to-3D pose lifting networks suffer from poor performance in cross-dataset benchmarks. Although the use of 2D keypoints joined by "stick-figure" limbs has shown promise as an intermediate step, stick-figures do not account for occlusion information that is often inherent in an image. In this paper, we propose a novel representation using opaque 3D limbs that preserves occlusion information while implicitly encoding joint locations. Crucially, when training on data with accurate three-dimensional keypoints and without part-maps, this representation allows training on abstract synthetic images, with occlusion, from as many synthetic viewpoints as desired. The result is a pose defined by limb angles rather than joint positions $\unicode{x2013}$ because poses are, in the real world, independent of cameras $\unicode{x2013}$ allowing us to predict poses that are completely independent of camera viewpoint. The result provides not only an improvement in same-dataset benchmarks, but a "quantum leap" in cross-dataset benchmarks.
    摘要 现有的2D-to-3D姿态提升网络受到跨数据集benchmark的表现不佳。虽然使用2D关键点并由“棒拌”肢体连接的“棒拌”limbs表示方法具有潜在的优势,但棒拌limbs并不考虑图像中的 occlusion 信息。在本文中,我们提出了一种新的表示方法,使用透明的3D肢体来保留 occlusion 信息,同时隐式地编码关节位置。在没有部分地图的情况下,在训练时使用抽象的 sintetic 图像,包括 occlusion,从任意 sintetic 视角训练。这种表示方法使得姿态定义由肢体角度而不是关节位置,因为姿态在实际世界中是独立于摄像头的。这种提出的方法不仅在同一个数据集上的benchmark中提高表现,而且在跨数据集benchmark中做出了“量子跃进”。

3Deformer: A Common Framework for Image-Guided Mesh Deformation

  • paper_url: http://arxiv.org/abs/2307.09892
  • repo_url: None
  • paper_authors: Hao Su, Xuefeng Liu, Jianwei Niu, Ji Wan, Xinghao Wu
  • for: 三维形状编辑框架的开发
  • methods: 基于异常 Renderer 技术,通过 semantic 图像和 mesh 材质之间的对应关系来进行形状编辑,并通过层次优化架构和多种策略和损失函数来保证形状精度、表面光滑、几何稳定和全局同步等特性。
  • results: 在评估实验中,提出的 3Deformer 能够生成出吸引人的结果,并达到了现状计算机视觉领域的先进水平。
    Abstract We propose 3Deformer, a general-purpose framework for interactive 3D shape editing. Given a source 3D mesh with semantic materials, and a user-specified semantic image, 3Deformer can accurately edit the source mesh following the shape guidance of the semantic image, while preserving the source topology as rigid as possible. Recent studies of 3D shape editing mostly focus on learning neural networks to predict 3D shapes, which requires high-cost 3D training datasets and is limited to handling objects involved in the datasets. Unlike these studies, our 3Deformer is a non-training and common framework, which only requires supervision of readily-available semantic images, and is compatible with editing various objects unlimited by datasets. In 3Deformer, the source mesh is deformed utilizing the differentiable renderer technique, according to the correspondences between semantic images and mesh materials. However, guiding complex 3D shapes with a simple 2D image incurs extra challenges, that is, the deform accuracy, surface smoothness, geometric rigidity, and global synchronization of the edited mesh should be guaranteed. To address these challenges, we propose a hierarchical optimization architecture to balance the global and local shape features, and propose further various strategies and losses to improve properties of accuracy, smoothness, rigidity, and so on. Extensive experiments show that our 3Deformer is able to produce impressive results and reaches the state-of-the-art level.
    摘要 我们提出了3Deformer,一个通用的交互式3D形状编辑框架。从源3D网格中提取 semantic materials,并使用用户指定的semantic图像作为形状指南,3Deformer可以准确地编辑源网格,同时保持原始结构的尺寸一致性。现有的3D形状编辑研究主要集中在学习神经网络预测3D形状,这需要高成本的3D训练数据集和对特定对象的限制。与这些研究不同,我们的3Deformer是一个不需要训练的通用框架,只需要 readily-available semantic图像的监督,并且可以编辑多种物体,不受数据集的限制。在3Deformer中,源网格通过 differentiable renderer 技术进行扭曲,根据 semantic图像和网格材质的对应关系。然而,将复杂的3D形状指导于简单的2D图像存在一些挑战,即保证扭曲精度、表面光滑、几何坚定性和编辑后的 mesh 的全局同步。为解决这些挑战,我们提出了层次优化架构,以平衡全局和局部形状特征,并提出了多种策略和损失来提高精度、光滑、坚定性等属性。我们的3Deformer在广泛的实验中能够 prodcue 出卓越的结果,达到了状态的末点水平。

A3D: Adaptive, Accurate, and Autonomous Navigation for Edge-Assisted Drones

  • paper_url: http://arxiv.org/abs/2307.09880
  • repo_url: None
  • paper_authors: Liekang Zeng, Haowei Chen, Daipeng Feng, Xiaoxi Zhang, Xu Chen
  • for: 提高自适应无人机导航的精度和安全性
  • methods: 使用深度神经网络和动态调整任务执行位置、输入分辨率和图像压缩比例来降低推理延迟和提高预测精度
  • results: 实现了28.06%的终端延迟减少和27.28%的飞行距离增加,比非适应解决方案更高效
    Abstract Accurate navigation is of paramount importance to ensure flight safety and efficiency for autonomous drones. Recent research starts to use Deep Neural Networks to enhance drone navigation given their remarkable predictive capability for visual perception. However, existing solutions either run DNN inference tasks on drones in situ, impeded by the limited onboard resource, or offload the computation to external servers which may incur large network latency. Few works consider jointly optimizing the offloading decisions along with image transmission configurations and adapting them on the fly. In this paper, we propose A3D, an edge server assisted drone navigation framework that can dynamically adjust task execution location, input resolution, and image compression ratio in order to achieve low inference latency, high prediction accuracy, and long flight distances. Specifically, we first augment state-of-the-art convolutional neural networks for drone navigation and define a novel metric called Quality of Navigation as our optimization objective which can effectively capture the above goals. We then design a deep reinforcement learning based neural scheduler at the drone side for which an information encoder is devised to reshape the state features and thus improve its learning ability. To further support simultaneous multi-drone serving, we extend the edge server design by developing a network-aware resource allocation algorithm, which allows provisioning containerized resources aligned with drones' demand. We finally implement a proof-of-concept prototype with realistic devices and validate its performance in a real-world campus scene, as well as a simulation environment for thorough evaluation upon AirSim. Extensive experimental results show that A3D can reduce end-to-end latency by 28.06% and extend the flight distance by up to 27.28% compared with non-adaptive solutions.
    摘要 准确的导航是无人机安全和效率的关键。 latest research 使用深度神经网络来提高无人机导航,因为它们具有remarkable predictive capability for visual perception。 However, existing solutions either run DNN inference tasks on drones in situ, impeded by limited onboard resources, or offload the computation to external servers, which may incur large network latency. Few works consider jointly optimizing the offloading decisions along with image transmission configurations and adapting them on the fly.In this paper, we propose A3D, an edge server-assisted drone navigation framework that can dynamically adjust task execution location, input resolution, and image compression ratio to achieve low inference latency, high prediction accuracy, and long flight distances. Specifically, we first augment state-of-the-art convolutional neural networks for drone navigation and define a novel metric called Quality of Navigation, which can effectively capture the above goals. We then design a deep reinforcement learning-based neural scheduler at the drone side, which uses an information encoder to reshape the state features and improve its learning ability. To support simultaneous multi-drone serving, we extend the edge server design by developing a network-aware resource allocation algorithm, which allows provisioning containerized resources aligned with drones' demand. We finally implement a proof-of-concept prototype with realistic devices and validate its performance in a real-world campus scene and a simulation environment for thorough evaluation upon AirSim.Experimental results show that A3D can reduce end-to-end latency by 28.06% and extend the flight distance by up to 27.28% compared with non-adaptive solutions.

BSDM: Background Suppression Diffusion Model for Hyperspectral Anomaly Detection

  • paper_url: http://arxiv.org/abs/2307.09861
  • repo_url: https://github.com/majitao-xd/bsdm-had
  • paper_authors: Jitao Ma, Weiying Xie, Yunsong Li, Leyuan Fang
  • For: 本研究旨在提高频谱异常检测(HAD)的性能, solves the problem of complex background in hyperspectral images (HSIs) without labeled samples.* Methods: 我们提出了一种新的方法,即background suppression diffusion model(BSDM), which can simultaneously learn latent background distributions and generalize to different datasets.* Results: 我们的方法可以减少HSIs的背景干扰,提高HAD的性能,并且可以适应不同的数据集。 We demonstrate the effectiveness of our method through assessments and generalization experiments on several real HSI datasets.
    Abstract Hyperspectral anomaly detection (HAD) is widely used in Earth observation and deep space exploration. A major challenge for HAD is the complex background of the input hyperspectral images (HSIs), resulting in anomalies confused in the background. On the other hand, the lack of labeled samples for HSIs leads to poor generalization of existing HAD methods. This paper starts the first attempt to study a new and generalizable background learning problem without labeled samples. We present a novel solution BSDM (background suppression diffusion model) for HAD, which can simultaneously learn latent background distributions and generalize to different datasets for suppressing complex background. It is featured in three aspects: (1) For the complex background of HSIs, we design pseudo background noise and learn the potential background distribution in it with a diffusion model (DM). (2) For the generalizability problem, we apply a statistical offset module so that the BSDM adapts to datasets of different domains without labeling samples. (3) For achieving background suppression, we innovatively improve the inference process of DM by feeding the original HSIs into the denoising network, which removes the background as noise. Our work paves a new background suppression way for HAD that can improve HAD performance without the prerequisite of manually labeled data. Assessments and generalization experiments of four HAD methods on several real HSI datasets demonstrate the above three unique properties of the proposed method. The code is available at https://github.com/majitao-xd/BSDM-HAD.
    摘要 “几何 Spectral 异常检测(HAD)在地球观测和深空探索中广泛应用。然而,几何 Spectral 图像的复杂背景会导致异常难以分辨。同时,几何 Spectral 图像的标注样本缺乏,导致现有 HAD 方法的泛化差。本文开展了第一次不带标注样本的背景学习问题研究。我们提出了一种新的解决方案,即背景抑制算法(BSDM),可同时学习几何 Spectral 图像的潜在背景分布,并在不同领域的数据集上进行泛化。它具有以下三个特点:1. 对几何 Spectral 图像的复杂背景,我们设计了假背景噪音,并通过演化模型(DM)学习潜在背景分布。2. 对泛化问题,我们应用了统计偏移模块,使BSDM能够适应不同领域的数据集,无需标注样本。3. 为了实现背景抑制,我们创新地改进了 DM 的推理过程,通过将原始几何 Spectral 图像Feed into 杀噪网络,从而 removing 背景噪音。我们的工作开拓了一条新的背景抑制途径,可以提高 HAD 性能,不需要手动标注数据。评估和泛化实验表明,我们的方法在多个真实的几何 Spectral 图像 dataset 上具有三个独特的特点。代码可以在 中找到。”

Blind Image Quality Assessment Using Multi-Stream Architecture with Spatial and Channel Attention

  • paper_url: http://arxiv.org/abs/2307.09857
  • repo_url: None
  • paper_authors: Hassan Khalid, Nisar Ahmed
  • for: 本研究旨在提出一种基于多流空间和通道注意力的自适应图像质量评估算法,以提高图像质量评估的准确性和可靠性。
  • methods: 该算法首先使用两种不同的后处器生成混合特征,然后通过空间和通道注意力来提供高权重 dlaregion of interest。
  • results: 研究表明,该算法可以更加准确地评估图像质量,与人类感知评估呈高相关性。同时,该算法也能够在不同类型的图像质量评估 task 中具有优秀的一致性和泛化性。
    Abstract BIQA (Blind Image Quality Assessment) is an important field of study that evaluates images automatically. Although significant progress has been made, blind image quality assessment remains a difficult task since images vary in content and distortions. Most algorithms generate quality without emphasizing the important region of interest. In order to solve this, a multi-stream spatial and channel attention-based algorithm is being proposed. This algorithm generates more accurate predictions with a high correlation to human perceptual assessment by combining hybrid features from two different backbones, followed by spatial and channel attention to provide high weights to the region of interest. Four legacy image quality assessment datasets are used to validate the effectiveness of our proposed approach. Authentic and synthetic distortion image databases are used to demonstrate the effectiveness of the proposed method, and we show that it has excellent generalization properties with a particular focus on the perceptual foreground information.
    摘要 BIQA (无目标图像质量评估) 是一个重要的研究领域,用于自动评估图像质量。尽管有了很大的进步,但无目标图像质量评估仍然是一个困难的任务,因为图像的内容和扭曲都很多。大多数算法会生成质量,而不是强调关键的区域兴趣。为解决这个问题,我们提出了一种多流批处理和通道注意力基于的算法。这个算法可以将多种混合特征从两个不同的背景拼接起来,然后在空间和通道上进行注意力分配,以提高关键区域的权重,从而生成更加准确的预测结果,与人类感知评估呈现高相关性。我们使用了四个传统的图像质量评估数据集来验证我们的提议的有效性,并使用了真实和synthetic扭曲图像库来示示我们的方法的普适性。

Hierarchical Spatio-Temporal Representation Learning for Gait Recognition

  • paper_url: http://arxiv.org/abs/2307.09856
  • repo_url: None
  • paper_authors: Lei Wang, Bo Liu, Fangfang Liang, Bincheng Wang
  • for: 本研究旨在提出一种基于层次空间-时间特征学习(HSTL)框架,用于提取走姿特征从粗到细。
  • methods: 该方法包括层次群集分析、自适应区域基于运动特征提取器(ARME)、层次特征映射(ASTP)和帧级时间下采样(FTA)等多个模块。
  • results: 对于CASIA-B、OUMVLP、GREW和Gait3D等数据集,该方法比 estado-of-the-art 高效,同时保持模型精度和复杂度的平衡。
    Abstract Gait recognition is a biometric technique that identifies individuals by their unique walking styles, which is suitable for unconstrained environments and has a wide range of applications. While current methods focus on exploiting body part-based representations, they often neglect the hierarchical dependencies between local motion patterns. In this paper, we propose a hierarchical spatio-temporal representation learning (HSTL) framework for extracting gait features from coarse to fine. Our framework starts with a hierarchical clustering analysis to recover multi-level body structures from the whole body to local details. Next, an adaptive region-based motion extractor (ARME) is designed to learn region-independent motion features. The proposed HSTL then stacks multiple ARMEs in a top-down manner, with each ARME corresponding to a specific partition level of the hierarchy. An adaptive spatio-temporal pooling (ASTP) module is used to capture gait features at different levels of detail to perform hierarchical feature mapping. Finally, a frame-level temporal aggregation (FTA) module is employed to reduce redundant information in gait sequences through multi-scale temporal downsampling. Extensive experiments on CASIA-B, OUMVLP, GREW, and Gait3D datasets demonstrate that our method outperforms the state-of-the-art while maintaining a reasonable balance between model accuracy and complexity.
    摘要 《坐姿识别方法》是一种人体特征识别技术,可以在无限制环境中识别个体,并且具有广泛的应用前景。现有的方法通常是基于身体部位特征的利用,但是它们经常忽略人体运动的层次关系。本文提出了一种层次空间时间学习(HSTL)框架,用于从粗到细提取坐姿特征。我们的框架开始于层次划分分析,以恢复全身到地方细节的多级体结构。接着,我们设计了一种适应区域基于运动特征Extractor(ARME),用于学习无关于地域的运动特征。我们的HSTL然后将多个ARME堆叠在一起,每个ARME都对应特定的层次分解级别。一种适应空间时间汇聚(ASTP)模块用于在不同级别上捕捉坐姿特征,以进行层次特征映射。最后,一种帧级时间抑制(FTA)模块用于通过多尺度时间抑制来减少坐姿序列中的重复信息。我们的方法在CASIA-B、OUMVLP、GREW和Gait3D数据集上进行了广泛的实验,结果表明我们的方法在精度和复杂性之间保持了良好的平衡,而且比现状态技术更高。

Cryo-forum: A framework for orientation recovery with uncertainty measure with the application in cryo-EM image analysis

  • paper_url: http://arxiv.org/abs/2307.09847
  • repo_url: https://github.com/phonchi/cryo-forum
  • paper_authors: Szu-Chi Chung
  • for: 这篇论文的目的是解决单分子冰 embedding电子显微镜(cryo-EM)中Orientation parameter的有效决定问题,这个问题在3D结构重建中扮演着关键的角色,但由于数据含有噪声和异常值,因此需要耗费大量的时间进行2D clean-up处理。
  • methods: 这篇论文提出了一种新的方法,使用10维特征向量表示Orientation,并使用Quadratically-Constrained Quadratic Program来预测Orientation,并且添加了一个不确定度度量。此外,提出了一种新的损失函数,该损失函数考虑了方向之间的距离,从而提高了方法的准确性。
  • results: 数值分析表明,该方法可以从2D cryo-EM图像中有效地回归Orientation,并且可以直接从数据集中清理噪声。此外,包装了该方法的软件包在用户友好的方式提供,以便开发者可以轻松地使用。
    Abstract In single-particle cryo-electron microscopy (cryo-EM), the efficient determination of orientation parameters for 2D projection images poses a significant challenge yet is crucial for reconstructing 3D structures. This task is complicated by the high noise levels present in the cryo-EM datasets, which often include outliers, necessitating several time-consuming 2D clean-up processes. Recently, solutions based on deep learning have emerged, offering a more streamlined approach to the traditionally laborious task of orientation estimation. These solutions often employ amortized inference, eliminating the need to estimate parameters individually for each image. However, these methods frequently overlook the presence of outliers and may not adequately concentrate on the components used within the network. This paper introduces a novel approach that uses a 10-dimensional feature vector to represent the orientation and applies a Quadratically-Constrained Quadratic Program to derive the predicted orientation as a unit quaternion, supplemented by an uncertainty metric. Furthermore, we propose a unique loss function that considers the pairwise distances between orientations, thereby enhancing the accuracy of our method. Finally, we also comprehensively evaluate the design choices involved in constructing the encoder network, a topic that has not received sufficient attention in the literature. Our numerical analysis demonstrates that our methodology effectively recovers orientations from 2D cryo-EM images in an end-to-end manner. Importantly, the inclusion of uncertainty quantification allows for direct clean-up of the dataset at the 3D level. Lastly, we package our proposed methods into a user-friendly software suite named cryo-forum, designed for easy accessibility by the developers.
    摘要 Single-particle cryo-electron microscopy (cryo-EM) 中的orientation parameter的效率确定是一个重要的挑战,但是它是重要的 для构建3D结构。这个任务受到高噪音水平的影响,包括异常值,因此需要许多时间consuming 2D clean-up 过程。最近,基于深度学习的解决方案出现了,它们提供了一种更加流畅的方法来传统上劳动密集的orientation estimation任务。这些解决方案通常采用总结式推理,从而消除了每个图像需要单独估计参数的需求。然而,这些方法可能忽视异常值的存在,并且可能不充分关注网络中使用的组件。本文介绍了一种新的方法,使用10维特征向量来表示orientation,并使用quadratically-constrained quadratic program来计算预测的orientation为单位旋转矩阵,并且附加了一个不确定度量。此外,我们还提出了一个新的损失函数,该函数考虑了 orientations之间的对称关系,从而提高了我们的方法的准确性。最后,我们还进行了对构建encoder网络的设计选择的完整评估,这是在文献中没有得到 suficient 的注意。我们的数值分析表明,我们的方法可以从2D cryo-EM图像中有效地回收orientation。更重要的是,由于包含不确定度量,我们的方法可以直接清理dataset at the 3D level。最后,我们将我们的提posed方法包装成一个 user-friendly 软件包,名为cryo-forum,用于方便开发者的使用。

Compressive Image Scanning Microscope

  • paper_url: http://arxiv.org/abs/2307.09841
  • repo_url: None
  • paper_authors: Ajay Gunalan, Marco Castello, Simonluca Piazza, Shunlei Li, Alberto Diaspro, Leonardo S. Mattos, Paolo Bianchini
  • for: 提高扫描镜观测镜像质量和减少数据收集时间
  • methods: 使用单 photon 触发冲击器(SPAD)数组探测器和固定抽象策略,通过并行扫描多个图像来提高压缩扫描镜图像质量
  • results: 实现高质量压缩扫描镜图像的生成,同时减少数据收集时间和降低谐振衰变的影响
    Abstract We present a novel approach to implement compressive sensing in laser scanning microscopes (LSM), specifically in image scanning microscopy (ISM), using a single-photon avalanche diode (SPAD) array detector. Our method addresses two significant limitations in applying compressive sensing to LSM: the time to compute the sampling matrix and the quality of reconstructed images. We employ a fixed sampling strategy, skipping alternate rows and columns during data acquisition, which reduces the number of points scanned by a factor of four and eliminates the need to compute different sampling matrices. By exploiting the parallel images generated by the SPAD array, we improve the quality of the reconstructed compressive-ISM images compared to standard compressive confocal LSM images. Our results demonstrate the effectiveness of our approach in producing higher-quality images with reduced data acquisition time and potential benefits in reducing photobleaching.
    摘要 我们提出了一种新的方法,用于在激光扫描镜(LSM)中实现压缩感知,特别是在图像扫描镜(ISM)中使用单 photon 膨胀 диод(SPAD)数组探测器。我们的方法解决了应用压缩感知到 LSM 中的两个主要限制:计算抽象矩阵的时间和重建图像的质量。我们采用了固定抽象策略,在数据收集过程中跳过 alternate 行和列,从而将数据量减少为四分之一,并消除计算不同抽象矩阵的需求。通过利用 SPAD 数组生成的平行图像,我们提高了压缩-ISM 图像的重建质量,相比标准压缩干扫LSM图像。我们的结果表明,我们的方法可以生成高质量图像,并降低数据收集时间和可能减少照明损害。

What do neural networks learn in image classification? A frequency shortcut perspective

  • paper_url: http://arxiv.org/abs/2307.09829
  • repo_url: https://github.com/nis-research/nn-frequency-shortcuts
  • paper_authors: Shunxin Wang, Raymond Veldhuis, Christoph Brune, Nicola Strisciuglio
  • for: 这个研究旨在 investigate the mechanisms of representation learning in neural networks (NNs) for classification tasks, and expand our understanding of frequency shortcuts.
  • methods: 研究使用了 synthetic datasets 和 natural images,并提出了一个metric to measure class-wise frequency characteristics和一种方法来Identify frequency shortcuts.
  • results: 研究发现,NNs 在 classification 任务上倾向于学习简单的解决方案,并在训练中学习的频谱特征取决于最明显的频率带,这些频率带可以是低频或高频。此外,研究还发现了频谱短cuts可以是基于文本或形状的,取决于最好简化目标。最后,研究还 validate了频谱短cuts的可转移性。
    Abstract Frequency analysis is useful for understanding the mechanisms of representation learning in neural networks (NNs). Most research in this area focuses on the learning dynamics of NNs for regression tasks, while little for classification. This study empirically investigates the latter and expands the understanding of frequency shortcuts. First, we perform experiments on synthetic datasets, designed to have a bias in different frequency bands. Our results demonstrate that NNs tend to find simple solutions for classification, and what they learn first during training depends on the most distinctive frequency characteristics, which can be either low- or high-frequencies. Second, we confirm this phenomenon on natural images. We propose a metric to measure class-wise frequency characteristics and a method to identify frequency shortcuts. The results show that frequency shortcuts can be texture-based or shape-based, depending on what best simplifies the objective. Third, we validate the transferability of frequency shortcuts on out-of-distribution (OOD) test sets. Our results suggest that frequency shortcuts can be transferred across datasets and cannot be fully avoided by larger model capacity and data augmentation. We recommend that future research should focus on effective training schemes mitigating frequency shortcut learning.
    摘要 <>频率分析有助于理解神经网络(NN)的表征学习机制。大多数研究都集中在NN的回归任务上,而对于分类任务的研究则比较少。这项研究则是对后者的实证研究,并扩展了频率短cut的理解。我们首先在synthetic数据上进行实验,设计了具有不同频率频谱偏好的数据集。我们的结果表明,NNs tend to learn simple solutions for classification, and what they learn first during training depends on the most distinctive frequency characteristics, which can be either low- or high-frequencies.然后,我们在自然图像上进行了实验,并提出了一种类别频率特征的度量和一种可以识别频率短cut的方法。结果显示,频率短cut可以是基于文本的或形状的,它们取决于对象的简化目标。最后,我们 validate了频率短cut的传送性,并证明它们可以在不同数据集上传送,而且不能完全通过更大的模型容量和数据增强来避免。我们建议未来的研究应该关注有效地减少频率短cut学习的训练方案。

Multi-modal Learning based Prediction for Disease

  • paper_url: http://arxiv.org/abs/2307.09823
  • repo_url: https://github.com/batuhankmkaraman/mlbasedad
  • paper_authors: Yaran Chen, Xueyu Chen, Yu Han, Haoran Li, Dongbin Zhao, Jingzhong Li, Xu Wang
  • for: 这个研究旨在开发一个非侵入性的肝病诊断系统,以便预测肝病变化和进展,并避免进行侵入性的肝生検。
  • methods: 这个研究使用了一个多modal的学习方法,结合来自多个不同来源的资料,包括来自临床评估、实验室和成像等的资料,以及一些面部图像。
  • results: 研究发现,使用多modal学习方法可以提高肝病诊断的精度,并且可以使用面部图像来进行肝病诊断,这可以降低肝生検的风险和成本。
    Abstract Non alcoholic fatty liver disease (NAFLD) is the most common cause of chronic liver disease, which can be predicted accurately to prevent advanced fibrosis and cirrhosis. While, a liver biopsy, the gold standard for NAFLD diagnosis, is invasive, expensive, and prone to sampling errors. Therefore, non-invasive studies are extremely promising, yet they are still in their infancy due to the lack of comprehensive research data and intelligent methods for multi-modal data. This paper proposes a NAFLD diagnosis system (DeepFLDDiag) combining a comprehensive clinical dataset (FLDData) and a multi-modal learning based NAFLD prediction method (DeepFLD). The dataset includes over 6000 participants physical examinations, laboratory and imaging studies, extensive questionnaires, and facial images of partial participants, which is comprehensive and valuable for clinical studies. From the dataset, we quantitatively analyze and select clinical metadata that most contribute to NAFLD prediction. Furthermore, the proposed DeepFLD, a deep neural network model designed to predict NAFLD using multi-modal input, including metadata and facial images, outperforms the approach that only uses metadata. Satisfactory performance is also verified on other unseen datasets. Inspiringly, DeepFLD can achieve competitive results using only facial images as input rather than metadata, paving the way for a more robust and simpler non-invasive NAFLD diagnosis.
    摘要 非 алкого醉liver病 (NAFLD)是 chronic liver disease 最常见的原因,可以准确预测并防止进展到高度纤维化和 cirrhosis。然而,liver biopsy,用于 NAFLD 诊断的 golden standard,是侵入性的、昂贵的,并且容易出现采样错误。因此,不侵入性的研究非常有前途,但是目前还处于初级阶段,因为缺乏全面的研究数据和智能的多Modal数据处理方法。这篇文章提出了一种 combining 全面临床数据集 (FLDData) 和多Modal学习基于 NAFLD 预测方法 (DeepFLD) 的 NAFLD 诊断系统 (DeepFLDDiag)。该数据集包括6000多名参与者的 физиicalexamination,实验室和成像研究,广泛的问卷和一部分参与者的面部图像,这是丰富和有价值的临床数据。从数据集中,我们量化分析并选择了对 NAFLD 预测最有价值的临床 metadata。此外,我们提出的 DeepFLD,一种基于多Modal输入(包括 metadata 和面部图像)的深度神经网络模型,可以更好地预测 NAFLD。在其他未经见过数据上,我们也证明了它的满意性表现。吸引地,DeepFLD 可以使用只有面部图像作为输入,而不是 metadata,达到类似的预测性能,这为非侵入性 NAFLD 诊断开创了新的可能性。

A Siamese-based Verification System for Open-set Architecture Attribution of Synthetic Images

  • paper_url: http://arxiv.org/abs/2307.09822
  • repo_url: None
  • paper_authors: Lydia Abady, Jun Wang, Benedetta Tondi, Mauro Barni
  • for: This paper aims to address the problem of open-set attribution of synthetic images to the architecture that generated them.
  • methods: The proposed verification framework relies on a Siamese Network to compare the query image with reference images generated by the same architecture.
  • results: The proposed method achieves excellent performance in both closed and open-set settings, with strong generalization capabilities.
    Abstract Despite the wide variety of methods developed for synthetic image attribution, most of them can only attribute images generated by models or architectures included in the training set and do not work with unknown architectures, hindering their applicability in real-world scenarios. In this paper, we propose a verification framework that relies on a Siamese Network to address the problem of open-set attribution of synthetic images to the architecture that generated them. We consider two different settings. In the first setting, the system determines whether two images have been produced by the same generative architecture or not. In the second setting, the system verifies a claim about the architecture used to generate a synthetic image, utilizing one or multiple reference images generated by the claimed architecture. The main strength of the proposed system is its ability to operate in both closed and open-set scenarios so that the input images, either the query and reference images, can belong to the architectures considered during training or not. Experimental evaluations encompassing various generative architectures such as GANs, diffusion models, and transformers, focusing on synthetic face image generation, confirm the excellent performance of our method in both closed and open-set settings, as well as its strong generalization capabilities.
    摘要 尽管现有多种方法用于生成图像归属,大多数方法只能归属图像由训练集中的模型或架构生成,无法处理未知的架构,这限制了它们在实际场景中的应用性。在这篇论文中,我们提出了一个验证框架,利用对称网络解决生成图像的开放集归属问题。我们考虑了两种不同的设定。在第一个设定中,系统确定两个图像是否由同一个生成架构生成。在第二个设定中,系统验证一个关于生成架构的声明,利用一个或多个由声明的架构生成的参考图像。我们的方法的主要优点在于它可以在关闭和开放集enario中运行,因此输入图像(或订单、参考图像)可以属于训练集中的架构或不属于。我们的实验评估了多种生成架构,包括GANs、扩散模型和转换器,以及对生成的人脸图像进行了评估,并证明了我们的方法在关闭和开放集enario中具有极高的表现力和普适性。

Hierarchical Semantic Perceptual Listener Head Video Generation: A High-performance Pipeline

  • paper_url: http://arxiv.org/abs/2307.09821
  • repo_url: None
  • paper_authors: Zhigang Chang, Weitai Hu, Qing Yang, Shibao Zheng
    for: This paper is written for the technical report of ViCo@2023 Conversational Head Generation Challenge in ACM Multimedia 2023 conference.methods: The paper uses a high-performance solution that enhances the hierarchical semantic extraction capability of the audio encoder module and improves the decoder part, renderer, and post-processing modules.results: The solution proposed in the paper achieves the first place on the official leaderboard for the track of listening head generation.
    Abstract In dyadic speaker-listener interactions, the listener's head reactions along with the speaker's head movements, constitute an important non-verbal semantic expression together. The listener Head generation task aims to synthesize responsive listener's head videos based on audios of the speaker and reference images of the listener. Compared to the Talking-head generation, it is more challenging to capture the correlation clues from the speaker's audio and visual information. Following the ViCo baseline scheme, we propose a high-performance solution by enhancing the hierarchical semantic extraction capability of the audio encoder module and improving the decoder part, renderer and post-processing modules. Our solution gets the first place on the official leaderboard for the track of listening head generation. This paper is a technical report of ViCo@2023 Conversational Head Generation Challenge in ACM Multimedia 2023 conference.
    摘要 在双向说话人-听众互动中,听众的头部反应以及说话人的头部运动,共同组成了重要的非语言表达。听众头生成任务的目标是基于说话人的音频和参考图像生成响应式的听众头视频。与对话头生成相比,捕捉说话人的相关征具有更高的挑战性。我们基于ViCo基eline方案,提出一种高性能的解决方案,通过增强音频编码模块的层次semantic抽象能力和decoder部分、渲染器和后处理模块的改进,实现了高质量的听众头生成。我们的解决方案在官方领先者榜上名列第一,这篇报告是ViCo@2023 conversational Head Generation Challenge的技术报告,发表于ACM Multimedia 2023会议上。

Deep unrolling Shrinkage Network for Dynamic MR imaging

  • paper_url: http://arxiv.org/abs/2307.09818
  • repo_url: https://github.com/yhao-z/dus-net
  • paper_authors: Yinghao Zhang, Xiaodi Li, Weihang Li, Yue Hu
  • For: 这个论文旨在提高动态磁共振成像(MR)图像的重建模型,特别是使用深度抽象网络(CNN)和简单阈值(ST)算法来强制遵循简洁约束。* Methods: 本文提出了一个新的运算子,叫做潜在阈值运算子(AST),这个运算子可以学习每个通道的阈值。同时,本文也提出了一个新的深度 unfolding shrinkage network(DUS-Net),这个模型是基于 alternate direction method of multipliers(ADMM)来优化磁共振成像的对应$l_1$ нор数据重建模型。* Results: 实验结果显示,提案的 DUS-Net 模型在一个公开存储的动态磁共振成像数据集上表现出色,比前一代方法更好。
    Abstract Deep unrolling networks that utilize sparsity priors have achieved great success in dynamic magnetic resonance (MR) imaging. The convolutional neural network (CNN) is usually utilized to extract the transformed domain, and then the soft thresholding (ST) operator is applied to the CNN-transformed data to enforce the sparsity priors. However, the ST operator is usually constrained to be the same across all channels of the CNN-transformed data. In this paper, we propose a novel operator, called soft thresholding with channel attention (AST), that learns the threshold for each channel. In particular, we put forward a novel deep unrolling shrinkage network (DUS-Net) by unrolling the alternating direction method of multipliers (ADMM) for optimizing the transformed $l_1$ norm dynamic MR reconstruction model. Experimental results on an open-access dynamic cine MR dataset demonstrate that the proposed DUS-Net outperforms the state-of-the-art methods. The source code is available at \url{https://github.com/yhao-z/DUS-Net}.
    摘要 深度折衣网络在动态磁共振成像中取得了很大成功。通常情况下,卷积神经网络(CNN)将数据映射到转换edomain中,然后使用soft thresholding(ST)操作来实现稀疏约束。然而,ST操作通常对所有通道的CNN-transformed数据进行同样的约束。在这篇论文中,我们提出了一种新的操作,即Channel Attention Soft Thresholding(AST),它可以学习每个通道的阈值。具体来说,我们提出了一种新的深度折衣缩放网络(DUS-Net),通过对 alternate direction method of multipliers(ADMM)的拓展来优化转换$l_1$ norm动力磁共振重建模型。实验结果表明,提议的DUS-Net在一个公开的动态磁共振MR数据集上表现出色,超过了当前的状态值。源代码可以在 \url{https://github.com/yhao-z/DUS-Net} 上获取。

LDP: Language-driven Dual-Pixel Image Defocus Deblurring Network

  • paper_url: http://arxiv.org/abs/2307.09815
  • repo_url: None
  • paper_authors: Hao Yang, Liyuan Pan, Yan Yang, Miaomiao Liu
  • for: 本研究旨在使用语言图像预训练框架(CLIP)来从双像缓冲(DP)对组中估计缓冲图。
  • methods: 我们首先设计了特制文本提示,使CLIP能够从DP对组中学习缓冲相关的几何假设知识。然后,我们提议一种输入双像DP对到CLIP的格式,无需任何微调。给出估计缓冲图后,我们引入了缓冲优先注意力块、缓冲权重损失和缓冲意识损失来恢复全ocus图像。
  • results: 我们在广泛的实验中达到了状态 искусственный智能的表现。
    Abstract Recovering sharp images from dual-pixel (DP) pairs with disparity-dependent blur is a challenging task.~Existing blur map-based deblurring methods have demonstrated promising results. In this paper, we propose, to the best of our knowledge, the first framework to introduce the contrastive language-image pre-training framework (CLIP) to achieve accurate blur map estimation from DP pairs unsupervisedly. To this end, we first carefully design text prompts to enable CLIP to understand blur-related geometric prior knowledge from the DP pair. Then, we propose a format to input stereo DP pair to the CLIP without any fine-tuning, where the CLIP is pre-trained on monocular images. Given the estimated blur map, we introduce a blur-prior attention block, a blur-weighting loss and a blur-aware loss to recover the all-in-focus image. Our method achieves state-of-the-art performance in extensive experiments.
    摘要 recuperating sharp images from dual-pixel (DP) pairs with disparity-dependent blur is a challenging task. existing blur map-based deblurring methods have demonstrated promising results. in this paper, we propose, to the best of our knowledge, the first framework to introduce the contrastive language-image pre-training framework (CLIP) to achieve accurate blur map estimation from DP pairs unsupervisedly. to this end, we first carefully design text prompts to enable CLIP to understand blur-related geometric prior knowledge from the DP pair. then, we propose a format to input stereo DP pair to the CLIP without any fine-tuning, where the CLIP is pre-trained on monocular images. given the estimated blur map, we introduce a blur-prior attention block, a blur-weighting loss and a blur-aware loss to recover the all-in-focus image. our method achieves state-of-the-art performance in extensive experiments.Here's the text with some minor adjustments to make it more readable in Simplified Chinese: recuperating sharp images from dual-pixel (DP) pairs with disparity-dependent blur is a challenging task. 现有的锐化图-基于的锐化方法已经达到了一定的成果。在这篇论文中,我们提出了,到目前为止为止,第一个引入语言-图像预训练框架(CLIP)来实现不监督的锐化图估计从DP对中。为此,我们首先仔细设计了文本提示,使CLIP能够从DP对中了解锐化相关的几何启示知识。然后,我们提议一种输入双目DP对到CLIP无需任何微调的格式,其中CLIP是基于单目图像的预训练的。给出估计的锐化图,我们引入了锐化关注块、锐化权重损失和锐化意识损失,以恢复所有锐化图像。我们的方法在广泛的实验中达到了顶尖性能。

GenKL: An Iterative Framework for Resolving Label Ambiguity and Label Non-conformity in Web Images Via a New Generalized KL Divergence

  • paper_url: http://arxiv.org/abs/2307.09810
  • repo_url: https://github.com/codetopaper/genkl
  • paper_authors: Xia Huang, Kai Fong Ernest Chong
  • for: 这篇论文是为了解决在网络图像Dataset中存在ambiguous instances(ID和OOD)问题而写的。
  • methods: 这篇论文使用了一种新的iterative training框架,called GenKL,来标识和重新标注ambiguous instances。GenKL使用了一种新的 $\alpha,\beta$ -generalized KL divergence来识别NC instances。
  • results: 在三个web图像Dataset上(Clothing1M、Food101/Food101N和mini WebVision 1.0),这种新的iterative training框架GenKL可以 дости到新的state-of-the-art classification accuracy:81.34%、85.73%和78.99%/92.54% (top-1/top-5)。
    Abstract Web image datasets curated online inherently contain ambiguous in-distribution (ID) instances and out-of-distribution (OOD) instances, which we collectively call non-conforming (NC) instances. In many recent approaches for mitigating the negative effects of NC instances, the core implicit assumption is that the NC instances can be found via entropy maximization. For "entropy" to be well-defined, we are interpreting the output prediction vector of an instance as the parameter vector of a multinomial random variable, with respect to some trained model with a softmax output layer. Hence, entropy maximization is based on the idealized assumption that NC instances have predictions that are "almost" uniformly distributed. However, in real-world web image datasets, there are numerous NC instances whose predictions are far from being uniformly distributed. To tackle the limitation of entropy maximization, we propose $(\alpha, \beta)$-generalized KL divergence, $\mathcal{D}_{\text{KL}^{\alpha, \beta}(p\|q)$, which can be used to identify significantly more NC instances. Theoretical properties of $\mathcal{D}_{\text{KL}^{\alpha, \beta}(p\|q)$ are proven, and we also show empirically that a simple use of $\mathcal{D}_{\text{KL}^{\alpha, \beta}(p\|q)$ outperforms all baselines on the NC instance identification task. Building upon $(\alpha,\beta)$-generalized KL divergence, we also introduce a new iterative training framework, GenKL, that identifies and relabels NC instances. When evaluated on three web image datasets, Clothing1M, Food101/Food101N, and mini WebVision 1.0, we achieved new state-of-the-art classification accuracies: $81.34\%$, $85.73\%$ and $78.99\%$/$92.54\%$ (top-1/top-5), respectively.
    摘要 translate("Web 图像集合在线批处理中存在歧义的内部(ID)实例和外部(OOD)实例,我们一起称之为非合规(NC)实例。在许多最近的NC实例缓解方法中,核心隐含假设是通过熵 maximization来发现NC实例。")以下是文本的简化中文翻译:在线批处理的网络图像集合中存在ID实例和OOD实例,我们称之为非合规(NC)实例。许多最近的NC实例缓解方法中,核心隐含假设是通过熵 maximization来发现NC实例。 translate("在实际情况下,NC实例的预测结果并不是很接近均匀分布的,这限制了熵 maximization的应用。为了解决这一问题,我们提出了(α, β)总体KL异同(D),可以更好地发现NC实例。")以下是文本的简化中文翻译:在实际情况下,NC实例的预测结果并不是很接近均匀分布的,这限制了熵 maximization的应用。为了解决这一问题,我们提出了(α, β)总体KL异同(D),可以更好地发现NC实例。 translate("我们证明了(α, β)总体KL异同的性质,并且通过实验表明,使用(α, β)总体KL异同可以比基础方法更好地发现NC实例。")以下是文本的简化中文翻译:我们证明了(α, β)总体KL异同的性质,并且通过实验表明,使用(α, β)总体KL异同可以比基础方法更好地发现NC实例。 translate("基于(α, β)总体KL异同,我们还提出了一种新的迭代培训框架,GenKL,可以同时标注和重新标注NC实例。")以下是文本的简化中文翻译:基于(α, β)总体KL异同,我们还提出了一种新的迭代培训框架,GenKL,可以同时标注和重新标注NC实例。 translate("我们在Clothing1M、Food101/Food101N和mini WebVision 1.0三个网络图像集合上评估了GenKL框架,并取得了新的状态级别分类精度:81.34%、85.73%和78.99%/92.54% (top-1/top-5)。")以下是文本的简化中文翻译:我们在Clothing1M、Food101/Food101N和mini WebVision 1.0三个网络图像集合上评估了GenKL框架,并取得了新的状态级别分类精度:81.34%、85.73%和78.99%/92.54% (top-1/top-5)。

Fix your downsampling ASAP! Be natively more robust via Aliasing and Spectral Artifact free Pooling

  • paper_url: http://arxiv.org/abs/2307.09804
  • repo_url: None
  • paper_authors: Julia Grabinski, Janis Keuper, Margret Keuper
  • for: 提高卷积神经网络的鲁棒性 against common corruptions and adversarial attacks
  • methods: 使用 aliasing and spectral artifact-free pooling (ASAP) 方法, modify FLC pooling method to reduce spectral leakage artifacts
  • results: 网络使用 ASAP 方法显示出高度的Native robustness against common corruptions and adversarial attacks, 同时保持相似的clean accuracy or even outperform the baseline.
    Abstract Convolutional neural networks encode images through a sequence of convolutions, normalizations and non-linearities as well as downsampling operations into potentially strong semantic embeddings. Yet, previous work showed that even slight mistakes during sampling, leading to aliasing, can be directly attributed to the networks' lack in robustness. To address such issues and facilitate simpler and faster adversarial training, [12] recently proposed FLC pooling, a method for provably alias-free downsampling - in theory. In this work, we conduct a further analysis through the lens of signal processing and find that such current pooling methods, which address aliasing in the frequency domain, are still prone to spectral leakage artifacts. Hence, we propose aliasing and spectral artifact-free pooling, short ASAP. While only introducing a few modifications to FLC pooling, networks using ASAP as downsampling method exhibit higher native robustness against common corruptions, a property that FLC pooling was missing. ASAP also increases native robustness against adversarial attacks on high and low resolution data while maintaining similar clean accuracy or even outperforming the baseline.
    摘要 convolutional neural networks 通过一系列卷积、 норма化和非线性运算以及下采样操作转化图像为强Semantic embedding。然而,先前的工作表明,even slight mistakes during sampling, leading to aliasing, can be directly attributed to the networks' lack of robustness。To address such issues and facilitate simpler and faster adversarial training, [12] recently proposed FLC pooling, a method for provably alias-free downsampling - in theory。在这项工作中,我们通过信号处理的视角进行进一步的分析,发现现有的 pooling 方法,通过频域地址aliasing,仍然存在频率泄漏 artifacts。因此,我们提议 aliasing和频率泄漏 artifact-free pooling,简称ASAP。尽管ASAP只是FLC pooling的一些修改,但是使用ASAP作为下采样方法的网络在常见损害和攻击下的Native robustness显著提高,而不会影响clean accuracy的性能。

From West to East: Who can understand the music of the others better?

  • paper_url: http://arxiv.org/abs/2307.09795
  • repo_url: https://github.com/pxaris/ccml
  • paper_authors: Charilaos Papaioannou, Emmanouil Benetos, Alexandros Potamianos
  • for: 本研究的目的是了解不同文化和风格的音乐表示,以及是否可以使用现有的深度学习模型来学习这些表示。
  • methods: 本研究使用了转移学习方法,将西方流行音乐和传统/民间音乐数据转移到不同文化和风格的音乐数据上,以 derive 音乐表示之间的相似性。
  • results: 实验结果显示,通过转移学习,可以在不同文化和风格的音乐数据上达到竞争性的表示性能,而最佳来源数据集的选择则因每个音乐文化而异。
    Abstract Recent developments in MIR have led to several benchmark deep learning models whose embeddings can be used for a variety of downstream tasks. At the same time, the vast majority of these models have been trained on Western pop/rock music and related styles. This leads to research questions on whether these models can be used to learn representations for different music cultures and styles, or whether we can build similar music audio embedding models trained on data from different cultures or styles. To that end, we leverage transfer learning methods to derive insights about the similarities between the different music cultures to which the data belongs to. We use two Western music datasets, two traditional/folk datasets coming from eastern Mediterranean cultures, and two datasets belonging to Indian art music. Three deep audio embedding models are trained and transferred across domains, including two CNN-based and a Transformer-based architecture, to perform auto-tagging for each target domain dataset. Experimental results show that competitive performance is achieved in all domains via transfer learning, while the best source dataset varies for each music culture. The implementation and the trained models are both provided in a public repository.
    摘要 最近的MIR发展已经导致了许多标准深度学习模型,其embedding可以用于多种下游任务。然而,大多数这些模型都是在西方流行/摇滚音乐和相关风格上训练的。这引发了研究问题,是否可以使用这些模型学习不同的音乐文化和风格的表示,或者可以建立来自不同文化或风格的音乐音频嵌入模型。为了解决这个问题,我们利用了传输学习方法,以derive关于不同音乐文化之间的相似性的洞察。我们使用了两个西方音乐数据集,两个来自东 Mediterranean 的传统/民谣数据集,以及两个印度古典音乐数据集。三个深度音频嵌入模型被训练和传输到不同领域,包括两个CNN基于的和一个Transformer基于的架构,以实现每个目标领域数据集的自动标签。实验结果表明,通过传输学习,在所有领域中可以 achiev competitive 性,而每个音乐文化的最佳来源数据集则不同。实现和训练过的模型都提供在公共存储库中。

DiffDP: Radiotherapy Dose Prediction via a Diffusion Model

  • paper_url: http://arxiv.org/abs/2307.09794
  • repo_url: https://github.com/scufzh/DiffDP
  • paper_authors: Zhenghao Feng, Lu Wen, Peng Wang, Binyu Yan, Xi Wu, Jiliu Zhou, Yan Wang
  • for: 这个研究旨在提高放射治疗规划中自动预测药用剂分布的精度和效率,并解决现有方法所受到的过滤问题。
  • methods: 本研究提出了一个扩散基于的药用剂预测模型(DiffDP),它包括一个前进过程和一个反进过程。在前进过程中,DiffDP将药用剂分布图描述当作 Gaussian 噪声,逐步添加小量噪声,并训练一个噪声预测器来预测添加在每个时间步骤中的噪声。在反进过程中,DiffDP从原始 Gaussian 噪声中逐步除去噪声,使用对照预测器,最终输出预测的药用剂分布图描述。
  • results: 实验结果显示,DiffDP 模型可以优化 radiotherapy 规划中药用剂分布的预测精度和效率,并且能够保持药用剂分布的稳定性和精确性。
    Abstract Currently, deep learning (DL) has achieved the automatic prediction of dose distribution in radiotherapy planning, enhancing its efficiency and quality. However, existing methods suffer from the over-smoothing problem for their commonly used L_1 or L_2 loss with posterior average calculations. To alleviate this limitation, we innovatively introduce a diffusion-based dose prediction (DiffDP) model for predicting the radiotherapy dose distribution of cancer patients. Specifically, the DiffDP model contains a forward process and a reverse process. In the forward process, DiffDP gradually transforms dose distribution maps into Gaussian noise by adding small noise and trains a noise predictor to predict the noise added in each timestep. In the reverse process, it removes the noise from the original Gaussian noise in multiple steps with the well-trained noise predictor and finally outputs the predicted dose distribution map. To ensure the accuracy of the prediction, we further design a structure encoder to extract anatomical information from patient anatomy images and enable the noise predictor to be aware of the dose constraints within several essential organs, i.e., the planning target volume and organs at risk. Extensive experiments on an in-house dataset with 130 rectum cancer patients demonstrate the s
    摘要 The DiffDP model consists of a forward process and a reverse process. In the forward process, DiffDP transforms dose distribution maps into Gaussian noise by adding small noise and trains a noise predictor to predict the added noise in each timestep. In the reverse process, it removes noise from original Gaussian noise in multiple steps with the well-trained noise predictor and outputs the predicted dose distribution map.To ensure accuracy, we designed a structure encoder to extract anatomical information from patient anatomy images, enabling the noise predictor to be aware of dose constraints within essential organs, such as the planning target volume and organs at risk.Extensive experiments on an in-house dataset with 130 rectum cancer patients demonstrate the superiority of the DiffDP model in accuracy and efficiency compared to existing methods.

Density-invariant Features for Distant Point Cloud Registration

  • paper_url: http://arxiv.org/abs/2307.09788
  • repo_url: https://github.com/liuquan98/gcl
  • paper_authors: Quan Liu, Hongzi Zhu, Yunsong Zhou, Hongyang Li, Shan Chang, Minyi Guo
  • for: 提高远距离自动驾驶车辆3D视野的扩展,重要的一环是远程outdoor LiDAR点云注册。
  • methods: 提出了Group-wise Contrastive Learning(GCL)方法,使用这种方法可以提取不受密度影响的геометрические特征来注册远程outdoor LiDAR点云。
  • results: 经过理论分析和实验 validate,GCL方法可以提高远方enario的注册准确率,在KITTI和nuScenes benchmark上提高了40.9%和26.9%。I hope that helps! Let me know if you have any other questions.
    Abstract Registration of distant outdoor LiDAR point clouds is crucial to extending the 3D vision of collaborative autonomous vehicles, and yet is challenging due to small overlapping area and a huge disparity between observed point densities. In this paper, we propose Group-wise Contrastive Learning (GCL) scheme to extract density-invariant geometric features to register distant outdoor LiDAR point clouds. We mark through theoretical analysis and experiments that, contrastive positives should be independent and identically distributed (i.i.d.), in order to train densityinvariant feature extractors. We propose upon the conclusion a simple yet effective training scheme to force the feature of multiple point clouds in the same spatial location (referred to as positive groups) to be similar, which naturally avoids the sampling bias introduced by a pair of point clouds to conform with the i.i.d. principle. The resulting fully-convolutional feature extractor is more powerful and density-invariant than state-of-the-art methods, improving the registration recall of distant scenarios on KITTI and nuScenes benchmarks by 40.9% and 26.9%, respectively. Code is available at https://github.com/liuQuan98/GCL.
    摘要 注册远程外部LiDAR点云是自主驾驶车辆共同视觉扩展的关键,但受到小 overlap 区域和巨大的观察点云密度差异困扰。在这篇论文中,我们提出了集合比较学习(GCL)方案,以提取不受点云密度影响的geometry特征来注册远程外部LiDAR点云。我们通过理论分析和实验发现,对于contrastive正例,需要独立并同分布(i.i.d),以便训练不受点云密度影响的特征提取器。我们提出了一种简单 yet有效的训练方案,使得多个点云在同一个空间位置(称为正例组)中的特征相似,从而自然避免了由一对点云的抽样偏见引入的i.i.d.原则。得到的全连接的特征提取器更强大和不受点云密度影响,在KITTI和nuScenes benchmark上提高了远程enario中注册回快率 by 40.9%和26.9%,分别。代码可以在https://github.com/liuQuan98/GCL上获取。

DVPT: Dynamic Visual Prompt Tuning of Large Pre-trained Models for Medical Image Analysis

  • paper_url: http://arxiv.org/abs/2307.09787
  • repo_url: None
  • paper_authors: Along He, Kai Wang, Zhihong Wang, Tao Li, Huazhu Fu
    for: 针对医疗领域中具有限制的标注数据问题,提出了一种 Parametric Efficient Fine-tuning (PEFT) 方法,以增进泛化模型的适应性和效率。methods: 提出了一种名为 Dynamic Visual Prompt Tuning (DVPT) 的方法,通过将冻结特征经过轻量级瓶颈层处理,学习医疗任务特定的分布,然后使用几个可变的视觉提示进行相互关注,以获取适合每个样本的知识。results: 对医疗分类和分割任务进行了广泛的实验,发现 DVPT 方法不仅可以有效地适应预先训练的模型到医疗领域,还可以在具有有限的标注数据的情况下提高数据效率。例如,与预先训练的模型进行比较,DVPT 方法可以在医疗分类任务上提高 Kappa 分数超过 2.20%,并且可以节省到 60% 的标注数据和 99% 的存储成本。
    Abstract Limited labeled data makes it hard to train models from scratch in medical domain, and an important paradigm is pre-training and then fine-tuning. Large pre-trained models contain rich representations, which can be adapted to downstream medical tasks. However, existing methods either tune all the parameters or the task-specific layers of the pre-trained models, ignoring the input variations of medical images, and thus they are not efficient or effective. In this work, we aim to study parameter-efficient fine-tuning (PEFT) for medical image analysis, and propose a dynamic visual prompt tuning method, named DVPT. It can extract knowledge beneficial to downstream tasks from large models with a few trainable parameters. Firstly, the frozen features are transformed by an lightweight bottleneck layer to learn the domain-specific distribution of downstream medical tasks, and then a few learnable visual prompts are used as dynamic queries and then conduct cross-attention with the transformed features, attempting to acquire sample-specific knowledge that are suitable for each sample. Finally, the features are projected to original feature dimension and aggregated with the frozen features. This DVPT module can be shared between different Transformer layers, further reducing the trainable parameters. To validate DVPT, we conduct extensive experiments with different pre-trained models on medical classification and segmentation tasks. We find such PEFT method can not only efficiently adapt the pre-trained models to the medical domain, but also brings data efficiency with partial labeled data. For example, with 0.5\% extra trainable parameters, our method not only outperforms state-of-the-art PEFT methods, even surpasses the full fine-tuning by more than 2.20\% Kappa score on medical classification task. It can saves up to 60\% labeled data and 99\% storage cost of ViT-B/16.
    摘要 因为医疗领域有限的标注数据,训练模型从头来是困难的。一种重要的思路是使用预训练并微调。大型预训练模型含有丰富的表示,可以适应下游医疗任务。然而,现有方法都是全部参数或任务特定层的微调,忽略医疗图像的输入变化,因此不是高效或有效的。在这项工作中,我们目的是研究高效微调(PEFT)方法 для医疗图像分析,并提出了动态视觉提示调整方法(DVPT)。它可以从大型模型中提取有用于下游任务的知识,并且只需几个可调参数。首先,冻结的特征被轻量级瓶颈层转换,以学习下游医疗任务的域特定分布。然后,通过一些可调视觉提示,对转换后的特征进行跨见注意力操作,以获得适合每个样本的样本特定知识。最后,特征被投影到原始特征维度,并与冻结特征进行汇聚。这种DVPT模块可以在不同的Transformer层之间共享,进一步减少可调参数。为验证DVPT,我们在医疗分类和分割任务上进行了广泛的实验。我们发现,这种PEFT方法不仅可以高效地适应预训练模型到医疗领域,还可以提供数据效果,即使只有0.5%的额外可调参数。例如,我们的方法不仅超过了当前最佳PEFT方法的2.20%卡方度提升,还可以降低60%的标注数据和99%的存储成本。

Source-Free Domain Adaptation for Medical Image Segmentation via Prototype-Anchored Feature Alignment and Contrastive Learning

  • paper_url: http://arxiv.org/abs/2307.09769
  • repo_url: https://github.com/cscyqj/miccai23-protocontra-sfda
  • paper_authors: Qinji Yu, Nan Xi, Junsong Yuan, Ziyu Zhou, Kang Dang, Xiaowei Ding
  • for: 这篇论文旨在解决医疗领域中的对应领域数据不可用问题,通过不需要同时存在源和目标领域数据的情况下,实现对医疗影像分类的领域适应。
  • methods: 这篇论文提出了一个 novel two-stage source-free domain adaptation(SFDA)框架,仅需要一个已经预训的源分类模型和无标目标数据,便可以在领域适应中进行对应。特别是在专扩型态扩大设计阶段,我们首先使用源模型的预训维度作为源原型,然后引入两向运输来对目标特征进行对齐。此外,我们还将对应领域数据中的不可靠预测 pixels 用于建立更加紧密的目标特征分布。
  • results: 实验结果显示,在大领域落差设定下,我们的方法在与现有的 SFDA 方法和一些 UDA 方法进行比较中,表现较好。 Code 可以在 https://github.com/CSCYQJ/MICCAI23-ProtoContra-SFDA 上获得。
    Abstract Unsupervised domain adaptation (UDA) has increasingly gained interests for its capacity to transfer the knowledge learned from a labeled source domain to an unlabeled target domain. However, typical UDA methods require concurrent access to both the source and target domain data, which largely limits its application in medical scenarios where source data is often unavailable due to privacy concern. To tackle the source data-absent problem, we present a novel two-stage source-free domain adaptation (SFDA) framework for medical image segmentation, where only a well-trained source segmentation model and unlabeled target data are available during domain adaptation. Specifically, in the prototype-anchored feature alignment stage, we first utilize the weights of the pre-trained pixel-wise classifier as source prototypes, which preserve the information of source features. Then, we introduce the bi-directional transport to align the target features with class prototypes by minimizing its expected cost. On top of that, a contrastive learning stage is further devised to utilize those pixels with unreliable predictions for a more compact target feature distribution. Extensive experiments on a cross-modality medical segmentation task demonstrate the superiority of our method in large domain discrepancy settings compared with the state-of-the-art SFDA approaches and even some UDA methods. Code is available at https://github.com/CSCYQJ/MICCAI23-ProtoContra-SFDA.
    摘要 <>转换文本到简化中文。<>无监督领域适应(USDA)在过去几年中得到了越来越多的关注,因为它可以将源领域中标注的知识传递到目标领域中。然而,传统的USDA方法通常需要同时访问源和目标领域的数据,这大大限制了它在医疗场景中的应用,因为源数据通常因隐私问题而不可доступible。为解决源数据缺失问题,我们提出了一个新的两stage无源领域适应(SFDA)框架,用于医疗像素分割,只有一个已经训练的源分割模型和无标注目标数据在领域适应过程中可用。具体来说,在原型锁定特征对应阶段,我们首先利用源模型已经训练的像素精度分类器的权重作为源原型,以保留源特征信息。然后,我们引入双向运输来对目标特征与类原型进行对应,通过最小化其预期成本来做出匹配。此外,我们还设计了一个对比学习阶段,以利用具有不可靠预测的像素来实现目标特征分布更加紧凑。我们在一个cross-modality医疗像素分割任务上进行了广泛的实验,结果表明,我们的方法在大领域差异设定下超过了当前SFDA方法和一些UDAD方法的性能。代码可以在https://github.com/CSCYQJ/MICCAI23-ProtoContra-SFDA中找到。

Longitudinal Data and a Semantic Similarity Reward for Chest X-Ray Report Generation

  • paper_url: http://arxiv.org/abs/2307.09758
  • repo_url: https://github.com/aehrc/cxrmate
  • paper_authors: Aaron Nicolson, Jason Dowling, Bevan Koopman
  • For: 提高放射学家的工作效率和精度,以及改善患者护理* Methods: 利用患者的长期历史信息和CXR-BERT reward来提高自动生成放射报告* Results: 比前方法更高效,能够更好地遵循放射学家的工作流程,并且可以提供更好的临床翻译Here’s the breakdown of each point:* For: The paper aims to improve the work efficiency and accuracy of radiologists, as well as improve patient care.* Methods: The paper proposes a new method that utilizes the patient’s longitudinal history and CXR-BERT reward to improve automatic CXR report generation.* Results: The proposed method achieves better performance than previous methods, aligns more closely with radiologists’ assessment, and provides a better pathway to clinical translation.
    Abstract The current burnout rate of radiologists is high due to the large and ever growing number of Chest X-Rays (CXRs) needing interpretation and reporting. Promisingly, automatic CXR report generation has the potential to aid radiologists with this laborious task and improve patient care. Previous CXR report generation methods are limited by their diagnostic inaccuracy and their lack of alignment with the workflow of radiologists. To address these issues, we present a new method that utilises the longitudinal history available from a patient's previous CXR study when generating a report, which imitates a radiologist's workflow. We also propose a new reward for reinforcement learning based on CXR-BERT -- which captures the clinical semantic similarity between reports -- to further improve CXR report generation. We conduct experiments on the publicly available MIMIC-CXR dataset with metrics more closely correlated with radiologists' assessment of reporting. The results indicate capturing a patient's longitudinal history improves CXR report generation and that CXR-BERT is a promising alternative to the current state-of-the-art reward. Our approach generates radiology reports that are quantitatively more aligned with those of radiologists than previous methods while simultaneously offering a better pathway to clinical translation. Our Hugging Face checkpoint (https://huggingface.co/aehrc/cxrmate) and code (https://github.com/aehrc/cxrmate) are publicly available.
    摘要 现有的胸部X射线(CXR)报告生成方法有高热含率,主要是因为逐渐增长的胸部X射线需要解释和报告的数量。幸运的是,自动生成CXR报告有很大的潜在作用,可以帮助胸部骨科医生处理这项劳动ioso�的任务,并提高患者的护理质量。在过去的CXR报告生成方法中,存在诊断不准确和医生工作流程不符的问题。为解决这些问题,我们提出了一种新的方法,该方法利用患者的前一次CXR研究记录来生成报告,这与胸部骨科医生的工作流程相似。此外,我们还提出了一种基于CXR-BERT的新奖励方法,这种奖励方法可以进一步提高CXR报告生成的质量。我们在公共可用的MIMIC-CXR数据集上进行了实验,结果表明,捕捉患者的 longitudinal历史可以提高CXR报告生成质量,而CXR-BERT是一个有前途的奖励方法。我们的方法可以生成与胸部骨科医生评估更加一致的报告,同时也提供了更好的临床翻译路径。我们的Hugging Face检查点(https://huggingface.co/aehrc/cxrmate)和代码(https://github.com/aehrc/cxrmate)都是公共可用的。

Generative Prompt Model for Weakly Supervised Object Localization

  • paper_url: http://arxiv.org/abs/2307.09756
  • repo_url: https://github.com/callsys/genpromp
  • paper_authors: Yuzhong Zhao, Qixiang Ye, Weijia Wu, Chunhua Shen, Fang Wan
  • for: 本研究旨在解决受限监督物体地标的物体本地化问题,提出一种生成提示模型(GenPromp),以便学习不具有特征的物体部分。
  • methods: 本研究使用生成提示模型(GenPromp),将图像分类标签转换为学习可能的提示编码,然后通过一个生成模型来conditionally恢复输入图像噪音。
  • results: 在CUB-200-2011和ILSVRC上进行实验,GenPromp分别超过了最佳探测模型的5.2%和5.6%(Top-1 Loc),设置了WSOL领域的固定基线。可以从https://github.com/callsys/GenPromp获取代码。
    Abstract Weakly supervised object localization (WSOL) remains challenging when learning object localization models from image category labels. Conventional methods that discriminatively train activation models ignore representative yet less discriminative object parts. In this study, we propose a generative prompt model (GenPromp), defining the first generative pipeline to localize less discriminative object parts by formulating WSOL as a conditional image denoising procedure. During training, GenPromp converts image category labels to learnable prompt embeddings which are fed to a generative model to conditionally recover the input image with noise and learn representative embeddings. During inference, enPromp combines the representative embeddings with discriminative embeddings (queried from an off-the-shelf vision-language model) for both representative and discriminative capacity. The combined embeddings are finally used to generate multi-scale high-quality attention maps, which facilitate localizing full object extent. Experiments on CUB-200-2011 and ILSVRC show that GenPromp respectively outperforms the best discriminative models by 5.2% and 5.6% (Top-1 Loc), setting a solid baseline for WSOL with the generative model. Code is available at https://github.com/callsys/GenPromp.
    摘要 弱监督物体Localization(WSOL)在学习物体Localization模型时仍然是一个挑战。传统的推理模型忽略了特征却不是最终的物体部分。在这个研究中,我们提出了一个生成提示模型(GenPromp),定义了首个生成管道,用于通过将图像类别标签转换为学习可能的提示嵌入,并使用生成模型来 conditionally 恢复输入图像噪声,以学习表示性嵌入。在训练过程中,GenPromp将图像类别标签转换为学习可能的提示嵌入,并将其 fed 到生成模型中,以conditionally 恢复输入图像噪声。在推理过程中,enPromp 将表示性嵌入与推理嵌入(从一个可用的视觉语言模型中查询)相结合,以实现表示性和推理能力的结合。最后,combined embeddings 将被用来生成多尺度高质量注意力地图,以便快速 Localize 全部物体范围。实验表明,GenPromp 可以在 CUB-200-2011 和 ILSVRC 上分别与最佳推理模型相比,提高 Top-1 Loc 的性能5.2%和5.6%,设置了WSOL领域的坚实基准。代码可以在 中找到。

Towards Robust Scene Text Image Super-resolution via Explicit Location Enhancement

  • paper_url: http://arxiv.org/abs/2307.09749
  • repo_url: https://github.com/csguoh/lemma
  • paper_authors: Hang Guo, Tao Dai, Guanghao Meng, Shu-Tao Xia
  • for: 提高Scene Text Recognition(STR)精度,address complex background disturbance
  • methods: 提出了一种新的方法LEMMA,explicitly model character regions to produce high-level text-specific guidance for super-resolution
  • results: 在TextZoom和四个场景文本识别benchmark上,与其他状态的方法进行比较,得到了superiority
    Abstract Scene text image super-resolution (STISR), aiming to improve image quality while boosting downstream scene text recognition accuracy, has recently achieved great success. However, most existing methods treat the foreground (character regions) and background (non-character regions) equally in the forward process, and neglect the disturbance from the complex background, thus limiting the performance. To address these issues, in this paper, we propose a novel method LEMMA that explicitly models character regions to produce high-level text-specific guidance for super-resolution. To model the location of characters effectively, we propose the location enhancement module to extract character region features based on the attention map sequence. Besides, we propose the multi-modal alignment module to perform bidirectional visual-semantic alignment to generate high-quality prior guidance, which is then incorporated into the super-resolution branch in an adaptive manner using the proposed adaptive fusion module. Experiments on TextZoom and four scene text recognition benchmarks demonstrate the superiority of our method over other state-of-the-art methods. Code is available at https://github.com/csguoh/LEMMA.
    摘要 Scene文本超解像(STISR)目标是提高图像质量并提高下游场景文本识别精度,最近几年取得了很大成功。然而,大多数现有方法在前进过程中对背景(非文本区域)和前景(文本区域)待遇处理相同,不充分考虑背景的复杂性,因此限制性。为解决这些问题,在这篇论文中,我们提出了一种新的方法LEMMA,其中明确模型文本区域以生成高级文本特定指导。为了有效地捕捉文本区域的位置,我们提出了位置增强模块,该模块通过注意力地图序列提取文本区域特征。此外,我们提出了多模态对应模块,该模块通过双向视Semantic对齐来生成高质量的初始指导,然后将其与超解像分支结合使用我们提出的适应融合模块。实验证明了我们的方法在TextZoom和四个场景文本识别标准测试集上的优势。代码可以在https://github.com/csguoh/LEMMA中下载。

Watch out Venomous Snake Species: A Solution to SnakeCLEF2023

  • paper_url: http://arxiv.org/abs/2307.09748
  • repo_url: https://github.com/xiaoxsparraw/clef2023
  • paper_authors: Feiran Hu, Peng Wang, Yangyang Li, Chenlong Duan, Zijian Zhu, Fei Wang, Faen Zhang, Yong Li, Xiu-Shen Wei
  • for: 这个研究是为了开发高级的类别植物识别算法,通过图像和数据库的分析。
  • methods: 本研究使用了现代的CNN模型和强大的数据增强,以学习图像的更好表现。另外,使用了 seesaw 损失函数来缓解长条形分布的挑战。在后期处理阶段,我们设计了一个轻量级的模型来计算 metadata 特征,并将毒蛇类别标签给一些模型是不确定的示例。
  • results: 本研究在私人领先板上获得了91.31%的最终度量结果,排名第一。代码可以在 https://github.com/xiaoxsparraw/CLEF2023 上取得。
    Abstract The SnakeCLEF2023 competition aims to the development of advanced algorithms for snake species identification through the analysis of images and accompanying metadata. This paper presents a method leveraging utilization of both images and metadata. Modern CNN models and strong data augmentation are utilized to learn better representation of images. To relieve the challenge of long-tailed distribution, seesaw loss is utilized in our method. We also design a light model to calculate prior probabilities using metadata features extracted from CLIP in post processing stage. Besides, we attach more importance to venomous species by assigning venomous species labels to some examples that model is uncertain about. Our method achieves 91.31% score of the final metric combined of F1 and other metrics on private leaderboard, which is the 1st place among the participators. The code is available at https://github.com/xiaoxsparraw/CLEF2023.
    摘要 《蛇类CLEF2023比赛目标是开发高级算法用于蛇类识别,通过图像和相关元数据分析。本文提出了利用图像和元数据的方法,使用现代 convolutional neural network(CNN)模型和强大的数据增强来学习图像的更好表示。为了解决长尾分布的挑战,我们使用 seesaw loss 方法。此外,我们还设计了一个轻量级模型用于计算元数据特征,并在后处理阶段使用 CLIP 提取的元数据特征来进行估计。此外,我们还强调有毒种类,对模型不确定的示例分配有毒种类标签。我们的方法在私人领先板上实现了 91.31% 的最终指标(F1 和其他指标的组合),位居所有参与者之首。代码可以在 GitHub 上找到:https://github.com/xiaoxsparraw/CLEF2023。

CPCM: Contextual Point Cloud Modeling for Weakly-supervised Point Cloud Semantic Segmentation

  • paper_url: http://arxiv.org/abs/2307.10316
  • repo_url: https://github.com/lizhaoliu-Lec/CPCM
  • paper_authors: Lizhao Liu, Zhuangwei Zhuang, Shangxin Huang, Xunlong Xiao, Tianhang Xiang, Cen Chen, Jingdong Wang, Mingkui Tan
  • for: 这个研究旨在探讨弱监督点 cloud semantic segmentation 的问题,尤其是对于Scene understanding 的环境学习。
  • methods: 我们提出了一个简单 yet effective的 Contextual Point Cloud Modeling (CPCM) 方法,包括两个部分:RegionMask 策略和 contextual masked training (CMT) 方法。
  • results: 我们的实验结果显示,CPCM 方法在ScanNet V2和S3DIS benchmarks 上比州前方法表现更好,对于弱监督点 cloud semantic segmentation 提供了更高的性能。
    Abstract We study the task of weakly-supervised point cloud semantic segmentation with sparse annotations (e.g., less than 0.1% points are labeled), aiming to reduce the expensive cost of dense annotations. Unfortunately, with extremely sparse annotated points, it is very difficult to extract both contextual and object information for scene understanding such as semantic segmentation. Motivated by masked modeling (e.g., MAE) in image and video representation learning, we seek to endow the power of masked modeling to learn contextual information from sparsely-annotated points. However, directly applying MAE to 3D point clouds with sparse annotations may fail to work. First, it is nontrivial to effectively mask out the informative visual context from 3D point clouds. Second, how to fully exploit the sparse annotations for context modeling remains an open question. In this paper, we propose a simple yet effective Contextual Point Cloud Modeling (CPCM) method that consists of two parts: a region-wise masking (RegionMask) strategy and a contextual masked training (CMT) method. Specifically, RegionMask masks the point cloud continuously in geometric space to construct a meaningful masked prediction task for subsequent context learning. CMT disentangles the learning of supervised segmentation and unsupervised masked context prediction for effectively learning the very limited labeled points and mass unlabeled points, respectively. Extensive experiments on the widely-tested ScanNet V2 and S3DIS benchmarks demonstrate the superiority of CPCM over the state-of-the-art.
    摘要 我们研究弱监督点云 semantic segmentation 任务,使用稀有的标注(例如,点云中的标注点数少于0.1%),以降低高成本的密集标注成本。然而,与极端稀有的标注点相关,提取场景理解中的对象信息和 контекст信息具有极大的挑战。为了解决这个问题,我们启发自masked modeling(例如MAE)在图像和视频表示学习中的应用。我们寻求通过masked modeling来学习点云中的Contextual信息,但直接将MAE应用于稀有标注的3D点云可能不太可能。首先,从3D点云中有效地遮盖有用的视觉上下文是一个非常困难的问题。其次,如何充分利用稀有标注来Contextual模型仍然是一个开放的问题。在这篇论文中,我们提出了一种简单又有效的Contextual Point Cloud Modeling(CPCM)方法,它包括两个部分:RegionMask和CMT。 RegionMask使用几何空间中的连续遮盖来构建一个有意义的masked prediction任务,以便进行后续的Contextual模型学习。CMT通过分离supervised segmentation和无supervised masked context prediction来有效地学习EXTremely Limited Labels和大量无标注点云,分别学习supervised segmentation和无supervised masked context prediction。我们在ScanNet V2和S3DIS benchmark上进行了广泛的实验,结果表明CPCM在状态的前ier exceeds the state-of-the-art。

Improved Distribution Matching for Dataset Condensation

  • paper_url: http://arxiv.org/abs/2307.09742
  • repo_url: https://github.com/uitrbn/idm
  • paper_authors: Ganlong Zhao, Guanbin Li, Yipeng Qin, Yizhou Yu
  • for: 这个论文的目的是缩小深度学习数据集,以减少储存成本和训练成本,但保持训练模型的表现良好。
  • methods: 这个论文提出了一种基于分布匹配的数据集缩小方法,该方法更加高效和有前途。具体来说,论文识别了两个重要缺陷(分布数据不均匀和无效的嵌入),并通过三种新技术(分区和扩展增强、有效和丰富的模型抽样、类别对分布调整)解决这些缺陷。
  • results: 这个方法在大量的数据集上实现了更高效的缩小,并且在训练深度学习模型时显示出了更好的表现,较之前的优化方法更加高效。实验结果显示了这个方法的有效性。
    Abstract Dataset Condensation aims to condense a large dataset into a smaller one while maintaining its ability to train a well-performing model, thus reducing the storage cost and training effort in deep learning applications. However, conventional dataset condensation methods are optimization-oriented and condense the dataset by performing gradient or parameter matching during model optimization, which is computationally intensive even on small datasets and models. In this paper, we propose a novel dataset condensation method based on distribution matching, which is more efficient and promising. Specifically, we identify two important shortcomings of naive distribution matching (i.e., imbalanced feature numbers and unvalidated embeddings for distance computation) and address them with three novel techniques (i.e., partitioning and expansion augmentation, efficient and enriched model sampling, and class-aware distribution regularization). Our simple yet effective method outperforms most previous optimization-oriented methods with much fewer computational resources, thereby scaling data condensation to larger datasets and models. Extensive experiments demonstrate the effectiveness of our method. Codes are available at https://github.com/uitrbn/IDM
    摘要

ClickSeg: 3D Instance Segmentation with Click-Level Weak Annotations

  • paper_url: http://arxiv.org/abs/2307.09732
  • repo_url: None
  • paper_authors: Leyao Liu, Tao Kong, Minzhao Zhu, Jiashuo Fan, Lu Fang
  • for: 这个论文旨在提出一种ClickSeg方法,该方法可以在具有只一个点标注的情况下实现3D实例分割。
  • methods: ClickSeg方法使用了一种基准的弱iously supervised training方法,通过模型自己生成pseudo标签来补充无标签数据。此外,该方法还提出了一个新的训练框架,使用k-means算法并设置了固定的初始种子。
  • results: ClickSeg方法在ScanNetV2和S3DIS数据集上实现了较好的实验结果,比前一个最佳弱iously supervised实例分割结果高出9.4%的MAP值。使用0.02%的超vision信号,ClickSeg方法可以达到$\sim$90%的准确率,同时也实现了同样的annotation设置下的状态态的 semantic segmentation结果。
    Abstract 3D instance segmentation methods often require fully-annotated dense labels for training, which are costly to obtain. In this paper, we present ClickSeg, a novel click-level weakly supervised 3D instance segmentation method that requires one point per instance annotation merely. Such a problem is very challenging due to the extremely limited labels, which has rarely been solved before. We first develop a baseline weakly-supervised training method, which generates pseudo labels for unlabeled data by the model itself. To utilize the property of click-level annotation setting, we further propose a new training framework. Instead of directly using the model inference way, i.e., mean-shift clustering, to generate the pseudo labels, we propose to use k-means with fixed initial seeds: the annotated points. New similarity metrics are further designed for clustering. Experiments on ScanNetV2 and S3DIS datasets show that the proposed ClickSeg surpasses the previous best weakly supervised instance segmentation result by a large margin (e.g., +9.4% mAP on ScanNetV2). Using 0.02% supervision signals merely, ClickSeg achieves $\sim$90% of the accuracy of the fully-supervised counterpart. Meanwhile, it also achieves state-of-the-art semantic segmentation results among weakly supervised methods that use the same annotation settings.
    摘要 三维实例分割方法经常需要完全标注的紧密标签进行训练,这些标注是贵重的获得。在这篇论文中,我们提出了ClickSeg,一种新的点级弱标注三维实例分割方法,只需要每个实例一个点的注释。这是一个非常困难的问题,因为非常少的标注,这已经rarely been solved before。我们首先开发了一个基eline的弱标注训练方法,该方法可以生成pseudo标签 для未标注数据,通过模型自身。为了利用点级标注设置的特性,我们进一步提议了一新的训练框架。而不是直接使用模型推理方式,即mean-shift clustering,来生成pseudo标签,我们提议使用k-meansWith fixed initial seeds:注释点。新的相似度度量被设计用于分类。实验表明,提案的ClickSeg超过了之前最佳弱标注实例分割结果,即+9.4% mAP on ScanNetV2。使用0.02%的超vision信号仅,ClickSeg实现了$\sim$90%的准确性,与完全标注的 counterpart相当。此外,它还实现了weakly supervised方法中的最佳semantic segmentation结果。

NTIRE 2023 Quality Assessment of Video Enhancement Challenge

  • paper_url: http://arxiv.org/abs/2307.09729
  • repo_url: None
  • paper_authors: Xiaohong Liu, Xiongkuo Min, Wei Sun, Yulun Zhang, Kai Zhang, Radu Timofte, Guangtao Zhai, Yixuan Gao, Yuqin Cao, Tengchuan Kou, Yunlong Dong, Ziheng Jia, Yilin Li, Wei Wu, Shuming Hu, Sibin Deng, Pengxiang Xiao, Ying Chen, Kai Li, Kai Zhao, Kun Yuan, Ming Sun, Heng Cong, Hao Wang, Lingzhi Fu, Yusheng Zhang, Rongyu Zhang, Hang Shi, Qihang Xu, Longan Xiao, Zhiliang Ma, Mirko Agarla, Luigi Celona, Claudio Rota, Raimondo Schettini, Zhiwei Huang, Yanan Li, Xiaotao Wang, Lei Lei, Hongye Liu, Wei Hong, Ironhead Chuang, Allen Lin, Drake Guan, Iris Chen, Kae Lou, Willy Huang, Yachun Tasi, Yvonne Kao, Haotian Fan, Fangyuan Kong, Shiqi Zhou, Hao Liu, Yu Lai, Shanshan Chen, Wenqi Wang, Haoning Wu, Chaofeng Chen, Chunzheng Zhu, Zekun Guo, Shiling Zhao, Haibing Yin, Hongkui Wang, Hanene Brachemi Meftah, Sid Ahmed Fezza, Wassim Hamidouche, Olivier Déforges, Tengfei Shi, Azadeh Mansouri, Hossein Motamednia, Amir Hossein Bakhtiari, Ahmad Mahmoudi Aznaveh
  • for: 本研究旨在提高视频处理领域中的视频质量评估(VQA)技术,特别是对于进行了增强处理的视频。
  • methods: 本研究使用了VQA Dataset for Perceptual Video Enhancement(VDPVE),包括600个彩色、亮度和对比度增强的视频、310个去锐化视频和301个去晃动视频。参赛者使用了多种方法,包括基线方法以及自己研发的模型。
  • results: 本研究发现了一些方法可以超越基线方法的性能,并且赢得了比赛。赢得者的方法展示了出色的预测性能。
    Abstract This paper reports on the NTIRE 2023 Quality Assessment of Video Enhancement Challenge, which will be held in conjunction with the New Trends in Image Restoration and Enhancement Workshop (NTIRE) at CVPR 2023. This challenge is to address a major challenge in the field of video processing, namely, video quality assessment (VQA) for enhanced videos. The challenge uses the VQA Dataset for Perceptual Video Enhancement (VDPVE), which has a total of 1211 enhanced videos, including 600 videos with color, brightness, and contrast enhancements, 310 videos with deblurring, and 301 deshaked videos. The challenge has a total of 167 registered participants. 61 participating teams submitted their prediction results during the development phase, with a total of 3168 submissions. A total of 176 submissions were submitted by 37 participating teams during the final testing phase. Finally, 19 participating teams submitted their models and fact sheets, and detailed the methods they used. Some methods have achieved better results than baseline methods, and the winning methods have demonstrated superior prediction performance.
    摘要

Uncertainty-Driven Multi-Scale Feature Fusion Network for Real-time Image Deraining

  • paper_url: http://arxiv.org/abs/2307.09728
  • repo_url: None
  • paper_authors: Ming Tong, Xuefeng Yan, Yongzhen Wang
  • for: 提高雨天下视觉测量系统的性能
  • methods: 提出了一种基于多级特征协同 fusion的不确定性驱动多尺度特征融合网络(UMFFNet),通过把不确定性信息纳入网络来估算推断uncertainty
  • results: 实验表明,UMFFNet可以在减少预测错误的情况下提高视觉测量系统的性能,并且比存在state-of-the-art图像推理方法更高效。
    Abstract Visual-based measurement systems are frequently affected by rainy weather due to the degradation caused by rain streaks in captured images, and existing imaging devices struggle to address this issue in real-time. While most efforts leverage deep networks for image deraining and have made progress, their large parameter sizes hinder deployment on resource-constrained devices. Additionally, these data-driven models often produce deterministic results, without considering their inherent epistemic uncertainty, which can lead to undesired reconstruction errors. Well-calibrated uncertainty can help alleviate prediction errors and assist measurement devices in mitigating risks and improving usability. Therefore, we propose an Uncertainty-Driven Multi-Scale Feature Fusion Network (UMFFNet) that learns the probability mapping distribution between paired images to estimate uncertainty. Specifically, we introduce an uncertainty feature fusion block (UFFB) that utilizes uncertainty information to dynamically enhance acquired features and focus on blurry regions obscured by rain streaks, reducing prediction errors. In addition, to further boost the performance of UMFFNet, we fused feature information from multiple scales to guide the network for efficient collaborative rain removal. Extensive experiments demonstrate that UMFFNet achieves significant performance improvements with few parameters, surpassing state-of-the-art image deraining methods.
    摘要 视觉基于的测量系统常常由雨水影响,因为雨斑在捕捉到的图像中带来的降低效果,现有的图像设备很难在实时中解决这个问题。大多数努力是利用深度网络进行图像除雨,并且已经取得了进步,但它们的参数较大,使得在资源有限的设备上部署不可能。此外,这些数据驱动的模型经常生成决定性的结果,不考虑其内在的可识别 uncertainty,这可能导致不良重建结果。Well-calibrated uncertainty可以帮助减轻预测错误和使测量设备避免风险,提高可用性。因此,我们提出了一种基于不确定性的多级特征融合网络(UMFFNet),该网络学习图像之间的可能性映射分布。我们首次引入了一个不确定性特征融合块(UFFB),该块利用不确定性信息来动态增强获取的特征和焦点在雨斑掩蔽的模糊区域,从而减少预测错误。此外,为了进一步提高 UMFFNet 的性能,我们将特征信息从多个级别融合,以便导航网络进行高效的合作雨除。广泛的实验表明, UMFFNet 可以在少量参数下取得显著的性能提升,超越当前的图像除雨方法。

SAMConvex: Fast Discrete Optimization for CT Registration using Self-supervised Anatomical Embedding and Correlation Pyramid

  • paper_url: http://arxiv.org/abs/2307.09727
  • repo_url: None
  • paper_authors: Zi Li, Lin Tian, Tony C. W. Mok, Xiaoyu Bai, Puyang Wang, Jia Ge, Jingren Zhou, Le Lu, Xianghua Ye, Ke Yan, Dakai Jin
  • for: 这篇论文的目的是提出一种快速且精确的 CT 注册方法,以便在医疗影像处理中实现高精度的注册。
  • methods: 这篇论文使用了一种叫做 SAMConvex 的快速粗糙到细致的确定优化方法,通过自适应的扩展特征(SAM)来捕捉本地和全局的信息。
  • results: 这篇论文的结果显示 SAMConvex 方法在两个间patient CT 注册数据集(腹部 CT 和头颈部 CT)以及一个内patient CT 注册数据集(肺部 CT)上都能够实现更高的精度和速度,并且只需要几十毫秒这么长。
    Abstract Estimating displacement vector field via a cost volume computed in the feature space has shown great success in image registration, but it suffers excessive computation burdens. Moreover, existing feature descriptors only extract local features incapable of representing the global semantic information, which is especially important for solving large transformations. To address the discussed issues, we propose SAMConvex, a fast coarse-to-fine discrete optimization method for CT registration that includes a decoupled convex optimization procedure to obtain deformation fields based on a self-supervised anatomical embedding (SAM) feature extractor that captures both local and global information. To be specific, SAMConvex extracts per-voxel features and builds 6D correlation volumes based on SAM features, and iteratively updates a flow field by performing lookups on the correlation volumes with a coarse-to-fine scheme. SAMConvex outperforms the state-of-the-art learning-based methods and optimization-based methods over two inter-patient registration datasets (Abdomen CT and HeadNeck CT) and one intra-patient registration dataset (Lung CT). Moreover, as an optimization-based method, SAMConvex only takes $\sim2$s ($\sim5s$ with instance optimization) for one paired images.
    摘要 估算距离 вектор场via一个在特征空间计算的成本量表示图像匹配得到了很大的成功,但它受到过度计算负担。此外,现有的特征描述符只能提取本地特征,无法表示全局semantic信息,尤其是用于解决大 transformations。为了解决这些问题,我们提议SAMConvex,一种快速、粗略到细节的分别优化方法 дляCT匹配,包括一个分离 convex 优化过程来获取变形场 based on a self-supervised anatomical embedding (SAM) feature extractor,该特征提取器可以捕捉到本地和全局信息。具体来说,SAMConvex提取每个voxel的特征并建立6D相关体积表 based on SAM features,然后iteratively更新流场by performing lookups on the correlation volumes with a coarse-to-fine scheme。SAMConvex在对两个Inter-patient CT registration数据集( Abdomen CT 和 HeadNeck CT)以及一个intra-patient CT registration数据集(Lung CT)进行比较时,超越了当前的学习基于方法和优化基于方法。此外,作为一种优化基于方法,SAMConvex只需要大约2秒(大约5秒with instance optimization)来处理一个对应的图像。

AesPA-Net: Aesthetic Pattern-Aware Style Transfer Networks

  • paper_url: http://arxiv.org/abs/2307.09724
  • repo_url: https://github.com/kibeom-hong/aespa-net
  • paper_authors: Kibeom Hong, Seogkyu Jeon, Junsoo Lee, Namhyuk Ahn, Kunhee Kim, Pilhyeon Lee, Daesik Kim, Youngjung Uh, Hyeran Byun
  • for: 这paper是为了解决style transfer问题,即将 contenido image 转换成艺术风格image。
  • methods: 这paper使用了Attention机制,以实现地图local patches of style image和content image的对应关系。同时,paper also introduces a novel metric called pattern repeatability to quantify the repetition of patterns in the style image, and a self-supervisory task to encourage the attention mechanism to learn precise and meaningful semantic correspondence.
  • results: 经过qualitative和quantitative evaluations, paper verify了pattern repeatability的可靠性,并demonstrated the superiority of the proposed framework compared to existing methods.
    Abstract To deliver the artistic expression of the target style, recent studies exploit the attention mechanism owing to its ability to map the local patches of the style image to the corresponding patches of the content image. However, because of the low semantic correspondence between arbitrary content and artworks, the attention module repeatedly abuses specific local patches from the style image, resulting in disharmonious and evident repetitive artifacts. To overcome this limitation and accomplish impeccable artistic style transfer, we focus on enhancing the attention mechanism and capturing the rhythm of patterns that organize the style. In this paper, we introduce a novel metric, namely pattern repeatability, that quantifies the repetition of patterns in the style image. Based on the pattern repeatability, we propose Aesthetic Pattern-Aware style transfer Networks (AesPA-Net) that discover the sweet spot of local and global style expressions. In addition, we propose a novel self-supervisory task to encourage the attention mechanism to learn precise and meaningful semantic correspondence. Lastly, we introduce the patch-wise style loss to transfer the elaborate rhythm of local patterns. Through qualitative and quantitative evaluations, we verify the reliability of the proposed pattern repeatability that aligns with human perception, and demonstrate the superiority of the proposed framework.
    摘要 通过实现目标风格的艺术表达,最近的研究借鉴了注意机制,因为它可以将风格图像中的本地小块映射到内容图像中的相应小块。然而,由于随机内容和艺术作品之间的 semantics 低,注意模块会不断利用风格图像中的特定本地小块,导致不和谐和明显的重复artefacts。为了超越这个限制和完成无瑕的艺术风格传输,我们将注意力领域的增强和风格图像中的 рит律掌握作为主要目标。在这篇论文中,我们提出了一个新的度量器,即风格图像中的pattern repeatability,用于量化风格图像中的pattern的重复度。基于pattern repeatability,我们提出了Aesthetic Pattern-Aware style transfer Networks(AesPA-Net),该网络可以找到风格图像中的精彩点,并寻找合适的本地和全局风格表达。此外,我们提出了一种新的自我超级视角任务,以便注意力机制学习准确和有意义的semantic相关性。最后,我们引入了 patch-wise style loss,以传递风格图像中的细腻的rhythm。通过质量和量化评估,我们证明了提案的pattern repeatability的可靠性,并证明了我们的提案方框的优越性。

Semantic-Aware Dual Contrastive Learning for Multi-label Image Classification

  • paper_url: http://arxiv.org/abs/2307.09715
  • repo_url: https://github.com/yu-gi-oh-leilei/sadcl
  • paper_authors: Leilei Ma, Dengdi Sun, Lei Wang, Haifeng Zhao, Bin Luo
  • for: 提高自然图像中多个对象或特征的图像semantics提取和分类效果
  • methods: 使用图像semantic关系模型、类 activation maps (CAM) 以及样本对比学习 (SSCL) 和类prototype对比学习 (PSCL)
  • results: 在五个大规模公共数据集上实验,提出的方法比现有方法高效,并且能够准确捕捉图像内容相关的涉及性 label
    Abstract Extracting image semantics effectively and assigning corresponding labels to multiple objects or attributes for natural images is challenging due to the complex scene contents and confusing label dependencies. Recent works have focused on modeling label relationships with graph and understanding object regions using class activation maps (CAM). However, these methods ignore the complex intra- and inter-category relationships among specific semantic features, and CAM is prone to generate noisy information. To this end, we propose a novel semantic-aware dual contrastive learning framework that incorporates sample-to-sample contrastive learning (SSCL) as well as prototype-to-sample contrastive learning (PSCL). Specifically, we leverage semantic-aware representation learning to extract category-related local discriminative features and construct category prototypes. Then based on SSCL, label-level visual representations of the same category are aggregated together, and features belonging to distinct categories are separated. Meanwhile, we construct a novel PSCL module to narrow the distance between positive samples and category prototypes and push negative samples away from the corresponding category prototypes. Finally, the discriminative label-level features related to the image content are accurately captured by the joint training of the above three parts. Experiments on five challenging large-scale public datasets demonstrate that our proposed method is effective and outperforms the state-of-the-art methods. Code and supplementary materials are released on https://github.com/yu-gi-oh-leilei/SADCL.
    摘要 原文:Extracting image semantics effectively and assigning corresponding labels to multiple objects or attributes for natural images is challenging due to the complex scene contents and confusing label dependencies. Recent works have focused on modeling label relationships with graph and understanding object regions using class activation maps (CAM). However, these methods ignore the complex intra- and inter-category relationships among specific semantic features, and CAM is prone to generate noisy information. To this end, we propose a novel semantic-aware dual contrastive learning framework that incorporates sample-to-sample contrastive learning (SSCL) as well as prototype-to-sample contrastive learning (PSCL). Specifically, we leverage semantic-aware representation learning to extract category-related local discriminative features and construct category prototypes. Then based on SSCL, label-level visual representations of the same category are aggregated together, and features belonging to distinct categories are separated. Meanwhile, we construct a novel PSCL module to narrow the distance between positive samples and category prototypes and push negative samples away from the corresponding category prototypes. Finally, the discriminative label-level features related to the image content are accurately captured by the joint training of the above three parts. Experiments on five challenging large-scale public datasets demonstrate that our proposed method is effective and outperforms the state-of-the-art methods. Code and supplementary materials are released on https://github.com/yu-gi-oh-leilei/SADCL.中文翻译:抽取自然图像中的 semantics 效果好并将多个对象或属性分配到相应的标签是一项挑战,原因在于场景内容的复杂性和标签依赖关系的混乱。 recent works 主要关注图标关系模型和对象区域理解使用类 activation maps (CAM)。然而,这些方法忽略了特定semantic feature之间的复杂内部和间接关系,而 CAM 也容易产生噪音信息。为此,我们提出了一种新的semantic-aware dual contrastive learning框架,该框架包括sample-to-sample contrastive learning (SSCL) 以及 prototype-to-sample contrastive learning (PSCL)。具体来说,我们利用semantic-aware representation learning来抽取相应类型的本地特征特征,并构建类型prototype。然后,基于 SSCL,同类标签的视觉表示被聚合在一起,不同类标签的表示被分离开。此外,我们构建了一种新的PSCL模块,以减少正样例与类型prototype之间的距离,并推动负样例离开对应的类型prototype。最后,通过上述三部分的集成,捕捉到图像内容相关的有用特征。 experiments 表明,我们提出的方法效果好,并超过了当前状态的方法。代码和补充材料可以在https://github.com/yu-gi-oh-leilei/SADCL 上获取。

Towards Saner Deep Image Registration

  • paper_url: http://arxiv.org/abs/2307.09696
  • repo_url: https://github.com/tuffr5/saner-deep-registration
  • paper_authors: Bin Duan, Ming Zhong, Yan Yan
  • for: 本文旨在探讨学习基于深度学习的图像registrations方法在医学图像注册中的精度和稳定性。
  • methods: 本文提出了一种基于reg regularization的sanity-enforcer方法,通过两种sanity check来降低模型的逆向一致性错误和提高其识别力。
  • results: 实验结果表明,该方法可以 simultanously improve the sanity of models without sacrificing any performance. Additionally, the authors provide theoretical guarantees for their method.
    Abstract With recent advances in computing hardware and surges of deep-learning architectures, learning-based deep image registration methods have surpassed their traditional counterparts, in terms of metric performance and inference time. However, these methods focus on improving performance measurements such as Dice, resulting in less attention given to model behaviors that are equally desirable for registrations, especially for medical imaging. This paper investigates these behaviors for popular learning-based deep registrations under a sanity-checking microscope. We find that most existing registrations suffer from low inverse consistency and nondiscrimination of identical pairs due to overly optimized image similarities. To rectify these behaviors, we propose a novel regularization-based sanity-enforcer method that imposes two sanity checks on the deep model to reduce its inverse consistency errors and increase its discriminative power simultaneously. Moreover, we derive a set of theoretical guarantees for our sanity-checked image registration method, with experimental results supporting our theoretical findings and their effectiveness in increasing the sanity of models without sacrificing any performance. Our code and models are available at https://github.com/tuffr5/Saner-deep-registration.
    摘要

GlobalMapper: Arbitrary-Shaped Urban Layout Generation

  • paper_url: http://arxiv.org/abs/2307.09693
  • repo_url: None
  • paper_authors: Liu He, Daniel Aliaga
  • for: Layout generation for urban buildings, using graph attention networks for fully automatic generation of realistic city blocks with arbitrary road networks and building shapes.
  • methods: Using graph attention networks to skeletonize building layouts and generate realistic urban layouts, with conditional generation based on learned priors.
  • results: Superior performance compared to prior layout generation networks, demonstrated through generating layouts for 28 large cities with arbitrary city block and building shapes.
    Abstract Modeling and designing urban building layouts is of significant interest in computer vision, computer graphics, and urban applications. A building layout consists of a set of buildings in city blocks defined by a network of roads. We observe that building layouts are discrete structures, consisting of multiple rows of buildings of various shapes, and are amenable to skeletonization for mapping arbitrary city block shapes to a canonical form. Hence, we propose a fully automatic approach to building layout generation using graph attention networks. Our method generates realistic urban layouts given arbitrary road networks, and enables conditional generation based on learned priors. Our results, including user study, demonstrate superior performance as compared to prior layout generation networks, support arbitrary city block and varying building shapes as demonstrated by generating layouts for 28 large cities.
    摘要 《计算机视觉和计算机图形学中的城市建筑布局模拟和设计是非常有价值的。城市建筑布局是一个由城市块网络定义的多个建筑物的集合,我们发现这些布局是离散结构,由多行不同形状的建筑物组成,可以通过skeletonization来将城市块形状划分为标准形状。因此,我们提出了一种完全自动化的城市建筑布局生成方法,使用图注意网络进行生成。我们的方法可以根据学习的先验知识进行Conditional生成,并且可以处理任意城市块和不同的建筑物形状。我们的结果,包括用户调查,显示了与之前的布局生成网络相比,我们的方法表现更出色,并且可以生成28个大城市的布局。》

Domain Adaptation based Enhanced Detection for Autonomous Driving in Foggy and Rainy Weather

  • paper_url: http://arxiv.org/abs/2307.09676
  • repo_url: None
  • paper_authors: Jinlong Li, Runsheng Xu, Jin Ma, Qin Zou, Jiaqi Ma, Hongkai Yu
  • for: 本研究旨在解决自动驾驶摄像头检测中的频繁出现的检测瓶颈问题,即在雾气和雨天下,模型训练在晴天下的检测模型可能无法达到预期的性能。
  • methods: 本研究提出了一种域 adaptive 对象检测框架,通过在图像水平和对象水平进行适应,以减少不同域之间的差异。此外,我们还提出了一种新的对抗梯度逆转层,通过对困难的实例进行对抗挑战,以提高模型在挑战性实例上的表现。最后,我们还提出了通过数据增强来生成一个辅助域,以便在新域上强制执行一个新的域级别的质量规则。
  • results: 实验结果表明,本研究在公共 V2V 测试集上实现了对 foggy 和 rainy 驾驶景情的对象检测性能的显著提高。
    Abstract Typically, object detection methods for autonomous driving that rely on supervised learning make the assumption of a consistent feature distribution between the training and testing data, however such assumption may fail in different weather conditions. Due to the domain gap, a detection model trained under clear weather may not perform well in foggy and rainy conditions. Overcoming detection bottlenecks in foggy and rainy weather is a real challenge for autonomous vehicles deployed in the wild. To bridge the domain gap and improve the performance of object detectionin foggy and rainy weather, this paper presents a novel framework for domain-adaptive object detection. The adaptations at both the image-level and object-level are intended to minimize the differences in image style and object appearance between domains. Furthermore, in order to improve the model's performance on challenging examples, we introduce a novel adversarial gradient reversal layer that conducts adversarial mining on difficult instances in addition to domain adaptation. Additionally, we suggest generating an auxiliary domain through data augmentation to enforce a new domain-level metric regularization. Experimental findings on public V2V benchmark exhibit a substantial enhancement in object detection specifically for foggy and rainy driving scenarios.
    摘要 Simplified Chinese:通常,基于supervised learning的自动驾驶对象检测方法假设了测试数据和训练数据中特征分布的一致性,但这个假设在不同的天气conditions下可能会失败。由于领域差距,训练在晴朗天气下的检测模型可能无法在雾天和雨天条件下表现好。为了在野外部署自动驾驶车辆中适应雾天和雨天条件,本文提出了一种适应领域对象检测框架。这种框架在图像和对象两级进行了适应,以减少图像和对象之间的差异。此外,我们还引入了一种新的对抗梯度反转层,以便在困难的实例上进行对抗挖掘。此外,我们还建议通过数据增强来生成一个辅助领域,以便强制执行一个新的领域级别的度量规则。实验结果表明,在公共V2Vbenchmark上,对雾天和雨天驾驶场景的对象检测表现得到了显著提高。

Object-aware Gaze Target Detection

  • paper_url: http://arxiv.org/abs/2307.09662
  • repo_url: https://github.com/francescotonini/object-aware-gaze-target-detection
  • paper_authors: Francesco Tonini, Nicola Dall’Asen, Cigdem Beyan, Elisa Ricci
  • for: 这个论文的目的是提出一种基于Transformer架构的目标识别方法,以便自动检测场景中的人员和被注视的对象,并建立每个人头和被注视对象之间的关系。
  • methods: 这种方法使用了Transformer架构,自动检测场景中的对象(包括人头),并建立每个人头和被注视对象之间的关系,从而实现了全面、可解释的眼动分析。
  • results: 根据在野外 benchmark 的评估,该方法在所有指标上达到了状态的艺术 Results(增加了2.91%的AUC、降低了50%的眼动距离、提高了9%的离屏平均精度),并在眼动目标检测和眼动对象的分类和 lokalisierung 方面提供了11-13%的提升。
    Abstract Gaze target detection aims to predict the image location where the person is looking and the probability that a gaze is out of the scene. Several works have tackled this task by regressing a gaze heatmap centered on the gaze location, however, they overlooked decoding the relationship between the people and the gazed objects. This paper proposes a Transformer-based architecture that automatically detects objects (including heads) in the scene to build associations between every head and the gazed-head/object, resulting in a comprehensive, explainable gaze analysis composed of: gaze target area, gaze pixel point, the class and the image location of the gazed-object. Upon evaluation of the in-the-wild benchmarks, our method achieves state-of-the-art results on all metrics (up to 2.91% gain in AUC, 50% reduction in gaze distance, and 9% gain in out-of-frame average precision) for gaze target detection and 11-13% improvement in average precision for the classification and the localization of the gazed-objects. The code of the proposed method is available https://github.com/francescotonini/object-aware-gaze-target-detection
    摘要 目标尝试检测目标目标在图像中的位置和概率是否离屏。一些工作已经解决这个任务,但它们忽略了人与观看的对象之间的关系。这篇论文提议一种基于Transformer架构的方法,自动检测场景中的对象(包括头部),并建立每个头部和观看的对象之间的关联,从而实现全面、可解释的视线分析,包括视线目标区域、视线像素点、类别和图像中的对象位置。在评估在野 benchmark中,我们的方法实现了状态机器人的结果(增加了2.91%的AUC,减少了50%的视线距离,增加了9%的离屏平均精度)。此外,我们的方法还提高了对观看的对象的分类和 lokalisierung的平均精度,增加了11-13%。我们的代码可以在https://github.com/francescotonini/object-aware-gaze-target-detection上获取。

Skin Lesion Correspondence Localization in Total Body Photography

  • paper_url: http://arxiv.org/abs/2307.09642
  • repo_url: https://github.com/weilunhuang-jhu/lesioncorrespondencetbp3d
  • paper_authors: Wei-Lun Huang, Davood Tashayyod, Jun Kang, Amir Gandjbakhche, Michael Kazhdan, Mehran Armand
  • for: 长期跟踪皮肤 lesion 的找寻、变化和Texture 有助于早期检测皮肤癌。但是,在全身图像中并未得到充分的研究。
  • methods: 我们提出了一种新的框架,将几何信息和Texture 信息结合使用,以确定皮肤 lesion 在全身图像中的匹配。首先,在源和目标3D 纹理模型上创建了体部标记或稀疏匹配。然后,每个顶点在每个模型上都被映射到一个表示 geodesic 距离标记的特征向量中。最后,对每个 source 中的 lesion 进行了粗略估算,使用几何信息编码在特征向量中,然后使用 Texture 信息进行精细调整。
  • results: 我们对公共和私人数据集进行了量化评估,与只有一次 longitudinal 研究的成功率相当。随着全身3D 捕捉的质量和频率的提高,我们预计提出的方法将成为跨度跟踪皮肤 lesion 的重要步骤。
    Abstract Longitudinal tracking of skin lesions - finding correspondence, changes in morphology, and texture - is beneficial to the early detection of melanoma. However, it has not been well investigated in the context of full-body imaging. We propose a novel framework combining geometric and texture information to localize skin lesion correspondence from a source scan to a target scan in total body photography (TBP). Body landmarks or sparse correspondence are first created on the source and target 3D textured meshes. Every vertex on each of the meshes is then mapped to a feature vector characterizing the geodesic distances to the landmarks on that mesh. Then, for each lesion of interest (LOI) on the source, its corresponding location on the target is first coarsely estimated using the geometric information encoded in the feature vectors and then refined using the texture information. We evaluated the framework quantitatively on both a public and a private dataset, for which our success rates (at 10 mm criterion) are comparable to the only reported longitudinal study. As full-body 3D capture becomes more prevalent and has higher quality, we expect the proposed method to constitute a valuable step in the longitudinal tracking of skin lesions.
    摘要 longitudinal tracking of skin lesions - finding correspondence, changes in morphology, and texture - is beneficial to the early detection of melanoma. However, it has not been well investigated in the context of full-body imaging. We propose a novel framework combining geometric and texture information to localize skin lesion correspondence from a source scan to a target scan in total body photography (TBP). Body landmarks or sparse correspondence are first created on the source and target 3D textured meshes. Every vertex on each of the meshes is then mapped to a feature vector characterizing the geodesic distances to the landmarks on that mesh. Then, for each lesion of interest (LOI) on the source, its corresponding location on the target is first coarsely estimated using the geometric information encoded in the feature vectors and then refined using the texture information. We evaluated the framework quantitatively on both a public and a private dataset, for which our success rates (at 10 mm criterion) are comparable to the only reported longitudinal study. As full-body 3D capture becomes more prevalent and has higher quality, we expect the proposed method to constitute a valuable step in the longitudinal tracking of skin lesions.Here's the translation in Traditional Chinese as well, for your reference:Longitudinal tracking of skin lesions - finding correspondence, changes in morphology, and texture - is beneficial to the early detection of melanoma. However, it has not been well investigated in the context of full-body imaging. We propose a novel framework combining geometric and texture information to localize skin lesion correspondence from a source scan to a target scan in total body photography (TBP). Body landmarks or sparse correspondence are first created on the source and target 3D textured meshes. Every vertex on each of the meshes is then mapped to a feature vector characterizing the geodesic distances to the landmarks on that mesh. Then, for each lesion of interest (LOI) on the source, its corresponding location on the target is first coarsely estimated using the geometric information encoded in the feature vectors and then refined using the texture information. We evaluated the framework quantitatively on both a public and a private dataset, for which our success rates (at 10 mm criterion) are comparable to the only reported longitudinal study. As full-body 3D capture becomes more prevalent and has higher quality, we expect the proposed method to constitute a valuable step in the longitudinal tracking of skin lesions.I hope this helps!

Conditional 360-degree Image Synthesis for Immersive Indoor Scene Decoration

  • paper_url: http://arxiv.org/abs/2307.09621
  • repo_url: https://github.com/kcshum/neural_360_decoration
  • paper_authors: Ka Chun Shum, Hong-Wing Pang, Binh-Son Hua, Duc Thanh Nguyen, Sai-Kit Yeung
  • for: 这个论文targets the problem of conditional scene decoration for 360-degree images, aiming to generate decorated images of the same scene in panorama view.
  • methods: 该方法首先开发了一个360度对象布局生成器,用于学习360度视图中对象的约束,然后使用这个对象布局来condition a generative adversarial network(GAN)来 sinthezize输入场景的图像。另外,他们还开发了一个简单 yet effective的场景清空器,用于 removing generated furniture and producing an emptied scene for the model to learn a cyclic constraint.
  • results: 经过训练于Structure3D dataset,该模型可以生成多种可控的对象布局,并且在Zillow indoor scene dataset上显示出优秀的性能。用户研究表明,该生成结果具有真实的图像质量和家具布局,提供了 immerse 的经验。
    Abstract In this paper, we address the problem of conditional scene decoration for 360-degree images. Our method takes a 360-degree background photograph of an indoor scene and generates decorated images of the same scene in the panorama view. To do this, we develop a 360-aware object layout generator that learns latent object vectors in the 360-degree view to enable a variety of furniture arrangements for an input 360-degree background image. We use this object layout to condition a generative adversarial network to synthesize images of an input scene. To further reinforce the generation capability of our model, we develop a simple yet effective scene emptier that removes the generated furniture and produces an emptied scene for our model to learn a cyclic constraint. We train the model on the Structure3D dataset and show that our model can generate diverse decorations with controllable object layout. Our method achieves state-of-the-art performance on the Structure3D dataset and generalizes well to the Zillow indoor scene dataset. Our user study confirms the immersive experiences provided by the realistic image quality and furniture layout in our generation results. Our implementation will be made available.
    摘要 在这篇论文中,我们解决了基于360度图像的条件场景饰化问题。我们的方法使用了一张360度背景图片,并生成了相同场景的潮拂视图图像。为了实现这一点,我们开发了一个360度对象布局生成器,它学习了360度视图中对象的秘密 вектор,以便在输入360度背景图片的基础上实现多种家具的排列。我们使用这个对象布局来condition一个生成敌对网络,以生成输入场景的图像。为了进一步强化我们的模型的生成能力,我们开发了一个简单 yet有效的场景空置器,它将生成的家具移除,并生成了一个空的场景,以便我们的模型学习一个循环约束。我们在Structure3D数据集上训练了我们的模型,并证明了我们的模型可以生成多样的饰化,并且可以控制对象布局。我们的方法实现了状态对的性能在Structure3D数据集上,并且在Zillow室内场景数据集上也具有良好的普适性。我们的用户研究证明了我们的生成结果提供了真实的图像质量和家具布局,使用者可以获得沉浸式的 immerse 体验。我们的实现将被公开。

Surgical Action Triplet Detection by Mixed Supervised Learning of Instrument-Tissue Interactions

  • paper_url: http://arxiv.org/abs/2307.09548
  • repo_url: None
  • paper_authors: Saurav Sharma, Chinedu Innocent Nwoye, Didier Mutter, Nicolas Padoy
  • for: 这篇论文的目的是提出一种基于交互图的多类 instrumente-aware transformer 模型,用于检测手术动作三元组。
  • methods: 该模型使用 MCIT 阶段和 IG 阶段两个部分组成,MCIT 阶段模型每个目标的类别特征,以减少 triplet 杂合的风险;IG 阶段构建了一个两侧动态图,以模型 instrumente 和目标之间的互动。
  • results: 在 CholectT50 数据集上测试,该模型提高了 instrumente localization 和 triplet 检测的性能,并在 MICCAI 2022 年的 CholecTriplet 挑战中击败了所有参赛者,提出了新的顶峰性表现。
    Abstract Surgical action triplets describe instrument-tissue interactions as (instrument, verb, target) combinations, thereby supporting a detailed analysis of surgical scene activities and workflow. This work focuses on surgical action triplet detection, which is challenging but more precise than the traditional triplet recognition task as it consists of joint (1) localization of surgical instruments and (2) recognition of the surgical action triplet associated with every localized instrument. Triplet detection is highly complex due to the lack of spatial triplet annotation. We analyze how the amount of instrument spatial annotations affects triplet detection and observe that accurate instrument localization does not guarantee better triplet detection due to the risk of erroneous associations with the verbs and targets. To solve the two tasks, we propose MCIT-IG, a two-stage network, that stands for Multi-Class Instrument-aware Transformer-Interaction Graph. The MCIT stage of our network models per class embedding of the targets as additional features to reduce the risk of misassociating triplets. Furthermore, the IG stage constructs a bipartite dynamic graph to model the interaction between the instruments and targets, cast as the verbs. We utilize a mixed-supervised learning strategy that combines weak target presence labels for MCIT and pseudo triplet labels for IG to train our network. We observed that complementing minimal instrument spatial annotations with target embeddings results in better triplet detection. We evaluate our model on the CholecT50 dataset and show improved performance on both instrument localization and triplet detection, topping the leaderboard of the CholecTriplet challenge in MICCAI 2022.
    摘要 针对手术场景中的 instrumente-tissue 交互,我们提出了“手术动作 triplets”的概念,即(instrument,verb,target)组合。这种概念支持etailed分析手术过程中的活动和工作流程。本文主要关注手术动作 triplet 检测,这是传统 triplet 认识任务的更加精准和复杂的一种。由于缺乏空间 triplet 标注,手术动作 triplet 检测是非常复杂的。我们分析了不同量的 instrumente 空间标注对 triplet 检测的影响,并观察到了虽然精确的 instrumente localization 不能保证更好的 triplet 检测,因为可能会 mistakenly associate with verbs and targets。为解决这两个任务,我们提出了 MCIT-IG 网络,即 Multi-Class Instrument-aware Transformer-Interaction Graph。MCIT 阶段中,我们模型了每个目标的多类 embedding,以降低将 triplets 混淆的风险。IG 阶段则构建了一个双向动态图,以模型 instrumente 和目标之间的互动,即 cast as verbs。我们采用了一种混合强化学习策略,将 MCIT 和 IG 两个阶段的训练数据进行混合。我们发现,通过补充 minimal instrumente 空间标注与目标嵌入,可以提高 triplet 检测的性能。我们对 CholecT50 数据集进行了评估,并在 MICCAI 2022 年的 CholecTriplet 挑战中排名第一。

Can Neural Network Memorization Be Localized?

  • paper_url: http://arxiv.org/abs/2307.09542
  • repo_url: https://github.com/pratyushmaini/localizing-memorization
  • paper_authors: Pratyush Maini, Michael C. Mozer, Hanie Sedghi, Zachary C. Lipton, J. Zico Kolter, Chiyuan Zhang
  • for: 这 paper 旨在解释深度过参数网络中 memorization 和 generalization 之间的交互作用。
  • methods: 这 paper 使用了 gradient accounting、layer rewinding 和 retraining 等三种实验来证明 memorization 不仅局限于最后几层,而是围绕一小组神经元分布在不同层次。
  • results: 研究发现, memorization 通常局限于模型中几个神经元或通道(大约5),而且可以通过 example-tied dropout 来引导 memorization 到预先确定的神经元集。这种dropout方法可以降低 memorized examples 的准确率从 100% 降至 3%,同时降低了泛化差距。
    Abstract Recent efforts at explaining the interplay of memorization and generalization in deep overparametrized networks have posited that neural networks $\textit{memorize}$ "hard" examples in the final few layers of the model. Memorization refers to the ability to correctly predict on $\textit{atypical}$ examples of the training set. In this work, we show that rather than being confined to individual layers, memorization is a phenomenon confined to a small set of neurons in various layers of the model. First, via three experimental sources of converging evidence, we find that most layers are redundant for the memorization of examples and the layers that contribute to example memorization are, in general, not the final layers. The three sources are $\textit{gradient accounting}$ (measuring the contribution to the gradient norms from memorized and clean examples), $\textit{layer rewinding}$ (replacing specific model weights of a converged model with previous training checkpoints), and $\textit{retraining}$ (training rewound layers only on clean examples). Second, we ask a more generic question: can memorization be localized $\textit{anywhere}$ in a model? We discover that memorization is often confined to a small number of neurons or channels (around 5) of the model. Based on these insights we propose a new form of dropout -- $\textit{example-tied dropout}$ that enables us to direct the memorization of examples to an apriori determined set of neurons. By dropping out these neurons, we are able to reduce the accuracy on memorized examples from $100\%\to3\%$, while also reducing the generalization gap.
    摘要 最近的研究表明,深度过参数网络中的记忆和通用之间存在一定的互动。研究人员认为,神经网络会在训练集中的罕见示例上进行记忆。记忆指的是能够正确预测训练集中的非典型示例。在这项工作中,我们发现了一些有利的证据,证明记忆不仅存在于各个层之间,而且受到一些特定层的限制。我们通过三种实验来证明这一点,即梯度考虑(测量模型中记忆和干净示例的贡献)、层恢复(将特定模型参数替换为训练过程中的检查点)和重训练(只训练恢复的层)。我们还提出了一个新的抽象概念——示例绑定Dropout,可以让我们将记忆示例直接链接到预先确定的神经元。通过dropout这些神经元,我们能够将记忆示例的准确率降低到3%,同时降低通用性隔阂。

Adversarial Bayesian Augmentation for Single-Source Domain Generalization

  • paper_url: http://arxiv.org/abs/2307.09520
  • repo_url: None
  • paper_authors: Sheng Cheng, Tejas Gokhale, Yezhou Yang
  • for: 该论文旨在解决难以泛化到未经见过的图像频谱问题,主要是因为缺乏多样化的训练数据、不可达的目标数据以及实际情况中的大频率差异。
  • methods: 该论文提出了一种新的算法——对抗抽象学习极 Bayesian 增强(ABA),用于生成单源频谱下的图像增强。 ABA 利用了对抗学习和极 Bayesian 神经网络来引导生成多样化的数据增强——这些生成的图像频谱可以帮助分类器泛化到未经见过的频谱。
  • results: 作者在多种频谱差异情况下(包括风格差异、子population差异和医学成像中的差异)进行了实验,并证明了 ABA 能够超越所有之前的状态艺术方法,包括预先定义的增强、像素级增强和卷积级增强。
    Abstract Generalizing to unseen image domains is a challenging problem primarily due to the lack of diverse training data, inaccessible target data, and the large domain shift that may exist in many real-world settings. As such data augmentation is a critical component of domain generalization methods that seek to address this problem. We present Adversarial Bayesian Augmentation (ABA), a novel algorithm that learns to generate image augmentations in the challenging single-source domain generalization setting. ABA draws on the strengths of adversarial learning and Bayesian neural networks to guide the generation of diverse data augmentations -- these synthesized image domains aid the classifier in generalizing to unseen domains. We demonstrate the strength of ABA on several types of domain shift including style shift, subpopulation shift, and shift in the medical imaging setting. ABA outperforms all previous state-of-the-art methods, including pre-specified augmentations, pixel-based and convolutional-based augmentations.
    摘要 通用到未经见的图像领域是一个具有挑战性的问题,主要由于训练数据缺乏多样性、目标数据困难获取以及实际场景中可能存在很大的领域变化。因此,数据增强成为领域总结方法中的关键组成部分。我们介绍了一种新的算法——对抗权重学增强(ABA),可以在单源领域总结Setting中学习生成图像增强。ABA借鉴了对抗学习和权重神经网络的优势,以引导生成多样化的数据增强——这些合成的图像领域帮助分类器在未经见的领域中总结。我们在多种领域shift中展示了ABA的强大性,包括风格shift、子群体shift和医学成像设置中的shift。ABA在所有前一个状态的方法之上升迁,包括预先指定的增强、像素基于的增强和卷积基于的增强。

AnyDoor: Zero-shot Object-level Image Customization

  • paper_url: http://arxiv.org/abs/2307.09481
  • repo_url: None
  • paper_authors: Xi Chen, Lianghua Huang, Yu Liu, Yujun Shen, Deli Zhao, Hengshuang Zhao
  • for: This paper presents a novel image generator called AnyDoor, which can teleport objects to new scenes in a harmonious way.
  • methods: The model uses a diffusion-based approach and is trained only once to generalize to diverse object-scene combinations at inference. The model also incorporates detail features to maintain texture details and support object blending with different surroundings.
  • results: The approach demonstrates superiority over existing alternatives and has great potential in real-world applications such as virtual try-on and object moving, as shown through extensive experiments.Here’s the Chinese translation:
  • for: 这篇论文提出了一种名为AnyDoor的图像生成器,可以将目标对象teleport到新的场景中,并在用户指定的位置进行融合。
  • methods: 模型采用了扩散基于的方法,并在推理阶段仅需要一次训练,以便通用多种对象-场景组合。模型还具有细节特征,以保持Texture细节并允许多种本地变化(例如光照、方向、姿势等),以支持对象融合不同的环境。
  • results: 方法在对 existed alternatives 的比较中表现出色,并在虚拟试穿和对象移动等实际应用中具有潜在的应用前景,根据广泛的实验结果。
    Abstract This work presents AnyDoor, a diffusion-based image generator with the power to teleport target objects to new scenes at user-specified locations in a harmonious way. Instead of tuning parameters for each object, our model is trained only once and effortlessly generalizes to diverse object-scene combinations at the inference stage. Such a challenging zero-shot setting requires an adequate characterization of a certain object. To this end, we complement the commonly used identity feature with detail features, which are carefully designed to maintain texture details yet allow versatile local variations (e.g., lighting, orientation, posture, etc.), supporting the object in favorably blending with different surroundings. We further propose to borrow knowledge from video datasets, where we can observe various forms (i.e., along the time axis) of a single object, leading to stronger model generalizability and robustness. Extensive experiments demonstrate the superiority of our approach over existing alternatives as well as its great potential in real-world applications, such as virtual try-on and object moving. Project page is https://damo-vilab.github.io/AnyDoor-Page/.
    摘要 这个工作介绍了AnyDoor,一种基于扩散的图像生成器,可以让目标对象在用户指定的位置上新场景中快速传输。而不需要对每个对象进行参数调整,我们的模型只需要一次性训练,并在推理阶段自动适应多种对象-场景组合。这种具有挑战性的零shot设定需要对特定对象进行充分的特征化。为此,我们在标准的标识特征之外,采用细节特征,这些特征旨在保持Texture细节,同时允许多元的本地变化(例如照明、方向、姿势等),使对象能够顺利地融入不同的环境中。此外,我们还提出了借鉴视频 datasets,我们可以在时间轴上观察不同形态(例如不同的照明、方向、姿势等)的同一个对象,从而提高模型的泛化和Robustness。广泛的实验证明了我们的方法的优越性以及其在实际应用中的巨大潜力,如虚拟试穿和对象移动等。项目页面可以在https://damo-vilab.github.io/AnyDoor-Page/中找到。

FACTS: Facial Animation Creation using the Transfer of Styles

  • paper_url: http://arxiv.org/abs/2307.09480
  • repo_url: None
  • paper_authors: Jack Saunders, Steven Caulkin, Vinay Namboodiri
  • for: 这篇论文的目的是为了创造更有表情的电子游戏角色。
  • methods: 这篇论文使用了一种新的方法,即通过修改样式特征来实现表情动画。具体来说,他们使用了一个StarGAN来转换3D表情动画为不同的情感和个人特有的样式。
  • results: 通过这种方法,他们能够保持动画的唇同步,并且可以实现不同的情感和个人特有的表情动画。
    Abstract The ability to accurately capture and express emotions is a critical aspect of creating believable characters in video games and other forms of entertainment. Traditionally, this animation has been achieved with artistic effort or performance capture, both requiring costs in time and labor. More recently, audio-driven models have seen success, however, these often lack expressiveness in areas not correlated to the audio signal. In this paper, we present a novel approach to facial animation by taking existing animations and allowing for the modification of style characteristics. Specifically, we explore the use of a StarGAN to enable the conversion of 3D facial animations into different emotions and person-specific styles. We are able to maintain the lip-sync of the animations with this method thanks to the use of a novel viseme-preserving loss.
    摘要 “创造真实的人物表情是视频游戏和娱乐媒体中非常重要的一部分。传统上,这种动画通常需要艺术努力或表演捕捉,它们需要大量的时间和劳动。在最近几年,听取模型已经取得了成功,但它们通常缺乏不相关于声音信号的表达能力。在这篇论文中,我们提出了一种新的面部动画方法,即通过现有动画的修改,以达到不同情感和人Specific风格。我们使用了一种名为StarGAN的模型,并使用一种新的视emeserving损失来保持动画的唇同步。”

GroupLane: End-to-End 3D Lane Detection with Channel-wise Grouping

  • paper_url: http://arxiv.org/abs/2307.09472
  • repo_url: None
  • paper_authors: Zhuoling Li, Chunrui Han, Zheng Ge, Jinrong Yang, En Yu, Haoqian Wang, Hengshuang Zhao, Xiangyu Zhang
  • for: 提高3D幕면检测效率,适应实际应用需求
  • methods: 提出一种简单、快速、端到端检测器,同时保持高检测精度
  • results: 在3个实际3D幕面测试 benchmark 上,GroupLane 比 published state-of-the-art PersFormer 提高13.6% F1 分数,并且具有 faster inference speed 和 lower FLOPs than PersFormer
    Abstract Efficiency is quite important for 3D lane detection due to practical deployment demand. In this work, we propose a simple, fast, and end-to-end detector that still maintains high detection precision. Specifically, we devise a set of fully convolutional heads based on row-wise classification. In contrast to previous counterparts, ours supports recognizing both vertical and horizontal lanes. Besides, our method is the first one to perform row-wise classification in bird-eye-view. In the heads, we split feature into multiple groups and every group of feature corresponds to a lane instance. During training, the predictions are associated with lane labels using the proposed single-win one-to-one matching to compute loss, and no post-processing operation is demanded for inference. In this way, our proposed fully convolutional detector, GroupLane, realizes end-to-end detection like DETR. Evaluated on 3 real world 3D lane benchmarks, OpenLane, Once-3DLanes, and OpenLane-Huawei, GroupLane adopting ConvNext-Base as the backbone outperforms the published state-of-the-art PersFormer by 13.6% F1 score in the OpenLane validation set. Besides, GroupLane with ResNet18 still surpasses PersFormer by 4.9% F1 score, while the inference speed is nearly 7x faster and the FLOPs is only 13.3% of it.
    摘要 “效率是3D跑道探测中非常重要的因素,在这个工作中,我们提出了一个简单、快速、端到端的探测器,可以保持高的探测精度。具体来说,我们设计了一组全 convolutional 头,基于行间分类。与前一代相比,我们的方法可以识别 both Vertical 和 Horizontal 跑道。此外,我们的方法是第一个在 bird-eye-view 中进行行间分类的。在 heads 中,我们将特征分为多个群体,每个群体的特征都与一个跑道实例相对应。在训练时,预测与跑道标签之间的对应使用我们提出的单一胜一对一匹配算法来计算损失,而探测时不需要进行后处理操作。这样,我们的提案的全 convolutional 探测器,GroupLane,可以实现端到端探测,类似于DETR。在3个真实世界3D跑道benchmark上进行评估,GroupLane 使用 ConvNext-Base 作为背景时,与已出版的State-of-the-art PersFormer 相比,在 OpenLane 验证集上提高了13.6% F1 分数。此外,GroupLane 使用 ResNet18 仍然超过 PersFormer 的4.9% F1 分数,而且探测速度几乎是7倍 faster,而 FLOPs 仅仅是13.3% 的它。”

Occlusion Aware Student Emotion Recognition based on Facial Action Unit Detection

  • paper_url: http://arxiv.org/abs/2307.09465
  • repo_url: None
  • paper_authors: Shrouk Wally, Ahmed Elsayed, Islam Alkabbany, Asem Ali, Aly Farag
  • for: 提高课堂环境质量,提高STEM学生在大学初年留存率。
  • methods: 使用人工 occlusion 数据集和注意力机制,提出 occlusion-aware 表情活动单元抽取方法,以便在课堂 SETTINGS 中识别表情。
  • results: 研究发现 occlusion 对 facial recognition 模型的性能有较大影响,并提出了一种 occlusion-aware 的表情活动单元抽取方法,可以提高识别表情的可靠性。
    Abstract Given that approximately half of science, technology, engineering, and mathematics (STEM) undergraduate students in U.S. colleges and universities leave by the end of the first year [15], it is crucial to improve the quality of classroom environments. This study focuses on monitoring students' emotions in the classroom as an indicator of their engagement and proposes an approach to address this issue. The impact of different facial parts on the performance of an emotional recognition model is evaluated through experimentation. To test the proposed model under partial occlusion, an artificially occluded dataset is introduced. The novelty of this work lies in the proposal of an occlusion-aware architecture for facial action units (AUs) extraction, which employs attention mechanism and adaptive feature learning. The AUs can be used later to classify facial expressions in classroom settings. This research paper's findings provide valuable insights into handling occlusion in analyzing facial images for emotional engagement analysis. The proposed experiments demonstrate the significance of considering occlusion and enhancing the reliability of facial analysis models in classroom environments. These findings can also be extended to other settings where occlusions are prevalent.
    摘要 据统计,约一半的科学、技术、工程和数学(STEM)本科学生在美国大学和学院内退学之前一年结束。因此,改善课堂环境质量是非常重要的。这项研究关注课堂中学生的情绪状况作为参与度的指标,并提出了一种解决方案。本研究通过实验测试不同的面部部分对情绪识别模型的性能的影响。为了测试提议的模型在受部分遮挡的情况下的性能,我们引入了一个人工遮挡数据集。本研究的新特点在于提出了一种遮挡意识的面部动作单元(AU)提取方法,该方法使用了注意力机制和适应特征学习。这些AU可以用于后续在课堂设置中分类表情。本研究的发现对于处理覆盖情况的面部图像情感参与度分析提供了有价值的信息。提议的实验表明了在课堂环境中考虑覆盖和提高面部分析模型的可靠性是非常重要的。这些发现可以应用到其他受覆盖的场景中。

Measuring Student Behavioral Engagement using Histogram of Actions

  • paper_url: http://arxiv.org/abs/2307.09420
  • repo_url: None
  • paper_authors: Ahmed Abdelkawy, Islam Alkabbany, Asem Ali, Aly Farag
  • for: 本研究提出了一种新的学生行为参与度测量技术,通过识别学生的动作来预测学生行为参与度水平。
  • methods: 本研究使用人体骨架来模拟学生姿势和上半身运动,并使用3D-CNN模型来学习学生上半身的动态。然后,使用训练好的3D-CNN模型来识别每段2分钟视频中的动作,并将这些动作组织成一个历史gram,该历史gram用于输入SVM分类器来确定学生是否参与度高或低。
  • results: 实验结果表明,本研究的方法可以识别学生动作的准确率达83.63%,并且可以捕捉教室中学生的平均参与度。
    Abstract In this paper, we propose a novel technique for measuring behavioral engagement through students' actions recognition. The proposed approach recognizes student actions then predicts the student behavioral engagement level. For student action recognition, we use human skeletons to model student postures and upper body movements. To learn the dynamics of student upper body, a 3D-CNN model is used. The trained 3D-CNN model is used to recognize actions within every 2minute video segment then these actions are used to build a histogram of actions which encodes the student actions and their frequencies. This histogram is utilized as an input to SVM classifier to classify whether the student is engaged or disengaged. To evaluate the proposed framework, we build a dataset consisting of 1414 2-minute video segments annotated with 13 actions and 112 video segments annotated with two engagement levels. Experimental results indicate that student actions can be recognized with top 1 accuracy 83.63% and the proposed framework can capture the average engagement of the class.
    摘要 在这篇论文中,我们提出了一种新的技术来测量学生的行为参与度。我们的方法首先识别学生的行为,然后预测学生的参与度水平。为了识别学生的行为,我们使用人体骨架来模拟学生的姿势和上半身运动。为了学习学生上半身的动态,我们使用3D-CNN模型进行训练。训练完成后,我们使用3D-CNN模型来识别每段2分钟视频中的行为,并将这些行为组织成一个历史gram中,其中每个行为的频率都是其对应的值。这个历史gram作为输入,我们使用支持向量机进行分类,以判断学生是否参与到活动中。为了评估我们的方法,我们建立了一个包含1414个2分钟视频段和112个视频段的数据集,其中每个视频段都有13种行为和两个参与度水平的标注。实验结果表明,我们的方法可以识别学生的行为 WITH top 1 准确率83.63%,并且可以捕捉学生班级的平均参与度。