eess.AS - 2023-10-21

SwG-former: Sliding-window Graph Convolutional Network Integrated with Conformer for Sound Event Localization and Detection

  • paper_url: http://arxiv.org/abs/2310.14016
  • repo_url: None
  • paper_authors: Weiming Huang, Qinghua Huang, Liyan Ma, Zhengyu Chen, Chuan Wang
  • for: 本研究旨在提高Sound Event Localization and Detection(SELD)系统的性能,特别是在自然空间声学环境下。
  • methods: 本研究提出了一种基于图表示的novel graph convolutional network(GCN)模型,具有同时抽取空间特征和时间特征的能力。此外,一种robust Conv2dAgg函数也被提出,用于协调邻居特征。
  • results: 对比与现有的先进SELD模型,本研究的SwG-former模型在同一个声学环境下表现出了superior的性能。此外,将SwG模块整合到EINV2网络中,得到的SwG-EINV2模型也超过了现有的SOTA方法。
    Abstract Sound event localization and detection (SELD) is a joint task of sound event detection (SED) and direction of arrival (DoA) estimation. SED mainly relies on temporal dependencies to distinguish different sound classes, while DoA estimation depends on spatial correlations to estimate source directions. To jointly optimize two subtasks, the SELD system should extract spatial correlations and model temporal dependencies simultaneously. However, numerous models mainly extract spatial correlations and model temporal dependencies separately. In this paper, the interdependence of spatial-temporal information in audio signals is exploited for simultaneous extraction to enhance the model performance. In response, a novel graph representation leveraging graph convolutional network (GCN) in non-Euclidean space is developed to extract spatial-temporal information concurrently. A sliding-window graph (SwG) module is designed based on the graph representation. It exploits sliding-windows with different sizes to learn temporal context information and dynamically constructs graph vertices in the frequency-channel (F-C) domain to capture spatial correlations. Furthermore, as the cornerstone of message passing, a robust Conv2dAgg function is proposed and embedded into the SwG module to aggregate the features of neighbor vertices. To improve the performance of SELD in a natural spatial acoustic environment, a general and efficient SwG-former model is proposed by integrating the SwG module with the Conformer. It exhibits superior performance in comparison to recent advanced SELD models. To further validate the generality and efficiency of the SwG-former, it is seamlessly integrated into the event-independent network version 2 (EINV2) called SwG-EINV2. The SwG-EINV2 surpasses the state-of-the-art (SOTA) methods under the same acoustic environment.
    摘要 声音事件地点Localization和检测(SELD)是一个joint任务,它包括声音事件检测(SED)和方向来源估计(DoA)。SED主要基于时间关系来分辨不同的声音类型,而DoA估计则基于空间相关性来估计源向量。为了同时优化两个子任务,SELD系统应该同时提取空间相关性和模型时间关系。然而,许多模型通常是分开提取空间相关性和时间关系。在这篇论文中,我们利用声音信号中的空间-时间信息相互关系,并将其同时提取出来,以提高模型性能。为此,我们开发了一种基于图表示的新型图 convolutional neural network(GCN)模型,可以同时提取空间-时间信息。我们还设计了一个基于图表示的滑动窗口模块(SwG),它利用不同的窗口大小来学习时间上下文信息,并在频道频率(F-C)空间中动态构建图顶点,以捕捉空间相关性。此外,我们还提出了一种Robust Conv2dAgg函数,用于在图顶点之间进行消息传递。为了在自然的空间声音环境中提高SELD性能,我们提出了一种通用和高效的SwG-former模型,其由SwG模块和Conformer结合而成。该模型在与现有高级SELD模型进行比较时表现出色。此外,我们还将SwG-former模型与事件独立网络版2(EINV2)结合,得到了SwG-EINV2模型。SwG-EINV2模型在同一个声音环境中超过了现有的最佳方法。