results: 这篇论文的实验结果显示,这种自监督学习方法可以很好地估测多个空间音响参数,包括时间差 arrival、直接至反射比例和反射时间。此外,这篇论文还证明了这种方法在实际应用中的可行性和有效性。Abstract
Supervised learning methods have shown effectiveness in estimating spatial acoustic parameters such as time difference of arrival, direct-to-reverberant ratio and reverberation time. However, they still suffer from the simulation-to-reality generalization problem due to the mismatch between simulated and real-world acoustic characteristics and the deficiency of annotated real-world data. To this end, this work proposes a self-supervised method that takes full advantage of unlabeled data for spatial acoustic parameter estimation. First, a new pretext task, i.e. cross-channel signal reconstruction (CCSR), is designed to learn a universal spatial acoustic representation from unlabeled multi-channel microphone signals. We mask partial signals of one channel and ask the model to reconstruct them, which makes it possible to learn spatial acoustic information from unmasked signals and extract source information from the other microphone channel. An encoder-decoder structure is used to disentangle the two kinds of information. By fine-tuning the pre-trained spatial encoder with a small annotated dataset, this encoder can be used to estimate spatial acoustic parameters. Second, a novel multi-channel audio Conformer (MC-Conformer) is adopted as the encoder model architecture, which is suitable for both the pretext and downstream tasks. It is carefully designed to be able to capture the local and global characteristics of spatial acoustics exhibited in the time-frequency domain. Experimental results of five acoustic parameter estimation tasks on both simulated and real-world data show the effectiveness of the proposed method. To the best of our knowledge, this is the first self-supervised learning method in the field of spatial acoustic representation learning and multi-channel audio signal processing.
摘要
<>Translate the given text into Simplified Chinese.<>超vised learning方法已经表现出在时间差到达、直接听到反射率和听到反射时间的估计中的效果。然而,它们仍然受到实际和模拟数据之间的匹配问题的影响,以及实际数据上的注释不充分的问题。为此,这项工作提出了一种无监督的方法,它可以完全利用无注释数据来进行空间音学参数估计。首先,我们设计了一个新的预测任务,即跨通道信号重建(CCSR),以学习从无注释多核心 Microphone 信号中获取空间音学信息。我们将一个通道的信号掩码为部分信号,并请求模型重建它们,从而使得模型可以从未掩码的信号中学习空间音学信息,并从另一个 Microphone 通道中提取源信息。我们使用了一个Encoder-Decoder结构来分离这两种信息。通过练化预先训练的空间编码器使其能够估计空间音学参数。其次,我们采用了一种适合预测和下游任务的多通道声学Conformer(MC-Conformer)模型 architecture,它适合在时间频域中展示空间音学特征的地方和全局特征。实验结果表明,该方法在模拟和实际数据上的五种声学参数估计任务中具有效果。据我们所知,这是无监督学习在空间音学表征学习和多通道声学信号处理领域的第一个方法。