cs.SD - 2023-08-19

Spatial Reconstructed Local Attention Res2Net with F0 Subband for Fake Speech Detection

  • paper_url: http://arxiv.org/abs/2308.09944
  • repo_url: None
  • paper_authors: Cunhang Fan, Jun Xue, Jianhua Tao, Jiangyan Yi, Chenglong Wang, Chengshi Zheng, Zhao Lv
    for: 本研究旨在提高假语音识别(FSD)任务的性能,特别是对于rhythm of synthetic speech too smooth的问题。methods: 本文提出了一种新的F0子带,以及一种具有spatial reconstructed local attention的Res2Net网络(SR-LA Res2Net)。results: 在ASVspoof 2019 LA数据集上,我们的提议方法实现了EER值为0.47%和min t-DCF值为0.0159,与所有单个系统中的最佳性能相当。
    Abstract The rhythm of synthetic speech is usually too smooth, which causes that the fundamental frequency (F0) of synthetic speech is significantly different from that of real speech. It is expected that the F0 feature contains the discriminative information for the fake speech detection (FSD) task. In this paper, we propose a novel F0 subband for FSD. In addition, to effectively model the F0 subband so as to improve the performance of FSD, the spatial reconstructed local attention Res2Net (SR-LA Res2Net) is proposed. Specifically, Res2Net is used as a backbone network to obtain multiscale information, and enhanced with a spatial reconstruction mechanism to avoid losing important information when the channel group is constantly superimposed. In addition, local attention is designed to make the model focus on the local information of the F0 subband. Experimental results on the ASVspoof 2019 LA dataset show that our proposed method obtains an equal error rate (EER) of 0.47% and a minimum tandem detection cost function (min t-DCF) of 0.0159, achieving the state-of-the-art performance among all of the single systems.
    摘要 文本中的人工语音的节奏通常太平滑,导致人工语音的基本频率(F0)与实际语音的F0有所不同。这些F0特征含有识别假语音的重要信息。在这篇论文中,我们提出了一种新的F0子带 для假语音检测(FSD)任务。此外,为了有效地模型F0子带,我们还提出了一种空间重建本地注意力Res2Net(SR-LA Res2Net)。具体来说,Res2Net被用作背景网络,以获取多尺度信息,并在核心矩阵上添加空间重建机制,以避免损失重要信息。此外,本地注意力被设计来使模型关注F0子带的本地信息。实验结果表明,我们提出的方法在ASVspoof 2019 LA数据集上达到了单个系统的状态略进行性表现,其EER为0.47%,min t-DCF为0.0159。