cs.SD - 2023-11-29

FAT-HuBERT: Front-end Adaptive Training of Hidden-unit BERT for Distortion-Invariant Robust Speech Recognition

  • paper_url: http://arxiv.org/abs/2311.17790
  • repo_url: None
  • paper_authors: Dongning Yang, Wei Wang, Yanmin Qian
  • for: 提高单通道speech增强技术的研究,以提高speech recognition系统的性能。
  • methods: 提出一种新的approachcalled FAT-HuBERT,通过自我supervised learning(SSL)提高ASR系统的鲁棒性。
  • results: 在 simulate noisy speech和real-world noisy speech上进行测试,实验结果表明FAT-HuBERT可以获得显著的word error rate(WER)减少。
    Abstract Advancements in monaural speech enhancement (SE) techniques have greatly improved the perceptual quality of speech. However, integrating these techniques into automatic speech recognition (ASR) systems has not yielded the expected performance gains, primarily due to the introduction of distortions during the SE process. In this paper, we propose a novel approach called FAT-HuBERT, which leverages distortion-invariant self-supervised learning (SSL) to enhance the robustness of ASR. To address the distortions introduced by the SE frontends, we introduce layer-wise fusion modules that incorporate features extracted from both observed noisy signals and enhanced signals. During training, the SE frontend is randomly selected from a pool of models. We evaluate the performance of FAT-HuBERT on simulated noisy speech generated from LibriSpeech as well as real-world noisy speech from the CHiME-4 1-channel dataset. The experimental results demonstrate a significant relative reduction in word error rate (WER).
    摘要 技术进步使得单通道语音提升(SE)技术的语音质量得到了大幅提高。然而,将这些技术集成到自动语音识别(ASR)系统中并没有达到预期的性能提升,主要是因为SE过程中引入的损害。在这篇论文中,我们提出了一种新的方法 called FAT-HuBERT,它利用损害不变的自我超vised学习(SSL)来提高ASR的鲁棒性。为了解决SE前端引入的损害,我们引入层次融合模块,这些模块将 observed 噪音信号和优化后的信号中提取的特征进行融合。在训练时,SE前端会随机从一个池中选择一个模型。我们在使用 LibriSpeech 生成的 simulate 噪音语音和 CHiME-4 1-channel dataset 中的实际噪音语音进行评估。实验结果显示,FAT-HuBERT 可以获得显著的 relative 错误率(WER)下降。