eess.AS - 2023-09-19

Exploring Speech Enhancement for Low-resource Speech Synthesis

  • paper_url: http://arxiv.org/abs/2309.10795
  • repo_url: None
  • paper_authors: Zhaoheng Ni, Sravya Popuri, Ning Dong, Kohei Saijo, Xiaohui Zhang, Gael Le Lan, Yangyang Shi, Vikas Chandra, Changhan Wang
  • for: 提高低资源语言 Text-to-Speech (TTS) 模型训练的高质量语音数据获取具有挑战性和成本高。
  • methods: 应用自动语音识别 (ASR) corpora 上的语音增强模型,以增强训练数据,并对低资源语言 TTS 系统进行训练。
  • results: 使用阿拉伯语 datasets 作为例子,我们显示了我们的管道比基eline方法在 ASR WER 指标上具有显著改进,并进行了实验分析语音增强和 TTS 性能之间的相关性。
    Abstract High-quality and intelligible speech is essential to text-to-speech (TTS) model training, however, obtaining high-quality data for low-resource languages is challenging and expensive. Applying speech enhancement on Automatic Speech Recognition (ASR) corpus mitigates the issue by augmenting the training data, while how the nonlinear speech distortion brought by speech enhancement models affects TTS training still needs to be investigated. In this paper, we train a TF-GridNet speech enhancement model and apply it to low-resource datasets that were collected for the ASR task, then train a discrete unit based TTS model on the enhanced speech. We use Arabic datasets as an example and show that the proposed pipeline significantly improves the low-resource TTS system compared with other baseline methods in terms of ASR WER metric. We also run empirical analysis on the correlation between speech enhancement and TTS performances.
    摘要 高质量和智能可理解的语音是文本到语音(TTS)模型训练的必要条件,但是获取低资源语言的高质量数据是具有挑战和成本的。将自动语音识别(ASR)集合中的非线性语音扭曲应用于TTS训练数据可以缓解这个问题,但是如何评估非线性语音扭曲对TTS训练的影响仍需要进一步调查。在本文中,我们训练了TF-GridNet语音增强模型,并将其应用于低资源 dataset,然后训练基于分立单元的 TTS 模型。我们使用阿拉伯语 dataset 作为例子,并证明了我们的管道可以对低资源 TTS 系统进行显著改进,比基eline方法在 ASR WER 指标上具有更高的性能。我们还运行了实验分析语音增强和 TTS 性能之间的相关性。