eess.AS - 2023-10-05

Latent Filling: Latent Space Data Augmentation for Zero-shot Speech Synthesis

  • paper_url: http://arxiv.org/abs/2310.03538
  • repo_url: None
  • paper_authors: Jae-Sung Bae, Joun Yeop Lee, Ji-Hyun Lee, Seongkyu Mun, Taehwa Kang, Hoon-Young Cho, Chanwoo Kim
  • for: 提高 zero-shot text-to-speech(ZS-TTS)系统的性能
  • methods: 使用简单而有效的幽合空间数据增强方法(Latent Filling,LF),在ZS-TTS系统的 speaker embedding 空间中进行数据增强
  • results: LF 能够提高 speaker 相似性,同时保持 speech 质量
    Abstract Previous works in zero-shot text-to-speech (ZS-TTS) have attempted to enhance its systems by enlarging the training data through crowd-sourcing or augmenting existing speech data. However, the use of low-quality data has led to a decline in the overall system performance. To avoid such degradation, instead of directly augmenting the input data, we propose a latent filling (LF) method that adopts simple but effective latent space data augmentation in the speaker embedding space of the ZS-TTS system. By incorporating a consistency loss, LF can be seamlessly integrated into existing ZS-TTS systems without the need for additional training stages. Experimental results show that LF significantly improves speaker similarity while preserving speech quality.
    摘要 Note: Simplified Chinese is a standardized form of Chinese that uses shorter words and sentences, and is often used in informal writing and online communication. The translation above uses Simplified Chinese characters and grammar.