eess.AS - 2023-09-20

A Neural TTS System with Parallel Prosody Transfer from Unseen Speakers

  • paper_url: http://arxiv.org/abs/2309.11487
  • repo_url: None
  • paper_authors: Slava Shechtman, Raul Fernandez
  • for: 这个研究的目的是开发一种可以从 parallel text recording 中提取高级别的语音特征,并将其应用于不同的 TTS voz 中,以实现更加自然和表情充沛的语音读取。
  • methods: 该研究使用了一种基于神经网络的 TTS 系统,并将其 equiped avec prosody-control 功能,以便在推理时间对语音输出进行更direct的Shape。
  • results: 研究表明,该系统可以准确地从新的说话者的 parallel text recording 中提取语音特征,并将其应用于不同的 TTS voz 中,无质量下降,同时保持目标 TTS voz 的identidad,根据一系列主观听力实验的评估。
    Abstract Modern neural TTS systems are capable of generating natural and expressive speech when provided with sufficient amounts of training data. Such systems can be equipped with prosody-control functionality, allowing for more direct shaping of the speech output at inference time. In some TTS applications, it may be desirable to have an option that guides the TTS system with an ad-hoc speech recording exemplar to impose an implicit fine-grained, user-preferred prosodic realization for certain input prompts. In this work we present a first-of-its-kind neural TTS system equipped with such functionality to transfer the prosody from a parallel text recording from an unseen speaker. We demonstrate that the proposed system can precisely transfer the speech prosody from novel speakers to various trained TTS voices with no quality degradation, while preserving the target TTS speakers' identity, as evaluated by a set of subjective listening experiments.
    摘要 现代神经网络Text-to-Speech系统可以从充足的训练数据中生成自然和表达力强的语音。这些系统可以搭载受控拍层功能,以更直接在推理时调节语音输出。在某些TTS应用程序中,可能愿意有一个选项,使TTS系统通过额外的即时示例来强制某些输入提示的细腻、用户首选的语音表现。在这种工作中,我们介绍了一种首次实现的神经网络TTS系统,可以将来自未见的说话人的语音特征精确地传递到不同的训练过的TTSvoice中,而无损质量,同时保持目标TTS speaker的身份,根据一组主观听力试验的评价。