eess.AS - 2023-07-29

METTS: Multilingual Emotional Text-to-Speech by Cross-speaker and Cross-lingual Emotion Transfer

paper_url: http://arxiv.org/abs/2307.15951
repo_url: None
paper_authors: Xinfa Zhu, Yi Lei, Tao Li, Yongmao Zhang, Hongbin Zhou, Heng Lu, Lei Xie
for: 本文提出了一种解决cross-speaker和cross-lingual语言 Transfer issue的Multilingual Emotional TTS（METTS）模型，以提高语音合成器的表情和语言能力。
methods: 本文使用了DelightfulTTS作为基础模型，并提出了以下设计：首先，通过多尺度情感模型来分解语音 просодии，以提取语言独特的情感表达；其次，通过形式变换信息的扰动来提高语音样本的精度和多样性；最后，通过 вектор量化的情感匹配器来选择参考音频，以确保合成的语音具有良好的自然性和情感多样性。
results: 实验结果表明，METTS模型能够有效地解决cross-speaker和cross-lingual语言 Transfer issue，并且能够生成高质量的语音合成。

Abstract
Previous multilingual text-to-speech (TTS) approaches have considered leveraging monolingual speaker data to enable cross-lingual speech synthesis. However, such data-efficient approaches have ignored synthesizing emotional aspects of speech due to the challenges of cross-speaker cross-lingual emotion transfer - the heavy entanglement of speaker timbre, emotion, and language factors in the speech signal will make a system produce cross-lingual synthetic speech with an undesired foreign accent and weak emotion expressiveness. This paper proposes the Multilingual Emotional TTS (METTS) model to mitigate these problems, realizing both cross-speaker and cross-lingual emotion transfer. Specifically, METTS takes DelightfulTTS as the backbone model and proposes the following designs. First, to alleviate the foreign accent problem, METTS introduces multi-scale emotion modeling to disentangle speech prosody into coarse-grained and fine-grained scales, producing language-agnostic and language-specific emotion representations, respectively. Second, as a pre-processing step, formant shift-based information perturbation is applied to the reference signal for better disentanglement of speaker timbre in the speech. Third, a vector quantization-based emotion matcher is designed for reference selection, leading to decent naturalness and emotion diversity in cross-lingual synthetic speech. Experiments demonstrate the good design of METTS.

摘要
previous多语言文本到语音（TTS）方法都是利用单语言说话人数据来实现跨语言语音合成。然而，这些数据有效approaches忽略了合成语言表达的情感方面，因为跨说话人跨语言情感传递的挑战在语音信号中存在 speaker timbre、情感和语言因素的严重杂糅。这篇论文提出了多语言情感TTS（METTS）模型，以解决这些问题。 Specifically, METTS使用DelightfulTTS作为基础模型，并提出以下设计：First, 以解决Foreign accent问题，METTS引入多级情感模型，将语音 просоди分解成粗级和细级两个档次，从而生成语言无关和语言特定的情感表示。Second, 作为预处理步骤，基于formant shift的信息扰动被应用于参考信号，以更好地分离说话人timbre在语音中。Third, 设计了基于vector quantization的情感匹配器，用于参考选择，从而实现了质量良好的自然语言和情感多样性在跨语言合成语音中。实验表明METTS的设计是合理的。