results: 研究发现,使用joint speaker encoder和phonetic posteriorgram可以实现高品质的跨语言语音转换,并且能够保持语音的自然性和特点。Abstract
Voice conversion systems have made significant advancements in terms of naturalness and similarity in common voice conversion tasks. However, their performance in more complex tasks such as cross-lingual voice conversion and expressive voice conversion remains imperfect. In this study, we propose a novel approach that combines a jointly trained speaker encoder and content features extracted from the cross-lingual speech recognition model Whisper to achieve high-quality cross-lingual voice conversion. Additionally, we introduce a speaker consistency loss to the joint encoder, which improves the similarity between the converted speech and the reference speech. To further explore the capabilities of the joint speaker encoder, we use the phonetic posteriorgram as the content feature, which enables the model to effectively reproduce both the speaker characteristics and the emotional aspects of the reference speech.
摘要
声音转换系统在日常声音转换任务中已经取得了显著的进步,但在跨语言声音转换和表情声音转换方面的表现仍然不够完美。在这项研究中,我们提出了一种新的方法, combinig a jointly trained speaker encoder和从跨语言语音识别模型Whisper提取的内容特征,以实现高质量的跨语言声音转换。此外,我们还添加了一个说话者一致性损失到联合编码器中,使模型能够更好地保持说话者的一致性。为了更好地探索联合说话者编码器的能力,我们使用了phonetic posteriorgram作为内容特征,这使得模型能够有效地复制参照语音中的说话者特征和情感特征。