results: 官方挑战结果显示,我们的系统在自然性方面表现出色,在任务1和任务2中分别获得了第1名和第2名。进一步的ablation研究证明了我们的系统设计的有效性。Abstract
This paper introduces the T23 team's system submitted to the Singing Voice Conversion Challenge 2023. Following the recognition-synthesis framework, our singing conversion model is based on VITS, incorporating four key modules: a prior encoder, a posterior encoder, a decoder, and a parallel bank of transposed convolutions (PBTC) module. We particularly leverage Whisper, a powerful pre-trained ASR model, to extract bottleneck features (BNF) as the input of the prior encoder. Before BNF extraction, we perform pitch perturbation to the source signal to remove speaker timbre, which effectively avoids the leakage of the source speaker timbre to the target. Moreover, the PBTC module extracts multi-scale F0 as the auxiliary input to the prior encoder, thereby capturing better pitch variations of singing. We design a three-stage training strategy to better adapt the base model to the target speaker with limited target speaker data. Official challenge results show that our system has superior performance in naturalness, ranking 1st and 2nd respectively in Task 1 and 2. Further ablation justifies the effectiveness of our system design.
摘要
Translated into Simplified Chinese:这篇论文介绍了T23团队在2023年Singing Voice Conversion Challenge中提交的系统。我们的唱歌转换模型基于VITS框架,包括四个关键模块:一个先前编码器、一个后续编码器、一个解码器和一个平行折叠卷积(PBTC)模块。我们特别利用Whisper,一个强大预训练的ASR模型,提取瓶颈特征(BNF)作为先前编码器的输入。在提取BNF之前,我们对源信号进行滥声调整,以除去说话者特征,这有效地避免了来源说话者特征的泄露到目标。此外,PBTC模块提取多Scale F0作为先前编码器的辅助输入,以更好地捕捉唱歌中的抑干变化。我们设计了三个阶段的训练策略,以更好地适应基模型到目标说话者 WITH limited target speaker data。官方挑战结果显示,我们的系统在自然性方面表现出色,在任务1和任务2中分别排名第一和第二。进一步的ablation justify我们系统设计的有效性。
The VoiceMOS Challenge 2023: Zero-shot Subjective Speech Quality Prediction for Multiple Domains
results: 发现法文文本至语音合成的两个子track有大量差异,而转换到歌唱voice的样本并不是我们所期望的那么难预测。Abstract
We present the second edition of the VoiceMOS Challenge, a scientific event that aims to promote the study of automatic prediction of the mean opinion score (MOS) of synthesized and processed speech. This year, we emphasize real-world and challenging zero-shot out-of-domain MOS prediction with three tracks for three different voice evaluation scenarios. Ten teams from industry and academia in seven different countries participated. Surprisingly, we found that the two sub-tracks of French text-to-speech synthesis had large differences in their predictability, and that singing voice-converted samples were not as difficult to predict as we had expected. Use of diverse datasets and listener information during training appeared to be successful approaches.
摘要
我们现在公布第二届语音MOS挑战赛,这是一个科学活动,旨在促进自动预测合成和处理speech的意见评分(MOS)的研究。本年,我们强调了真实世界和挑战性的零shot OUT-OF-DOMAIN MOS预测,并设置了三个不同的voice评估场景的三个 tracks。本赛事有来自行业和学术界的十支队伍参与,来自七个不同国家。我们很感奇,我们发现法语文本到语音合成的两个子track具有大量的预测性,而转换到歌唱样本并不如我们预期的那么难预测。使用多样化的数据集和听众信息 durante 训练显示出成功的方法。