results: 实验结果表明,我们的方法在零shot VST场景中比其他VST模型表现更好。Audioamples可以在 \url{https://hiervst.github.io/} 上 obtaint。Abstract
Despite rapid progress in the voice style transfer (VST) field, recent zero-shot VST systems still lack the ability to transfer the voice style of a novel speaker. In this paper, we present HierVST, a hierarchical adaptive end-to-end zero-shot VST model. Without any text transcripts, we only use the speech dataset to train the model by utilizing hierarchical variational inference and self-supervised representation. In addition, we adopt a hierarchical adaptive generator that generates the pitch representation and waveform audio sequentially. Moreover, we utilize unconditional generation to improve the speaker-relative acoustic capacity in the acoustic representation. With a hierarchical adaptive structure, the model can adapt to a novel voice style and convert speech progressively. The experimental results demonstrate that our method outperforms other VST models in zero-shot VST scenarios. Audio samples are available at \url{https://hiervst.github.io/}.
摘要
尽管voice style transfer(VST)领域的进步 rapid, recent zero-shot VST system 仍然缺乏能够传递新说者的voice style的能力。在这篇论文中,我们提出了一种层次适应式 zero-shot VST 模型,即 HierVST。无需任何文本脚本,我们只使用语音数据来训练模型,通过层次变量推断和自我supervised representation。此外,我们采用层次适应生成器,生成抽象音频序列和声音表示。此外,我们利用无条件生成技术,提高speaker-relative acoustic capacity。通过层次适应结构,模型可以逐步适应新的voice style,并将语音转换为新的语音风格。实验结果表明,我们的方法在zero-shot VST场景下表现出色,超过了其他VST模型。听音amples可以在 \url{https://hiervst.github.io/} 上找到。
ÌròyìnSpeech: A multi-purpose Yorùbá Speech Corpus
results: 这个 dataset 总共包含 38.5 小时的数据,由 80 名志愿者录音。Abstract
We introduce the \`{I}r\`{o}y\`{i}nSpeech corpus -- a new dataset influenced by a desire to increase the amount of high quality, freely available, contemporary Yor\`{u}b\'{a} speech. We release a multi-purpose dataset that can be used for both TTS and ASR tasks. We curated text sentences from the news and creative writing domains under an open license i.e., CC-BY-4.0 and had multiple speakers record each sentence. We provide 5000 of our utterances to the Common Voice platform to crowdsource transcriptions online. The dataset has 38.5 hours of data in total, recorded by 80 volunteers.
摘要
我们介绍《尼罗言语 corpus》——一个新的数据集,受到了提高高质量、自由可用、当代尼罗语言的需求的影响。我们发布了多用途的数据集,可以用于 TTS 和 ASR 任务。我们从新闻和创意写作领域中选取了 CC-BY-4.0 开源许可证下的文本句子,并由多名说话者录制每句句子。我们向 Common Voice 平台提供了5000个语音utterance,以便在线受众投票转录。总共有38.5小时的数据,由80名志愿者录制。