results: 根据实验结果,Vec-Tok Speech 在使用 50 万小时演讲数据时表现出色,比其他最佳模型更好。Abstract
Language models (LMs) have recently flourished in natural language processing and computer vision, generating high-fidelity texts or images in various tasks. In contrast, the current speech generative models are still struggling regarding speech quality and task generalization. This paper presents Vec-Tok Speech, an extensible framework that resembles multiple speech generation tasks, generating expressive and high-fidelity speech. Specifically, we propose a novel speech codec based on speech vectors and semantic tokens. Speech vectors contain acoustic details contributing to high-fidelity speech reconstruction, while semantic tokens focus on the linguistic content of speech, facilitating language modeling. Based on the proposed speech codec, Vec-Tok Speech leverages an LM to undertake the core of speech generation. Moreover, Byte-Pair Encoding (BPE) is introduced to reduce the token length and bit rate for lower exposure bias and longer context coverage, improving the performance of LMs. Vec-Tok Speech can be used for intra- and cross-lingual zero-shot voice conversion (VC), zero-shot speaking style transfer text-to-speech (TTS), speech-to-speech translation (S2ST), speech denoising, and speaker de-identification and anonymization. Experiments show that Vec-Tok Speech, built on 50k hours of speech, performs better than other SOTA models. Code will be available at https://github.com/BakerBunker/VecTok .
摘要
语言模型(LM)在自然语言处理和计算机视觉领域最近得到了广泛应用,生成了高精度的文本或图像。然而,当前的语音生成模型仍然在语音质量和任务泛化方面存在困难。这篇论文提出了Vec-Tok Speech框架,这是一个可扩展的框架,可以实现多种语音生成任务,生成高质量和高精度的语音。具体来说,我们提出了一种基于语音向量和semantic token的语音编码方法。语音向量包含语音中的音响细节,以便实现高精度的语音重建;semantic token则关注语音中的语言内容,使得语言模型可以更好地模型语音。基于该语音编码方法,Vec-Tok Speech可以通过使用LM进行核心语音生成。此外,我们还引入了字节对编码(BPE),以降低токен长度和比特率,从而提高LM的性能。Vec-Tok Speech可以用于 между语言和cross-lingual零Shift语音转换(VC)、零Shift发音风格转换(TTS)、语音到语音翻译(S2ST)、语音干扰除和speaker隐藏和匿名化。实验结果表明,Vec-Tok Speech,基于50000小时的语音数据,与其他SOTA模型相比,表现更好。代码将在GitHub上提供,地址为https://github.com/BakerBunker/VecTok。