cs.SD - 2023-07-29

MSStyleTTS: Multi-Scale Style Modeling with Hierarchical Context Information for Expressive Speech Synthesis

  • paper_url: http://arxiv.org/abs/2307.16012
  • repo_url: None
  • paper_authors: Shun Lei, Yixuan Zhou, Liyang Chen, Zhiyong Wu, Xixin Wu, Shiyin Kang, Helen Meng
  • for: 这个论文的目的是提出一种基于多尺度 Style 模型的 expresive speech synthesis 方法,以便在人机交互场景中更加自然和 expresive。
  • methods: 该方法使用两个子模块:一个是多尺度 Style 提取器,另一个是多尺度 Style 预测器。这两个子模块与 FastSpeech 2 基于 acoustic model 一起训练。预测器通过考虑上下文结构关系来探索层次结构上的 Context information,并预测 Style 嵌入。提取器则提取了多尺度 Style 嵌入从真实的speech中。
  • results: 论文的实验结果表明,该方法与三个基准方法进行比较,在域内和域外 audiobook 数据集上具有显著的优势。此外,文章还进行了Context information 和多尺度 Style 表示的分析,这些分析从未被讨论过。
    Abstract Expressive speech synthesis is crucial for many human-computer interaction scenarios, such as audiobooks, podcasts, and voice assistants. Previous works focus on predicting the style embeddings at one single scale from the information within the current sentence. Whereas, context information in neighboring sentences and multi-scale nature of style in human speech are neglected, making it challenging to convert multi-sentence text into natural and expressive speech. In this paper, we propose MSStyleTTS, a style modeling method for expressive speech synthesis, to capture and predict styles at different levels from a wider range of context rather than a sentence. Two sub-modules, including multi-scale style extractor and multi-scale style predictor, are trained together with a FastSpeech 2 based acoustic model. The predictor is designed to explore the hierarchical context information by considering structural relationships in context and predict style embeddings at global-level, sentence-level and subword-level. The extractor extracts multi-scale style embedding from the ground-truth speech and explicitly guides the style prediction. Evaluations on both in-domain and out-of-domain audiobook datasets demonstrate that the proposed method significantly outperforms the three baselines. In addition, we conduct the analysis of the context information and multi-scale style representations that have never been discussed before.
    摘要 <>Expressive speech synthesis is crucial for many human-computer interaction scenarios, such as audiobooks, podcasts, and voice assistants. Previous works focus on predicting the style embeddings at one single scale from the information within the current sentence. However, context information in neighboring sentences and the multi-scale nature of style in human speech are neglected, making it challenging to convert multi-sentence text into natural and expressive speech. In this paper, we propose MSStyleTTS, a style modeling method for expressive speech synthesis, to capture and predict styles at different levels from a wider range of context rather than a sentence. Two sub-modules, including a multi-scale style extractor and a multi-scale style predictor, are trained together with a FastSpeech 2 based acoustic model. The predictor is designed to explore the hierarchical context information by considering structural relationships in context and predict style embeddings at global-level, sentence-level, and subword-level. The extractor extracts multi-scale style embedding from the ground-truth speech and explicitly guides the style prediction. Evaluations on both in-domain and out-of-domain audiobook datasets demonstrate that the proposed method significantly outperforms the three baselines. In addition, we conduct the analysis of the context information and multi-scale style representations that have never been discussed before.中文简体版:人机交互场景中, expresive speech synthesis 是非常重要的,如 audiobooks、podcasts 和 voice assistants。先前的工作都是根据当前句子中的信息预测 style embedding 的,而忽略了周围句子的上下文信息和人类语音中的多级式样本,这使得将多句子文本转化为自然和 expresive speech 变得困难。在这篇论文中,我们提出了 MSStyleTTS,一种基于 FastSpeech 2 的 speech synthesis 模型,可以在更广泛的上下文中捕捉和预测多级式样本。我们的模型包括两个子模块:多级式样本抽取器和多级式样本预测器。前者从真实的语音中提取多级式样本 embedding,并直接导引预测;后者利用上下文关系来预测 style embedding 的 hierarchical 结构,包括全局水平、句子水平和字句水平。我们对具有域外和域内的 audiobook 数据集进行评估,结果显示,我们的方法至少超过了三个基线。此外,我们还进行了上下文信息和多级式样本表示的分析,这些研究方法从未被讨论过。

Moisesdb: A dataset for source separation beyond 4-stems

  • paper_url: http://arxiv.org/abs/2307.15913
  • repo_url: https://github.com/moises-ai/moises-db
  • paper_authors: Igor Pereira, Felipe Araújo, Filip Korzeniowski, Richard Vogl
  • for: 这个论文是为了介绍音乐源分离的MoisesDB数据集而写的。
  • methods: 这个论文使用了一个二级层次的分类法来组织音频源,并提供了一个使用Python编程的易于使用的库来下载、处理和使用MoisesDB数据集。
  • results: 这个论文提供了不同粒度的开源分离模型的基准结果,并分析了数据集的内容。
    Abstract In this paper, we introduce the MoisesDB dataset for musical source separation. It consists of 240 tracks from 45 artists, covering twelve musical genres. For each song, we provide its individual audio sources, organized in a two-level hierarchical taxonomy of stems. This will facilitate building and evaluating fine-grained source separation systems that go beyond the limitation of using four stems (drums, bass, other, and vocals) due to lack of data. To facilitate the adoption of this dataset, we publish an easy-to-use Python library to download, process and use MoisesDB. Alongside a thorough documentation and analysis of the dataset contents, this work provides baseline results for open-source separation models for varying separation granularities (four, five, and six stems), and discuss their results.
    摘要 在这篇论文中,我们介绍了MoisesDB数据集,用于音乐来源分离。它包含240首歌曲,来自45位艺术家,涵盖了12种音乐类型。对每首歌曲,我们提供了它的个别音频来源,以二级层级的分类系统组织。这将促进建立和评估细化来源分离系统,超出了使用四个来源(鼓、 bass、其他和 vocals)的限制,因为缺乏数据。为便于使用这个数据集,我们在Python库中发布了一个易于使用的下载、处理和使用MoisesDB的工具。此外,我们还提供了数据集的详细文档和分析,以及不同的分离精度(四、五、六个来源)的基准结果。

UniBriVL: Robust Universal Representation and Generation of Audio Driven Diffusion Models

  • paper_url: http://arxiv.org/abs/2307.15898
  • repo_url: None
  • paper_authors: Sen Fang, Bowen Gao, Yangjian Wu, Jingwen Cai, Teik Toe Teoh
  • for: 这篇论文旨在提出一种基于 Bridging-Vision-and-Language(BriVL)的universal语言表示学习方法,以实现多modal应用程序的开发。
  • methods: 该方法使用audio、图像和文本在共享空间内嵌入,解决了多modal语言表示学习中的主要挑战,同时能够有效地捕捉audio和图像之间的相关性。
  • results: 我们的实验结果表明,UniBriVL在下游任务中具有较高的效果,并且能够从audio中生成相应的图像。此外,我们还进行了质量评估,发现UniBriVL能够生成高质量的图像。
    Abstract Multimodal large models have been recognized for their advantages in various performance and downstream tasks. The development of these models is crucial towards achieving general artificial intelligence in the future. In this paper, we propose a novel universal language representation learning method called UniBriVL, which is based on Bridging-Vision-and-Language (BriVL). Universal BriVL embeds audio, image, and text into a shared space, enabling the realization of various multimodal applications. Our approach addresses major challenges in robust language (both text and audio) representation learning and effectively captures the correlation between audio and image. Additionally, we demonstrate the qualitative evaluation of the generated images from UniBriVL, which serves to highlight the potential of our approach in creating images from audio. Overall, our experimental results demonstrate the efficacy of UniBriVL in downstream tasks and its ability to choose appropriate images from audio. The proposed approach has the potential for various applications such as speech recognition, music signal processing, and captioning systems.
    摘要 多Modal大型模型已被认为具有多种表现和下游任务的优势。这些模型的开发是未来通用人工智能的重要步骤。在这篇文章中,我们提出了一种新的通用语言表现学习方法,即UniBriVL,它基于桥接视觉和语言(BriVL)。这个通用BriVL嵌入音频、影像和文本到共享空间中,使得实现多modal应用的可能性。我们的方法解决了语言表现学习中的重要挑战,并具有优秀的捕捉音频和影像之间的联乘。此外,我们还进行了生成图像的质感评估,以强调我们的方法在创建图像的能力。总的来说,我们的实验结果显示UniBriVL在下游任务中的有效性,并能够从音频中选择适当的图像。这种方法的应用包括语音识别、音乐信号处理和描述系统等。