results: 论文的实验结果表明,该方法与三个基准方法进行比较,在域内和域外 audiobook 数据集上具有显著的优势。此外,文章还进行了Context information 和多尺度 Style 表示的分析,这些分析从未被讨论过。Abstract
Expressive speech synthesis is crucial for many human-computer interaction scenarios, such as audiobooks, podcasts, and voice assistants. Previous works focus on predicting the style embeddings at one single scale from the information within the current sentence. Whereas, context information in neighboring sentences and multi-scale nature of style in human speech are neglected, making it challenging to convert multi-sentence text into natural and expressive speech. In this paper, we propose MSStyleTTS, a style modeling method for expressive speech synthesis, to capture and predict styles at different levels from a wider range of context rather than a sentence. Two sub-modules, including multi-scale style extractor and multi-scale style predictor, are trained together with a FastSpeech 2 based acoustic model. The predictor is designed to explore the hierarchical context information by considering structural relationships in context and predict style embeddings at global-level, sentence-level and subword-level. The extractor extracts multi-scale style embedding from the ground-truth speech and explicitly guides the style prediction. Evaluations on both in-domain and out-of-domain audiobook datasets demonstrate that the proposed method significantly outperforms the three baselines. In addition, we conduct the analysis of the context information and multi-scale style representations that have never been discussed before.
摘要
<>Expressive speech synthesis is crucial for many human-computer interaction scenarios, such as audiobooks, podcasts, and voice assistants. Previous works focus on predicting the style embeddings at one single scale from the information within the current sentence. However, context information in neighboring sentences and the multi-scale nature of style in human speech are neglected, making it challenging to convert multi-sentence text into natural and expressive speech. In this paper, we propose MSStyleTTS, a style modeling method for expressive speech synthesis, to capture and predict styles at different levels from a wider range of context rather than a sentence. Two sub-modules, including a multi-scale style extractor and a multi-scale style predictor, are trained together with a FastSpeech 2 based acoustic model. The predictor is designed to explore the hierarchical context information by considering structural relationships in context and predict style embeddings at global-level, sentence-level, and subword-level. The extractor extracts multi-scale style embedding from the ground-truth speech and explicitly guides the style prediction. Evaluations on both in-domain and out-of-domain audiobook datasets demonstrate that the proposed method significantly outperforms the three baselines. In addition, we conduct the analysis of the context information and multi-scale style representations that have never been discussed before.中文简体版:人机交互场景中, expresive speech synthesis 是非常重要的,如 audiobooks、podcasts 和 voice assistants。先前的工作都是根据当前句子中的信息预测 style embedding 的,而忽略了周围句子的上下文信息和人类语音中的多级式样本,这使得将多句子文本转化为自然和 expresive speech 变得困难。在这篇论文中,我们提出了 MSStyleTTS,一种基于 FastSpeech 2 的 speech synthesis 模型,可以在更广泛的上下文中捕捉和预测多级式样本。我们的模型包括两个子模块:多级式样本抽取器和多级式样本预测器。前者从真实的语音中提取多级式样本 embedding,并直接导引预测;后者利用上下文关系来预测 style embedding 的 hierarchical 结构,包括全局水平、句子水平和字句水平。我们对具有域外和域内的 audiobook 数据集进行评估,结果显示,我们的方法至少超过了三个基线。此外,我们还进行了上下文信息和多级式样本表示的分析,这些研究方法从未被讨论过。
Moisesdb: A dataset for source separation beyond 4-stems
results: 这个论文提供了不同粒度的开源分离模型的基准结果,并分析了数据集的内容。Abstract
In this paper, we introduce the MoisesDB dataset for musical source separation. It consists of 240 tracks from 45 artists, covering twelve musical genres. For each song, we provide its individual audio sources, organized in a two-level hierarchical taxonomy of stems. This will facilitate building and evaluating fine-grained source separation systems that go beyond the limitation of using four stems (drums, bass, other, and vocals) due to lack of data. To facilitate the adoption of this dataset, we publish an easy-to-use Python library to download, process and use MoisesDB. Alongside a thorough documentation and analysis of the dataset contents, this work provides baseline results for open-source separation models for varying separation granularities (four, five, and six stems), and discuss their results.
摘要
在这篇论文中,我们介绍了MoisesDB数据集,用于音乐来源分离。它包含240首歌曲,来自45位艺术家,涵盖了12种音乐类型。对每首歌曲,我们提供了它的个别音频来源,以二级层级的分类系统组织。这将促进建立和评估细化来源分离系统,超出了使用四个来源(鼓、 bass、其他和 vocals)的限制,因为缺乏数据。为便于使用这个数据集,我们在Python库中发布了一个易于使用的下载、处理和使用MoisesDB的工具。此外,我们还提供了数据集的详细文档和分析,以及不同的分离精度(四、五、六个来源)的基准结果。
UniBriVL: Robust Universal Representation and Generation of Audio Driven Diffusion Models
results: 我们的实验结果表明,UniBriVL在下游任务中具有较高的效果,并且能够从audio中生成相应的图像。此外,我们还进行了质量评估,发现UniBriVL能够生成高质量的图像。Abstract
Multimodal large models have been recognized for their advantages in various performance and downstream tasks. The development of these models is crucial towards achieving general artificial intelligence in the future. In this paper, we propose a novel universal language representation learning method called UniBriVL, which is based on Bridging-Vision-and-Language (BriVL). Universal BriVL embeds audio, image, and text into a shared space, enabling the realization of various multimodal applications. Our approach addresses major challenges in robust language (both text and audio) representation learning and effectively captures the correlation between audio and image. Additionally, we demonstrate the qualitative evaluation of the generated images from UniBriVL, which serves to highlight the potential of our approach in creating images from audio. Overall, our experimental results demonstrate the efficacy of UniBriVL in downstream tasks and its ability to choose appropriate images from audio. The proposed approach has the potential for various applications such as speech recognition, music signal processing, and captioning systems.
摘要
多Modal大型模型已被认为具有多种表现和下游任务的优势。这些模型的开发是未来通用人工智能的重要步骤。在这篇文章中,我们提出了一种新的通用语言表现学习方法,即UniBriVL,它基于桥接视觉和语言(BriVL)。这个通用BriVL嵌入音频、影像和文本到共享空间中,使得实现多modal应用的可能性。我们的方法解决了语言表现学习中的重要挑战,并具有优秀的捕捉音频和影像之间的联乘。此外,我们还进行了生成图像的质感评估,以强调我们的方法在创建图像的能力。总的来说,我们的实验结果显示UniBriVL在下游任务中的有效性,并能够从音频中选择适当的图像。这种方法的应用包括语音识别、音乐信号处理和描述系统等。