eess.AS - 2023-11-14

Mustango: Toward Controllable Text-to-Music Generation

paper_url: http://arxiv.org/abs/2311.08355
repo_url: https://github.com/amaai-lab/mustango
paper_authors: Jan Melechovsky, Zixun Guo, Deepanway Ghosal, Navonil Majumder, Dorien Herremans, Soujanya Poria
for: 这个研究的目的是发展一个可控制的文本到音乐系统，以掌控生成的音乐的乐器、和声、速度、键等音乐特性。
methods: 这个系统使用了 diffusion 模型，并将音乐领域知识 Informed UNet 模组（MuNet）与文本内容进行整合，以便从文本描述中提取音乐特性，并将其融入到散射过程中。
results: 这个研究显示，这个 Mustango 系统可以实现高质量的音乐生成，并且可以从文本描述中提取音乐特性，并且可以对散射过程进行控制，以实现欲要的乐器、和声、速度、键等音乐特性。

Abstract
With recent advancements in text-to-audio and text-to-music based on latent diffusion models, the quality of generated content has been reaching new heights. The controllability of musical aspects, however, has not been explicitly explored in text-to-music systems yet. In this paper, we present Mustango, a music-domain-knowledge-inspired text-to-music system based on diffusion, that expands the Tango text-to-audio model. Mustango aims to control the generated music, not only with general text captions, but from more rich captions that could include specific instructions related to chords, beats, tempo, and key. As part of Mustango, we propose MuNet, a Music-Domain-Knowledge-Informed UNet sub-module to integrate these music-specific features, which we predict from the text prompt, as well as the general text embedding, into the diffusion denoising process. To overcome the limited availability of open datasets of music with text captions, we propose a novel data augmentation method that includes altering the harmonic, rhythmic, and dynamic aspects of music audio and using state-of-the-art Music Information Retrieval methods to extract the music features which will then be appended to the existing descriptions in text format. We release the resulting MusicBench dataset which contains over 52K instances and includes music-theory-based descriptions in the caption text. Through extensive experiments, we show that the quality of the music generated by Mustango is state-of-the-art, and the controllability through music-specific text prompts greatly outperforms other models in terms of desired chords, beat, key, and tempo, on multiple datasets.

摘要
Recent advancements in text-to-audio and text-to-music based on latent diffusion models have led to significant improvements in generated content quality. However, the controllability of musical aspects in text-to-music systems has not been explicitly explored. In this paper, we present Mustango, a text-to-music system based on diffusion that expands the Tango text-to-audio model. Mustango aims to control the generated music not only with general text captions but also with specific instructions related to chords, beats, tempo, and key.As part of Mustango, we propose MuNet, a Music-Domain-Knowledge-Informed UNet sub-module that integrates music-specific features into the diffusion denoising process. To address the limited availability of open datasets of music with text captions, we propose a novel data augmentation method that includes altering the harmonic, rhythmic, and dynamic aspects of music audio and using state-of-the-art Music Information Retrieval methods to extract music features. We release the resulting MusicBench dataset, which contains over 52K instances and includes music-theory-based descriptions in the caption text.Through extensive experiments, we show that the quality of the music generated by Mustango is state-of-the-art, and the controllability through music-specific text prompts greatly outperforms other models in terms of desired chords, beat, key, and tempo on multiple datasets.