cs.SD - 2023-11-19

Encoding Performance Data in MEI with the Automatic Music Performance Analysis and Comparison Toolkit (AMPACT)

  • paper_url: http://arxiv.org/abs/2311.11363
  • repo_url: None
  • paper_authors: Johanna Devaney, Cecilia Beauchamp
  • for: 这个论文是用于介绍一种新的MEI编码方法,使用最近添加的\texttt{}元素来存储表演数据。
  • methods: 这个论文使用了Automatic Music Performance Analysis and Comparison Toolkit(AMPACT)来提取表演数据,并将其编码为JSON对象,并将其链接到特定的乐谱Note中的\texttt{}元素。
  • results: 这个论文使用了一组流行音乐 vocals 来示例出\texttt{}元素可以编码的范围和AMPACT可以在缺乏完整乐谱的情况下提取表演数据。
    Abstract This paper presents a new method of encoding performance data in MEI using the recently added \texttt{} element. Performance data was extracted using the Automatic Music Performance Analysis and Comparison Toolkit (AMPACT) and encoded as a JSON object within an \texttt{} element linked to a specific musical note. A set of pop music vocals has was encoded to demonstrate both the range of descriptors that can be encoded in and how AMPACT can be used for extracting performance data in the absence of a fully specified musical score.
    摘要 Here's the Simplified Chinese translation:这篇论文提出了一种使用 MEI 中最近添加的 \texttt{} 元素来编码性能数据的新方法。性能数据由 Automatic Music Performance Analysis and Comparison Toolkit (AMPACT) 提取,并以 JSON 对象形式嵌入在 \texttt{} 元素中,与特定的音符相关联。为示这种编码方式的应用范围和 AMPACT 如何在缺乏完整的乐谱情况下提取性能数据,一组流行音乐 vocals 被编码。

M$^{2}$UGen: Multi-modal Music Understanding and Generation with the Power of Large Language Models

  • paper_url: http://arxiv.org/abs/2311.11255
  • repo_url: None
  • paper_authors: Atin Sakkeer Hussain, Shansong Liu, Chenshuo Sun, Ying Shan
    for:* 这种研究旨在利用大型语言模型(LLM)来理解和生成不同modalities的音乐。methods:* 该研究使用了预训练的MERT、ViT和ViViT模型来理解音乐、图片和视频。* 使用AudioLDM 2和MusicGen来实现音乐生成。results:* 该研究通过结合多modal的理解和音乐生成,实现了创造性的潜力。* 对比当前状态的艺术模型,该模型的实验结果表明其可以达到或超越现有的性能。
    Abstract The current landscape of research leveraging large language models (LLMs) is experiencing a surge. Many works harness the powerful reasoning capabilities of these models to comprehend various modalities, such as text, speech, images, videos, etc. They also utilize LLMs to understand human intention and generate desired outputs like images, videos, and music. However, research that combines both understanding and generation using LLMs is still limited and in its nascent stage. To address this gap, we introduce a Multi-modal Music Understanding and Generation (M$^{2}$UGen) framework that integrates LLM's abilities to comprehend and generate music for different modalities. The M$^{2}$UGen framework is purpose-built to unlock creative potential from diverse sources of inspiration, encompassing music, image, and video through the use of pretrained MERT, ViT, and ViViT models, respectively. To enable music generation, we explore the use of AudioLDM 2 and MusicGen. Bridging multi-modal understanding and music generation is accomplished through the integration of the LLaMA 2 model. Furthermore, we make use of the MU-LLaMA model to generate extensive datasets that support text/image/video-to-music generation, facilitating the training of our M$^{2}$UGen framework. We conduct a thorough evaluation of our proposed framework. The experimental results demonstrate that our model achieves or surpasses the performance of the current state-of-the-art models.
    摘要 当前研究利用大型语言模型(LLM)的领域正在推进。许多研究利用这些模型的强大理解能力来理解不同类型的数据,如文本、语音、图像、视频等。它们还利用LLM来理解人类意图并生成所需的输出,如图像、视频和音乐。然而,将理解和生成用LLM结合的研究仍然很有限,在初始阶段。为解决这个差距,我们介绍了一个多模态音乐理解和生成(M$^{2}$UGen)框架,它结合了不同类型的启发源的LLM能力,以生成不同模式的音乐。M$^{2}$UGen框架是为了解锁多模态音乐创作的创新潜力,包括音乐、图像和视频。为实现音乐生成,我们探索了AudioLDM 2和MusicGen等模型的使用。通过将多模态理解和音乐生成结合起来,我们使用了LLaMA 2模型。此外,我们使用MU-LLaMA模型生成了大量的文本/图像/视频到音乐生成数据集,以支持我们的M$^{2}$UGen框架的训练。我们进行了严格的实验评估,实验结果表明,我们的模型在性能上与当前状态的艺术模型相当或超越。