results: 论文在11个任务上实现了州际级或至少竞争性的成绩,并且发现UniAudio模型在所有训练任务中表现出了强大的能力。Abstract
Large Language models (LLM) have demonstrated the capability to handle a variety of generative tasks. This paper presents the UniAudio system, which, unlike prior task-specific approaches, leverages LLM techniques to generate multiple types of audio (including speech, sounds, music, and singing) with given input conditions. UniAudio 1) first tokenizes all types of target audio along with other condition modalities, 2) concatenates source-target pair as a single sequence, and 3) performs next-token prediction using LLM. Also, a multi-scale Transformer model is proposed to handle the overly long sequences caused by the residual vector quantization based neural codec in tokenization. Training of UniAudio is scaled up to 165K hours of audio and 1B parameters, based on all generative tasks, aiming to obtain sufficient prior knowledge not only in the intrinsic properties of audio but also the inter-relationship between audio and other modalities. Therefore, the trained UniAudio model has the potential to become a foundation model for universal audio generation: it shows strong capability in all trained tasks and can seamlessly support new audio generation tasks after simple fine-tuning. Experiments demonstrate that UniAudio achieves state-of-the-art or at least competitive results on most of the 11 tasks. Demo and code are released at https://github.com/yangdongchao/UniAudio
摘要
大型语言模型(LLM)已经证明了处理多种生成任务的能力。这篇论文介绍了UniAudio系统,与前一些任务特定的方法不同,通过LLM技术来生成多种音频(包括语音、声音、音乐和歌唱),并且可以根据输入条件进行生成。UniAudio的实现方式包括以下三个步骤:1. 对所有类型的目标音频进行token化,并将其与其他条件模式一起 concatenate 成一个序列。2. 使用 LLM 进行下一个token预测。3. 使用多级 transformer 模型来处理由 residual vector quantization 基于的 neural codec 生成的过长序列。在训练UniAudio时,使用了165K小时的音频和1B参数,基于所有生成任务,以获得充足的先验知识不仅在音频的内在性能,还在音频和其他模式之间的关系。因此,训练UniAudio模型后,可以作为普适的音频生成基模型,它在所有训练任务中表现出了强大的能力,并且可以通过简单的微调来支持新的音频生成任务。实验结果表明,UniAudio在大多数11个任务中具有国际级或至少竞争力的成绩。示例和代码可以在https://github.com/yangdongchao/UniAudio 中下载。
Pianist Identification Using Convolutional Neural Networks
For: 本研究旨在用深度学习技术自动识别表演型钢琴演奏者,解决了建立智能音乐 инструмент和智能音乐系统的挑战。* Methods: 我们使用卷积神经网络和表达特征来实现自动识别,并对大规模的表演型钢琴演奏数据进行了深度学习技术的应用和改进。* Results: 我们的模型在6类识别任务中达到85.3%的准确率,比基eline模型高出了20.8%。我们的改进的数据集也提供了更好的训练数据,为自动演奏者识别做出了重要贡献。Abstract
This paper presents a comprehensive study of automatic performer identification in expressive piano performances using convolutional neural networks (CNNs) and expressive features. Our work addresses the challenging multi-class classification task of identifying virtuoso pianists, which has substantial implications for building dynamic musical instruments with intelligence and smart musical systems. Incorporating recent advancements, we leveraged large-scale expressive piano performance datasets and deep learning techniques. We refined the scores by expanding repetitions and ornaments for more accurate feature extraction. We demonstrated the capability of one-dimensional CNNs for identifying pianists based on expressive features and analyzed the impact of the input sequence lengths and different features. The proposed model outperforms the baseline, achieving 85.3% accuracy in a 6-way identification task. Our refined dataset proved more apt for training a robust pianist identifier, making a substantial contribution to the field of automatic performer identification. Our codes have been released at https://github.com/BetsyTang/PID-CNN.
摘要