cs.SD - 2023-11-21

Self-Supervised Music Source Separation Using Vector-Quantized Source Category Estimates

  • paper_url: http://arxiv.org/abs/2311.13058
  • repo_url: None
  • paper_authors: Marco Pasini, Stefan Lattner, George Fazekas
  • for: This paper is focused on developing a self-supervised music source separation system that does not require audio queries during inference time, making it more suitable for genres with varied timbres and effects.
  • methods: The proposed method uses a query-based approach during training, where the continuous embedding of query audios is substituted with Vector Quantized (VQ) representations. The model is trained end-to-end with up to N classes, as determined by the VQ’s codebook size, and seeks to effectively categorize instrument classes.
  • results: The proposed method is demonstrated to be effective in separating music sources, even for genres with diverse instrumentation and effects. The authors provide examples and additional results online.
    Abstract Music source separation is focused on extracting distinct sonic elements from composite tracks. Historically, many methods have been grounded in supervised learning, necessitating labeled data, which is occasionally constrained in its diversity. More recent methods have delved into N-shot techniques that utilize one or more audio samples to aid in the separation. However, a challenge with some of these methods is the necessity for an audio query during inference, making them less suited for genres with varied timbres and effects. This paper offers a proof-of-concept for a self-supervised music source separation system that eliminates the need for audio queries at inference time. In the training phase, while it adopts a query-based approach, we introduce a modification by substituting the continuous embedding of query audios with Vector Quantized (VQ) representations. Trained end-to-end with up to N classes as determined by the VQ's codebook size, the model seeks to effectively categorise instrument classes. During inference, the input is partitioned into N sources, with some potentially left unutilized based on the mix's instrument makeup. This methodology suggests an alternative avenue for considering source separation across diverse music genres. We provide examples and additional results online.
    摘要 音乐源分离是专注于从复杂的音轨中提取明确的音乐元素的技术。历史上,许多方法围绕着supervised learning进行了定制,它们需要标注数据,偶尔存在限制的多样性。更近期的方法则涉及N-shot技术,使用一个或多个音频样本来帮助分离。然而,一些这些方法的挑战是在推理时需要音频查询,使得它们对于具有多种 timbre 和特效的流行乐种类来说不太适用。这篇论文提供了一种自主学习的音乐源分离系统的证明,它在推理时不需要音频查询。在训练阶段,我们引入了一种修改,将维度量化(VQ)表示取代 kontinuous embedding 的查询音频。通过在 N 类上进行挖掘训练,我们训练了一个端到端的模型,以便有效地分类乐器类。在推理阶段,输入被分配到 N 个源,可能会根据混合的乐器构成而有一些未使用的部分。这种方法可能会为不同的音乐类型的音乐源分离提供一条新的途径。我们在线提供了更多的例子和结果。

Adapting pretrained speech model for Mandarin lyrics transcription and alignment

  • paper_url: http://arxiv.org/abs/2311.12488
  • repo_url: https://github.com/navi0105/lyricalignment
  • paper_authors: Jun-You Wang, Chon-In Leong, Yu-Chen Lin, Li Su, Jyh-Shing Roger Jang
  • for: 本研究主要针对的是词幕识别和对齐问题,在过去几年内得到了 significiant 性能提高,但大多数之前的研究都只专注于英语,尽管该语言有大量数据可用。
  • methods: 我们采用了预训练的 Whisper 模型,并在 monophonic 普通话 dataset 上进行了微调。我们还使用了数据扩充和源分离模型来解决数据稀缺问题。
  • results: 结果显示,我们的方法在 Mandarin 多音谱数据集上实现了 caracter error rate less than 18%,并且 Mean Absolute Error 为 0.071 秒。这些结果表明采用预训练 speech 模型可以在低资源enario 中实现高效的词幕识别和对齐。
    Abstract The tasks of automatic lyrics transcription and lyrics alignment have witnessed significant performance improvements in the past few years. However, most of the previous works only focus on English in which large-scale datasets are available. In this paper, we address lyrics transcription and alignment of polyphonic Mandarin pop music in a low-resource setting. To deal with the data scarcity issue, we adapt pretrained Whisper model and fine-tune it on a monophonic Mandarin singing dataset. With the use of data augmentation and source separation model, results show that the proposed method achieves a character error rate of less than 18% on a Mandarin polyphonic dataset for lyrics transcription, and a mean absolute error of 0.071 seconds for lyrics alignment. Our results demonstrate the potential of adapting a pretrained speech model for lyrics transcription and alignment in low-resource scenarios.
    摘要 过去几年来,自动歌词译解和歌词Alignment的任务有所进步,但大多数前一些作品仅对英文进行研究,对于其他语言来说,数据不足是一大障碍。本文将关注于复音乐 Mandarin 流行音乐中的歌词译解和Alignment。为了解决数据不足的问题,我们运用预训Whisper模型,并对单音 Mandarin 歌曲进行调整。通过数据增强和源分离模型,我们获得了以下结果:在 Mandarin 复音乐 dataset 上,我们的方法可以实现字串错误率低于 18%,并且对于歌词Alignment的误差可以在 0.071 秒内控制。我们的结果显示了适用预训Speech模型来进行歌词译解和Alignment的潜力,尤其是在低资源情况下。

HPCNeuroNet: Advancing Neuromorphic Audio Signal Processing with Transformer-Enhanced Spiking Neural Networks

  • paper_url: http://arxiv.org/abs/2311.12449
  • repo_url: None
  • paper_authors: Murat Isik, Hiruna Vishwamith, Kayode Inadagbo, I. Can Dikmen
  • for: 这个研究旨在开发一个基于神经网络的对话音频处理架构,整合了神经网络(SNN)、传播器和高性能计算(HPC)的优点,以提高处理多种语言和背景噪音的人类声音处理能力。
  • methods: 本研究使用了Intel N-DNS数据集,利用短时间傅立叶变换(STFT) для时间频率表示,使用传播器嵌入来生成稠密 вектор,并使用SNN编码/解码机制来将脉冲训练转换为对应的对话音频处理。
  • results: 比较结果显示,提案的加速器在100MHz的频率下可以 дости到71.11 Giga-Operations Per Second(GOP/s)的throughput,并且在3.55W的对应的质量上实现了高效的能源耗用。此外,通过设计空间探索,我们提供了优化核心能力的指导方针,以满足不同的对话音频处理任务。
    Abstract This paper presents a novel approach to neuromorphic audio processing by integrating the strengths of Spiking Neural Networks (SNNs), Transformers, and high-performance computing (HPC) into the HPCNeuroNet architecture. Utilizing the Intel N-DNS dataset, we demonstrate the system's capability to process diverse human vocal recordings across multiple languages and noise backgrounds. The core of our approach lies in the fusion of the temporal dynamics of SNNs with the attention mechanisms of Transformers, enabling the model to capture intricate audio patterns and relationships. Our architecture, HPCNeuroNet, employs the Short-Time Fourier Transform (STFT) for time-frequency representation, Transformer embeddings for dense vector generation, and SNN encoding/decoding mechanisms for spike train conversions. The system's performance is further enhanced by leveraging the computational capabilities of NVIDIA's GeForce RTX 3060 GPU and Intel's Core i9 12900H CPU. Additionally, we introduce a hardware implementation on the Xilinx VU37P HBM FPGA platform, optimizing for energy efficiency and real-time processing. The proposed accelerator achieves a throughput of 71.11 Giga-Operations Per Second (GOP/s) with a 3.55 W on-chip power consumption at 100 MHz. The comparison results with off-the-shelf devices and recent state-of-the-art implementations illustrate that the proposed accelerator has obvious advantages in terms of energy efficiency and design flexibility. Through design-space exploration, we provide insights into optimizing core capacities for audio tasks. Our findings underscore the transformative potential of integrating SNNs, Transformers, and HPC for neuromorphic audio processing, setting a new benchmark for future research and applications.
    摘要 The HPCNeuroNet architecture employs the Short-Time Fourier Transform (STFT) for time-frequency representation, Transformer embeddings for dense vector generation, and SNN encoding/decoding mechanisms for spike train conversions. The system's performance is further enhanced by leveraging the computational capabilities of NVIDIA's GeForce RTX 3060 GPU and Intel's Core i9 12900H CPU. Additionally, we implement the hardware on the Xilinx VU37P HBM FPGA platform, optimizing for energy efficiency and real-time processing. The proposed accelerator achieves a throughput of 71.11 Giga-Operations Per Second (GOP/s) with a 3.55 W on-chip power consumption at 100 MHz. Compared to off-the-shelf devices and recent state-of-the-art implementations, the proposed accelerator shows obvious advantages in terms of energy efficiency and design flexibility.Through design-space exploration, we provide insights into optimizing core capacities for audio tasks. Our findings demonstrate the potential of integrating SNNs, Transformers, and HPC for neuromorphic audio processing, setting a new benchmark for future research and applications.

Equipping Pretrained Unconditional Music Transformers with Instrument and Genre Controls

  • paper_url: http://arxiv.org/abs/2311.12257
  • repo_url: None
  • paper_authors: Weihan Xu, Julian McAuley, Shlomo Dubnov, Hao-Wen Dong
  • for: 这个研究目的是探讨“预训练和调整”模式在符号音乐生成中的效果。
  • methods: 作者使用了150万首歌曲数据库,首先预训了一个大型无条件transformer模型,然后提出了一个简单的技术来将这个预训模型与乐器和类型控制 tokens 融合,实现更高级的控制性和表达力。
  • results: 实验结果显示,提案的模型可以成功地根据用户指定的乐器和类型生成符号音乐,并在主观听诊中较baseline模型高度凝聚、和谐、排序和全面质量。
    Abstract The ''pretraining-and-finetuning'' paradigm has become a norm for training domain-specific models in natural language processing and computer vision. In this work, we aim to examine this paradigm for symbolic music generation through leveraging the largest ever symbolic music dataset sourced from the MuseScore forum. We first pretrain a large unconditional transformer model using 1.5 million songs. We then propose a simple technique to equip this pretrained unconditional music transformer model with instrument and genre controls by finetuning the model with additional control tokens. Our proposed representation offers improved high-level controllability and expressiveness against two existing representations. The experimental results show that the proposed model can successfully generate music with user-specified instruments and genre. In a subjective listening test, the proposed model outperforms the pretrained baseline model in terms of coherence, harmony, arrangement and overall quality.
    摘要 “pretraining-and-finetuning”模式在自然语言处理和计算机视觉领域成为标准训练域域模型的方法。在这个工作中,我们想要检查这种模式是否适用于符号音乐生成。我们首先使用150万首歌曲预训练大型不条件变换器模型。然后,我们提出一种简单的技术,通过额外的控制token来赋能这个预训练的无条件音乐变换器模型。我们的提出的表示具有更高水平的控制性和表达性,与现有的两种表示进行比较。实验结果表明,我们的模型可以成功地生成用户指定的乐器和乐种的音乐。在主观听测中,我们的模型比预训练基线模型在凝合度、和声、排序和总质量方面表现更好。