results: 这个系统在Sound Demixing Challenge(SDX23)的MSS追踪上 Ranked the first place,并在MUSDB18HQ上 achieved state-of-the-art result without extra training data,具有9.80 dB的平均SDR。Abstract
Music source separation (MSS) aims to separate a music recording into multiple musically distinct stems, such as vocals, bass, drums, and more. Recently, deep learning approaches such as convolutional neural networks (CNNs) and recurrent neural networks (RNNs) have been used, but the improvement is still limited. In this paper, we propose a novel frequency-domain approach based on a Band-Split RoPE Transformer (called BS-RoFormer). BS-RoFormer relies on a band-split module to project the input complex spectrogram into subband-level representations, and then arranges a stack of hierarchical Transformers to model the inner-band as well as inter-band sequences for multi-band mask estimation. To facilitate training the model for MSS, we propose to use the Rotary Position Embedding (RoPE). The BS-RoFormer system trained on MUSDB18HQ and 500 extra songs ranked the first place in the MSS track of Sound Demixing Challenge (SDX23). Benchmarking a smaller version of BS-RoFormer on MUSDB18HQ, we achieve state-of-the-art result without extra training data, with 9.80 dB of average SDR.
摘要
音乐源分离(MSS)目标是将音乐录音分解成多个音乐上的独立 Component,如 vocals、bass、鼓等。 现在,深度学习方法如卷积神经网络(CNN)和循环神经网络(RNN)已经被应用,但是改进的空间仍然有限。在这篇论文中,我们提出了一种新的频域方法,基于 Band-Split RoPE Transformer(BS-RoFormer)。BS-RoFormer使用带分模块将输入复杂spectrogram projet到子带水平表示,然后将一 stack of hierarchical Transformers用于内带和间带序列的模型化,以实现多带掩码估计。为了训练MSS模型,我们提出了Rotary Position Embedding(RoPE)。BS-RoFormer系统在MUSDB18HQ和500首EXTRA歌曲上训练后,在Sound Demixing Challenge(SDX23)的MSS轨道上排名第一,并 achieved state-of-the-art result without extra training data,具有9.80 dB的平均SDR。 benchmarking一个小版本的BS-RoFormer在MUSDB18HQ上,我们也达到了最佳结果,不需要额外的训练数据,具有9.80 dB的平均SDR。
BWSNet: Automatic Perceptual Assessment of Audio Signals
results: 对两个BWS研究中的社会态度和气质质量的声音进行测试,结果显示该模型的嵌入空间结构与人类评价具有一致性。Here’s the translation in English:
for: This paper proposes a model that can be trained from raw human judgments obtained through a Best-Worst scaling (BWS) experiment, which maps sound samples into an embedded space that represents the perception of a studied attribute.
methods: The model uses a set of cost functions and constraints, interpreting trial-wise ordinal relations as distance comparisons in a metric learning task.
results: The results show that the structure of the latent space is faithful to human judgments, based on tests on two BWS studies investigating the perception of speech social attitudes and timbral qualities.Abstract
This paper introduces BWSNet, a model that can be trained from raw human judgements obtained through a Best-Worst scaling (BWS) experiment. It maps sound samples into an embedded space that represents the perception of a studied attribute. To this end, we propose a set of cost functions and constraints, interpreting trial-wise ordinal relations as distance comparisons in a metric learning task. We tested our proposal on data from two BWS studies investigating the perception of speech social attitudes and timbral qualities. For both datasets, our results show that the structure of the latent space is faithful to human judgements.
摘要
Translated into Simplified Chinese:这篇论文介绍了BWSNet模型,可以从简陋人类评价获得的 raw 数据进行训练。它将声音样本映射到一个嵌入空间中,该空间表达人类对研究 attribute 的感受。为此,我们提出了一组成本函数和约束,将试验性质的ORDINAL 关系解释为度量学习任务中的距离比较。我们在两个 BWS 研究中进行了测试,研究的是speech 社会态度和气质质量。为两个数据集,我们的结果表明,嵌入空间的结构与人类评价具有一致性。
Symbolic Music Representations for Classification Tasks: A Systematic Evaluation
results: 我们的系统性评估结果显示,图表示方法在三个乐曲级别分类任务中表现出色,而且训练成本较低。Abstract
Music Information Retrieval (MIR) has seen a recent surge in deep learning-based approaches, which often involve encoding symbolic music (i.e., music represented in terms of discrete note events) in an image-like or language like fashion. However, symbolic music is neither an image nor a sentence, and research in the symbolic domain lacks a comprehensive overview of the different available representations. In this paper, we investigate matrix (piano roll), sequence, and graph representations and their corresponding neural architectures, in combination with symbolic scores and performances on three piece-level classification tasks. We also introduce a novel graph representation for symbolic performances and explore the capability of graph representations in global classification tasks. Our systematic evaluation shows advantages and limitations of each input representation. Our results suggest that the graph representation, as the newest and least explored among the three approaches, exhibits promising performance, while being more light-weight in training.
摘要
音乐信息检索(MIR)在深度学习基于方法方面有最近的增长,这些方法常常通过编码符号音乐(即音乐表示为离散音 Event)来实现图像或语言类似的表示方式。然而,符号音乐并不是图像 noch ist eine Sprache,研究符号领域缺乏全面的不同表示方法的报告。在这篇论文中,我们调查矩阵(钢琴 Roll)、序列和图表示法和其相应的神经网络架构,并在三个乐曲级别分类任务中进行了系统性的评估。我们还介绍了一种新的图表示法 для符号性表演,并探讨图表示法在全球分类任务中的能力。我们的系统性评估显示每种输入表示方法的优势和局限性。结果表明,图表示法,作为最新并最少探索的一种方法,具有抢人的表现,同时在训练中更轻量级。
Employing Real Training Data for Deep Noise Suppression
results: 使用实际训练数据和PESQ-DNN,DNS模型的训练效果比使用仅Synthetic训练数据的参考方法更好,在Synthetic测试数据上比基线方法提高0.32个PESQ分,在实际测试数据上也超过了基线方法0.05个DNSMOS分。Abstract
Most deep noise suppression (DNS) models are trained with reference-based losses requiring access to clean speech. However, sometimes an additive microphone model is insufficient for real-world applications. Accordingly, ways to use real training data in supervised learning for DNS models promise to reduce a potential training/inference mismatch. Employing real data for DNS training requires either generative approaches or a reference-free loss without access to the corresponding clean speech. In this work, we propose to employ an end-to-end non-intrusive deep neural network (DNN), named PESQ-DNN, to estimate perceptual evaluation of speech quality (PESQ) scores of enhanced real data. It provides a reference-free perceptual loss for employing real data during DNS training, maximizing the PESQ scores. Furthermore, we use an epoch-wise alternating training protocol, updating the DNS model on real data, followed by PESQ-DNN updating on synthetic data. The DNS model trained with the PESQ-DNN employing real data outperforms all reference methods employing only synthetic training data. On synthetic test data, our proposed method excels the Interspeech 2021 DNS Challenge baseline by a significant 0.32 PESQ points. Both on synthetic and real test data, the proposed method beats the baseline by 0.05 DNSMOS points - although PESQ-DNN optimizes for a different perceptual metric.
摘要
现有大多数深度噪声抑制(DNS)模型通常通过参考基于的损失来训练,但是在实际应用中,加法式麦克风模型可能不足。因此,使用实际训练数据来训练 DNS 模型可能会降低训练/推断匹配问题。使用实际数据进行 DNS 训练需要使用生成方法或无参考损失。在这种情况下,我们提出了一种非侵入式深度神经网络(DNN)名为 PESQ-DNN,用于估算噪声抑制后的语音质量评分(PESQ)分数。PESQ-DNN 提供了一种无参考的感知损失,可以使用实际数据进行 DNS 训练,最大化 PESQ 分数。此外,我们使用了一种每个班次 alternate 训练协议,首先更新 DNS 模型使用实际数据,然后 PESQ-DNN 使用生成数据进行更新。使用 PESQ-DNN 进行 DNS 训练的模型超越了所有参考方法,使用只有synthetic 训练数据。在synthetic 测试数据上,我们的提议方法高于2021年慕尼黑语音处理大会 DNS 挑战基准值的0.32 PESQ 分数。在synthetic 和实际测试数据上,我们的提议方法超越了基准值0.05 DNSMOS 分数。
Personalized Adaptation with Pre-trained Speech Encoders for Continuous Emotion Recognition
results: 我们的方法在 MSP-Podcast 集合上进行了广泛的实验,结果表明,我们的方法可以一直高于强个性化基eline,并实现情感识别预测的状态前沿性。Abstract
There are individual differences in expressive behaviors driven by cultural norms and personality. This between-person variation can result in reduced emotion recognition performance. Therefore, personalization is an important step in improving the generalization and robustness of speech emotion recognition. In this paper, to achieve unsupervised personalized emotion recognition, we first pre-train an encoder with learnable speaker embeddings in a self-supervised manner to learn robust speech representations conditioned on speakers. Second, we propose an unsupervised method to compensate for the label distribution shifts by finding similar speakers and leveraging their label distributions from the training set. Extensive experimental results on the MSP-Podcast corpus indicate that our method consistently outperforms strong personalization baselines and achieves state-of-the-art performance for valence estimation.
摘要
“各自差异的表达行为,受文化 norms 和人类特质的影响,会导致情感识别的表现下降。因此,个人化是识别表情认识的重要步骤。在这篇文章中,我们首先透过自我监督学习法,将speaker embedding learnable 运算在训练集中,以学习基于话者的Robust speech表现。其次,我们提出了一个无supervision的方法,通过找到相似的话者,并利用它们在训练集中的标签分布来补偿标签分布的差异。实验结果显示,我们的方法可以与强大的个人化基eline 相比,并 achieve state-of-the-art 的表情认识性能。”Note: Please note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you need Traditional Chinese, please let me know.
The Batik-plays-Mozart Corpus: Linking Performance to Score to Musicological Annotations
methods: 该论文使用了专业钢琴家 Roland Batik 的录音,并将其与现代莫扎特乐谱标准版本进行了精度对应。
results: 该论文创建了一个高精度的钢琴表演数据集,可以用于研究表演与结构之间的关系,并提供了两个探索性实验来证明其使用价值。Abstract
We present the Batik-plays-Mozart Corpus, a piano performance dataset combining professional Mozart piano sonata performances with expert-labelled scores at a note-precise level. The performances originate from a recording by Viennese pianist Roland Batik on a computer-monitored B\"osendorfer grand piano, and are available both as MIDI files and audio recordings. They have been precisely aligned, note by note, with a current standard edition of the corresponding scores (the New Mozart Edition) in such a way that they can further be connected to the musicological annotations (harmony, cadences, phrases) on these scores that were recently published by Hentschel et al. (2021). The result is a high-quality, high-precision corpus mapping scores and musical structure annotations to precise note-level professional performance information. As the first of its kind, it can serve as a valuable resource for studying various facets of expressive performance and their relationship with structural aspects. In the paper, we outline the curation process of the alignment and conduct two exploratory experiments to demonstrate its usefulness in analyzing expressive performance.
摘要
我们现在提出了“巴提克扮演莫扎特” corpora,这是一个结合了专业莫扎特钢琴室内乐表演和专家标注的谱面数据集。表演来自奥地利钢琴家罗兰·巴提克在计算机监测的波Sendendorfer大钢琴上的录音,并以MIDI文件和音频录音的形式可用。它们已经精准地对应了现代标准版谱面(新莫扎特版),以便可以与最近由豪伦肯(2021)等人发表的音乐学注释(和声、推移、段落)相连接。这个结果是一个高质量、高精度的谱面和音乐结构注释映射数据集。作为首个类型的资源,它可以用于研究不同的表演特征和其与结构方面的关系。在论文中,我们介绍了对Alignment的策略和进行了两项探索性实验,以 demonstate其在分析表演各种特征方面的用用。
PESTO: Pitch Estimation with Self-supervised Transposition-equivariant Objective
results: 我们对歌唱voice和乐器pitch估计两个任务进行评估,并显示我们的模型能够泛化到任务和数据集,同时具有轻量级和实时应用compatibility。具体来说,我们的结果超过了自适应基eline和supervised方法的自适应基eline,并将自适应和supervised方法之间的性能差逐渐缩小。Abstract
In this paper, we address the problem of pitch estimation using Self Supervised Learning (SSL). The SSL paradigm we use is equivariance to pitch transposition, which enables our model to accurately perform pitch estimation on monophonic audio after being trained only on a small unlabeled dataset. We use a lightweight ($<$ 30k parameters) Siamese neural network that takes as inputs two different pitch-shifted versions of the same audio represented by its Constant-Q Transform. To prevent the model from collapsing in an encoder-only setting, we propose a novel class-based transposition-equivariant objective which captures pitch information. Furthermore, we design the architecture of our network to be transposition-preserving by introducing learnable Toeplitz matrices. We evaluate our model for the two tasks of singing voice and musical instrument pitch estimation and show that our model is able to generalize across tasks and datasets while being lightweight, hence remaining compatible with low-resource devices and suitable for real-time applications. In particular, our results surpass self-supervised baselines and narrow the performance gap between self-supervised and supervised methods for pitch estimation.
摘要
在这篇论文中,我们解决了使用自适应学习(SSL)的抽象问题。我们使用的SSL模式是抽象到音高的转换,这使得我们的模型可以在只有一小量无标签数据上训练后,对声音进行精准的抽象。我们使用一个轻量级(<30k参数)的同构神经网络,该网络接受两个不同的抽象后的同一个音频表示,即其常见频谱变换。为避免encoder-only设置中的模型崩溃,我们提出了一种新的类型-基于的转换平衡目标,该目标捕捉到音高信息。此外,我们设计了网络的架构,使其保持转换平衡,通过引入学习的托凯利矩阵。 我们对两个任务:唱歌voice和乐器pitch estimation进行评估,并显示了我们的模型可以适应任务和数据集,同时具有轻量级和实时应用Compatible。具体来说,我们的结果超过了自我监督基线,并将自我监督和指导方法之间的性能差距缩小。