cs.SD - 2023-08-21

LibriWASN: A Data Set for Meeting Separation, Diarization, and Recognition with Asynchronous Recording Devices

  • paper_url: http://arxiv.org/abs/2308.10682
  • repo_url: None
  • paper_authors: Joerg Schmalenstroeer, Tobias Gburrek, Reinhold Haeb-Umbach
  • For: 测试闹钟同步算法、会议分离、笔记录系统和讲解系统在无线听写感知网络上的性能。* Methods: 使用了9种设备(5款智能手机和4个麦克风阵列)记录29个通道的数据,其中每个设备采样钟不同步。* Results: 提供了一个名为LibriWASN的数据集,可以用于测试闹钟同步算法、会议分离、笔记录系统和讲解系统在无线听写感知网络上的性能,并且提供了两个不同房间的数据记录和真实的会议分离信息。
    Abstract We present LibriWASN, a data set whose design follows closely the LibriCSS meeting recognition data set, with the marked difference that the data is recorded with devices that are randomly positioned on a meeting table and whose sampling clocks are not synchronized. Nine different devices, five smartphones with a single recording channel and four microphone arrays, are used to record a total of 29 channels. Other than that, the data set follows closely the LibriCSS design: the same LibriSpeech sentences are played back from eight loudspeakers arranged around a meeting table and the data is organized in subsets with different percentages of speech overlap. LibriWASN is meant as a test set for clock synchronization algorithms, meeting separation, diarization and transcription systems on ad-hoc wireless acoustic sensor networks. Due to its similarity to LibriCSS, meeting transcription systems developed for the former can readily be tested on LibriWASN. The data set is recorded in two different rooms and is complemented with ground-truth diarization information of who speaks when.
    摘要 我们现在介绍LibriWASN数据集,它的设计与LibriCSS会议认知数据集相似,但是它的数据记录设备随机分布在会议表前,并且采样时钟不协调。数据集使用了九个设备,包括五个智能手机和四个麦克风数组,共记录29个通道。除了这些,数据集的设计与LibriCSS几乎相同:同样的LibriSpeech句子在会议表周围的八个扬声器上播放,数据分成不同的 overlap percentage 的subset。LibriWASN是为测试 wireless acoustic sensor networks 上的时钟同步算法、会议分离、笔记录系统和 trascription 系统而设计的。由于它与LibriCSS的相似性,可以将LibriCSS 上开发的会议转录系统 directly 应用于LibriWASN。数据集在两个不同的房间中录制,并且有详细的ground truth diarization信息,表示每个人在哪时说话。

An Anchor-Point Based Image-Model for Room Impulse Response Simulation with Directional Source Radiation and Sensor Directivity Patterns

  • paper_url: http://arxiv.org/abs/2308.10543
  • repo_url: None
  • paper_authors: Chao Pan, Lei Zhang, Yilong Lu, Jilu Jin, Lin Qiu, Jingdong Chen, Jacob Benesty
  • for: 这篇论文的目的是扩展图像模型方法,以便在不同应用中使用。
  • methods: 这篇论文使用了图像模型方法,并在其基础上开发了一种 anchor-point-image-model(APIM)方法,以便模拟冲击响应。APIM方法考虑了源辐射和探测器方向性特征,并通过引入anchor点来确定虚拟源的方向。
  • results: 研究人员通过开发了一种算法来生成室内冲击响应,并考虑了方向函数、时间延迟和计算复杂性。这种模型和算法可以在各种音频问题中模拟室内响应,提高和评估处理算法。
    Abstract The image model method has been widely used to simulate room impulse responses and the endeavor to adapt this method to different applications has also piqued great interest over the last few decades. This paper attempts to extend the image model method and develops an anchor-point-image-model (APIM) approach as a solution for simulating impulse responses by including both the source radiation and sensor directivity patterns. To determine the orientations of all the virtual sources, anchor points are introduced to real sources, which subsequently lead to the determination of the orientations of the virtual sources. An algorithm is developed to generate room impulse responses with APIM by taking into account the directional pattern functions, factional time delays, as well as the computational complexity. The developed model and algorithms can be used in various acoustic problems to simulate room acoustics and improve and evaluate processing algorithms.
    摘要 Image模型方法已广泛应用于模拟室内冲激响应,而在过去几十年中,它的应用也引发了广泛的兴趣。这篇论文尝试扩展图像模型方法,并开发了一种基于图像模型的笔点图像模型(APIM)方法,以便通过包括源辐射和感知器直径模式来模拟冲激响应。为确定虚拟源的方向,引入了实际源的笔点,从而确定虚拟源的方向。为生成室内冲激响应使用APIM方法,需要考虑方向函数、分解时间延迟以及计算复杂性。这种开发的模型和算法可以应用于各种音频问题,以模拟室内音频和改进和评估处理算法。

Implicit Self-supervised Language Representation for Spoken Language Diarization

  • paper_url: http://arxiv.org/abs/2308.10470
  • repo_url: None
  • paper_authors: Jagabandhu Mishra, S. R. Mahadeva Prasanna
  • for: 这个研究是为了开发一种基于语音语义分割的语言分类系统,以提高语音识别和语言转换的性能。
  • methods: 这个研究使用了三种不同的框架,分别是固定分 segmentation、change point-based segmentation 和 E2E 框架,以实现语音语义分割。而且,这些框架都使用了隐式框架,以便更好地适应低资源语言。
  • results: 研究发现,使用 x-vector 作为隐式语言表示的方法可以达到与显式语言分类系统相同的性能。而使用 E2E 框架的最佳隐式语言分类性能为 6.38%,但是在使用实际的微软CS(MSCS)数据集时,性能下降到 60.4%,这主要是因为 MSCS 数据集中次语言的分布不同于 TTSF-LD 数据集。此外,以避免分段膨润,使用小值 N 是有助于提高语言分类性能的。
    Abstract In a code-switched (CS) scenario, the use of spoken language diarization (LD) as a pre-possessing system is essential. Further, the use of implicit frameworks is preferable over the explicit framework, as it can be easily adapted to deal with low/zero resource languages. Inspired by speaker diarization (SD) literature, three frameworks based on (1) fixed segmentation, (2) change point-based segmentation and (3) E2E are proposed to perform LD. The initial exploration with synthetic TTSF-LD dataset shows, using x-vector as implicit language representation with appropriate analysis window length ($N$) can able to achieve at per performance with explicit LD. The best implicit LD performance of $6.38$ in terms of Jaccard error rate (JER) is achieved by using the E2E framework. However, considering the E2E framework the performance of implicit LD degrades to $60.4$ while using with practical Microsoft CS (MSCS) dataset. The difference in performance is mostly due to the distributional difference between the monolingual segment duration of secondary language in the MSCS and TTSF-LD datasets. Moreover, to avoid segment smoothing, the smaller duration of the monolingual segment suggests the use of a small value of $N$. At the same time with small $N$, the x-vector representation is unable to capture the required language discrimination due to the acoustic similarity, as the same speaker is speaking both languages. Therefore, to resolve the issue a self-supervised implicit language representation is proposed in this study. In comparison with the x-vector representation, the proposed representation provides a relative improvement of $63.9\%$ and achieved a JER of $21.8$ using the E2E framework.
    摘要 在code-switched(CS)enario中,使用口语语音分类(LD)作为前置处理系统是必要的。进一步地,使用隐式框架更加有利,因为它可以轻松适应低/零资源语言。以speaker diarization(SD)文献为 inspiration,本研究提出了基于(1)固定分 segmentation、(2)变点分 segmentation和(3)E2E的三种框架来实现LD。初步探索使用synthetic TTSF-LD数据集表明,使用x-vector作为隐式语言表示,并采用合适的分析窗口长度(N)可以与显式LD准确性相当。最佳隐式LD性能为6.38%,基于E2E框架。然而,在使用实际的Microsoft CS(MSCS)数据集时,隐式LD性能下降到60.4%,主要是因为MSCS数据集中次语言分段时间的分布与TTSF-LD数据集不同。此外,为了避免分段滑动,小于Monolingual segment duration的值更适合。同时,使用小N值,x-vector表示不能捕捉所需的语言划分,因为同一个speaker同时说两种语言时,它们的音频相似度很高。因此,本研究提出了一种自助学习隐式语言表示,与x-vector表示相比,它提供了63.9%的相对改进,并在E2E框架下实现JER值为21.8%。

Multi-GradSpeech: Towards Diffusion-based Multi-Speaker Text-to-speech Using Consistent Diffusion Models

  • paper_url: http://arxiv.org/abs/2308.10428
  • repo_url: None
  • paper_authors: Heyang Xue, Shuai Guo, Pengcheng Zhu, Mengxiao Bi
  • for: 多个说话人 Text-to-Speech (TTS) 应用场景中的改进
  • methods: 使用 Consistent Diffusion Model (CDM) 作为生成模型,在训练过程中保证 CDM 的一致性,以解决 diffusion 模型在训练和抽象阶段的抽象问题
  • results: 对比 Grad-TTS 和 fine-tuning 方法,提高不同说话人参与的多speaker TTS性能,甚至超越 fine-tuning 方法 Audio samples 可以在 https://welkinyang.github.io/multi-gradspeech/ 上获取
    Abstract Recent advancements in diffusion-based acoustic models have revolutionized data-sufficient single-speaker Text-to-Speech (TTS) approaches, with Grad-TTS being a prime example. However, diffusion models suffer from drift in training and sampling distributions due to imperfect score-matching. The sampling drift problem leads to these approaches struggling in multi-speaker scenarios in practice. In this paper, we present Multi-GradSpeech, a multi-speaker diffusion-based acoustic models which introduces the Consistent Diffusion Model (CDM) as a generative modeling approach. We enforce the consistency property of CDM during the training process to alleviate the sampling drift problem in the inference stage, resulting in significant improvements in multi-speaker TTS performance. Our experimental results corroborate that our proposed approach can improve the performance of different speakers involved in multi-speaker TTS compared to Grad-TTS, even outperforming the fine-tuning approach. Audio samples are available at https://welkinyang.github.io/multi-gradspeech/
    摘要 近期 diffusion-based 音频模型在数据充足单个说话人 Text-to-Speech (TTS) 方法中得到了革命性的进步,Grad-TTS 是其中的一个典型例子。然而, diffusion 模型在训练和采样分布中会受到偏移问题的影响,这会导致这些方法在多个说话人场景中实际应用中困难。在这篇论文中,我们提出了 Multi-GradSpeech,一种多个说话人 diffusion-based 音频模型,其中引入了一种生成模型approach,即 Consistent Diffusion Model (CDM)。我们在训练过程中对 CDM 进行了一种强制保持一致性的方法,以解决训练和采样偏移问题,从而在多个说话人 TTS 性能中做出了显著改进。我们的实验结果表明,我们的提议方法可以在不同的说话人参与 TTS 中提高性能,甚至超过了精度调整方法。音频示例可以在 中找到。

Neural Architectures Learning Fourier Transforms, Signal Processing and Much More….

  • paper_url: http://arxiv.org/abs/2308.10388
  • repo_url: None
  • paper_authors: Prateek Verma
    for:The paper explores the use of neural architectures for learning kernels in Fourier Transform, specifically for audio signal processing applications.methods:The paper proposes using a neural architecture to learn sinusoidal kernels and discovers various signal-processing properties such as windowing functions, onset detectors, high pass filters, low pass filters, modulations, etc. The neural architecture has a comb filter-like structure on top of the learned kernels.results:The paper shows that the proposed method can learn kernels that are adapted to the problem at hand and discovers various signal-processing properties. The learned kernels can be used for tasks such as audio signal processing, and the method has the potential to be used for other signal processing applications as well.
    Abstract This report will explore and answer fundamental questions about taking Fourier Transforms and tying it with recent advances in AI and neural architecture. One interpretation of the Fourier Transform is decomposing a signal into its constituent components by projecting them onto complex exponentials. Variants exist, such as discrete cosine transform that does not operate on the complex domain and projects an input signal to only cosine functions oscillating at different frequencies. However, this is a fundamental limitation, and it needs to be more suboptimal. The first one is that all kernels are sinusoidal: What if we could have some kernels adapted or learned according to the problem? What if we can use neural architectures for this? We show how one can learn these kernels from scratch for audio signal processing applications. We find that the neural architecture not only learns sinusoidal kernel shapes but discovers all kinds of incredible signal-processing properties. E.g., windowing functions, onset detectors, high pass filters, low pass filters, modulations, etc. Further, upon analysis of the filters, we find that the neural architecture has a comb filter-like structure on top of the learned kernels. Comb filters that allow harmonic frequencies to pass through are one of the core building blocks/types of filters similar to high-pass, low-pass, and band-pass filters of various traditional signal processing algorithms. Further, we can also use the convolution operation with a signal to be learned from scratch, and we will explore papers in the literature that uses this with that robust Transformer architectures. Further, we would also explore making the learned kernel's content adaptive, i.e., learning different kernels for different inputs.
    摘要 This report will explore and answer fundamental questions about taking Fourier Transforms and tying it with recent advances in AI and neural architecture. One interpretation of the Fourier Transform is decomposing a signal into its constituent components by projecting them onto complex exponentials. Variants exist, such as the discrete cosine transform, which does not operate on the complex domain and projects an input signal to only cosine functions oscillating at different frequencies. However, this is a fundamental limitation, and it needs to be more suboptimal. The first one is that all kernels are sinusoidal: What if we could have some kernels adapted or learned according to the problem? What if we can use neural architectures for this? We show how one can learn these kernels from scratch for audio signal processing applications. We find that the neural architecture not only learns sinusoidal kernel shapes but discovers all kinds of incredible signal-processing properties. E.g., windowing functions, onset detectors, high pass filters, low pass filters, modulations, etc. Further, upon analysis of the filters, we find that the neural architecture has a comb filter-like structure on top of the learned kernels. Comb filters that allow harmonic frequencies to pass through are one of the core building blocks/types of filters similar to high-pass, low-pass, and band-pass filters of various traditional signal processing algorithms. Further, we can also use the convolution operation with a signal to be learned from scratch, and we will explore papers in the literature that uses this with that robust Transformer architectures. Further, we would also explore making the learned kernel's content adaptive, i.e., learning different kernels for different inputs.This report will explore and answer fundamental questions about taking Fourier Transforms and tying it with recent advances in AI and neural architecture. One interpretation of the Fourier Transform is decomposing a signal into its constituent components by projecting them onto complex exponentials. Variants exist, such as the discrete cosine transform, which does not operate on the complex domain and projects an input signal to only cosine functions oscillating at different frequencies. However, this is a fundamental limitation, and it needs to be more suboptimal. The first one is that all kernels are sinusoidal: What if we could have some kernels adapted or learned according to the problem? What if we can use neural architectures for this? We show how one can learn these kernels from scratch for audio signal processing applications. We find that the neural architecture not only learns sinusoidal kernel shapes but discovers all kinds of incredible signal-processing properties. E.g., windowing functions, onset detectors, high pass filters, low pass filters, modulations, etc. Further, upon analysis of the filters, we find that the neural architecture has a comb filter-like structure on top of the learned kernels. Comb filters that allow harmonic frequencies to pass through are one of the core building blocks/types of filters similar to high-pass, low-pass, and band-pass filters of various traditional signal processing algorithms. Further, we can also use the convolution operation with a signal to be learned from scratch, and we will explore papers in the literature that uses this with that robust Transformer architectures. Further, we would also explore making the learned kernel's content adaptive, i.e., learning different kernels for different inputs.This report will explore and answer fundamental questions about taking Fourier Transforms and tying it with recent advances in AI and neural architecture. One interpretation of the Fourier Transform is decomposing a signal into its constituent components by projecting them onto complex exponentials. Variants exist, such as the discrete cosine transform, which does not operate on the complex domain and projects an input signal to only cosine functions oscillating at different frequencies. However, this is a fundamental limitation, and it needs to be more suboptimal. The first one is that all kernels are sinusoidal: What if we could have some kernels adapted or learned according to the problem? What if we can use neural architectures for this? We show how one can learn these kernels from scratch for audio signal processing applications. We find that the neural architecture not only learns sinusoidal kernel shapes but discovers all kinds of incredible signal-processing properties. E.g., windowing functions, onset detectors, high pass filters, low pass filters, modulations, etc. Further, upon analysis of the filters, we find that the neural architecture has a comb filter-like structure on top of the learned kernels. Comb filters that allow harmonic frequencies to pass through are one of the core building blocks/types of filters similar to high-pass, low-pass, and band-pass filters of various traditional signal processing algorithms. Further, we can also use the convolution operation with a signal to be learned from scratch, and we will explore papers in the literature that uses this with that robust Transformer architectures. Further, we would also explore making the learned kernel's content adaptive, i.e., learning different kernels for different inputs.This report will explore and answer fundamental questions about taking Fourier Transforms and tying it with recent advances in AI and neural architecture. One interpretation of the Fourier Transform is decomposing a signal into its constituent components by projecting them onto complex exponentials. Variants exist, such as the discrete cosine transform, which does not operate on the complex domain and projects an input signal to only cosine functions oscillating at different frequencies. However, this is a fundamental limitation, and it needs to be more suboptimal. The first one is that all kernels are sinusoidal: What if we could have some kernels adapted or learned according to the problem? What if we can use neural architectures for this? We show how one can learn these kernels from scratch for audio signal processing applications. We find that the neural architecture not only learns sinusoidal kernel shapes but discovers all kinds of incredible signal-processing properties. E.g., windowing functions, onset detectors, high pass filters, low pass filters, modulations, etc. Further, upon analysis of the filters, we find that the neural architecture has a comb filter-like structure on top of the learned kernels. Comb filters that allow harmonic frequencies to pass through are one of the core building blocks/types of filters similar to high-pass, low-pass, and band-pass filters of various traditional signal processing algorithms. Further, we can also use the convolution operation with a signal to be learned from scratch, and we will explore papers in the literature that uses this with that robust Transformer architectures. Further, we would also explore making the learned kernel's content adaptive, i.e., learning different kernels for different inputs.This report will explore and answer fundamental questions about taking Fourier Transforms and tying it with recent advances in AI and neural architecture. One interpretation of the Fourier Transform is decomposing a signal into its constituent components by projecting them onto complex exponentials. Variants exist, such as the discrete cosine transform, which does not operate on the complex domain and projects an input signal to only cosine functions oscillating at different frequencies. However, this is a fundamental limitation, and it needs to be more suboptimal. The first one is that all kernels are sinusoidal: What if we could have some kernels adapted or learned according to the problem? What if we can use neural architectures for this? We show how one can learn these kernels from scratch for audio signal processing applications. We find that the neural architecture not only learns sinusoidal kernel shapes but discovers all kinds of incredible signal-processing properties. E.g., windowing functions, onset detectors, high pass filters, low pass filters, modulations, etc. Further, upon analysis of the filters, we find that the neural architecture has a comb filter-like structure on top of the learned kernels. Comb filters that allow harmonic frequencies to pass through are one of the core building blocks/types of filters similar to high-pass, low-pass, and band-pass filters of various traditional signal processing algorithms. Further, we can also use the convolution operation with a signal to be learned from scratch, and we will explore papers in the literature that uses this with that robust Transformer architectures. Further, we would also explore making the learned kernel's content adaptive, i.e., learning different kernels for different inputs.This report will explore and answer fundamental questions about taking Fourier Transforms and tying it with recent advances in AI and neural architecture. One interpretation of the Fourier Transform is decomposing a signal into its constituent components by projecting them onto complex exponentials. Variants exist, such as the discrete cosine transform, which does not operate on the complex domain and projects an input signal to only cosine functions oscillating at different frequencies. However, this is a fundamental limitation, and it needs to be more suboptimal. The first one is that all kernels are sinusoidal: What if we could have some kernels adapted or learned according to the problem? What if we can use neural architectures for this? We show how one can learn these kernels from scratch for audio signal processing applications. We find that the neural architecture not only learns sinusoidal kernel shapes but discovers all kinds of incredible signal-processing properties. E.g., windowing functions, onset detectors, high pass filters, low pass filters, modulations, etc. Further, upon analysis of the filters, we find that the neural architecture has a comb filter-like structure on top of the learned kernels. Comb filters that allow harmonic frequencies to pass through are one of the core building blocks/types of filters similar to high-pass, low-pass, and band-pass filters of various traditional signal processing algorithms. Further, we can also use the convolution operation with a signal to be learned from scratch, and we will explore papers in the literature that uses this with that robust Transformer architectures. Further, we would also explore making the learned kernel's content adaptive, i.e., learning different kernels for different inputs.This report will explore and answer fundamental questions about taking Fourier Transforms and tying it with recent advances in AI and neural architecture. One interpretation of the Fourier Transform is decomposing a signal into its constituent components by projecting them onto complex exponentials. Variants exist, such as the discrete cosine transform, which does not operate on the complex domain and projects an input signal to only cosine functions oscillating at different frequencies. However, this is a fundamental limitation, and it needs to be more suboptimal. The first one is that all kernels are sinusoidal: What if we could have some kernels adapted or learned according to the problem? What if we can use neural architectures for this? We show how one can learn these kernels from scratch for audio signal processing applications. We find that the neural architecture not only learns sinusoidal kernel shapes but discovers all kinds of incredible signal-processing properties. E.g., windowing functions, onset detectors, high pass filters, low pass filters, modulations, etc. Further, upon analysis of the filters, we find that the neural architecture has a comb filter-like structure on top of the learned kernels. Comb filters that allow harmonic frequencies to pass through are one of the core building blocks/types of filters similar to high-pass, low-pass, and band-pass filters of various traditional signal processing algorithms. Further, we can also use the convolution operation with a signal to be learned from scratch, and we will explore papers in the literature that uses this with that robust Transformer architectures. Further, we would also explore making the learned kernel's content adaptive, i.e., learning different kernels for different inputs.This report will explore and answer fundamental questions about taking Fourier Transforms and tying it with recent advances in AI and neural architecture. One interpretation of the Fourier Transform is decomposing a signal into its constituent components by projecting them onto complex exponentials. Variants exist, such as the discrete cosine transform, which does not operate on the complex domain and projects an input signal to only cosine functions oscillating at different frequencies. However, this is a fundamental limitation, and it needs to be more suboptimal. The first one is that all kernels are sinusoidal: What if we could have some kernels adapted or learned according to the problem? What if we can use neural architectures for this? We show how one can learn these kernels from scratch for audio signal processing applications. We find that the neural architecture not only learns sinusoidal kernel shapes but discovers all kinds of incredible signal-processing properties. E.g., windowing functions, onset detectors, high pass filters, low pass filters, modulations, etc. Further, upon analysis of the filters, we find that the neural architecture has a comb filter-like structure on top of the learned kernels. Comb filters that allow harmonic frequencies to pass through are one of the core building blocks/types of filters similar to high-pass, low-pass, and band-pass filters of various traditional signal processing algorithms. Further, we can also use the convolution operation with a signal to be learned from scratch, and we will explore papers in the literature that uses this with that robust Transformer architectures. Further, we would also explore making the learned kernel's content adaptive, i.e., learning different kernels for different inputs.This report will explore and answer fundamental questions about taking Fourier Transforms and tying it with recent advances in AI and neural architecture. One interpretation of the Fourier Transform is decomposing a signal into its constituent components by projecting them onto complex exponentials. Variants exist, such as the discrete cosine transform, which does not operate on the complex domain and projects an input signal to only cosine functions oscillating at different frequencies. However, this is a fundamental limitation, and it needs to be more suboptimal. The first one is that all kernels are sinusoidal: What if we could have some kernels adapted or learned according to the problem? What if we can use neural architectures for this? We show how one can learn these kernels from scratch for audio signal processing applications. We find that the neural architecture not only learns sinusoidal kernel shapes but discovers all kinds of incredible signal-processing properties. E.g., windowing functions, onset detectors, high pass filters, low pass filters, modulations, etc. Further, upon analysis of the filters, we find that the neural architecture has a comb filter-like structure on top of the learned kernels. Comb filters that allow harmonic frequencies to pass through are one of the core building blocks/types of filters similar to high-pass, low-pass, and band-pass filters of various traditional signal processing algorithms. Further, we can also use the convolution operation with a signal to be learned from scratch, and we will explore papers in the literature that uses this with that robust Transformer architectures. Further, we would also explore making the learned kernel's content adaptive, i.e., learning different kernels for different inputs.This report will explore and answer fundamental questions about taking Fourier Transforms and tying it with recent advances in AI and neural architecture. One interpretation of

Local Periodicity-Based Beat Tracking for Expressive Classical Piano Music

  • paper_url: http://arxiv.org/abs/2308.10355
  • repo_url: https://github.com/sunnycyc/plpdp4beat
  • paper_authors: Ching-Yu Chiu, Meinard Müller, Matthew E. P. Davies, Alvin Wen-Yu Su, Yi-Hsuan Yang
  • For: The paper is written for the purpose of improving beat tracking in Western classical piano music, specifically addressing the limitations of existing post-processing trackers (PPTs) in handling local tempo changes.* Methods: The paper proposes a new local periodicity-based PPT called predominant local pulse-based dynamic programming (PLPDP) tracking, which incorporates a method called “predominant local pulses” (PLP) in combination with a dynamic programming (DP) component to jointly consider locally detected periodicity and beat activation strength.* Results: Compared to existing PPTs, PLPDP particularly enhances the recall values at the cost of a lower precision, resulting in an overall improvement of F1-score for beat tracking in two large datasets of Western classical piano music (ASAP and Maz-5). Specifically, PLPDP improved the F1-score from 0.473 to 0.493 in ASAP and from 0.595 to 0.838 in Maz-5.
    Abstract To model the periodicity of beats, state-of-the-art beat tracking systems use "post-processing trackers" (PPTs) that rely on several empirically determined global assumptions for tempo transition, which work well for music with a steady tempo. For expressive classical music, however, these assumptions can be too rigid. With two large datasets of Western classical piano music, namely the Aligned Scores and Performances (ASAP) dataset and a dataset of Chopin's Mazurkas (Maz-5), we report on experiments showing the failure of existing PPTs to cope with local tempo changes, thus calling for new methods. In this paper, we propose a new local periodicity-based PPT, called predominant local pulse-based dynamic programming (PLPDP) tracking, that allows for more flexible tempo transitions. Specifically, the new PPT incorporates a method called "predominant local pulses" (PLP) in combination with a dynamic programming (DP) component to jointly consider the locally detected periodicity and beat activation strength at each time instant. Accordingly, PLPDP accounts for the local periodicity, rather than relying on a global tempo assumption. Compared to existing PPTs, PLPDP particularly enhances the recall values at the cost of a lower precision, resulting in an overall improvement of F1-score for beat tracking in ASAP (from 0.473 to 0.493) and Maz-5 (from 0.595 to 0.838).
    摘要 现代拍轨系统通常使用“后处理追踪器”(PPT)来模拟拍征的周期性。然而,这些假设在表演古典音乐中可能过于僵硬。我们使用了西方古典钢琴音乐的两个大数据集, namely Aligned Scores and Performances(ASAP)数据集和钢琴 Mazurkas(Maz-5)数据集,进行实验,发现现有PPT的失败。因此,我们提出了一种新的本地周期性基于的PPT,即主导本地抖拍基于动态规划(PLPDP)追踪。PLPDP通过结合本地抖拍和激活强度来考虑每个时刻的周期性和拍征活动。因此,PLPDP不再仅仅基于全局的拍征假设。与现有PPT相比,PLPDP尤其提高了ASAP中的回归值(从0.473提高到0.493)和Maz-5中的总体表现(从0.595提高到0.838)。