cs.SD - 2023-11-30

Subspace Hybrid MVDR Beamforming for Augmented Hearing

  • paper_url: http://arxiv.org/abs/2311.18689
  • repo_url: None
  • paper_authors: Sina Hafezi, Alastair H. Moore, Pierre H. Guiraud, Patrick A. Naylor, Jacob Donley, Vladimir Tourbabin, Thomas Lunner
  • for: 提高HEAD-WORN MICROPHONE ARRAY中的augmented reality音频识别性能
  • methods: 使用多通道语音增强算法, combine signal-dependent beamformer的 adaptability和signal-independent super-directive beamformer的计算效率和稳定性
  • results: 对实际记录和 simulate cocktail-party场景进行评估,相比基eline super-directive beamformer,提出的算法显示出 significiant noise suppression、speech intelligibility和speech quality的改善
    Abstract Signal-dependent beamformers are advantageous over signal-independent beamformers when the acoustic scenario - be it real-world or simulated - is straightforward in terms of the number of sound sources, the ambient sound field and their dynamics. However, in the context of augmented reality audio using head-worn microphone arrays, the acoustic scenarios encountered are often far from straightforward. The design of robust, high-performance, adaptive beamformers for such scenarios is an on-going challenge. This is due to the violation of the typically required assumptions on the noise field caused by, for example, rapid variations resulting from complex acoustic environments, and/or rotations of the listener's head. This work proposes a multi-channel speech enhancement algorithm which utilises the adaptability of signal-dependent beamformers while still benefiting from the computational efficiency and robust performance of signal-independent super-directive beamformers. The algorithm has two stages. (i) The first stage is a hybrid beamformer based on a dictionary of weights corresponding to a set of noise field models. (ii) The second stage is a wide-band subspace post-filter to remove any artifacts resulting from (i). The algorithm is evaluated using both real-world recordings and simulations of a cocktail-party scenario. Noise suppression, intelligibility and speech quality results show a significant performance improvement by the proposed algorithm compared to the baseline super-directive beamformer. A data-driven implementation of the noise field dictionary is shown to provide more noise suppression, and similar speech intelligibility and quality, compared to a parametric dictionary.
    摘要 受信号висиendent的扩束器比受信号无关的扩束器在 straightforward 的声学enario下更有优势,但在头戴式麦克铺数组中的增强现实音频应用中,遇到的声学enario往往很复杂。设计高性能、适应性强的扩束器是一项持续的挑战。这是因为噪声场的假设被噪声环境中的快速变化和/或 listener 的头部旋转所遗弃。这项工作提出了一种多通道语音增强算法,该算法利用了信号висиendent扩束器的适应性,同时仍然保留了信号无关扩束器的计算效率和Robust性。该算法包括两个阶段。(i)第一阶段是一种基于字典的扩束器,该字典包含一组噪声场模型的权重。(ii)第二阶段是一种宽频域子 filters,用于除去任何由前一阶段所产生的artefacts。该算法在实际记录和 simulate 一个cocktail-party场景中进行评估。噪声抑制、语音知能和音质结果表明提案的算法与基准超 direktibeamformer 相比具有显著的性能改善。一种基于数据驱动的噪声场字典实现被证明可以提供更多的噪声抑制,同时保持与 parametric 字典相同的语音知能和音质。

Barwise Music Structure Analysis with the Correlation Block-Matching Segmentation Algorithm

  • paper_url: http://arxiv.org/abs/2311.18604
  • repo_url: None
  • paper_authors: Axel Marmoret, Jérémy E. Cohen, Frédéric Bimbot
  • for: 这个论文主要是为了提高音乐结构分析(MSA)领域的自动化分析方法。
  • methods: 这个论文提出了一种基于块匹配算法(CBM)的自适应音乐结构分析方法,该算法可以自动从音频信号的特征表示中计算自相似矩阵,并在时间频谱上进行时间频谱分割。
  • results: 研究发现,在理想的条件下,提议的算法可以与已知架构的超vised状态革命方法竞争水平,而无需知道歌词的具体内容。此外,算法还是开源的,可以根据不同的应用场景进行自定义。
    Abstract Music Structure Analysis (MSA) is a Music Information Retrieval task consisting of representing a song in a simplified, organized manner by breaking it down into sections typically corresponding to ``chorus'', ``verse'', ``solo'', etc. In this work, we extend an MSA algorithm called the Correlation Block-Matching (CBM) algorithm introduced by (Marmoret et al., 2020, 2022b). The CBM algorithm is a dynamic programming algorithm that segments self-similarity matrices, which are a standard description used in MSA and in numerous other applications. In this work, self-similarity matrices are computed from the feature representation of an audio signal and time is sampled at the bar-scale. This study examines three different standard similarity functions for the computation of self-similarity matrices. Results show that, in optimal conditions, the proposed algorithm achieves a level of performance which is competitive with supervised state-of-the-art methods while only requiring knowledge of bar positions. In addition, the algorithm is made open-source and is highly customizable.
    摘要 音乐结构分析(MSA)是音乐信息检索任务,它通过将歌曲分解成 Typically correspond to "chorus", "verse", "solo", etc. 的部分来表示它们在一种简化的、有序的方式。在这种工作中,我们扩展了一种名为协方差块匹配(CBM)算法,该算法是一种动态计划算法,用于在 MSA 和其他应用中常用的自similarity矩阵中进行分割。在这种研究中,我们使用音频信号的特征表示来计算自similarity矩阵,并在 bar 标准上采样时间。研究发现,在理想的条件下,我们的方法可以与经验驱动的状态 arts 的性能竞争,只需要知道bar的位置。此外,我们的算法是开源的,可以高度自定义。

String Sound Synthesizer on GPU-accelerated Finite Difference Scheme

  • paper_url: http://arxiv.org/abs/2311.18505
  • repo_url: None
  • paper_authors: Jin Woo Lee, Min Jun Choi, Kyogu Lee
  • for: 这篇论文描述了一种非线性弦音 sintizer,基于finite difference simulation方法模拟弦的动态行为下不同的刺激。
  • methods: 该synthesizer使用了一种多参数化的弦模拟引擎,可以模拟弦的自然震动行为,包括基频调制、弹性、张力、频率相关损耗和刺激控制。
  • results: 这个开源的物理模型模拟器不仅对音 signal处理社区有利,还可以作为一个新的数据集建立工具,对于 neural network-based audio synthesis领域的发展做出了贡献。PyTorch实现的这个synthesizer具有灵活性,可以在CPU和GPU上使用,从而提高了它的应用性。GPU的利用可以并行操作空间和批处理维度,进一步提高了它的实用性作为数据生成器。
    Abstract This paper introduces a nonlinear string sound synthesizer, based on a finite difference simulation of the dynamic behavior of strings under various excitations. The presented synthesizer features a versatile string simulation engine capable of stochastic parameterization, encompassing fundamental frequency modulation, stiffness, tension, frequency-dependent loss, and excitation control. This open-source physical model simulator not only benefits the audio signal processing community but also contributes to the burgeoning field of neural network-based audio synthesis by serving as a novel dataset construction tool. Implemented in PyTorch, this synthesizer offers flexibility, facilitating both CPU and GPU utilization, thereby enhancing its applicability as a simulator. GPU utilization expedites computation by parallelizing operations across spatial and batch dimensions, further enhancing its utility as a data generator.
    摘要 Translated into Simplified Chinese:这篇论文介绍了一种基于finite difference方法的非线性弦音 sinthezizer,可模拟弦的动态行为下不同的刺激。提出的synthesizer具有可 Stochastic parameterization功能,包括基 frequency Modulation、弹性、张力、频率相关损耗和刺激控制。这个开源物理模型模拟器不仅为音频信号处理社区提供了利益,还为 neural network-based audio sinthezis领域的发展做出了贡献, serving as a novel dataset construction tool。 implemented in PyTorch,这个synthesizer具有灵活性,可以在 CPU 和 GPU 上使用,从而提高其作为模拟器的可用性。 GPU 的使用可以平行化操作,通过 spatial 和批量维度的并行化,进一步提高其作为数据生成器的可用性。

Sound Terminology Describing Production and Perception of Sonification

  • paper_url: http://arxiv.org/abs/2312.00091
  • repo_url: None
  • paper_authors: Tim Ziemer
  • for: 这篇论文的目的是提出一种解决SONIFICATION研究中的术语不一致问题的方法,以促进不同领域研究者之间的交流和合作。
  • methods: 本论文使用了文献研究和问题描述的方法,找到了SONIFICATION研究中存在的问题,并应用了解决这些问题的方法。
  • results: 本论文的结果是提出了三个方面的SONIFICATION设计方法,并为每个方面提供了具体的术语和解释,以促进不同领域研究者之间的交流和合作。
    Abstract Sonification research is intrinsically interdisciplinary. Consequently, a proper documentation of, and interdisciplinary discourse about a sonification is often hindered by terminology discrepancies between involved disciplines, i.e., the lack of a common sound terminology in sonification research. Without a common ground, a researcher from one discipline may have troubles understanding the implementation and imagining the resulting sound perception of a sonification, if the sonification is described by a researcher from another discipline. To find a common ground, I consulted literature on interdisciplinary research and discourse, identified problems that occur in sonification, and applied the recommended solutions. As a result, I recommend considering three aspects of sonification individually, namely 1.) Sound Design Concept, 2.) Objective and 3.) Method, clarifying which discipline is involved in which aspect, and sticking to this discipline's terminology. As two requirements of sonifications are that they are a) reproducible and b) interpretable, I recommend documenting and discussing every sonification design once using audio engineering terminology, and once using psychoacoustic terminology. The appendix provides comprehensive lists of sound terms from both disciplines, together with relevant literature and a clarification of often misunderstood and misused terms.
    摘要 儿化研究自然具有跨学科性质,因此有效地记录和跨学科交流关于儿化的描述经常受到不同领域之间的术语差异所妨碍。不同领域之间缺乏共同的声学术语,使得一位来自一个领域的研究者可能很难理解另一个领域的儿化实现和它所导致的听觉效果。为了找到共同点,我咨询了跨学科研究和交流的文献,识别了儿化中的问题,并采用了建议的解决方案。结果,我建议在儿化中分别考虑三个方面,即1.) 声音设计概念,2.) 目的和3.) 方法,并明确哪些领域在哪个方面参与,并且遵循这些领域的术语。由于儿化的两个必要条件是可重现性和可解释性,我建议对每个儿化设计使用音响工程术语进行文档和讨论,并使用心理听觉学术语进行第二次文档和讨论。附录提供了声学术语列表和相关文献,以及有关常被错解和滥用的词汇的解释。

Audio Prompt Tuning for Universal Sound Separation

  • paper_url: http://arxiv.org/abs/2311.18399
  • repo_url: https://github.com/redrabbit94/apt-uss
  • paper_authors: Yuzhuo Liu, Xubo Liu, Yan Zhao, Yuanyuan Wang, Rui Xia, Pingchuan Tain, Yuxuan Wang
  • for: 提高现有的声音分离系统的精度和稳定性
  • methods: 使用小量示例数据进行声音提示调整(APT),并将声音分离模型的参数冻结以保持通用性
  • results: 在MUSDB18和ESC-50 dataset上比基eline模型提高了0.67dB和2.06dB的信号噪声比性能,并且使用只有5个声音示例还可以超越基eline系统在ESC-50 dataset上表现。
    Abstract Universal sound separation (USS) is a task to separate arbitrary sounds from an audio mixture. Existing USS systems are capable of separating arbitrary sources, given a few examples of the target sources as queries. However, separating arbitrary sounds with a single system is challenging, and the robustness is not always guaranteed. In this work, we propose audio prompt tuning (APT), a simple yet effective approach to enhance existing USS systems. Specifically, APT improves the separation performance of specific sources through training a small number of prompt parameters with limited audio samples, while maintaining the generalization of the USS model by keeping its parameters frozen. We evaluate the proposed method on MUSDB18 and ESC-50 datasets. Compared with the baseline model, APT can improve the signal-to-distortion ratio performance by 0.67 dB and 2.06 dB using the full training set of two datasets. Moreover, APT with only 5 audio samples even outperforms the baseline systems utilizing full training data on the ESC-50 dataset, indicating the great potential of few-shot APT.
    摘要 <>translate the following text into Simplified Chinese<> Universal sound separation (USS) 是一个将多种声音从音频混合中分离出来的任务。现有的 USS 系统可以在给定一些目标源示例后,将任意的声音分离出来。然而,使用单个系统将任意声音分离开来是困难的,并且不一定能够保证Robustness。在这项工作中,我们提出了音频提示调整(APT),一种简单 yet effective的方法来提高现有 USS 系统的分离性能。具体来说,APT 通过在 USS 模型中固定参数的情况下,通过少量的音频样本进行提示调整,以提高特定源的分离性能,而不会影响 USS 模型的总体性能。我们在 MUSDB18 和 ESC-50 数据集上进行评估该方法。相比基eline模型,APT 可以在两个数据集上提高信号噪声比例性能,分别为 0.67 dB 和 2.06 dB。此外,APT 只使用 5 个音频样本,还能在 ESC-50 数据集上超越基eline 系统,这表明了几何shot APT 的潜力。