eess.AS - 2023-07-27

Audio Inputs for Active Speaker Detection and Localization via Microphone Array

  • paper_url: http://arxiv.org/abs/2307.14739
  • repo_url: None
  • paper_authors: Davide Berghi, Philip J. B. Jackson
  • for: 本研究探讨了基于多道声音捕获的活聊检测和定位问题,即活聊检测和定位(ASDL)。
  • methods: 本研究使用了一种卷积Recurrent Neural Network(CRNN),使用了多道声音中的空间音学特征作为输入,并对不同的渠道数和干扰噪音进行比较。
  • results: 实验结果表明,使用GCC-PHAT、SALSA特征和新的扩权报知方法可以减轻不同噪音水平下的表达性能下降,并且可以根据降噪性和渠道数进行优化。
    Abstract This study considers the problem of detecting and locating an active talker's horizontal position from multichannel audio captured by a microphone array. We refer to this as active speaker detection and localization (ASDL). Our goal was to investigate the performance of spatial acoustic features extracted from the multichannel audio as the input of a convolutional recurrent neural network (CRNN), in relation to the number of channels employed and additive noise. To this end, experiments were conducted to compare the generalized cross-correlation with phase transform (GCC-PHAT), the spatial cue-augmented log-spectrogram (SALSA) features, and a recently-proposed beamforming method, evaluating their robustness to various noise intensities. The array aperture and sampling density were tested by taking subsets from the 16-microphone array. Results and tests of statistical significance demonstrate the microphones' contribution to performance on the TragicTalkers dataset, which offers opportunities to investigate audio-visual approaches in the future.
    摘要

  • paper_url: http://arxiv.org/abs/2307.14650
  • repo_url: https://github.com/feima0011/physics-informed-neural-network-for-head-related-transfer-function-upsampling
  • paper_authors: Fei Ma, Thushara D. Abhayapala, Prasanga N. Samarasinghe, Xingyu Chen
  • for: 提高虚拟听觉体验的真实性,使用physics-informed neural network(PINN)方法进行HRTF upsampling。
  • methods: 基于Helmholtz方程的PINN方法,利用HRTF的物理特性来做upsampling,避免基于测量数据的局限性。
  • results: 对多个数据集进行比较,PINN方法在 interpolate 和 extrapolate 两种enario中具有更高的性能,不受under-fitting和over-fitting问题的影响。
    Abstract Head-related transfer functions (HRTFs) capture the spatial and spectral features that a person uses to localize sound sources in space and thus are vital for creating an authentic virtual acoustic experience. However, practical HRTF measurement systems can only provide an incomplete measurement of a person's HRTFs, and this necessitates HRTF upsampling. This paper proposes a physics-informed neural network (PINN) method for HRTF upsampling. Unlike other upsampling methods which are based on the measured HRTFs only, the PINN method exploits the Helmholtz equation as additional information for constraining the upsampling process. This helps the PINN method to generate physically amiable upsamplings which generalize beyond the measured HRTFs. Furthermore, the width and the depth of the PINN are set according to the dimensionality of HRTFs under spherical harmonic (SH) decomposition and the Helmholtz equation. This makes the PINN have an appropriate level of expressiveness and thus does not suffer from under-fitting and over-fitting problems. Numerical experiments confirm the superior performance of the PINN method for HRTF upsampling in both interpolation and extrapolation scenarios over several datasets in comparison with the SH methods.
    摘要 人头相关传函数(HRTF)捕捉了声音源的空间和频率特征,因此是创建真实虚拟听音场的关键。然而,实际测量HRTF系统只能提供 incomplete HRTF 测量,这需要HRTF 采样。这篇论文提出了一种基于物理学习神经网络(PINN)方法的 HRTF 采样方法。与其他采样方法不同,PINN 方法利用 Helmholtz 方程作为额外信息,以制约采样过程。这帮助 PINN 方法生成physically amiable的采样,并且这些采样可以扩展到测量 HRTF 之 beyond。此外,PINN 方法的宽度和深度是根据 HRTF 的维度下圆函数(SH)划分和 Helmholtz 方程来设置。这使得 PINN 方法具有合适的表达能力,并且不会出现过拟合和下拟合问题。 numerical experiments 表明,PINN 方法在 interpolate 和 extrapolate scenarios 中对多个数据集的 HRTF 采样性能较 SH 方法更高。

NeuroHeed: Neuro-Steered Speaker Extraction using EEG Signals

  • paper_url: http://arxiv.org/abs/2307.14303
  • repo_url: None
  • paper_authors: Zexu Pan, Marvin Borsdorf, Siqi Cai, Tanja Schultz, Haizhou Li
  • for: 本研究旨在开发一种基于EEG信号的选择性听力模型,以实现在听到干扰的多人对话中提取主要的说话人信号。
  • methods: 该模型使用EEG信号来建立一个neuronal attractor,其与听到的语音刺激相关,并通过在线和离线两种方式实现实时和非实时的抽取。在线NeuroHeed还包括一个自适应核心编码器,以积累过去抽取的语音信号,以便在下一个时间窗口中帮助抽取当前说话人信号。
  • results: 实验结果表明,NeuroHeed能够有效地提取主要的说话人信号,并达到高质量、出色的 восприятия质量和语音可理解性。
    Abstract Humans possess the remarkable ability to selectively attend to a single speaker amidst competing voices and background noise, known as selective auditory attention. Recent studies in auditory neuroscience indicate a strong correlation between the attended speech signal and the corresponding brain's elicited neuronal activities, which the latter can be measured using affordable and non-intrusive electroencephalography (EEG) devices. In this study, we present NeuroHeed, a speaker extraction model that leverages EEG signals to establish a neuronal attractor which is temporally associated with the speech stimulus, facilitating the extraction of the attended speech signal in a cocktail party scenario. We propose both an offline and an online NeuroHeed, with the latter designed for real-time inference. In the online NeuroHeed, we additionally propose an autoregressive speaker encoder, which accumulates past extracted speech signals for self-enrollment of the attended speaker information into an auditory attractor, that retains the attentional momentum over time. Online NeuroHeed extracts the current window of the speech signals with guidance from both attractors. Experimental results demonstrate that NeuroHeed effectively extracts brain-attended speech signals, achieving high signal quality, excellent perceptual quality, and intelligibility in a two-speaker scenario.
    摘要 人类具有选择性听觉能力,能够在多个声音和背景噪声中选择一个声音,这种能力被称为选择性听觉注意力。最近的听觉神经科学研究表明,在听觉过程中选择的语音信号和大脑发生的神经活动之间存在强相关性,这些神经活动可以使用便宜和不侵入的电enzephalography(EEG)设备测量。在这个研究中,我们介绍了NeuroHeed,一种基于EEG信号的语音抽取模型,可以在听觉场景中提取选择的语音信号。我们提出了两种NeuroHeed,一个是OFFLINE版本,另一个是ONLINE版本。在ONLINE版本中,我们还提出了自适应语音编码器,该编码器将过去提取的语音信号accumulate为自我投入的听觉招引器,以保持注意力的积累。ONLINE NeuroHeed在当前窗口中提取语音信号,受到两个招引器的引导。实验结果表明,NeuroHeed可以有效地提取大脑注意力的语音信号,实现高质量的语音信号、优美的听觉质量和语音清晰度在两个说话者场景中。