for: solve more challenging scenarios of multi-channel recordings with multiple simultaneous talkers
methods: universal encoder designed for multiple tasks, compatible with any microphone array, and trained without labeled multi-channel data
results: consistently outperformed combinations like the WavLM model with the BeamformIt frontend in speech recognition and speaker diarization tasksAbstract
The speech field is evolving to solve more challenging scenarios, such as multi-channel recordings with multiple simultaneous talkers. Given the many types of microphone setups out there, we present the UniX-Encoder. It's a universal encoder designed for multiple tasks, and worked with any microphone array, in both solo and multi-talker environments. Our research enhances previous multi-channel speech processing efforts in four key areas: 1) Adaptability: Contrasting traditional models constrained to certain microphone array configurations, our encoder is universally compatible. 2) Multi-Task Capability: Beyond the single-task focus of previous systems, UniX-Encoder acts as a robust upstream model, adeptly extracting features for diverse tasks including ASR and speaker recognition. 3) Self-Supervised Training: The encoder is trained without requiring labeled multi-channel data. 4) End-to-End Integration: In contrast to models that first beamform then process single-channels, our encoder offers an end-to-end solution, bypassing explicit beamforming or separation. To validate its effectiveness, we tested the UniX-Encoder on a synthetic multi-channel dataset from the LibriSpeech corpus. Across tasks like speech recognition and speaker diarization, our encoder consistently outperformed combinations like the WavLM model with the BeamformIt frontend.
摘要
《演讲场景在解决更加复杂的enario中进行进步,例如多通道录音多个同时发言人。为了解决这些问题,我们提出了UniX-Encoder。它是一种通用编码器,适用于多种 микрофон设置,并在单个和多个发言人环境中都可以工作。我们的研究在以下四个领域进行了进一步改进:1. 适应性:相比传统模型固定于特定的 микрофон设置,我们的编码器是通用的,可以与任何 микрофон设置结合使用。2. 多任务能力:除了先前的单一任务集成,UniX-Encoder 还可以作为一个强大的上游模型,可以提取多种任务的特征,包括ASR和speaker recognition。3. 自我超vised Training:我们的编码器不需要标注的多通道数据进行训练。4. 端到端集成:与先前的模型不同,UniX-Encoder 不需要显式的扩散或分离。它提供了一个端到端的解决方案,直接处理多通道数据,而不需要先进行分离或扩散。为验证其效果,我们在LibriSpeech corpus上测试了UniX-Encoder,并在多个任务,如speech recognition和speaker diarization中,consistently exceeded combinations like WavLM model with BeamformIt frontend。》
Covariance Blocking and Whitening Method for Successive Relative Transfer Function Vector Estimation in Multi-Speaker Scenarios
results: 在使用两个说话者的估计RTF向量在线性受限最小噪声抑制器中, simulation results使用了真实世界记录的多个说话者位置,demonstrate that the proposed CBW method outperforms the conventional BOP and covariance whitening methods in terms of signal-to-interferer-and-noise ratio improvement.Abstract
This paper addresses the challenge of estimating the relative transfer function (RTF) vectors of multiple speakers in a noisy and reverberant environment. More specifically, we consider a scenario where two speakers activate successively. In this scenario, the RTF vector of the first speaker can be estimated in a straightforward way and the main challenge lies in estimating the RTF vector of the second speaker during segments where both speakers are simultaneously active. To estimate the RTF vector of the second speaker the so-called blind oblique projection (BOP) method determines the oblique projection operator that optimally blocks the second speaker. Instead of blocking the second speaker, in this paper we propose a covariance blocking and whitening (CBW) method, which first blocks the first speaker and applies whitening using the estimated noise covariance matrix and then estimates the RTF vector of the second speaker based on a singular value decomposition. When using the estimated RTF vectors of both speakers in a linearly constrained minimum variance beamformer, simulation results using real-world recordings for multiple speaker positions demonstrate that the proposed CBW method outperforms the conventional BOP and covariance whitening methods in terms of signal-to-interferer-and-noise ratio improvement.
摘要
To estimate the RTF vector of the second speaker, the so-called blind oblique projection (BOP) method is used to determine the oblique projection operator that optimally blocks the second speaker. However, in this paper, we propose a covariance blocking and whitening (CBW) method instead. This method first blocks the first speaker and applies whitening using the estimated noise covariance matrix, and then estimates the RTF vector of the second speaker based on a singular value decomposition.When using the estimated RTF vectors of both speakers in a linearly constrained minimum variance beamformer, simulation results using real-world recordings for multiple speaker positions show that the proposed CBW method outperforms the conventional BOP and covariance whitening methods in terms of signal-to-interferer-and-noise ratio improvement.