results: 可以匹配 oracle 双耳 LCMV 扬声器在非低延迟配置下的表现,且只需2毫秒的延迟Abstract
Speech enhancement in hearing aids is a challenging task since the hardware limits the number of possible operations and the latency needs to be in the range of only a few milliseconds. We propose a deep-learning model compatible with these limitations, which we refer to as Group-Communication Filter-and-Sum Network (GCFSnet). GCFSnet is a causal multiple-input single output enhancement model using filter-and-sum processing in the time-frequency domain and a multi-frame deep post filter. All filters are complex-valued and are estimated by a deep-learning model using weight-sharing through Group Communication and quantization-aware training for reducing model size and computational footprint. For a further increase in performance, a low bit rate binaural link for delayed binaural features is proposed to use binaural information while retaining a latency of 2ms. The performance of an oracle binaural LCMV beamformer in non-low-latency configuration can be matched even by a unilateral configuration of the GCFSnet in terms of objective metrics.
摘要
听见助手中的语音提升是一项具有挑战性的任务,因为硬件的限制只能进行一定数量的操作,并且响应时间需要在几毫秒内。我们提议一种深度学习模型,称之为群组通信滤波和总网络(GCFSnet)。GCFSnet是一种 causal 多输入单出提升模型,使用时域频域的滤波和总处理,并使用多帧深度后 filters。所有滤波器都是复数值的,并由深度学习模型使用 weight-sharing 和量化感知训练来 estimates。为了进一步提高性能,我们提议使用低比特率双耳链接,以使用双耳信息而不超过2毫秒的响应时间。GCFSnet 的单边配置可以与非低延迟配置下的 oracle 双耳 LCMV 扫描器匹配,以对象指标来衡量。
Semi-supervised multi-channel speaker diarization with cross-channel attention
results: 在CHiME-7 Mixer6数据集上,我们的系统比基于分 clustering 模型的开发集上的相对减少率为57.01%。在CHiME-6数据集上,当使用80%和50%标签的训练数据时,我们的系统与使用100%标签的训练数据相对性能相似。Abstract
Most neural speaker diarization systems rely on sufficient manual training data labels, which are hard to collect under real-world scenarios. This paper proposes a semi-supervised speaker diarization system to utilize large-scale multi-channel training data by generating pseudo-labels for unlabeled data. Furthermore, we introduce cross-channel attention into the Neural Speaker Diarization Using Memory-Aware Multi-Speaker Embedding (NSD-MA-MSE) to learn channel contextual information of speaker embeddings better. Experimental results on the CHiME-7 Mixer6 dataset which only contains partial speakers' labels of the training set, show that our system achieved 57.01% relative DER reduction compared to the clustering-based model on the development set. We further conducted experiments on the CHiME-6 dataset to simulate the scenario of missing partial training set labels. When using 80% and 50% labeled training data, our system performs comparably to the results obtained using 100% labeled data for training.
摘要
大多数神经网络发音分类系统都需要充足的手动训练数据标签,这些标签在实际场景下很难收集。这篇论文提议一种半监督的发音分类系统,利用大规模多通道训练数据生成 pseudo-标签,以便于无标签数据的学习。此外,我们在 Neural Speaker Diarization Using Memory-Aware Multi-Speaker Embedding (NSD-MA-MSE) 中引入了ChannelContextual Information的权重学习,以更好地学习发音者特征的通道信息。实验结果表明,在 CHiME-7 Mixer6 数据集上,我们的系统与 clustering-based 模型在开发集上实现了57.01%的相对减少性能。我们进一步在 CHiME-6 数据集上进行了 simulate 失去部分训练集标签的场景,当使用 80% 和 50% 标签过滤的训练数据时,我们的系统与使用 100% 标签训练的结果相当。