eess.AS - 2023-07-18

Low bit rate binaural link for improved ultra low-latency low-complexity multichannel speech enhancement in Hearing Aids

paper_url: http://arxiv.org/abs/2307.08858
repo_url: None
paper_authors: Nils L. Westhausen, Bernd T. Meyer
for: 提高听力器的听音质量
methods: 使用深度学习模型和 Filter-and-Sum 处理，并采用量化承载和集群通信来减少模型大小和计算负担
results: 可以匹配 oracle 双耳 LCMV 扬声器在非低延迟配置下的表现，且只需2毫秒的延迟Here’s the Chinese text in the format you requested:
for: 提高听力器的听音质量
methods: 使用深度学习模型和 Filter-and-Sum 处理，并采用量化承载和集群通信来减少模型大小和计算负担
results: 可以匹配 oracle 双耳 LCMV 扬声器在非低延迟配置下的表现，且只需2毫秒的延迟

Abstract
Speech enhancement in hearing aids is a challenging task since the hardware limits the number of possible operations and the latency needs to be in the range of only a few milliseconds. We propose a deep-learning model compatible with these limitations, which we refer to as Group-Communication Filter-and-Sum Network (GCFSnet). GCFSnet is a causal multiple-input single output enhancement model using filter-and-sum processing in the time-frequency domain and a multi-frame deep post filter. All filters are complex-valued and are estimated by a deep-learning model using weight-sharing through Group Communication and quantization-aware training for reducing model size and computational footprint. For a further increase in performance, a low bit rate binaural link for delayed binaural features is proposed to use binaural information while retaining a latency of 2ms. The performance of an oracle binaural LCMV beamformer in non-low-latency configuration can be matched even by a unilateral configuration of the GCFSnet in terms of objective metrics.

摘要
听见助手中的语音提升是一项具有挑战性的任务，因为硬件的限制只能进行一定数量的操作，并且响应时间需要在几毫秒内。我们提议一种深度学习模型，称之为群组通信滤波和总网络（GCFSnet）。GCFSnet是一种 causal 多输入单出提升模型，使用时域频域的滤波和总处理，并使用多帧深度后 filters。所有滤波器都是复数值的，并由深度学习模型使用 weight-sharing 和量化感知训练来 estimates。为了进一步提高性能，我们提议使用低比特率双耳链接，以使用双耳信息而不超过2毫秒的响应时间。GCFSnet 的单边配置可以与非低延迟配置下的 oracle 双耳 LCMV 扫描器匹配，以对象指标来衡量。

Semi-supervised multi-channel speaker diarization with cross-channel attention

paper_url: http://arxiv.org/abs/2307.08688
repo_url: None
paper_authors: Shilong Wu, Jun Du, Maokui He, Shutong Niu, Hang Chen, Haitao Tang, Chin-Hui Lee
for: 本研究提出了一种半监督式的Speaker diarization系统，以利用大规模多通道训练数据，并生成pseudo标签来labels无标签数据。
methods: 本研究引入了 Cross-channel attention机制，以更好地学习speaker embedding的通道上下文信息。
results: 在CHiME-7 Mixer6数据集上，我们的系统比基于分 clustering 模型的开发集上的相对减少率为57.01%。在CHiME-6数据集上，当使用80%和50%标签的训练数据时，我们的系统与使用100%标签的训练数据相对性能相似。

Abstract
Most neural speaker diarization systems rely on sufficient manual training data labels, which are hard to collect under real-world scenarios. This paper proposes a semi-supervised speaker diarization system to utilize large-scale multi-channel training data by generating pseudo-labels for unlabeled data. Furthermore, we introduce cross-channel attention into the Neural Speaker Diarization Using Memory-Aware Multi-Speaker Embedding (NSD-MA-MSE) to learn channel contextual information of speaker embeddings better. Experimental results on the CHiME-7 Mixer6 dataset which only contains partial speakers' labels of the training set, show that our system achieved 57.01% relative DER reduction compared to the clustering-based model on the development set. We further conducted experiments on the CHiME-6 dataset to simulate the scenario of missing partial training set labels. When using 80% and 50% labeled training data, our system performs comparably to the results obtained using 100% labeled data for training.

摘要
大多数神经网络发音分类系统都需要充足的手动训练数据标签，这些标签在实际场景下很难收集。这篇论文提议一种半监督的发音分类系统，利用大规模多通道训练数据生成 pseudo-标签，以便于无标签数据的学习。此外，我们在 Neural Speaker Diarization Using Memory-Aware Multi-Speaker Embedding (NSD-MA-MSE) 中引入了ChannelContextual Information的权重学习，以更好地学习发音者特征的通道信息。实验结果表明，在 CHiME-7 Mixer6 数据集上，我们的系统与 clustering-based 模型在开发集上实现了57.01%的相对减少性能。我们进一步在 CHiME-6 数据集上进行了 simulate 失去部分训练集标签的场景，当使用 80% 和 50% 标签过滤的训练数据时，我们的系统与使用 100% 标签训练的结果相当。