results: 单道信号提升可以提高keyword spotting精度,但无法在听力训练后提高精度Abstract
Noise robustness is a key aspect of successful speech applications. Speech enhancement (SE) has been investigated to improve automatic speech recognition accuracy; however, its effectiveness for keyword spotting (KWS) is still under-investigated. In this paper, we conduct a comprehensive study on single-channel speech enhancement for keyword spotting on the Google Speech Command (GSC) dataset. To investigate robustness to noise, the GSC dataset is augmented with noise signals from the WSJ0 Hipster Ambient Mixtures (WHAM!) noise dataset. Our investigation includes not only applying SE before KWS but also performing joint training of the SE frontend and KWS backend models. Moreover, we explore audio injection, a common approach to reduce distortions by using a weighted average of the enhanced and original signals. Audio injection is then further optimized by using another model that predicts the weight for each utterance. Our investigation reveals that SE can improve KWS accuracy on noisy speech when the backend model is trained on clean speech; however, despite our extensive exploration, it is difficult to improve the KWS accuracy with SE when the backend is trained on noisy speech.
摘要
噪声Robustness是成功语音应用程序的关键方面。语音增强(SE)已经被研究以提高自动语音识别精度,但是它对关键词搜索(KWS)的影响还未得到充分调查。在这篇论文中,我们进行了对单通道语音增强的全面研究,以提高Google语音命令(GSC)数据集上的关键词搜索精度。为了调查噪声的影响,我们使用WHAM!噪声数据集中的噪声信号来扩展GSC数据集。我们的调查包括不仅将SE应用于KWS前置处理,还包括将SE前端和KWS后端模型进行共同训练。此外,我们还探索了音频注入,一种常见的方法,通过使用每个语音的权重来减少损害。我们发现,当后端模型训练于干净语音时,SE可以提高KWS精度在噪声语音中;但是,我们进行了广泛的探索,但是很难通过SE提高KWS精度,当后端模型训练于噪声语音。
Neural Network Augmented Kalman Filter for Robust Acoustic Howling Suppression
results: 比起独立的NN和加尔曼筛方法,提出的方法实现了更好的AHS性能,经验验证了方法的有效性Abstract
Acoustic howling suppression (AHS) is a critical challenge in audio communication systems. In this paper, we propose a novel approach that leverages the power of neural networks (NN) to enhance the performance of traditional Kalman filter algorithms for AHS. Specifically, our method involves the integration of NN modules into the Kalman filter, enabling refining reference signal, a key factor in effective adaptive filtering, and estimating covariance metrics for the filter which are crucial for adaptability in dynamic conditions, thereby obtaining improved AHS performance. As a result, the proposed method achieves improved AHS performance compared to both standalone NN and Kalman filter methods. Experimental evaluations validate the effectiveness of our approach.
摘要
喷流喊叫控制(AHS)是音频通信系统中的一个关键挑战。在这篇论文中,我们提出了一种新的方法,利用神经网络(NN)提高传统的卡尔曼筛算法的AHS性能。具体来说,我们的方法是将NN模块与卡尔曼筛相结合,以便更好地调整参照信号,这是有效的适应滤波的关键因素,并估算筛子的协方差度量,这些度量对于在动态条件下的适应性至关重要。因此,我们的方法可以实现AHS性能的改进。实验评估表明,我们的方法比单独使用NN和卡尔曼筛方法都有更好的性能。
Advancing Acoustic Howling Suppression through Recursive Training of Neural Networks
for: 本研究旨在解决声学喊响问题,提出了一种基于神经网络(NN)模块的训练框架,以便坚实地 Addressing the acoustic howling issue by examining its fundamental formation process.
methods: 该框架在训练过程中将NN模块 integrate into the closed-loop system,通过在训练过程中使用生成回传信号来尝试模拟实际应用场景中的喊响供应(AHS)流程。此外,该框架还提出了两种方法:一种仅采用NN,另一种 combining NN with the traditional Kalman filter。
results: 实验结果表明,该框架可以对声学喊响供应进行有效的抑制,与前一代基于NN的方法相比,具有显著的改进。Abstract
In this paper, we introduce a novel training framework designed to comprehensively address the acoustic howling issue by examining its fundamental formation process. This framework integrates a neural network (NN) module into the closed-loop system during training with signals generated recursively on the fly to closely mimic the streaming process of acoustic howling suppression (AHS). The proposed recursive training strategy bridges the gap between training and real-world inference scenarios, marking a departure from previous NN-based methods that typically approach AHS as either noise suppression or acoustic echo cancellation. Within this framework, we explore two methodologies: one exclusively relying on NN and the other combining NN with the traditional Kalman filter. Additionally, we propose strategies, including howling detection and initialization using pre-trained offline models, to bolster trainability and expedite the training process. Experimental results validate that this framework offers a substantial improvement over previous methodologies for acoustic howling suppression.
摘要
在本文中,我们介绍了一种新的训练框架,旨在全面Addressing the acoustic howling issue by examining its fundamental formation process. 这个框架通过在训练过程中 интеGRATE一个神经网络(NN)模块到关闭Loop系统中,通过在飞行中生成的信号来准确模拟流动Acoustic howling suppression(AHS)的流程。我们的 recursive training strategy bridge the gap between training and real-world inference scenarios, 与前一些NN-based方法不同,通常将AHS看作是噪声Suppression或Acoustic echo cancellation。在这个框架中,我们探讨了两种方法:一种完全依赖于NN,另一种 combining NN with the traditional Kalman filter。此外,我们还提出了一些策略,包括如何探测和初始化难以训练的模型,以增强训练可靠性和加速训练过程。实验结果表明,这个框架可以substantially improve the acoustic howling suppression compared to previous methodologies.
Multichannel Voice Trigger Detection Based on Transform-average-concatenate
results: 对比基eline channel选择方法,提出的方法可以降低false rejection rate(FRR)达到30%,提高VT系统的准确率和效率。Abstract
Voice triggering (VT) enables users to activate their devices by just speaking a trigger phrase. A front-end system is typically used to perform speech enhancement and/or separation, and produces multiple enhanced and/or separated signals. Since conventional VT systems take only single-channel audio as input, channel selection is performed. A drawback of this approach is that unselected channels are discarded, even if the discarded channels could contain useful information for VT. In this work, we propose multichannel acoustic models for VT, where the multichannel output from the frond-end is fed directly into a VT model. We adopt a transform-average-concatenate (TAC) block and modify the TAC block by incorporating the channel from the conventional channel selection so that the model can attend to a target speaker when multiple speakers are present. The proposed approach achieves up to 30% reduction in the false rejection rate compared to the baseline channel selection approach.
摘要
通过语音触发(VT),用户可以通过说出触发语言来活动设备。前端系统通常用于进行语音增强和/或分离,生成多个增强和/或分离的信号。由于传统VT系统只接受单 кана声音输入,因此需要进行通道选择。这种方法的缺点是会抛弃未选择的通道,即使这些抛弃的通道可能包含有用信息 для VT。在这项工作中,我们提议使用多通道音频模型来进行VT,其中前端输出的多通道输入直接传递到VT模型中。我们采用了变换均值 concatenate(TAC)块,并将TAC块修改为包括传统通道选择的通道,以便模型可以在多个说话者存在时听到目标说话者。我们的方法可以相比基eline通道选择方法实现最多30%的假拒绝率降低。
DualVC 2: Dynamic Masked Convolution for Unified Streaming and Non-Streaming Voice Conversion
results: 比对 DualVC 和基eline 系统,DualVC 2 在对话metric和对话metric中表现出色,并且具有低延迟(186.4 ms)Abstract
Voice conversion is becoming increasingly popular, and a growing number of application scenarios require models with streaming inference capabilities. The recently proposed DualVC attempts to achieve this objective through streaming model architecture design and intra-model knowledge distillation along with hybrid predictive coding to compensate for the lack of future information. However, DualVC encounters several problems that limit its performance. First, the autoregressive decoder has error accumulation in its nature and limits the inference speed as well. Second, the causal convolution enables streaming capability but cannot sufficiently use future information within chunks. Third, the model is unable to effectively address the noise in the unvoiced segments, lowering the sound quality. In this paper, we propose DualVC 2 to address these issues. Specifically, the model backbone is migrated to a Conformer-based architecture, empowering parallel inference. Causal convolution is replaced by non-causal convolution with dynamic chunk mask to make better use of within-chunk future information. Also, quiet attention is introduced to enhance the model's noise robustness. Experiments show that DualVC 2 outperforms DualVC and other baseline systems in both subjective and objective metrics, with only 186.4 ms latency. Our audio samples are made publicly available.
摘要
声音转换正在不断受欢迎,而更多的应用场景需要流处理推理能力。最近提出的双VC模型尝试通过流处理模型架构设计和内部知识储存加以杜绝预测编码来实现这一目标。然而,双VC模型存在一些限制其性能的问题。首先, autoregressive 解码器具有自然的错误积累特性,限制推理速度。其次, causal 卷积可以实现流处理能力,但是无法充分利用内存中的未来信息。最后,模型无法有效地处理无声段的噪声,下降声音质量。在这篇论文中,我们提出了双VC 2模型来解决这些问题。具体来说,模型核心被迁移到基于 Conformer 架构的architecture,实现并行推理。 causal 卷积被替换为非 causal 卷积,并使用动态 chunk mask 来更好地利用内存中的未来信息。此外,我们还引入了静态注意力来提高模型的噪声耐性。实验结果显示,双VC 2 模型在主观和客观指标中都超过了双VC 和其他基eline系统,具有只有 186.4 毫秒延迟。我们的声音样本被公开发布。