results: 研究结果表明,对ASRlevel进行SE front-end的训练可以降低artifact错误,但是会增加噪声错误。此外,通过简单地 interpolate the enhanced and observed signals,可以达到降低artifacts和增加噪声的效果,而无需修改SE和ASR模块。这些结果为设计ASR无关的SE front-end提供了更好的理解和一种新的想法。Abstract
Jointly training a speech enhancement (SE) front-end and an automatic speech recognition (ASR) back-end has been investigated as a way to mitigate the influence of \emph{processing distortion} generated by single-channel SE on ASR. In this paper, we investigate the effect of such joint training on the signal-level characteristics of the enhanced signals from the viewpoint of the decomposed noise and artifact errors. The experimental analyses provide two novel findings: 1) ASR-level training of the SE front-end reduces the artifact errors while increasing the noise errors, and 2) simply interpolating the enhanced and observed signals, which achieves a similar effect of reducing artifacts and increasing noise, improves ASR performance without jointly modifying the SE and ASR modules, even for a strong ASR back-end using a WavLM feature extractor. Our findings provide a better understanding of the effect of joint training and a novel insight for designing an ASR agnostic SE front-end.
摘要
jointly 训练一个抖音减少(SE)前端和一个自动语音识别(ASR)后端,以降低单通道SE生成的处理扭曲对ASR的影响。在这篇论文中,我们研究了这种联合训练对减少噪声和缺陷错误的信号水平特征的影响。实验分析提供了两个新发现:1)ASR级别训练SE前端可以降低缺陷错误,同时增加噪声错误;2)简单地 interpolating 减少和观察到的信号,可以实现降低缺陷和增加噪声的效果,无需修改SE和ASR模块,即使使用一个强大的ASR后端使用WavLM特征提取器。我们的发现为 joint 训练的影响提供了更好的理解,并为设计ASR无关的SE前端提供了一个新的视角。
Neural network-based virtual microphone estimation with virtual microphone and beamformer-level multi-task loss
results: 在多话者下定制不足的情况下,提出了一种多任务NN-VME方法,实现了33.1%的相对WRER提升和10.8%的相对于先前NN-VME方法的提升。Abstract
Array processing performance depends on the number of microphones available. Virtual microphone estimation (VME) has been proposed to increase the number of microphone signals artificially. Neural network-based VME (NN-VME) trains an NN with a VM-level loss to predict a signal at a microphone location that is available during training but not at inference. However, this training objective may not be optimal for a specific array processing back-end, such as beamforming. An alternative approach is to use a training objective considering the array-processing back-end, such as a loss on the beamformer output. This approach may generate signals optimal for beamforming but not physically grounded. To combine the advantages of both approaches, this paper proposes a multi-task loss for NN-VME that combines both VM-level and beamformer-level losses. We evaluate the proposed multi-task NN-VME on multi-talker underdetermined conditions and show that it achieves a 33.1 % relative WER improvement compared to using only real microphones and 10.8 % compared to using a prior NN-VME approach.
摘要
(Simplified Chinese translation)阵列处理性能取决于可用的麦克风数量。虚拟麦克风估计(VME)已经提出来增加虚拟麦克风信号的数量。基于神经网络的VME(NN-VME)使用一个神经网络,在训练时使用VM级损失来预测在训练过程中可用的麦克风位置上的信号,但在推理时不可用。然而,这种训练目标可能不适合特定的阵列处理后端,例如扩散算法。一种alternative approach是使用包含阵列处理后端的训练目标,例如扩散输出损失。这种方法可能生成适合扩散的信号,但不是物理上的。为了结合两种方法的优点,本文提出了一种多任务损失函数 дляNN-VME,该函数combines VM级和扩散级损失。我们对多话者下的干扰过度的情况进行评估,并显示了使用该方法可以相比使用真实的麦克风和10.8%相比,提高33.1%的相对WRER。
APNet2: High-quality and High-efficiency Neural Vocoder with Direct Prediction of Amplitude and Phase Spectra
paper_authors: Hui-Peng Du, Ye-Xin Lu, Yang Ai, Zhen-Hua Ling for: 提高高质量语音生成的实用性methods: 采用ConvNeXt v2作为后馈网络进行幅度和相位预测,并引入多分辨率检定器(MRD)进行GAN型损失优化results: 在常见配置下(即采样率22.05kHz,spectral frame shift256点,约11.6ms),提出的APNet2 vocoder在比较 HiFi-GAN和iSTFTNet等其他 vocoder的情况下,Synthesized speech质量达到了相同水平,同时具有迅速的推理速度。Abstract
In our previous work, we proposed a neural vocoder called APNet, which directly predicts speech amplitude and phase spectra with a 5 ms frame shift in parallel from the input acoustic features, and then reconstructs the 16 kHz speech waveform using inverse short-time Fourier transform (ISTFT). APNet demonstrates the capability to generate synthesized speech of comparable quality to the HiFi-GAN vocoder but with a considerably improved inference speed. However, the performance of the APNet vocoder is constrained by the waveform sampling rate and spectral frame shift, limiting its practicality for high-quality speech synthesis. Therefore, this paper proposes an improved iteration of APNet, named APNet2. The proposed APNet2 vocoder adopts ConvNeXt v2 as the backbone network for amplitude and phase predictions, expecting to enhance the modeling capability. Additionally, we introduce a multi-resolution discriminator (MRD) into the GAN-based losses and optimize the form of certain losses. At a common configuration with a waveform sampling rate of 22.05 kHz and spectral frame shift of 256 points (i.e., approximately 11.6ms), our proposed APNet2 vocoder outperformed the original APNet and Vocos vocoders in terms of synthesized speech quality. The synthesized speech quality of APNet2 is also comparable to that of HiFi-GAN and iSTFTNet, while offering a significantly faster inference speed.
摘要
在我们之前的工作中,我们提出了一种神经 vocoder called APNet,它直接预测了speech 幅和相位 спектrum 的 5 ms 帧shift 并在并行地从输入语音特征中预测,然后使用 inverse short-time Fourier transform (ISTFT) 重建 16 kHz 语音波形。APNet 表现出了能够生成与 HiFi-GAN vocoder 相当的质量的合成语音,但是它的推理速度明显提高。然而,APNet 的性能受到波形采样率和 spectral frame shift 的限制,限制了其实际应用的质量。因此,这篇文章提出了 APNet2 vocoder。我们的提议的 APNet2 vocoder 采用 ConvNeXt v2 作为幅和相位预测的背bone 网络,以提高模型的能力。此外,我们还引入了 multi-resolution discriminator (MRD) 到 GAN-based 损失中,并优化了某些损失的形式。在一般配置下(即波形采样率为 22.05 kHz,spectral frame shift 为 256点,约为 11.6ms),我们的提议的 APNet2 vocoder 在与原始 APNet 和 Vocos vocoders 的比较中表现出了较好的合成语音质量。同时,APNet2 的合成语音质量也与 HiFi-GAN 和 iSTFTNet 相当,而且具有了显著 faster 的推理速度。