eess.AS - 2023-07-13

Personalization for BERT-based Discriminative Speech Recognition Rescoring

  • paper_url: http://arxiv.org/abs/2307.06832
  • repo_url: None
  • paper_authors: Jari Kolehmainen, Yile Gu, Aditya Gourav, Prashanth Gurunath Shivakumar, Ankur Gandhe, Ariya Rastrow, Ivan Bulyko
  • for: 提高个性化内容 recognition 的挑战
  • methods: 使用个性化内容在神经网络重新评分步骤中进行个性化,包括 Gazetteers、Prompting 和 cross-attention 基于encoder-decoder模型
  • results: 在个性化 named entities 测试集上,每种方法都提高了 word error rate 超过 10%,相比神经网络重新评分基线。 natural language prompts 可以提高 word error rate 7% 无需训练,但有一定的总体泛化损失。 Gazetteers 最终得到了10%的word error rate 提高,同时在总测试集上也提高了1%的word error rate。
    Abstract Recognition of personalized content remains a challenge in end-to-end speech recognition. We explore three novel approaches that use personalized content in a neural rescoring step to improve recognition: gazetteers, prompting, and a cross-attention based encoder-decoder model. We use internal de-identified en-US data from interactions with a virtual voice assistant supplemented with personalized named entities to compare these approaches. On a test set with personalized named entities, we show that each of these approaches improves word error rate by over 10%, against a neural rescoring baseline. We also show that on this test set, natural language prompts can improve word error rate by 7% without any training and with a marginal loss in generalization. Overall, gazetteers were found to perform the best with a 10% improvement in word error rate (WER), while also improving WER on a general test set by 1%.
    摘要 recognition of personalized content remains a challenge in end-to-end speech recognition. we explore three novel approaches that use personalized content in a neural rescoring step to improve recognition: gazetteers, prompting, and a cross-attention based encoder-decoder model. we use internal de-identified en-US data from interactions with a virtual voice assistant supplemented with personalized named entities to compare these approaches. on a test set with personalized named entities, we show that each of these approaches improves word error rate by over 10%, against a neural rescoring baseline. we also show that on this test set, natural language prompts can improve word error rate by 7% without any training and with a marginal loss in generalization. overall, gazetteers were found to perform the best with a 10% improvement in word error rate (wer), while also improving wer on a general test set by 1%.

LACE: A light-weight, causal model for enhancing coded speech through adaptive convolutions

  • paper_url: http://arxiv.org/abs/2307.06610
  • repo_url: None
  • paper_authors: Jan Büthe, Jean-Marc Valin, Ahmed Mustafa
  • for: 提高编码后的语音质量
  • methods: 使用深度神经网络生成经典滤波器核心,并且实现了实时执行在桌面或移动设备CPU上
  • results: 能够实现有效的宽带编码,bitrates下降至6kb/s
    Abstract Classical speech coding uses low-complexity postfilters with zero lookahead to enhance the quality of coded speech, but their effectiveness is limited by their simplicity. Deep Neural Networks (DNNs) can be much more effective, but require high complexity and model size, or added delay. We propose a DNN model that generates classical filter kernels on a per-frame basis with a model of just 300~K parameters and 100~MFLOPS complexity, which is a practical complexity for desktop or mobile device CPUs. The lack of added delay allows it to be integrated into the Opus codec, and we demonstrate that it enables effective wideband encoding for bitrates down to 6 kb/s.
    摘要 Translated into Simplified Chinese:传统的语音编码使用低复杂度后 filters 来提高语音质量,但其效果受其简单性的限制。深度神经网络(DNNs)可以更加有效,但需要高度复杂度和模型大小,或者添加延迟。我们提议一个 DNN 模型,在每帧基础上生成经典滤波器kernels,模型只有 300~K 参数和 100~MFLOPS 复杂度,这是Desktop或移动设备 CPU 的实际复杂度。没有添加延迟,它可以与 Opus 编码器集成,我们示出它在 bitrate 为 6 kb/s 下实现了有效的宽带编码。