eess.AS - 2023-11-19

Label-Synchronous Neural Transducer for Adaptable Online E2E Speech Recognition

  • paper_url: http://arxiv.org/abs/2311.11353
  • repo_url: None
  • paper_authors: Keqi Deng, Philip C. Woodland
  • for: 提高自动语音识别(ASR)的泛化能力,增强其在不同语音频谱中的表现。
  • methods: 提出了一种标签同步神经转化器(LS-Transducer),通过文本数据来进行领域适应。LS-Transducer使用标签水平Encoder表示性提取,然后将其与预测网络输出结合。这使得预测网络可以轻松地适应文本数据,而不需要大量的标签数据。此外,提出了一种自动步进机制(AIF),可以在低延迟下生成标签水平Encoder表示性,并且可以用于流处理。
  • results: 与标准神经转化器相比,提出的LS-Transducer在内域LibriSpeech数据上实现了12.9%的相对WRER(WRER)减少,以及21.4%和24.6%的相对WRER减少在跨频数据上(TED-LIUM 2和AESRC2020),并且可以在不同语音频谱中保持高度的同步。
    Abstract Although end-to-end (E2E) automatic speech recognition (ASR) has shown state-of-the-art recognition accuracy, it tends to be implicitly biased towards the training data distribution which can degrade generalisation. This paper proposes a label-synchronous neural transducer (LS-Transducer), which provides a natural approach to domain adaptation based on text-only data. The LS-Transducer extracts a label-level encoder representation before combining it with the prediction network output. Since blank tokens are no longer needed, the prediction network performs as a standard language model, which can be easily adapted using text-only data. An Auto-regressive Integrate-and-Fire (AIF) mechanism is proposed to generate the label-level encoder representation while retaining low latency operation that can be used for streaming. In addition, a streaming joint decoding method is designed to improve ASR accuracy while retaining synchronisation with AIF. Experiments show that compared to standard neural transducers, the proposed LS-Transducer gave a 12.9% relative WER reduction (WERR) for intra-domain LibriSpeech data, as well as 21.4% and 24.6% relative WERRs on cross-domain TED-LIUM 2 and AESRC2020 data with an adapted prediction network.
    摘要 although end-to-end (E2E) automatic speech recognition (ASR) has shown state-of-the-art recognition accuracy, it tends to be implicitly biased towards the training data distribution which can degrade generalisation. this paper proposes a label-synchronous neural transducer (LS-Transducer), which provides a natural approach to domain adaptation based on text-only data. the LS-Transducer extracts a label-level encoder representation before combining it with the prediction network output. since blank tokens are no longer needed, the prediction network performs as a standard language model, which can be easily adapted using text-only data. an auto-regressive integrate-and-fire (AIF) mechanism is proposed to generate the label-level encoder representation while retaining low latency operation that can be used for streaming. in addition, a streaming joint decoding method is designed to improve ASR accuracy while retaining synchronisation with AIF. experiments show that compared to standard neural transducers, the proposed LS-Transducer gave a 12.9% relative WER reduction (WERR) for intra-domain LibriSpeech data, as well as 21.4% and 24.6% relative WERRs on cross-domain TED-LIUM 2 and AESRC2020 data with an adapted prediction network.Note:* "WER" stands for "Word Error Rate"* "WERR" stands for "Word Error Rate Reduction"* "LibriSpeech" is a dataset of speech recordings* "TED-LIUM 2" and "AESRC2020" are other datasets of speech recordings