results: 与标准神经转化器相比,提出的LS-Transducer在内域LibriSpeech数据上实现了12.9%的相对WRER(WRER)减少,以及21.4%和24.6%的相对WRER减少在跨频数据上(TED-LIUM 2和AESRC2020),并且可以在不同语音频谱中保持高度的同步。Abstract
Although end-to-end (E2E) automatic speech recognition (ASR) has shown state-of-the-art recognition accuracy, it tends to be implicitly biased towards the training data distribution which can degrade generalisation. This paper proposes a label-synchronous neural transducer (LS-Transducer), which provides a natural approach to domain adaptation based on text-only data. The LS-Transducer extracts a label-level encoder representation before combining it with the prediction network output. Since blank tokens are no longer needed, the prediction network performs as a standard language model, which can be easily adapted using text-only data. An Auto-regressive Integrate-and-Fire (AIF) mechanism is proposed to generate the label-level encoder representation while retaining low latency operation that can be used for streaming. In addition, a streaming joint decoding method is designed to improve ASR accuracy while retaining synchronisation with AIF. Experiments show that compared to standard neural transducers, the proposed LS-Transducer gave a 12.9% relative WER reduction (WERR) for intra-domain LibriSpeech data, as well as 21.4% and 24.6% relative WERRs on cross-domain TED-LIUM 2 and AESRC2020 data with an adapted prediction network.
摘要
although end-to-end (E2E) automatic speech recognition (ASR) has shown state-of-the-art recognition accuracy, it tends to be implicitly biased towards the training data distribution which can degrade generalisation. this paper proposes a label-synchronous neural transducer (LS-Transducer), which provides a natural approach to domain adaptation based on text-only data. the LS-Transducer extracts a label-level encoder representation before combining it with the prediction network output. since blank tokens are no longer needed, the prediction network performs as a standard language model, which can be easily adapted using text-only data. an auto-regressive integrate-and-fire (AIF) mechanism is proposed to generate the label-level encoder representation while retaining low latency operation that can be used for streaming. in addition, a streaming joint decoding method is designed to improve ASR accuracy while retaining synchronisation with AIF. experiments show that compared to standard neural transducers, the proposed LS-Transducer gave a 12.9% relative WER reduction (WERR) for intra-domain LibriSpeech data, as well as 21.4% and 24.6% relative WERRs on cross-domain TED-LIUM 2 and AESRC2020 data with an adapted prediction network.Note:* "WER" stands for "Word Error Rate"* "WERR" stands for "Word Error Rate Reduction"* "LibriSpeech" is a dataset of speech recordings* "TED-LIUM 2" and "AESRC2020" are other datasets of speech recordings