eess.AS - 2023-10-17

Iterative Shallow Fusion of Backward Language Model for End-to-End Speech Recognition

  • paper_url: http://arxiv.org/abs/2310.11010
  • repo_url: None
  • paper_authors: Atsunori Ogawa, Takafumi Moriya, Naoyuki Kamo, Naohiro Tawara, Marc Delcroix
  • for: 这 paper 的目的是提出一种新的浅融合(SF)方法,以利用外部的反向语言模型(BLM)来实现端到端自动语音识别(ASR)系统。
  • methods: 这 paper 使用的方法包括:(1)使用 BLM 对 ASR гипотезы进行迭代处理,以取代上一轮计算的 BLM 分数;(2)使用 PBLM 进行部分句子预测,以提高 ISF 的效果。
  • results: 实验结果表明,使用 ISF 可以在 ASR 系统中提高性能,并且可以避免在解码过程中提前剔除可能的 гипотезы。此外,将 SF 和 ISF 相互结合可以获得更高的性能提升。
    Abstract We propose a new shallow fusion (SF) method to exploit an external backward language model (BLM) for end-to-end automatic speech recognition (ASR). The BLM has complementary characteristics with a forward language model (FLM), and the effectiveness of their combination has been confirmed by rescoring ASR hypotheses as post-processing. In the proposed SF, we iteratively apply the BLM to partial ASR hypotheses in the backward direction (i.e., from the possible next token to the start symbol) during decoding, substituting the newly calculated BLM scores for the scores calculated at the last iteration. To enhance the effectiveness of this iterative SF (ISF), we train a partial sentence-aware BLM (PBLM) using reversed text data including partial sentences, considering the framework of ISF. In experiments using an attention-based encoder-decoder ASR system, we confirmed that ISF using the PBLM shows comparable performance with SF using the FLM. By performing ISF, early pruning of prospective hypotheses can be prevented during decoding, and we can obtain a performance improvement compared to applying the PBLM as post-processing. Finally, we confirmed that, by combining SF and ISF, further performance improvement can be obtained thanks to the complementarity of the FLM and PBLM.
    摘要 我们提出了一种新的浅合并(SF)方法,利用外部的反向语言模型(BLM)来实现端到端自动语音识别(ASR)。BLM具有与前向语言模型(FLM)的 complementary 特性,其合作效果已经通过重新评分ASR假设来确认。在我们的SF中,我们在解码过程中逐渐应用BLM于部分ASR假设,在反向方向(即从可能的下一个单词到开始符)进行迭代,并将每轮计算的BLM分数替换为上一轮计算的分数。为了增强ISF的效果,我们使用了倒转文本数据来训练一个具有部分句子意识的BLM(PBLM)。在使用了注意力基于encoder-decoder ASR系统的实验中,我们证明了ISF使用PBLM可以与SF使用FLM相比。通过执行ISF,在解码过程中可以避免早期淘汰可能的假设,从而获得性能提升。最后,我们证明了,通过将SF和ISF结合使用,可以增加性能的提升,这是因为FLM和PBLM之间存在 complementarity。

Advanced accent/dialect identification and accentedness assessment with multi-embedding models and automatic speech recognition

  • paper_url: http://arxiv.org/abs/2310.11004
  • repo_url: None
  • paper_authors: Shahram Ghorbani, John H. L. Hansen
    for:* 这个研究旨在提高非Native语言speech中的腔调识别和外国腔评估的准确性。methods:* 利用先进的预训练语言标识(LID)和说话人标识(SID)模型的嵌入,以提高非Native语言speech中腔调识别和外国腔评估的准确性。results:* 结果表明,使用预训练LID和SID模型的嵌入可以有效地编码非Native语言speech中的腔调信息。* 此外,LID和SID编码的腔调信息与从scratch开发的端到端腔调识别模型结合使用,可以提高腔调识别的准确性。
    Abstract Accurately classifying accents and assessing accentedness in non-native speakers are both challenging tasks due to the complexity and diversity of accent and dialect variations. In this study, embeddings from advanced pre-trained language identification (LID) and speaker identification (SID) models are leveraged to improve the accuracy of accent classification and non-native accentedness assessment. Findings demonstrate that employing pre-trained LID and SID models effectively encodes accent/dialect information in speech. Furthermore, the LID and SID encoded accent information complement an end-to-end accent identification (AID) model trained from scratch. By incorporating all three embeddings, the proposed multi-embedding AID system achieves superior accuracy in accent identification. Next, we investigate leveraging automatic speech recognition (ASR) and accent identification models to explore accentedness estimation. The ASR model is an end-to-end connectionist temporal classification (CTC) model trained exclusively with en-US utterances. The ASR error rate and en-US output of the AID model are leveraged as objective accentedness scores. Evaluation results demonstrate a strong correlation between the scores estimated by the two models. Additionally, a robust correlation between the objective accentedness scores and subjective scores based on human perception is demonstrated, providing evidence for the reliability and validity of utilizing AID-based and ASR-based systems for accentedness assessment in non-native speech.
    摘要 准确地分类不同的口音和讲话风格是一项非常复杂和多样化的任务,特别是在非Native speaker的语音中。在本研究中,我们利用先进的预训练语言标识(LID)和发音标识(SID)模型的嵌入来提高非Native speaker的口音分类和讲话风格评估的准确性。研究发现,使用预训练LID和SID模型可以有效地嵌入语音中的口音/方言信息。此外,LID和SID嵌入的口音信息与从零开始训练的口音标识(AID)模型相结合,可以提高口音标识的准确性。然后,我们 investigate了利用自然语音识别(ASR)和口音标识模型来评估讲话风格。ASR模型是一个端到端的连接式时间分类(CTC)模型,专门使用英文语音训练。ASR错误率和AID模型的en-US输出被用作对象评估风格的标准差分。研究结果表明,两个模型之间存在强相关性,并且对人类对讲话风格的评估也存在robust相关性,这提供了使用AID和ASR基于的系统进行讲话风格评估的可靠性和有效性的证据。