paper_authors: Anshu Bhatia, Sanchit Sinha, Saket Dingliwal, Karthik Gopalakrishnan, Sravan Bodapati, Katrin Kirchhoff for: 实现自动话语识别(ASR) tasks中的表现,尤其是针对非标准的话者口音 population。methods: 使用自我超级vised learning方法从大量无标注的话语数据库中学习speech表现,并训练具有口音特点的residual adapter来自适应非标准的话者口音。results: 在4种口音中实现了强大的word error rate reduction(WERR),较HuBERT-large更好,具体来说是22.7%。此外,我们还证明了我们的方法是model和task-agnostic的。Abstract
Speech representations learned in a self-supervised fashion from massive unlabeled speech corpora have been adapted successfully toward several downstream tasks. However, such representations may be skewed toward canonical data characteristics of such corpora and perform poorly on atypical, non-native accented speaker populations. With the state-of-the-art HuBERT model as a baseline, we propose and investigate self-supervised adaptation of speech representations to such populations in a parameter-efficient way via training accent-specific residual adapters. We experiment with 4 accents and choose automatic speech recognition (ASR) as the downstream task of interest. We obtain strong word error rate reductions (WERR) over HuBERT-large for all 4 accents, with a mean WERR of 22.7% with accent-specific adapters and a mean WERR of 25.1% if the entire encoder is accent-adapted. While our experiments utilize HuBERT and ASR as the downstream task, our proposed approach is both model and task-agnostic.
摘要
自然语音训练集中学习的自我超vised的语音表示方式,经验到了多种下游任务的适应。然而,这些表示方式可能受到大量无标注语音训练集的典型数据特征的偏见,对非典型、非本地口音 speaker populations 表现不佳。基于当前顶峰 HuBERT 模型,我们提议并 investigate 自parameter-efficient 的方式,通过训练口音Specific residual adapter 来适应 speech 表示到这些 populations。我们实验了 4 种口音,选择了自动语音识别 (ASR) 作为下游任务。我们获得了所有 4 种口音的强大字误率减少 (WERR),与 HuBERT-large 的mean WERR 相比, mean WERR 为 22.7%,全encoder 适应 WERR 为 25.1%。虽然我们的实验使用 HuBERT 和 ASR 作为下游任务,但我们提议的方法是模型和任务agnostic。