results: 实验结果表明,转移学习可以提供0.48% EER,知识传递可以提供0.43% EER,而使用说话者适配器可以提供0.57% EER。总的来说,我们的方法可以有效地将ASR能力传递到人识别任务中。Abstract
This paper explores the use of ASR-pretrained Conformers for speaker verification, leveraging their strengths in modeling speech signals. We introduce three strategies: (1) Transfer learning to initialize the speaker embedding network, improving generalization and reducing overfitting. (2) Knowledge distillation to train a more flexible speaker verification model, incorporating frame-level ASR loss as an auxiliary task. (3) A lightweight speaker adaptor for efficient feature conversion without altering the original ASR Conformer, allowing parallel ASR and speaker verification. Experiments on VoxCeleb show significant improvements: transfer learning yields a 0.48% EER, knowledge distillation results in a 0.43% EER, and the speaker adaptor approach, with just an added 4.92M parameters to a 130.94M-parameter model, achieves a 0.57% EER. Overall, our methods effectively transfer ASR capabilities to speaker verification tasks.
摘要
这篇论文探讨使用ASR预训练的具有相同特征的声音识别器进行说话人识别,利用其对语音信号的模型化能力。我们介绍了三种策略:(1)将说话人嵌入网络初始化为可重用的模型,以提高泛化和降低过拟合。(2)通过知识传递来训练更灵活的说话人识别模型,并将帧级ASR损失作为辅助任务。(3)一种轻量级的说话人适配器,可以高效地将特征转换而不改变原始ASR Conformer,以便并行进行ASR和说话人识别。在VoxCeleb上进行实验,我们发现:(1)将说话人嵌入网络初始化为可重用的模型可以得到0.48%的EER。(2)通过知识传递来训练更灵活的说话人识别模型可以得到0.43%的EER。(3)使用说话人适配器可以在130.94M参数的模型上添加4.92M参数,并达到0.57%的EER。总的来说,我们的方法可以有效地将ASR能力传递到说话人识别任务。