eess.AS - 2023-11-07

Fine-tuning convergence model in Bengali speech recognition

  • paper_url: http://arxiv.org/abs/2311.04122
  • repo_url: None
  • paper_authors: Zhu Ruiying, Shen Meng
  • for: 提高自动speech recognition模型的性能,特别是对于孟加拉语的识别。
  • methods: 使用wave2vec 2.0预训练模型进行微调,并调整学习率和dropout参数。
  • results: 在测试集上,WER由0.508降至0.437,并将训练和验证集合并使用,实现了很好的WER值0.436。
    Abstract Research on speech recognition has attracted considerable interest due to the difficult task of segmenting uninterrupted speech. Among various languages, Bengali features distinct rhythmic patterns and tones, making it particularly difficult to recognize and lacking an efficient commercial recognition method. In order to improve the automatic speech recognition model for Bengali, our team has chosen to utilize the wave2vec 2.0 pre-trained model, which has undergone convergence for fine-tuning. Regarding Word Error Rate (WER), the learning rate and dropout parameters were fine-tuned, and after the model training was stable, attempts were made to enlarge the training set ratio, which improved the model's performance. Consequently, there was a notable enhancement in the WER from 0.508 to 0.437 on the test set of the publicly listed official dataset. Afterwards, the training and validation sets were merged, creating a comprehensive dataset that was used as the training set, achieving a remarkable WER of 0.436.
    摘要 Translated into Simplified Chinese:研究对语音识别具有很大的 интерес,因为分词是一项非常困难的任务。在各种语言中,孟加拉语具有特殊的节奏和调音特征,使其识别非常困难,而且没有有效的商业识别方法。为了提高自动语音识别模型的性能,我们团队选择使用wave2vec 2.0预训练模型,并进行了优化。对于 Word Error Rate(WER),我们调整了学习率和dropout参数,并在模型训练稳定后,尝试将训练集比例扩大,这使得模型性能得到了明显改善。在公共列表的官方数据集上,WER从0.508下降至0.437。然后,我们将训练集和验证集合并,创建了一个完整的数据集,并使用这个数据集进行了训练,达到了很出色的WER值0.436。