results: 对wTIMIT数据库中的各个speaker组,US英语取得最佳result,相比基eline,word error rate降低18.2%。进一步调查发现嘟嚓speech中缺失的喉咙信息对嘟嚓speech识别性表现产生了最大的影响。Abstract
Whispering is a distinct form of speech known for its soft, breathy, and hushed characteristics, often used for private communication. The acoustic characteristics of whispered speech differ substantially from normally phonated speech and the scarcity of adequate training data leads to low automatic speech recognition (ASR) performance. To address the data scarcity issue, we use a signal processing-based technique that transforms the spectral characteristics of normal speech to those of pseudo-whispered speech. We augment an End-to-End ASR with pseudo-whispered speech and achieve an 18.2% relative reduction in word error rate for whispered speech compared to the baseline. Results for the individual speaker groups in the wTIMIT database show the best results for US English. Further investigation showed that the lack of glottal information in whispered speech has the largest impact on whispered speech ASR performance.
摘要
嘟哒是一种特殊的语言形式,其特点是软、浅、低声,通常用于私人通信。嘟哒speech的听音特性与正常发音 speech 有很大差异,导致自动语音识别(ASR)性能较低。为解决数据缺乏问题,我们使用一种信号处理基本技术,将正常语音的spectral特性转换为 pseudo-嘟哒speech。我们将端到端 ASR 扩展到 pseudo-嘟哒speech,并实现了对嘟哒speech的18.2% 相对下降 word error rate。 results for the individual speaker groups in the wTIMIT database show the best results for US English。进一步调查发现,嘟哒speech ASR 性能中最大的影响因素是缺乏舌喙信息。