results: 研究通过生成和验证23小时的讲话笔记,创造了一个自由说话 ASR bencmark,并证明了图像刺激对自由说话的影响。Abstract
In this paper, we present a 170.83 hour Indian English spontaneous speech dataset. Lack of Indian English speech data is one of the major hindrances in developing robust speech systems which are adapted to the Indian speech style. Moreover this scarcity is even more for spontaneous speech. This corpus is crowd sourced over varied Indian nativities, genders and age groups. Traditional spontaneous speech collection strategies involve capturing of speech during interviewing or conversations. In this study, we use images as stimuli to induce spontaneity in speech. Transcripts for 23 hours is generated and validated which can serve as a spontaneous speech ASR benchmark. Quality of the corpus is validated with voice activity detection based segmentation, gender verification and image semantic correlation. Which determines a relationship between image stimulus and recorded speech using caption keywords derived from Image2Text model and high occurring words derived from whisper ASR generated transcripts.
摘要
在这篇论文中,我们提供了170.83小时的印度英语自然语言说话数据集。印度英语说话数据的缺乏是开发适应印度说话风格的语音系统的一个主要障碍。此外,这种缺乏还更加突出在自然语言说话方面。这个 corpus 是通过印度不同的本地、性别和年龄组合来收集的。传统的自然语言说话收集策略通常是在面谈或对话中采集说话。在这项研究中,我们使用图像作为唤起自由说话的刺激。我们生成了23小时的讲解,并验证了其可用性。我们使用视觉活动检测基于分 segmentation、性别验证和图像 semantics 的相关性来评估数据质量。这种相关性是基于图像刺激和 whisper ASR 生成的讲解词汇和高发生的词汇来确定。