eess.AS - 2023-07-15

Audio-Visual Speech Enhancement Using Self-supervised Learning to Improve Speech Intelligibility in Cochlear Implant Simulations

  • paper_url: http://arxiv.org/abs/2307.07748
  • repo_url: None
  • paper_authors: Richard Lee Lai, Jen-Cheng Hou, Mandar Gogate, Kia Dashtipour, Amir Hussain, Yu Tsao
  • for: 帮助听力障碍者更好地理解对话,特别是在噪声环境中。
  • methods: 提出了一种基于深度学习的自动识别技术,combines 视频和声音信号,并使用Transformer-based SSL AV-HuBERT模型提取特征,然后使用 BLSTM-based SE 模型进行加工。
  • results: 实验结果显示,提出的方法可以成功地超越限制性的数据问题,并且在不同的噪声环境中都能够提高对话理解性。具体来说,PESQ 值从 1.43 提高到 1.67,STOI 值从 0.70 提高到 0.74。此外,还进行了对 CI 用户的评估,结果表明,在人工对话中遇到的动态噪声中,SSL-AVSE 表现出了明显的改善。NCM 值提高了 26.5% 到 87.2% 相比于噪声基线。
    Abstract Individuals with hearing impairments face challenges in their ability to comprehend speech, particularly in noisy environments. The aim of this study is to explore the effectiveness of audio-visual speech enhancement (AVSE) in enhancing the intelligibility of vocoded speech in cochlear implant (CI) simulations. Notably, the study focuses on a challenged scenario where there is limited availability of training data for the AVSE task. To address this problem, we propose a novel deep neural network framework termed Self-Supervised Learning-based AVSE (SSL-AVSE). The proposed SSL-AVSE combines visual cues, such as lip and mouth movements, from the target speakers with corresponding audio signals. The contextually combined audio and visual data are then fed into a Transformer-based SSL AV-HuBERT model to extract features, which are further processed using a BLSTM-based SE model. The results demonstrate several key findings. Firstly, SSL-AVSE successfully overcomes the issue of limited data by leveraging the AV-HuBERT model. Secondly, by fine-tuning the AV-HuBERT model parameters for the target SE task, significant performance improvements are achieved. Specifically, there is a notable enhancement in PESQ (Perceptual Evaluation of Speech Quality) from 1.43 to 1.67 and in STOI (Short-Time Objective Intelligibility) from 0.70 to 0.74. Furthermore, the performance of the SSL-AVSE was evaluated using CI vocoded speech to assess the intelligibility for CI users. Comparative experimental outcomes reveal that in the presence of dynamic noises encountered during human conversations, SSL-AVSE exhibits a substantial improvement. The NCM (Normal Correlation Matrix) values indicate an increase of 26.5% to 87.2% compared to the noisy baseline.
    摘要 听力障碍者面临听说能力下降的挑战,特别是在噪声环境中。本研究的目的是探讨audio-visualspeech增强(AVSE)在cochlear implant(CI)模拟中的有效性。值得注意的是,这个研究强调了一个困难的enario,即有限的培训数据 дляAVSE任务。为解决这个问题,我们提出了一种新的深度神经网络框架,称为Self-Supervised Learning-based AVSE(SSL-AVSE)。SSL-AVSE通过将目标speaker的lip和mouth运动视频信号与相应的音频信号结合,并将这些上下文混合的数据feed into一个Transformer-based SSL AV-HuBERT模型来提取特征。然后,通过一个BLSTM-based SE模型进行进一步处理。实验结果显示了以下几点:首先,SSL-AVSE成功地超越了有限的数据问题,利用了AV-HuBERT模型。其次,通过对AV-HuBERT模型参数的微调,在目标SE任务中实现了显著性能提升。具体来说,PESQ(Perceptual Evaluation of Speech Quality)从1.43提高到1.67,STOI(Short-Time Objective Intelligibility)从0.70提高到0.74。此外,SSL-AVSE的性能还被评估了使用CI vocoded speech来评估智能度对CI用户的智能度。对比实验结果表明,在人类对话中遇到的动态噪声中,SSL-AVSE表现出了显著提升。NCM(Normal Correlation Matrix)值表明,在噪声基eline比较之下,SSL-AVSE的提升为26.5%-87.2%。