results: 研究结果显示, JOINT 最小处理框架可以提高语音识别度,并限制噪音处理量,对于有利的噪音情况下,语音质量不会过度受损。Abstract
We consider speech enhancement for signals picked up in one noisy environment that must be rendered to a listener in another noisy environment. For both far-end noise reduction and near-end listening enhancement, it has been shown that excessive focus on noise suppression or intelligibility maximization may lead to excessive speech distortions and quality degradations in favorable noise conditions, where intelligibility is already at ceiling level. Recently [1,2] propose to remedy this with a minimum processing framework that either reduces noise or enhances listening a minimum amount given that a certain intelligibility criterion is still satisfied. Additionally, it has been shown that joint consideration of both environments improves speech enhancement performance. In this paper, we formulate a joint far- and near-end minimum processing framework, that improves intelligibility while limiting speech distortions in favorable noise conditions. We provide closed-form solutions to specific boundary scenarios and investigate performance for the general case using numerical optimization. We also show that concatenating existing minimum processing far- and near-end enhancement methods preserves the effects of the initial methods. Results show that the joint optimization can further improve performance compared to the concatenated approach.
摘要
我们考虑 speech 增强器在一个噪音环境中捕捉的讯号,需要在另一个噪音环境中呈现给听者。对于距离端噪音抑制和近端听力增强而言,过度强调噪音抑制或智能化最大化可能会导致对于有利的噪音情况下的话语变化和质量下降。最近,[1,2] 提出了一个最小处理框架,可以在保持智能化水平下最小化噪音或增强听力。此外,jointly 考虑两个环境可以提高话语增强表现。在这篇文章中,我们建立了一个共同距离和近端最小处理框架,可以在有利噪音情况下提高智能化水平,并限制话语变化。我们提供了关闭式解的具体情况,并通过数值优化进行探索。我们还证明了 concatenating 现有的最小处理距离和近端增强方法可以保持初始方法的效果。结果显示,共同优化可以进一步提高表现,比 concatenated 方法更好。
Deep Complex U-Net with Conformer for Audio-Visual Speech Enhancement
results: 在COG-MHEAR AVSE Challenge 2023 的基准模型上表现出优于0.14的提升,并在台湾官话语音视频数据集(TMSV)上与状态级模型相当,并且在所有比较模型中表现出最佳result。Abstract
Recent studies have increasingly acknowledged the advantages of incorporating visual data into speech enhancement (SE) systems. In this paper, we introduce a novel audio-visual SE approach, termed DCUC-Net (deep complex U-Net with conformer network). The proposed DCUC-Net leverages complex domain features and a stack of conformer blocks. The encoder and decoder of DCUC-Net are designed using a complex U-Net-based framework. The audio and visual signals are processed using a complex encoder and a ResNet-18 model, respectively. These processed signals are then fused using the conformer blocks and transformed into enhanced speech waveforms via a complex decoder. The conformer blocks consist of a combination of self-attention mechanisms and convolutional operations, enabling DCUC-Net to effectively capture both global and local audio-visual dependencies. Our experimental results demonstrate the effectiveness of DCUC-Net, as it outperforms the baseline model from the COG-MHEAR AVSE Challenge 2023 by a notable margin of 0.14 in terms of PESQ. Additionally, the proposed DCUC-Net performs comparably to a state-of-the-art model and outperforms all other compared models on the Taiwan Mandarin speech with video (TMSV) dataset.
摘要
近年研究均认可了将视觉数据 integrate 到语音提升(SE)系统中的优势。在这篇论文中,我们介绍了一种新的嵌入式音视频SE方法,称为DCUC-Net(深度复杂U-Net with 准确网络)。我们的DCUC-Net利用复杂Domain特征和一个堆栈的准确块。编码器和解码器都采用了复杂U-Net的框架。音频和视频信号分别通过复杂编码器和ResNet-18模型处理,然后通过准确块进行拼接,并转化为提升后的语音波形。准确块包括自注意机制和卷积操作,使DCUC-Net能够有效地捕捉全局和局部音视频相互依赖关系。我们的实验结果表明,DCUC-Net比基线模型在COG-MHEAR AVSE Challenge 2023中表现出了明显的提升(0.14),并且与当前的状态艺模型相当,在台湾官话语音视频(TMSV)数据集上表现出了最高的性能。
Ensembling Multilingual Pre-Trained Models for Predicting Multi-Label Regression Emotion Share from Speech
results: 测试集的spearman correlation coefficient为0.537,开发集的spearman correlation coefficient为0.524,两者都高于之前基于单语言数据的融合方法的研究结果(test集的spearman correlation coefficient为0.476,开发集的spearman correlation coefficient为0.470)。Abstract
Speech emotion recognition has evolved from research to practical applications. Previous studies of emotion recognition from speech have focused on developing models on certain datasets like IEMOCAP. The lack of data in the domain of emotion modeling emerges as a challenge to evaluate models in the other dataset, as well as to evaluate speech emotion recognition models that work in a multilingual setting. This paper proposes an ensemble learning to fuse results of pre-trained models for emotion share recognition from speech. The models were chosen to accommodate multilingual data from English and Spanish. The results show that ensemble learning can improve the performance of the baseline model with a single model and the previous best model from the late fusion. The performance is measured using the Spearman rank correlation coefficient since the task is a regression problem with ranking values. A Spearman rank correlation coefficient of 0.537 is reported for the test set, while for the development set, the score is 0.524. These scores are higher than the previous study of a fusion method from monolingual data, which achieved scores of 0.476 for the test and 0.470 for the development.
摘要
研究者们在演讲情感识别方面从研究阶段逐渐演化到实际应用。 previous studies on speech emotion recognition have focused on developing models on specific datasets such as IEMOCAP. However, the lack of data in the domain of emotion modeling poses a challenge to evaluate models on other datasets and to evaluate speech emotion recognition models that work in a multilingual setting. This paper proposes an ensemble learning approach to fuse the results of pre-trained models for speech emotion recognition. The models chosen accommodate multilingual data from English and Spanish. The results show that ensemble learning can improve the performance of the baseline model and the previous best model from late fusion. The performance is measured using the Spearman rank correlation coefficient, as the task is a regression problem with ranking values. The reported Spearman rank correlation coefficient for the test set is 0.537, while for the development set, the score is 0.524. These scores are higher than the previous study of a fusion method from monolingual data, which achieved scores of 0.476 for the test and 0.470 for the development.
Directional Source Separation for Robust Speech Recognition on Smart Glasses
results: irectional source separation 可以提高语音识别率和说话者检测精度,但是对对话伙伴无效。 joint training Directional source separation 和 ASR 模型可以 achieve the best overall ASR performance.Abstract
Modern smart glasses leverage advanced audio sensing and machine learning technologies to offer real-time transcribing and captioning services, considerably enriching human experiences in daily communications. However, such systems frequently encounter challenges related to environmental noises, resulting in degradation to speech recognition and speaker change detection. To improve voice quality, this work investigates directional source separation using the multi-microphone array. We first explore multiple beamformers to assist source separation modeling by strengthening the directional properties of speech signals. In addition to relying on predetermined beamformers, we investigate neural beamforming in multi-channel source separation, demonstrating that automatic learning directional characteristics effectively improves separation quality. We further compare the ASR performance leveraging separated outputs to noisy inputs. Our results show that directional source separation benefits ASR for the wearer but not for the conversation partner. Lastly, we perform the joint training of the directional source separation and ASR model, achieving the best overall ASR performance.
摘要
现代智能眼镜利用先进的音频感知和机器学习技术,在实时转录和字幕服务方面提供了很大的便利,对日常交流中的人类体验带来了很大的改善。然而,这些系统经常遇到环境噪音的挑战,导致语音识别和发言者变换的干扰。为了提高音质,本工作研究了多频道源分离。我们首先探讨了多种扩声器,以增强对话语音的方向性特性。此外,我们还 investigate了基于自动学习的神经扩声器在多个通道源分离中的应用,并证明了自动学习方向特性可以有效提高分离质量。最后,我们比较了利用分离输出进行ASR的性能和直接使用噪音输入进行ASR的性能,结果表明irectional source separation对ASR有利,但对对话伙伴无效。最后,我们实现了irectional source separation和ASR模型的共同训练,达到了最佳的总ASR性能。