results: 相比同种audio-only基eline,提高了9.1%和6.2%的word error rate(WER),同时也提高了PESQ、STOI和SRMR分数。Abstract
Accurate recognition of cocktail party speech containing overlapping speakers, noise and reverberation remains a highly challenging task to date. Motivated by the invariance of visual modality to acoustic signal corruption, an audio-visual multi-channel speech separation, dereverberation and recognition approach featuring a full incorporation of visual information into all system components is proposed in this paper. The efficacy of the video input is consistently demonstrated in mask-based MVDR speech separation, DNN-WPE or spectral mapping (SpecM) based speech dereverberation front-end and Conformer ASR back-end. Audio-visual integrated front-end architectures performing speech separation and dereverberation in a pipelined or joint fashion via mask-based WPD are investigated. The error cost mismatch between the speech enhancement front-end and ASR back-end components is minimized by end-to-end jointly fine-tuning using either the ASR cost function alone, or its interpolation with the speech enhancement loss. Experiments were conducted on the mixture overlapped and reverberant speech data constructed using simulation or replay of the Oxford LRS2 dataset. The proposed audio-visual multi-channel speech separation, dereverberation and recognition systems consistently outperformed the comparable audio-only baseline by 9.1% and 6.2% absolute (41.7% and 36.0% relative) word error rate (WER) reductions. Consistent speech enhancement improvements were also obtained on PESQ, STOI and SRMR scores.
摘要
当前,cocktail party speech中的叠加说话者、噪声和回声识别仍然是一个非常挑战性的任务。这是因为视觉modalities具有对声音信号损害的不变性,所以一种包含视觉信息的多渠道音频视觉混合分离、减少回声和识别方法被提议。这篇论文中的方法是通过使用面积基于的MVDR音频分离、DNN-WPE或spectral mapping(SpecM)基于的音频前端分离和Conformer ASR后端来实现。我们还 investigate了将视觉信息完全 интегрирован到所有系统组件中的音频视觉混合前端架构。在pipelined或联合的方式下,我们使用面积基于的WPD来实现音频视觉混合。为了消除语音提高前端和后端组件之间的错误成本差,我们使用综合jointly fine-tuning,使用ASR成本函数alone或其混合with语音提高损失来减少错误成本差。我们对于混合了simulation和replay的Oxford LRS2 dataset中的杂合混响杂音数据进行了实验。结果表明,我们的音频视觉多渠道分离、减少回声和识别系统在相比于相同的音频基eline的9.1%和6.2%绝对(41.7%和36.0%相对)word error rate(WER)下提高了识别率。此外,我们还获得了相应的语音提高成果在PESQ、STOI和SRMR scores上。
The Relationship Between Speech Features Changes When You Get Depressed: Feature Correlations for Improving Speed and Performance of Depression Detection
methods: 该论文使用了SVM和LSTM两种模型,并使用了Androids Corpus dataset,包括112名speaker,其中58人被诊断为职业心理医生。
results: 实验结果显示,使用特征相关矩阵而不是特征向量可以提高模型的训练速度和性能,错误率下降23.1%-26.6%。这可能是因为抑郁 speaker中特征相关性更为变化。Abstract
This work shows that depression changes the correlation between features extracted from speech. Furthermore, it shows that using such an insight can improve the training speed and performance of depression detectors based on SVMs and LSTMs. The experiments were performed over the Androids Corpus, a publicly available dataset involving 112 speakers, including 58 people diagnosed with depression by professional psychiatrists. The results show that the models used in the experiments improve in terms of training speed and performance when fed with feature correlation matrices rather than with feature vectors. The relative reduction of the error rate ranges between 23.1% and 26.6% depending on the model. The probable explanation is that feature correlation matrices appear to be more variable in the case of depressed speakers. Correspondingly, such a phenomenon can be thought of as a depression marker.
摘要
Evaluating raw waveforms with deep learning frameworks for speech emotion recognition
paper_authors: Zeynep Hilal Kilimci, Ulku Bayraktar, Ayhan Kucukmanisa for:The paper focuses on speech emotion recognition using deep learning techniques, specifically on the contribution of feeding raw audio files directly into deep neural networks without any feature extraction stage.methods:The proposed model uses a combination of machine learning algorithms, ensemble learning methods, and deep and hybrid deep learning techniques, including support vector machine, decision tree, naive Bayes, random forests, majority voting, and stacking. The model also employs convolutional neural networks, long short-term memory networks, and hybrid CNN-LSTM models.results:The proposed model achieves state-of-the-art performance on six different data sets, including EMO-DB, RAVDESS, TESS, CREMA, SAVEE, and TESS+RAVDESS. Specifically, the CNN model achieves 95.86% accuracy on the TESS+RAVDESS data set, outperforming existing approaches. The proposed model also demonstrates high accuracy on other data sets, ranging from 90.34% to 99.48% and 69.72% to 85.76%, depending on the data set and the model used.Abstract
Speech emotion recognition is a challenging task in speech processing field. For this reason, feature extraction process has a crucial importance to demonstrate and process the speech signals. In this work, we represent a model, which feeds raw audio files directly into the deep neural networks without any feature extraction stage for the recognition of emotions utilizing six different data sets, EMO-DB, RAVDESS, TESS, CREMA, SAVEE, and TESS+RAVDESS. To demonstrate the contribution of proposed model, the performance of traditional feature extraction techniques namely, mel-scale spectogram, mel-frequency cepstral coefficients, are blended with machine learning algorithms, ensemble learning methods, deep and hybrid deep learning techniques. Support vector machine, decision tree, naive Bayes, random forests models are evaluated as machine learning algorithms while majority voting and stacking methods are assessed as ensemble learning techniques. Moreover, convolutional neural networks, long short-term memory networks, and hybrid CNN- LSTM model are evaluated as deep learning techniques and compared with machine learning and ensemble learning methods. To demonstrate the effectiveness of proposed model, the comparison with state-of-the-art studies are carried out. Based on the experiment results, CNN model excels existent approaches with 95.86% of accuracy for TESS+RAVDESS data set using raw audio files, thence determining the new state-of-the-art. The proposed model performs 90.34% of accuracy for EMO-DB with CNN model, 90.42% of accuracy for RAVDESS with CNN model, 99.48% of accuracy for TESS with LSTM model, 69.72% of accuracy for CREMA with CNN model, 85.76% of accuracy for SAVEE with CNN model in speaker-independent audio categorization problems.
摘要
《speech emotion recognition是speech processing领域中的一项挑战。为了实现这一目标,特征提取过程具有重要的重要性。在这项工作中,我们提出了一种模型,该模型直接将原始音频文件传递给深度神经网络,不需要特征提取阶段。为了证明模型的贡献,我们对传统特征提取技术(mel-scale spectogram、mel-frequency cepstral coefficients)与机器学习算法(支持向量机、决策树、naive Bayes、Random Forest)、ensemble learning方法(majority voting、stacking)进行比较。此外,我们还评估了深度学习技术(卷积神经网络、长短期记忆网络、混合卷积-LSTM)。基于实验结果,CNN模型在TESS+RAVDESS数据集上达到了95.86%的准确率,超过了现有方法,确定了新的状态态-of-the-art。我们的模型在EMO-DB、RAVDESS、TESS、CREMA和SAVEE数据集上达到了90.34%、90.42%、99.48%、69.72%和85.76%的准确率。》
DSARSR: Deep Stacked Auto-encoders Enhanced Robust Speaker Recognition
results: 实验结果显示,提出的方法比之前的方法表现更好。Abstract
Speaker recognition is a biometric modality that utilizes the speaker's speech segments to recognize the identity, determining whether the test speaker belongs to one of the enrolled speakers. In order to improve the robustness of the i-vector framework on cross-channel conditions and explore the nova method for applying deep learning to speaker recognition, the Stacked Auto-encoders are used to get the abstract extraction of the i-vector instead of applying PLDA. After pre-processing and feature extraction, the speaker and channel-independent speeches are employed for UBM training. The UBM is then used to extract the i-vector of the enrollment and test speech. Unlike the traditional i-vector framework, which uses linear discriminant analysis (LDA) to reduce dimension and increase the discrimination between speaker subspaces, this research use stacked auto-encoders to reconstruct the i-vector with lower dimension and different classifiers can be chosen to achieve final classification. The experimental results show that the proposed method achieves better performance than the state-of-the-art method.
摘要
On-Device Constrained Self-Supervised Speech Representation Learning for Keyword Spotting via Knowledge Distillation
results: 我们在使用了我们的S3RL架构进行 Alexa 关键词搜寻探测任务时,在正常和噪音情况下表现出色,证明了知识传授方法在自我超vised模型中构建关键词搜寻模型的时候具有优秀的效果,并且在内存限制和受损数据收集的情况下运行。Abstract
Large self-supervised models are effective feature extractors, but their application is challenging under on-device budget constraints and biased dataset collection, especially in keyword spotting. To address this, we proposed a knowledge distillation-based self-supervised speech representation learning (S3RL) architecture for on-device keyword spotting. Our approach used a teacher-student framework to transfer knowledge from a larger, more complex model to a smaller, light-weight model using dual-view cross-correlation distillation and the teacher's codebook as learning objectives. We evaluated our model's performance on an Alexa keyword spotting detection task using a 16.6k-hour in-house dataset. Our technique showed exceptional performance in normal and noisy conditions, demonstrating the efficacy of knowledge distillation methods in constructing self-supervised models for keyword spotting tasks while working within on-device resource constraints.
摘要
大型自我监督模型是有效的特征提取器,但是在设备内存限制和偏见数据采集下面临挑战,特别是在关键词检测中。为解决这问题,我们提出了基于知识填充的自我监督语音表示学习(S3RL)建模,用于在设备上进行关键词检测。我们采用了教师-学生框架,将大型、复杂的模型知识传播到小型、轻量级模型,使用双视交叉相关知识填充和教师的编码库作为学习目标。我们使用Alexa关键词检测任务上的16.6万小时内部数据进行评估。我们的技术在正常和噪音条件下都表现出色,证明了知识填充方法在构建自我监督模型的关键词检测任务中的效果。