eess.AS - 2023-11-22

End-to-end Transfer Learning for Speaker-independent Cross-language Speech Emotion Recognition

  • paper_url: http://arxiv.org/abs/2311.13678
  • repo_url: None
  • paper_authors: Duowei Tang, Peter Kuppens, Luc Geurts, Toon van Waterschoot
  • for: 本研究旨在提高cross-language speech emotion recognition(SER)的性能,提出了一种基于移植学习的深度神经网络(DNN)模型。
  • methods: 我们使用wav2vec 2.0预训练模型将不同语言、speaker和录制条件的音频时域波形转换为共享语言的特征空间,从而减少语言差异。此外,我们还提出了一种新的深度 Within-Class Co-variance Normalisation(Deep-WCCN)层,可以插入到DNN模型中,以减少发音人差异、通道差异等多种差异。整个模型在综合损失下进行了终端精度的 fine-tuning,并在英语、德语和中文三种语言的 dataset 上进行了验证。
  • results: 我们的提议方法不仅在同语言设定下与基于通用音频特征集的基准模型相比,而且在 cross-language 设定下与基准模型有 statistically significant 的差异。此外,我们还对 Deep-WCCN 进行了实验验证,并证明了它可以进一步提高模型性能。最后,我们对当前文献中使用同一个测试集的结果进行了比较,并发现了我们的提议模型在 cross-language SER 中的显著优异。
    Abstract Data-driven models achieve successful results in Speech Emotion Recognition (SER). However, these models, which are based on general acoustic features or end-to-end approaches, show poor performance when the testing set has a different language (i.e., the cross-language setting) than the training set or when they come from a different dataset (i.e., the cross-corpus setting). To alleviate this problem, this paper presents an end-to-end Deep Neural Network (DNN) model based on transfer learning for cross-language SER. We use the wav2vec 2.0 pre-trained model to transform audio time-domain waveforms from different languages, different speakers and different recording conditions into a feature space shared by multiple languages, thereby it reduces the language variabilities in the speech features. Next, we propose a new Deep-Within-Class Co-variance Normalisation (Deep-WCCN) layer that can be inserted into the DNN model and it aims to reduce other variabilities including speaker variability, channel variability and so on. The whole model is fine-tuned in an end-to-end manner on a combined loss and is validated on datasets from three languages (i.e., English, German, Chinese). Experiment results show that our proposed method not only outperforms the baseline model that is based on common acoustic feature sets for SER in the within-language setting, but also significantly outperforms the baseline model for cross-language setting. In addition, we also experimentally validate the effectiveness of Deep-WCCN, which can further improve the model performance. Finally, to comparing the results in the recent literatures that use the same testing datasets, our proposed model shows significantly better performance than other state-of-the-art models in cross-language SER.
    摘要 数据驱动模型在语音情感识别(SER)中取得成功。然而,这些模型,它们基于通用的音响特征或端到端方法,在测试集有不同语言(即跨语言设定)或来自不同数据集(即跨藏设定)时表现不佳。为解决这个问题,本文提出了一种基于传输学习的深度神经网络(DNN)模型,用于跨语言SER。我们使用wav2vec 2.0预训练模型将不同语言、 speaker和录音条件的音响时域波形转换到一个共享多语言的特征空间,从而减少语言差异在语音特征上。然后,我们提出了一种新的深度内类 covariance normalization(Deep-WCCN)层,可以插入到DNN模型中,并希望减少其他不确定性,包括发音者变化、通道变化等。整个模型在端到端方式进行了 fine-tuning,并在英语、德语和中文三种语言的数据集上进行验证。实验结果显示,我们的提议方法不仅在 Within-language 设定中比基eline模型,以及 Common acoustic feature sets for SER 的基eline模型,表现更好,还在跨语言设定中表现出色。此外,我们还进行了 Deep-WCCN 的实验 validate,并证明其可以进一步改善模型性能。最后,我们与当前文献中使用相同测试数据集的其他国际前沿模型进行比较,我们的提议模型显示出了更好的性能。

Sparsity-Driven EEG Channel Selection for Brain-Assisted Speech Enhancement

  • paper_url: http://arxiv.org/abs/2311.13436
  • repo_url: None
  • paper_authors: Jie Zhang, Qing-Tian Xu, Zhen-Hua Ling
  • for: 提高多个说话人的干扰音质量,使用 brain-assisted speech enhancement network (BASEN) 以及 electroencephalogram (EEG) 信号来暗示列入者的听众焦点。
  • methods: 采用 temporal convolutional network 和 convolutional multi-layer cross attention module 来融合 EEG-audio 特征,并提出两种渠道选择方法(residual Gumbel selection 和 convolutional regularization selection)来解决训练不稳定和重复渠道选择问题。
  • results: 对公共数据集进行实验,提出的基eline BASEN 比现有方法更高效,并且两种渠道选择方法可以减少大量有用的 EEG 渠道数量,而无需影响性能。
    Abstract Speech enhancement is widely used as a front-end to improve the speech quality in many audio systems, while it is still hard to extract the target speech in multi-talker conditions without prior information on the speaker identity. It was shown by auditory attention decoding that the attended speaker can be revealed by the electroencephalogram (EEG) of the listener implicitly. In this work, we therefore propose a novel end-to-end brain-assisted speech enhancement network (BASEN), which incorporates the listeners' EEG signals and adopts a temporal convolutional network together with a convolutional multi-layer cross attention module to fuse EEG-audio features. Considering that an EEG cap with sparse channels exhibits multiple benefits and in practice many electrodes might contribute marginally, we further propose two channel selection methods, called residual Gumbel selection and convolutional regularization selection. They are dedicated to tackling the issues of training instability and duplicated channel selections, respectively. Experimental results on a public dataset show the superiority of the proposed baseline BASEN over existing approaches. The proposed channel selection methods can significantly reduce the amount of informative EEG channels with a negligible impact on the performance.
    摘要 听话提升广泛用于多种音频系统的前端,以提高语音质量,但在多个说话人条件下,还是很难提取目标语音。研究人员发现,听众的电enzephalogram(EEG)可以隐式地披露被注意的说话人。因此,我们提出了一种新的整体脑助推进式语音提升网络(BASEN),它将听众的EEG信号和一个时间卷积网络、一个卷积多层跨注意模块 fusion EEG-音频特征。由于EEG帽子通常具有稀疏通道,我们提出了两种通道选择方法,即剩余束质随机选择和卷积regularization选择。它们用于解决训练不稳定和重复通道选择的问题。实验结果表明,我们的基eline BASEN比现有方法更高效。两种通道选择方法可以减少有用EEG通道的数量,而无法影响性能的影响。

Performance Analysis Of Binaural Signal Matching (BSM) in the Time-Frequency Domain

  • paper_url: http://arxiv.org/abs/2311.13390
  • repo_url: None
  • paper_authors: Ami Berger, Vladimir Tourbabin, Jacob Donley, Zamir Ben-Hur, Boaz Rafaely
  • for: 研究了一种适用于穿戴式和移动阵列的双耳呈现方法,以提高虚拟现实、 теле conferencing 和娱乐应用中的声音质量。
  • methods: 使用了一种叫做双耳信号匹配(BSM)的方法,并对其进行了参数化的时域频域分析,以适应动态环境。
  • results: 对于听到 reverberant speech 的情况,使用 BSM 方法可以提高双耳呈现的质量,并且对于不同的参数化方案进行了比较。
    Abstract The capture and reproduction of spatial audio is becoming increasingly popular, with the mushrooming of applications in teleconferencing, entertainment and virtual reality. Many binaural reproduction methods have been developed and studied extensively for spherical and other specially designed arrays. However, the recent increased popularity of wearable and mobile arrays requires the development of binaural reproduction methods for these arrays. One such method is binaural signal matching (BSM). However, to date this method has only been investigated with fixed matched filters designed for long audio recordings. With the aim of making the BSM method more adaptive to dynamic environments, this paper analyzes BSM with a parameterized sound-field in the time-frequency domain. The paper presents results of implementing the BSM method on a sound-field that was decomposed into its direct and reverberant components, and compares this implementation with the BSM computed for the entire sound-field, to compare performance for binaural reproduction of reverberant speech in a simulated environment.
    摘要 capture 和重现三维音频正在不断受欢迎,在电子会议、娱乐和虚拟现实等领域中渐渐普及。许多三维重现方法已经被研究了很多,但是随着可穿戴式和移动阵列的普及,需要开发适用于这些阵列的三维重现方法。一种如此方法是三维信号匹配(BSM)。然而,到目前为止,BSM方法只被研究了针对长 audio recording 的固定匹配筛选器。为了使 BSM 方法在动态环境中更加适应,这篇论文分析了 BSM 方法在时域频域中参数化的声场中的应用。文章展示了对声场进行了直接和反射组分的分解,并与整个声场 BSM 计算相比较,以评估三维重现泛音环境中的泛音speech 的表现。

Deep Audio Zooming: Beamwidth-Controllable Neural Beamformer

  • paper_url: http://arxiv.org/abs/2311.13075
  • repo_url: None
  • paper_authors: Meng Yu, Dong Yu
  • for: 这篇论文的目的是提出一种简单 yet effective的听录范围特征(FOV),用于选择性增强声音信号,特别是在用户定义的场景中。
  • methods: 该论文使用了一种新的方法,即Counter FOV特征,用于捕捉声音信号的全部方向特征,包括用户定义的场景中的声音源。
  • results: 实验结果表明,提出的FOV特征和Counter FOV特征可以准确地捕捉声音信号,并且可以在实时应用中实现低功耗和高效率。
    Abstract Audio zooming, a signal processing technique, enables selective focusing and enhancement of sound signals from a specified region, attenuating others. While traditional beamforming and neural beamforming techniques, centered on creating a directional array, necessitate the designation of a singular target direction, they often overlook the concept of a field of view (FOV), that defines an angular area. In this paper, we proposed a simple yet effective FOV feature, amalgamating all directional attributes within the user-defined field. In conjunction, we've introduced a counter FOV feature capturing directional aspects outside the desired field. Such advancements ensure refined sound capture, particularly emphasizing the FOV's boundaries, and guarantee the enhanced capture of all desired sound sources inside the user-defined field. The results from the experiment demonstrate the efficacy of the introduced angular FOV feature and its seamless incorporation into a low-power subband model suited for real-time applica?tions.
    摘要 audio 焦化(zooming)是一种信号处理技术,可以选择性强调和提高声音信号来自特定区域,而遮掩其他。而传统的束形探测(beamforming)和神经束形探测(neural beamforming)技术,旨在创建一个指向性阵列,通常会忽略场视野(FOV)的概念,FOV定义了一个angular频谱区域。在这篇论文中,我们提出了一个简单 yet effective的 FOV特征,汇集了用户定义的场所有指向特征。此外,我们还引入了一个对场外指向特征的适应特征,以确保高精度的声音捕捉,特别是场视野的边界,以及所有在用户定义的场中的所有 жела要的声音源的提高捕捉。实验结果证明我们引入的angular FOV特征的有效性,并可以与低功耗的卷积模型相结合,适用于实时应用。