eess.AS - 2023-11-24

Learning Arousal-Valence Representation from Categorical Emotion Labels of Speech

  • paper_url: http://arxiv.org/abs/2311.14816
  • repo_url: https://github.com/ETZET/SpeechEmotionAVLearning
  • paper_authors: Enting Zhou, You Zhang, Zhiyao Duan
  • for: 这个论文的目的是学习从语音材料中预测情绪表达的精度和豁达性。
  • methods: 这个论文使用了自然语言处理技术和深度学习算法,首先学习了一个高维、情绪相关的语音特征表示,然后使用了 anchored dimensionality reduction 将这个表示映射到了2D的情绪坐标系中。
  • results: 实验结果表明,这个方法可以在没有使用过程中的AV标注数据的情况下,与状态 искусственный进行比较,在IEMOCAP数据集上达到了类似的CCC性能。此外,对MEAD和EmoDB数据集的AV预测结果的可见化也提供了可解释的AV表示的视觉化。
    Abstract Dimensional representations of speech emotions such as the arousal-valence (AV) representation provide a continuous and fine-grained description and control than their categorical counterparts. They have wide applications in tasks such as dynamic emotion understanding and expressive text-to-speech synthesis. Existing methods that predict the dimensional emotion representation from speech cast it as a supervised regression task. These methods face data scarcity issues, as dimensional annotations are much harder to acquire than categorical labels. In this work, we propose to learn the AV representation from categorical emotion labels of speech. We start by learning a rich and emotion-relevant high-dimensional speech feature representation using self-supervised pre-training and emotion classification fine-tuning. This representation is then mapped to the 2D AV space according to psychological findings through anchored dimensionality reduction. Experiments show that our method achieves a Concordance Correlation Coefficient (CCC) performance comparable to state-of-the-art supervised regression methods on IEMOCAP without leveraging ground-truth AV annotations during training. This validates our proposed approach on AV prediction. Furthermore, visualization of AV predictions on MEAD and EmoDB datasets shows the interpretability of the learned AV representations.
    摘要 尺度表示情感的维度表示,如动力吟唱情感(AV)表示,提供了连续和精细的描述和控制,比其分类对手更多。它们在动态情感理解和表达文本到语音合成等任务中有广泛的应用。现有的方法预测AV表示从语音中的批处是一种有监督的回归任务。这些方法面临数据缺乏问题,因为维度标注更加困难于分类标签。在这项工作中,我们提议从语音的分类情感标签中学习AV表示。我们首先学习一个富有情感相关的高维语音特征表示,使用自我监督预训练和情感分类练熟。这个表示然后通过静态维度减少映射到2D AV空间,根据心理发现。实验表明,我们的方法在IEMOCAP无需在训练过程中使用真实的AV标注的情况下,可以达到与当前状态的监督回归方法相同的CCC性能。这将 validate我们的提出的方法。此外,对MEAD和EmoDB数据集的AV预测视觉显示了学习的AV表示的可读性。