cs.SD - 2023-08-04

Efficient Monaural Speech Enhancement using Spectrum Attention Fusion

paper_url: http://arxiv.org/abs/2308.02263
repo_url: None
paper_authors: Jinyu Long, Jetic Gū, Binhao Bai, Zhibo Yang, Ping Wei, Junli Li
for: 提高自动speech处理管道中的干声提取精度，使用Transformer模型，但是这些模型具有较高的计算成本和训练数据需求。
methods: 提出一种名为spectrum attention fusion的改进方法，通过重新构造一个卷积模块来取代自我注意层，以更有效地融合频谱特征。
results: 对于Voice Bank + DEMAND数据集，提出的模型可以与状态体现相当或更好的结果，但具有较小的参数量（0.58M）。

Abstract
Speech enhancement is a demanding task in automated speech processing pipelines, focusing on separating clean speech from noisy channels. Transformer based models have recently bested RNN and CNN models in speech enhancement, however at the same time they are much more computationally expensive and require much more high quality training data, which is always hard to come by. In this paper, we present an improvement for speech enhancement models that maintains the expressiveness of self-attention while significantly reducing model complexity, which we have termed Spectrum Attention Fusion. We carefully construct a convolutional module to replace several self-attention layers in a speech Transformer, allowing the model to more efficiently fuse spectral features. Our proposed model is able to achieve comparable or better results against SOTA models but with significantly smaller parameters (0.58M) on the Voice Bank + DEMAND dataset.

摘要
干扰除是自动语音处理管道中的一项具有挑战性的任务，旨在分离干扰 Channel 中的清晰语音。基于 Transformer 的模型在最近的一段时间内对语音干扰 Task 表现出色，但同时它们也比 RNN 和 CNN 模型更加 computationally expensive 并且需要更多的高质量训练数据，这些数据总是困难找到。在这篇论文中，我们提出一种改进语音干扰模型，保持自我注意力的表达力，同时减少模型的复杂度，我们称之为 Spectrum Attention Fusion。我们精心构建了一个 convolutional 模块，以取代一些 speech Transformer 中的自我注意层，使模型更好地融合 спектраль特征。我们提出的模型可以在 Voice Bank + DEMAND 数据集上达到与 SOTA 模型相同或更好的result，但具有远小于 SOTA 模型的参数（0.58M）。

Emo-DNA: Emotion Decoupling and Alignment Learning for Cross-Corpus Speech Emotion Recognition

paper_url: http://arxiv.org/abs/2308.02190
repo_url: https://github.com/jiaxin-ye/emo-dna
paper_authors: Jiaxin Ye, Yujie Wei, Xin-Cheng Wen, Chenglong Ma, Zhizhong Huang, Kunhong Liu, Hongming Shan
for: 这个研究目的是为了实现跨 corps speech emotion识别（SER），即将一个已有 Label的 corpus 中的感情识别能力扩展到一个无 Label 的 corpus 中。
methods: 这个研究提出了一个名为 Emotion Decoupling aNd Alignment 的新框架（EMO-DNA），它是一种基于适应领域调整（UDA）的新方法，可以学习感情相关的 corpus-invariant 特征。EMO-DNA 的两大特点是：一是对于感情的分离学习，二是两个层次的感情Alignment。
results: 实验结果显示，EMO-DNA 比前一代方法在跨 corps 的数个情况下表现出色，具有更高的准确率和更好的一致性。

Abstract
Cross-corpus speech emotion recognition (SER) seeks to generalize the ability of inferring speech emotion from a well-labeled corpus to an unlabeled one, which is a rather challenging task due to the significant discrepancy between two corpora. Existing methods, typically based on unsupervised domain adaptation (UDA), struggle to learn corpus-invariant features by global distribution alignment, but unfortunately, the resulting features are mixed with corpus-specific features or not class-discriminative. To tackle these challenges, we propose a novel Emotion Decoupling aNd Alignment learning framework (EMO-DNA) for cross-corpus SER, a novel UDA method to learn emotion-relevant corpus-invariant features. The novelties of EMO-DNA are two-fold: contrastive emotion decoupling and dual-level emotion alignment. On one hand, our contrastive emotion decoupling achieves decoupling learning via a contrastive decoupling loss to strengthen the separability of emotion-relevant features from corpus-specific ones. On the other hand, our dual-level emotion alignment introduces an adaptive threshold pseudo-labeling to select confident target samples for class-level alignment, and performs corpus-level alignment to jointly guide model for learning class-discriminative corpus-invariant features across corpora. Extensive experimental results demonstrate the superior performance of EMO-DNA over the state-of-the-art methods in several cross-corpus scenarios. Source code is available at https://github.com/Jiaxin-Ye/Emo-DNA.

摘要
cross-corpus speech emotion recognition (SER) 目标是将一个受过标注的语料库中的情绪推断到另一个没有标注的语料库中，这是一项非常具有挑战性的任务，因为两个语料库之间存在很大的差异。现有的方法通常基于不监督领域适应（UDA），尝试通过全局分布对齐来学习语料库共同的特征，但是却不幸的是，得到的特征往往与语料库特定的特征或不是类别特征杂合。为了解决这些挑战，我们提出了一个新的情绪解决和对齐学习框架（EMO-DNA） для cross-corpus SER，一种新的UDA方法来学习情绪相关的语料库共同特征。EMO-DNA的两个新特点是：对比性情绪解决和双级情绪对齐。一方面，我们的对比性情绪解决通过对比减除损失来强化情绪相关特征与语料库特定特征之间的分离性。另一方面，我们的双级情绪对齐引入了一个自适应阈值 pseudo-标签法，选择对模型有信任度的目标样本进行类别对齐，并在语料库级别对齐以同步导向模型学习类别特征杂合的语料库共同特征。我们的实验结果表明，EMO-DNA在多个 cross-corpus 场景中表现出了与当前状态方法相比的显著性能优势。源代码可以在中下载。

Capturing Spectral and Long-term Contextual Information for Speech Emotion Recognition Using Deep Learning Techniques

paper_url: http://arxiv.org/abs/2308.04517
repo_url: None
paper_authors: Samiul Islam, Md. Maksudul Haque, Abu Jobayer Md. Sadat
for: 本研究旨在超越传统方法，提高speech emotion recognition的准确率。
methods: 本研究提出了一种ensemble模型，结合文本数据的Graph Convolutional Networks（GCN）和audio信号的HuBERT transformer进行分析。
results: 结果表明，GCNs可以很好地捕捉文本数据中的长期上下文关系和意义，而HuBERT可以很好地捕捉speech中的时间动态和细微变化，这两种方法的组合可以提高emotion recognition的准确率。

Abstract
Traditional approaches in speech emotion recognition, such as LSTM, CNN, RNN, SVM, and MLP, have limitations such as difficulty capturing long-term dependencies in sequential data, capturing the temporal dynamics, and struggling to capture complex patterns and relationships in multimodal data. This research addresses these shortcomings by proposing an ensemble model that combines Graph Convolutional Networks (GCN) for processing textual data and the HuBERT transformer for analyzing audio signals. We found that GCNs excel at capturing Long-term contextual dependencies and relationships within textual data by leveraging graph-based representations of text and thus detecting the contextual meaning and semantic relationships between words. On the other hand, HuBERT utilizes self-attention mechanisms to capture long-range dependencies, enabling the modeling of temporal dynamics present in speech and capturing subtle nuances and variations that contribute to emotion recognition. By combining GCN and HuBERT, our ensemble model can leverage the strengths of both approaches. This allows for the simultaneous analysis of multimodal data, and the fusion of these modalities enables the extraction of complementary information, enhancing the discriminative power of the emotion recognition system. The results indicate that the combined model can overcome the limitations of traditional methods, leading to enhanced accuracy in recognizing emotions from speech.

摘要
传统方法在语音情绪识别中，如LSTM、CNN、RNN、SVM和MLP，有一些缺点，包括难以捕捉顺序数据中长期依赖关系、模糊时间动态和复杂的模式和关系。这种研究解决这些缺点，提出一个 ensemble 模型， combining 文本数据的 Graph Convolutional Networks (GCN) 和语音信号的 HuBERT 变换器。我们发现，GCN 可以很好地捕捉文本数据中长期上下文关系和意义，通过利用文本的图形表示来捕捉词语之间的Semantic 关系。而 HuBERT 利用自我注意机制，可以捕捉长距离依赖关系，使得模型可以识别语音中的时间动态，捕捉语音中细微的变化和细节，从而提高情绪识别的准确性。通过合并 GCN 和 HuBERT，我们的ensemble模型可以利用这两种方法的优点。这使得同时分析多Modal 数据，并将这些模式融合，以提高情绪识别系统的推理力。结果表明，我们的combined模型可以超越传统方法的局限性，导致识别语音中的情绪的准确性得到提高。

N-gram Boosting: Improving Contextual Biasing with Normalized N-gram Targets

paper_url: http://arxiv.org/abs/2308.02092
repo_url: None
paper_authors: Wang Yau Li, Shreekantha Nadig, Karol Chang, Zafarullah Mahmood, Riqiang Wang, Simon Vandieken, Jonas Robertson, Fred Mailhot
for: 提高商务会议中关键词语识别率
methods: 使用两步关键词强化机制，normalized unigrams和n-grams进行强化，避免缺失命中问题
results: 相比原始数据集，提高关键词识别率26%，相比LibriSpeech，提高2%

Abstract
Accurate transcription of proper names and technical terms is particularly important in speech-to-text applications for business conversations. These words, which are essential to understanding the conversation, are often rare and therefore likely to be under-represented in text and audio training data, creating a significant challenge in this domain. We present a two-step keyword boosting mechanism that successfully works on normalized unigrams and n-grams rather than just single tokens, which eliminates missing hits issues with boosting raw targets. In addition, we show how adjusting the boosting weight logic avoids over-boosting multi-token keywords. This improves our keyword recognition rate by 26% relative on our proprietary in-domain dataset and 2% on LibriSpeech. This method is particularly useful on targets that involve non-alphabetic characters or have non-standard pronunciations.

摘要
精准转写特有名称和技术术语 particualrly important in speech-to-text应用程序 для商业对话。这些词语，which are essential to understanding the conversation, are often rare and therefore likely to be under-represented in text and audio training data, creating a significant challenge in this domain. We present a two-step keyword boosting mechanism that successfully works on normalized unigrams and n-grams rather than just single tokens, which eliminates missing hits issues with boosting raw targets. In addition, we show how adjusting the boosting weight logic avoids over-boosting multi-token keywords. This improves our keyword recognition rate by 26% relative on our proprietary in-domain dataset and 2% on LibriSpeech. This method is particularly useful on targets that involve non-alphabetic characters or have non-standard pronunciations.Here's the breakdown of the translation:* 精准转写 (jīngshù zhōngyì) - accurate transcription* 特有名称 (tèyǒu míngcè) - proper names* 技术术语 (jìshuō shūyǔ) - technical terms* 商业对话 (shāngyào duìhùa) - business conversations* 词语 (cíyǔ) - words* essential (zhòngyào) - essential* understanding (dànzhi) - understanding* domain (diàngròng) - domain* 预处理 (yùchū) - preprocessing* 单词 (danzi) - single token* 拼音 (pīnyīn) - pinyin* 非汉字 (fēihànzì) - non-hanzi characters* 非标准发音 (fēizhǔshuāng fāyīn) - non-standard pronunciationsNote that the translation is written in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you need Traditional Chinese, please let me know.