cs.SD - 2023-07-20

Transfer Learning and Bias Correction with Pre-trained Audio Embeddings

  • paper_url: http://arxiv.org/abs/2307.10834
  • repo_url: https://github.com/changhongw/audio-embedding-bias
  • paper_authors: Changhong Wang, Gaël Richard, Brian McFee
  • for: 这研究旨在调查预训练的音频表示中的偏见传递现象,以及这些偏见如何影响音乐信息检索任务中的器乐器识别。
  • methods: 本研究使用了三种不同的预训练表示(VGGish、OpenL3和YAMNet),并对它们进行了比较性分析,以 Investigate the properties of pre-trained audio representations for the task of instrument recognition.
  • results: 研究发现,不同的预训练表示在不同的数据集上 exhibit comparable performance,但在不同数据集之间的泛化性不同。研究还发现,数据集标识和类型分布可能是偏见的来源。为了 Mitigate the effects of bias, the research proposes and evaluates post-processing countermeasures.
    Abstract Deep neural network models have become the dominant approach to a large variety of tasks within music information retrieval (MIR). These models generally require large amounts of (annotated) training data to achieve high accuracy. Because not all applications in MIR have sufficient quantities of training data, it is becoming increasingly common to transfer models across domains. This approach allows representations derived for one task to be applied to another, and can result in high accuracy with less stringent training data requirements for the downstream task. However, the properties of pre-trained audio embeddings are not fully understood. Specifically, and unlike traditionally engineered features, the representations extracted from pre-trained deep networks may embed and propagate biases from the model's training regime. This work investigates the phenomenon of bias propagation in the context of pre-trained audio representations for the task of instrument recognition. We first demonstrate that three different pre-trained representations (VGGish, OpenL3, and YAMNet) exhibit comparable performance when constrained to a single dataset, but differ in their ability to generalize across datasets (OpenMIC and IRMAS). We then investigate dataset identity and genre distribution as potential sources of bias. Finally, we propose and evaluate post-processing countermeasures to mitigate the effects of bias, and improve generalization across datasets.
    摘要 深度神经网络模型已成为音乐信息检索(MIR)中主要的方法。这些模型通常需要大量(注解)训练数据来达到高精度。由于不 все MIR 应用程序具有足够的训练数据,因此在域转移成为越来越普遍。这种方法允许 Representation 从一个任务传递到另一个,并可以通过较少的训练数据要求达到下游任务的高精度。然而,预训练音频表示的性质并不完全理解。 Specifically, 与传统工程化特征不同,预训练深度网络中的表示可能会嵌入和传播模型训练中的偏见。这项研究 investigate 预训练音频表示中偏见的现象,特别是在 instruNet 识别任务中。我们首先示出三个不同的预训练表示(VGGish、OpenL3 和 YAMNet)在单个数据集上受限时具有相似的表现,但在不同数据集上有不同的泛化能力(OpenMIC 和 IRMAS)。然后,我们研究数据集标识和类型分布是否会影响偏见。最后,我们提出并评估了针对偏见进行后处理的缓解措施,以改善数据集之间的泛化性能。

Cross-Corpus Multilingual Speech Emotion Recognition: Amharic vs. Other Languages

  • paper_url: http://arxiv.org/abs/2307.10814
  • repo_url: None
  • paper_authors: Ephrem Afele Retta, Richard Sutcliffe, Jabar Mahmood, Michael Abebe Berwo, Eiad Almekhlafi, Sajjad Ahmed Khan, Shehzad Ashraf Chaudhry, Mustafa Mhamed, Jun Feng
  • for: 这个研究是为了探索跨语言和多语言情感识别(SER)任务中的可行性,当资源匮乏时。
  • methods: 这个研究使用了AlexNet、VGGE(一种VGG架构的变体)和ResNet50三个分类器,并将所有数据集映射为仅两个类别(正面和负面),以便直接比较不同语言之间的性能。
  • results: 研究发现,使用英语或德语作为源语言,对于target语言阿姆利卡的识别率比较高,而使用阿姆利卡作为target语言,对于英语或德语的识别率也有良好的表现。此外,使用多个非阿姆利卡语言进行训练,对于阿姆利卡的识别率也有较好的表现。总的来说,这些结果表明,跨语言和多语言的训练可以是一个有效的策略,当资源匮乏时。
    Abstract In a conventional Speech emotion recognition (SER) task, a classifier for a given language is trained on a pre-existing dataset for that same language. However, where training data for a language does not exist, data from other languages can be used instead. We experiment with cross-lingual and multilingual SER, working with Amharic, English, German and URDU. For Amharic, we use our own publicly-available Amharic Speech Emotion Dataset (ASED). For English, German and Urdu we use the existing RAVDESS, EMO-DB and URDU datasets. We followed previous research in mapping labels for all datasets to just two classes, positive and negative. Thus we can compare performance on different languages directly, and combine languages for training and testing. In Experiment 1, monolingual SER trials were carried out using three classifiers, AlexNet, VGGE (a proposed variant of VGG), and ResNet50. Results averaged for the three models were very similar for ASED and RAVDESS, suggesting that Amharic and English SER are equally difficult. Similarly, German SER is more difficult, and Urdu SER is easier. In Experiment 2, we trained on one language and tested on another, in both directions for each pair: Amharic<->German, Amharic<->English, and Amharic<->Urdu. Results with Amharic as target suggested that using English or German as source will give the best result. In Experiment 3, we trained on several non-Amharic languages and then tested on Amharic. The best accuracy obtained was several percent greater than the best accuracy in Experiment 2, suggesting that a better result can be obtained when using two or three non-Amharic languages for training than when using just one non-Amharic language. Overall, the results suggest that cross-lingual and multilingual training can be an effective strategy for training a SER classifier when resources for a language are scarce.
    摘要 在一个常规的语音情感识别(SER)任务中,一个分类器为某种语言进行训练,通常使用该语言的已有数据集。但是,当数据集不存在时,可以使用其他语言的数据。我们在阿姆哈里语、英语、德语和乌尔都语之间进行了交叉语言和多语言 SER 的实验。对于阿姆哈里语,我们使用了我们自己公开的阿姆哈里语语音情感数据集(ASED)。对于英语、德语和乌尔都语,我们使用了现有的 RAVDESS、EMO-DB 和 URDU 数据集。我们按照之前的研究进行了所有数据集的标签映射,将所有数据集的标签映射到只有两个类别:正面和负面。这样,我们可以直接对不同语言进行比较,并将不同语言结合在一起用于训练和测试。在实验一中,我们使用了三个模型:AlexNet、VGGE 和 ResNet50,进行了单语言 SER 测试。结果表明,在 ASED 和 RAVDESS 上,三个模型的平均结果几乎相同,表明阿姆哈里语和英语 SER 相当Difficult。类似地,德语 SER 更加Difficult,而乌尔都语 SER 相对更加容易。在实验二中,我们将一种语言作为源语言,将另一种语言作为目标语言,在两个方向上进行了每对测试。结果表明,当使用英语或德语作为源语言时,在 Amharic 作为目标语言时的最佳准确率比较高。在实验三中,我们使用了多种非阿姆哈里语言进行训练,然后在 Amharic 上进行测试。最佳准确率比较高,与实验二中的最佳准确率相比,表明可以通过使用两三种非阿姆哈里语言进行训练,而不是只使用一种非阿姆哈里语言,来获得更高的准确率。总的来说,结果表明,在语言资源稀缺时,可以使用交叉语言和多语言训练的策略来训练一个 SER 分类器,以提高其性能。

Perceptual Quality Assessment of Omnidirectional Audio-visual Signals

  • paper_url: http://arxiv.org/abs/2307.10813
  • repo_url: None
  • paper_authors: Xilei Zhu, Huiyu Duan, Yuqin Cao, Yuxin Zhu, Yucheng Zhu, Jing Liu, Li Chen, Xiongkuo Min, Guangtao Zhai
  • for: 评估普directional视频的质量,以提高用户的体验质量(Quality of Experience,QoE)。
  • methods: 使用多模式的音频和视频质量评估模型进行融合,以实现全referenced omnidirectional audio-visual quality assessment(OAVQA)。
  • results: 在大规模的音频视频质量评估 dataset 上验证了音频视频融合方法的效果,提供了新的 benchmark для普directional QoE 评估。
    Abstract Omnidirectional videos (ODVs) play an increasingly important role in the application fields of medical, education, advertising, tourism, etc. Assessing the quality of ODVs is significant for service-providers to improve the user's Quality of Experience (QoE). However, most existing quality assessment studies for ODVs only focus on the visual distortions of videos, while ignoring that the overall QoE also depends on the accompanying audio signals. In this paper, we first establish a large-scale audio-visual quality assessment dataset for omnidirectional videos, which includes 375 distorted omnidirectional audio-visual (A/V) sequences generated from 15 high-quality pristine omnidirectional A/V contents, and the corresponding perceptual audio-visual quality scores. Then, we design three baseline methods for full-reference omnidirectional audio-visual quality assessment (OAVQA), which combine existing state-of-the-art single-mode audio and video QA models via multimodal fusion strategies. We validate the effectiveness of the A/V multimodal fusion method for OAVQA on our dataset, which provides a new benchmark for omnidirectional QoE evaluation. Our dataset is available at https://github.com/iamazxl/OAVQA.
    摘要 这篇文章评估了单向影片(ODV)在医疗、教育、广告、旅游等应用领域的重要性,并认为评估ODV的质量是为服务提供者提高用户体验质量(QoE)的关键。然而,大多数现有的质量评估研究仅对ODV的视觉扭转进行评估,忽略了音频信号的影响。本文首先建立了大规模的音频视觉质量评估数据集,包括15高品质原始音频视觉内容中的375个扭转音频视觉序列,以及它们的相应感觉音频视觉质量分数。接着,我们设计了三个基线方法 для全referenced omnidirectional audio-visual quality assessment(OAVQA),这些方法结合了现有的单模式音频和视觉质量评估模型,并通过多模式融合策略进行融合。我们验证了这些A/V多模式融合方法的有效性,将提供一个新的OMNi-QoE评估指标。您可以在GitHub上获取我们的数据集:https://github.com/iamazxl/OAVQA。

Vesper: A Compact and Effective Pretrained Model for Speech Emotion Recognition

  • paper_url: http://arxiv.org/abs/2307.10757
  • repo_url: https://github.com/happycolor/vesper
  • paper_authors: Weidong Chen, Xiaofen Xing, Peihao Chen, Xiangmin Xu
  • for: 本研究旨在适应大规模预训练模型(PTM)于语音情感识别任务。PTMs为人工智能预测新提供了光,但它们在特定任务上的性能可以进一步改进,而且在实际应用中使用PTMs可能会受到其巨大大小的限制。
  • methods: 本研究提出了一种优化大规模PTMs для特定任务的方法,以生成特定任务的PTMs。这种方法包括使用语音集基于WavLM的预训练encoder,并对语音中的情感特征进行识别。为了提高情感信息的敏感度,encoder使用了情感引导的遮盾策略,并在不同层次和层次之间进行协同和自我超vision。
  • results: 实验结果表明,使用4层的Vesper可以超越WavLM Base的12层,而12层的Vesper可以超越WavLM Large的24层。
    Abstract This paper presents a paradigm that adapts general large-scale pretrained models (PTMs) to speech emotion recognition task. Although PTMs shed new light on artificial general intelligence, they are constructed with general tasks in mind, and thus, their efficacy for specific tasks can be further improved. Additionally, employing PTMs in practical applications can be challenging due to their considerable size. Above limitations spawn another research direction, namely, optimizing large-scale PTMs for specific tasks to generate task-specific PTMs that are both compact and effective. In this paper, we focus on the speech emotion recognition task and propose an improved emotion-specific pretrained encoder called Vesper. Vesper is pretrained on a speech dataset based on WavLM and takes into account emotional characteristics. To enhance sensitivity to emotional information, Vesper employs an emotion-guided masking strategy to identify the regions that need masking. Subsequently, Vesper employs hierarchical and cross-layer self-supervision to improve its ability to capture acoustic and semantic representations, both of which are crucial for emotion recognition. Experimental results on the IEMOCAP, MELD, and CREMA-D datasets demonstrate that Vesper with 4 layers outperforms WavLM Base with 12 layers, and the performance of Vesper with 12 layers surpasses that of WavLM Large with 24 layers.
    摘要 In this paper, the authors focus on the speech emotion recognition task and propose an improved emotion-specific pre-trained encoder called Vesper. Vesper is pre-trained on a speech dataset based on WavLM and takes into account emotional characteristics. To enhance sensitivity to emotional information, Vesper employs an emotion-guided masking strategy to identify the regions that need masking. Additionally, Vesper uses hierarchical and cross-layer self-supervision to improve its ability to capture acoustic and semantic representations, which are crucial for emotion recognition.Experimental results on the IEMOCAP, MELD, and CREMA-D datasets show that Vesper with 4 layers outperforms WavLM Base with 12 layers, and the performance of Vesper with 12 layers surpasses that of WavLM Large with 24 layers. This demonstrates that Vesper is a more effective and efficient approach for speech emotion recognition compared to existing PTMs.

PAS: Partial Additive Speech Data Augmentation Method for Noise Robust Speaker Verification

  • paper_url: http://arxiv.org/abs/2307.10628
  • repo_url: https://github.com/rst0070/Partial_Additive_Speech
  • paper_authors: Wonbin Kim, Hyun-seo Shin, Ju-ho Kim, Jungwoo Heo, Chan-yeong Lim, Ha-Jin Yu
  • for: 提高干扰环境下的 speaker verification(SV)系统的噪音抗性
  • methods: 提出了一种新的干扰方法:partial additive speech(PAS),用于培养SV系统在噪音环境下的性能
  • results: PAS方法相比传统的干扰方法,在EER指标下表现出4.64%和5.01%的相对改进,并通过注意模块和 speaker embedding 的分析,证明了该方法的效iveness。
    Abstract Background noise reduces speech intelligibility and quality, making speaker verification (SV) in noisy environments a challenging task. To improve the noise robustness of SV systems, additive noise data augmentation method has been commonly used. In this paper, we propose a new additive noise method, partial additive speech (PAS), which aims to train SV systems to be less affected by noisy environments. The experimental results demonstrate that PAS outperforms traditional additive noise in terms of equal error rates (EER), with relative improvements of 4.64% and 5.01% observed in SE-ResNet34 and ECAPA-TDNN. We also show the effectiveness of proposed method by analyzing attention modules and visualizing speaker embeddings.
    摘要 背景噪声会降低语音智能和质量,使得说话验证(SV)在噪声环境下成为一项具有挑战性的任务。为了提高噪声Robustness of SV系统,通常使用添加噪声数据增强方法。在这篇论文中,我们提出了一种新的添加噪声方法,即部分添加语音(PAS),旨在培养SV系统免受噪声环境的影响。实验结果显示,PAS在EER方面超过传统的添加噪声,升高了4.64%和5.01%的提升,分别在SE-ResNet34和ECAPA-TDNN中。我们还通过分析注意力模块和可见化说话特征来证明提案的有效性。

Transsion TSUP’s speech recognition system for ASRU 2023 MADASR Challenge

  • paper_url: http://arxiv.org/abs/2307.11778
  • repo_url: None
  • paper_authors: Xiaoxiao Li, Gaosheng Zhang, An Zhu, Weiyong Li, Shuming Fang, Xiaoyue Yang, Jianchao Zhu
  • for: 这个论文是为了提出一种用于ASRU 2023 MADASR Challenge的语音识别系统,并且专注于适应低资源印度语言的ASR模型。
  • methods: 该系统使用了一个压缩转换器encoder和一个反向传输层decoder,并使用了共同CTC-Attention训练损失。此外,在TLG beam search解码中还使用了一个外部KenLM语言模型。
  • results: 该方法在四个轨道上实现了单词错误率(WER)为24.17%、24.43%、15.97%和15.97% для孟买语言,以及WER为19.61%、19.54%、15.48%和15.48% для拜克瑞语言。这些结果表明该方法的有效性。
    Abstract This paper presents a speech recognition system developed by the Transsion Speech Understanding Processing Team (TSUP) for the ASRU 2023 MADASR Challenge. The system focuses on adapting ASR models for low-resource Indian languages and covers all four tracks of the challenge. For tracks 1 and 2, the acoustic model utilized a squeezeformer encoder and bidirectional transformer decoder with joint CTC-Attention training loss. Additionally, an external KenLM language model was used during TLG beam search decoding. For tracks 3 and 4, pretrained IndicWhisper models were employed and finetuned on both the challenge dataset and publicly available datasets. The whisper beam search decoding was also modified to support an external KenLM language model, which enabled better utilization of the additional text provided by the challenge. The proposed method achieved word error rates (WER) of 24.17%, 24.43%, 15.97%, and 15.97% for Bengali language in the four tracks, and WER of 19.61%, 19.54%, 15.48%, and 15.48% for Bhojpuri language in the four tracks. These results demonstrate the effectiveness of the proposed method.
    摘要 For tracks 1 and 2, the acoustic model utilized a squeezeformer encoder and bidirectional transformer decoder with joint CTC-Attention training loss. Additionally, an external KenLM language model was used during TLG beam search decoding.For tracks 3 and 4, pretrained IndicWhisper models were employed and finetuned on both the challenge dataset and publicly available datasets. The whisper beam search decoding was also modified to support an external KenLM language model, which enabled better utilization of the additional text provided by the challenge.The proposed method achieved word error rates (WER) of 24.17%, 24.43%, 15.97%, and 15.97% for Bengali language in the four tracks, and WER of 19.61%, 19.54%, 15.48%, and 15.48% for Bhojpuri language in the four tracks. These results demonstrate the effectiveness of the proposed method.