cs.SD - 2023-08-23

Analysis of XLS-R for Speech Quality Assessment

  • paper_url: http://arxiv.org/abs/2308.12077
  • repo_url: https://github.com/lcn-kul/xls-r-analysis-sqa
  • paper_authors: Bastiaan Tamm, Rik Vandenberghe, Hugo Van hamme
    for: 这个论文的目的是对 speech quality assessment 进行深入分析,以提高自动评估speech quality的能力。methods: 作者使用了 pre-trained wav2vec-based XLS-R embeddings,并分析了每层的表征特征以及不同模型大小的表征特征。results: 研究发现,低级别特征和高级别特征都有优秀的表征特征,并且这两个特征各自捕捉不同的特征。此外,作者还研究了这些表征特征在不同水平的损害下的敏感性,以及将这两个特征融合是否可以提高 MOS 预测的性能。
    Abstract In online conferencing applications, estimating the perceived quality of an audio signal is crucial to ensure high quality of experience for the end user. The most reliable way to assess the quality of a speech signal is through human judgments in the form of the mean opinion score (MOS) metric. However, such an approach is labor intensive and not feasible for large-scale applications. The focus has therefore shifted towards automated speech quality assessment through end-to-end training of deep neural networks. Recently, it was shown that leveraging pre-trained wav2vec-based XLS-R embeddings leads to state-of-the-art performance for the task of speech quality prediction. In this paper, we perform an in-depth analysis of the pre-trained model. First, we analyze the performance of embeddings extracted from each layer of XLS-R and also for each size of the model (300M, 1B, 2B parameters). Surprisingly, we find two optimal regions for feature extraction: one in the lower-level features and one in the high-level features. Next, we investigate the reason for the two distinct optima. We hypothesize that the lower-level features capture characteristics of noise and room acoustics, whereas the high-level features focus on speech content and intelligibility. To investigate this, we analyze the sensitivity of the MOS predictions with respect to different levels of corruption in each category. Afterwards, we try fusing the two optimal feature depths to determine if they contain complementary information for MOS prediction. Finally, we compare the performance of the proposed models and assess the generalizability of the models on unseen datasets.
    摘要 在在线会议应用程序中,估算语音信号的感知质量是关键以确保用户的高质量经验。 however, such an approach is labor-intensive and not feasible for large-scale applications. Therefore, the focus has shifted towards automated speech quality assessment through end-to-end training of deep neural networks. Recently, it was shown that leveraging pre-trained wav2vec-based XLS-R embeddings leads to state-of-the-art performance for the task of speech quality prediction. In this paper, we perform an in-depth analysis of the pre-trained model. First, we analyze the performance of embeddings extracted from each layer of XLS-R and also for each size of the model (300M, 1B, 2B parameters). Surprisingly, we find two optimal regions for feature extraction: one in the lower-level features and one in the high-level features. Next, we investigate the reason for the two distinct optima. We hypothesize that the lower-level features capture characteristics of noise and room acoustics, whereas the high-level features focus on speech content and intelligibility. To investigate this, we analyze the sensitivity of the MOS predictions with respect to different levels of corruption in each category. Afterwards, we try fusing the two optimal feature depths to determine if they contain complementary information for MOS prediction. Finally, we compare the performance of the proposed models and assess the generalizability of the models on unseen datasets.

Joint Prediction of Audio Event and Annoyance Rating in an Urban Soundscape by Hierarchical Graph Representation Learning

  • paper_url: http://arxiv.org/abs/2308.11980
  • repo_url: https://github.com/yuanbo2020/hgrl
  • paper_authors: Yuanbo Hou, Siyang Song, Cheng Luo, Andrew Mitchell, Qiaoqiao Ren, Weicheng Xie, Jian Kang, Wenwu Wang, Dick Botteldooren
  • for: This paper is written for the purpose of exploring the relationship between objective audio events and subjective annoyance ratings in a soundscape.
  • methods: The paper proposes a novel hierarchical graph representation learning (HGRL) approach that links objective audio events with subjective annoyance ratings of the soundscape perceived by humans.
  • results: The proposed HGRL approach successfully integrates objective audio events with subjective annoyance ratings for audio event classification (AEC) and audio scene perception (ARP) tasks, and coordinates the relations between coarse-grained and fine-grained audio event information with the subjective annoyance ratings.
    Abstract Sound events in daily life carry rich information about the objective world. The composition of these sounds affects the mood of people in a soundscape. Most previous approaches only focus on classifying and detecting audio events and scenes, but may ignore their perceptual quality that may impact humans' listening mood for the environment, e.g. annoyance. To this end, this paper proposes a novel hierarchical graph representation learning (HGRL) approach which links objective audio events (AE) with subjective annoyance ratings (AR) of the soundscape perceived by humans. The hierarchical graph consists of fine-grained event (fAE) embeddings with single-class event semantics, coarse-grained event (cAE) embeddings with multi-class event semantics, and AR embeddings. Experiments show the proposed HGRL successfully integrates AE with AR for AEC and ARP tasks, while coordinating the relations between cAE and fAE and further aligning the two different grains of AE information with the AR.
    摘要 日常生活中的听觉事件含有丰富的对象世界信息。听觉事件的组成会影响人们在听觉景中的情绪。以前的方法主要是对听觉事件和场景进行分类和检测,可能忽略了这些听觉事件对人们听觉环境的影响,例如厌烦。为此,这篇论文提出了一种新的层次图表学习(HGRL)方法,将对象听觉事件(AE)与人们对听觉景的感知愤怒评分(AR)关联起来。层次图包括细化事件(fAE)嵌入、粗化事件(cAE)嵌入和AR嵌入。实验显示,提议的HGRL方法成功地将AE与AR集成在AEC和ARP任务中,同时协调粗化事件和细化事件之间的关系,并将两种不同的AE信息与AR进行对齐。

CED: Consistent ensemble distillation for audio tagging

  • paper_url: http://arxiv.org/abs/2308.11957
  • repo_url: https://github.com/richermans/ced
  • paper_authors: Heinrich Dinkel, Yongqing Wang, Zhiyong Yan, Junbo Zhang, Yujun Wang
  • for: 提高音频分类任务的性能和减少模型大小
  • methods: 使用扩展和知识填充(KD)技术,以及一种名为Consistent Teaching的简单训练框架
  • results: 在Audioset(AS)benchmark上实现49.0 mean average precision(mAP)的10M参数模型,以及可用于预训练模型和代码的GitHub链接
    Abstract Augmentation and knowledge distillation (KD) are well-established techniques employed in the realm of audio classification tasks, aimed at enhancing performance and reducing model sizes on the widely recognized Audioset (AS) benchmark. Although both techniques are effective individually, their combined use, called consistent teaching, hasn't been explored before. This paper proposes CED, a simple training framework that distils student models from large teacher ensembles with consistent teaching. To achieve this, CED efficiently stores logits as well as the augmentation methods on disk, making it scalable to large-scale datasets. Central to CED's efficacy is its label-free nature, meaning that only the stored logits are used for the optimization of a student model only requiring 0.3\% additional disk space for AS. The study trains various transformer-based models, including a 10M parameter model achieving a 49.0 mean average precision (mAP) on AS. Pretrained models and code are available at https://github.com/RicherMans/CED.
    摘要 《增强和知识储存(KD)在音频分类任务中广泛应用,以提高性能并减少模型大小。虽然两种技术都是有效的,但它们的共同使用,即一致教学,尚未被探讨。这篇论文提出了CED,一种简单的训练框架,将大师模型ensemble中的学习logits与增强方法存储在磁盘上,使其可扩展到大规模数据集。CED的核心特点是无标签的,即只使用存储的logits来优化学生模型,需要额外磁盘空间为0.3%。研究使用了不同的变换器模型,包括1000万参数的模型,在AS上达到49.0的 mean average precision(mAP)。预训练模型和代码可以在https://github.com/RicherMans/CED上下载。》Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you need the translation in Traditional Chinese, please let me know.

Example-Based Framework for Perceptually Guided Audio Texture Generation

  • paper_url: http://arxiv.org/abs/2308.11859
  • repo_url: None
  • paper_authors: Purnima Kamath, Chitralekha Gupta, Lonce Wyse, Suranga Nanayakkara
  • for: 用于控制生成的Audio texture的semantic特征
  • methods: 使用自动生成的示例来确定用户定义的 semantic attribute的导向向量
  • results: 方法可以找到相对应的、准确的导向向量,并应用于其他任务 such as selective semantic attribute transfer。In English, it means:
  • for: Control the semantic features of generated Audio textures
  • methods: Use automatically generated examples to determine user-defined semantic attributes guidance vectors
  • results: The method can find relevant and deterministic guidance vectors for controllable generation of both discrete and continuous textures, and can be applied to other tasks such as selective semantic attribute transfer.
    Abstract Generative models for synthesizing audio textures explicitly encode controllability by conditioning the model with labelled data. While datasets for audio textures can be easily recorded in-the-wild, semantically labeling them is expensive, time-consuming, and prone to errors due to human annotator subjectivity. Thus, to control generation, there is a need to automatically infer user-defined perceptual factors of variation in the latent space of a generative model while modelling unlabeled textures. In this paper, we propose an example-based framework to determine vectors to guide texture generation based on user-defined semantic attributes. By synthesizing a few synthetic examples to indicate the presence or absence of a semantic attribute, we can infer the guidance vectors in the latent space of a generative model to control that attribute during generation. Our results show that our method is capable of finding perceptually relevant and deterministic guidance vectors for controllable generation for both discrete as well as continuous textures. Furthermore, we demonstrate the application of this method to other tasks such as selective semantic attribute transfer.
    摘要 <>将给定文本翻译成简化中文。<>生成模型用于Synthesizing audio textureExplicitly编码可控性,通过将模型 conditioning Labelled data。虽然 audio texture dataset 可以轻松在野外录制,但Semantically labeling 它们是 expensive, time-consuming,和 prone to errors due to human annotator subjectivity。因此,要控制生成,需要自动推导用户定义的感知因素变化在生成模型的 latent space 中。在这篇论文中,我们提出了一个 example-based 框架,可以根据用户定义的 semantic attribute 推导生成 texture 的 guidance vectors。通过在 latent space 中生成一些 synthetic examples,我们可以控制生成中的 attribute。我们的结果表明,我们的方法可以找到具有感知 relevance 和 deterministic 的 guidance vectors,以控制生成 discrete 和 continuous texture。此外,我们还应用了这种方法到其他任务,如选择性 semantic attribute transfer。

  • paper_url: http://arxiv.org/abs/2308.11773
  • repo_url: None
  • paper_authors: Yuezhou Zhang, Amos A Folarin, Judith Dineley, Pauline Conde, Valeria de Angel, Shaoxiong Sun, Yatharth Ranjan, Zulqarnain Rashid, Callum Stewart, Petroula Laiou, Heet Sankesara, Linglong Qian, Faith Matcham, Katie M White, Carolin Oetzmann, Femke Lamers, Sara Siddi, Sara Simblett, Björn W. Schuller, Srinivasan Vairavan, Til Wykes, Josep Maria Haro, Brenda WJH Penninx, Vaibhav A Narayan, Matthew Hotopf, Richard JB Dobson, Nicholas Cummins, RADAR-CNS consortium
  • For: The paper aims to identify specific speech topics that may indicate depression severity using natural language processing on social media data.* Methods: The study uses the Whisper tool and BERTopic model to analyze 3919 smartphone-collected speech recordings from 265 participants, and compares behavioral and linguistic characteristics across identified topics to elucidate their associations with depression.* Results: The study finds that six specific speech topics (No Expectations, Sleep, Mental Therapy, Haircut, Studying, and Coursework) are associated with high depression severity, and that topic shifts and changes in depression severity over time are correlated. The study also shows that the BERTopic model is effective in identifying these topics in a smaller dataset.Here are the three points in Simplified Chinese text:* For: 这项研究目标是通过自然语言处理技术来分析社交媒体数据,以确定特定的言语主题与抑郁程度的关系。* Methods: 这项研究使用了Whisper工具和BERTopic模型,对3919个智能手机采集的语音记录中的356个语音记录进行分析,并比较了不同话题的行为特征和语言特征,以了解它们与抑郁的关系。* Results: 研究发现,六个特定的话题(无期望、睡眠、心理治疗、剪发、学习和课程)与高度抑郁相关,而话题变化和抑郁程度变化的关系也存在正相关性。此外,研究还证明了BERTopic模型在相似的小型数据集上的效果。
    Abstract Language use has been shown to correlate with depression, but large-scale validation is needed. Traditional methods like clinic studies are expensive. So, natural language processing has been employed on social media to predict depression, but limitations remain-lack of validated labels, biased user samples, and no context. Our study identified 29 topics in 3919 smartphone-collected speech recordings from 265 participants using the Whisper tool and BERTopic model. Six topics with a median PHQ-8 greater than or equal to 10 were regarded as risk topics for depression: No Expectations, Sleep, Mental Therapy, Haircut, Studying, and Coursework. To elucidate the topic emergence and associations with depression, we compared behavioral (from wearables) and linguistic characteristics across identified topics. The correlation between topic shifts and changes in depression severity over time was also investigated, indicating the importance of longitudinally monitoring language use. We also tested the BERTopic model on a similar smaller dataset (356 speech recordings from 57 participants), obtaining some consistent results. In summary, our findings demonstrate specific speech topics may indicate depression severity. The presented data-driven workflow provides a practical approach to collecting and analyzing large-scale speech data from real-world settings for digital health research.
    摘要 研究表明,语言使用与抑郁有相关性,但大规模验证是需要的。传统方法如临床研究昂贵。因此,自然语言处理技术在社交媒体上预测抑郁,但有限制:无验证标签、用户样本偏斜和无 Context。我们的研究通过听众工具和BERTopic模型分析了3919个语音记录和265名参与者的语音,并确定了29个话题。6个话题的PHQ-8分数大于或等于10被视为抑郁风险话题:无期望、睡眠、心理治疗、剪发、学习和课程。为了解释话题出现和抑郁的关系,我们比较了语音和行为特征(来自佩戴设备),并investigated了话题变化和抑郁严重程度的时间变化。我们还在类似的小型数据集(356个语音记录和57名参与者)上测试了BERTopic模型,获得了一些一致的结果。总之,我们的发现表明,特定的语音话题可能指示抑郁严重程度。我们提供的数据驱动的工作流程为数据科学研究提供了实用的方法。