cs.SD - 2023-08-17

Severity Classification of Parkinson’s Disease from Speech using Single Frequency Filtering-based Features

  • paper_url: http://arxiv.org/abs/2308.09042
  • repo_url: None
  • paper_authors: Sudarsana Reddy Kadiri, Manila Kodali, Paavo Alku
  • for: 这个研究旨在提出一种新的评估parkinson病(PD)严重程度的对象方法,以提高诊断和治疗的效果。
  • methods: 该研究使用了单频 filtering(SFF)方法,从而 derivation two sets of novel features:(1) SFF cepstral coefficients(SFFCC)和 (2) MFCCs from SFF(MFCC-SFF)。SFF 方法可以提供更高的spectro-temporal resolution,而且在三种说话任务(vowel、sentence、text reading)中都表现出了更好的效果。
  • results: 实验结果表明,提出的特征比普通的MFCC特征更高,在三种说话任务中都达到了5.8%、7.0%和2.4%的提升。
    Abstract Developing objective methods for assessing the severity of Parkinson's disease (PD) is crucial for improving the diagnosis and treatment. This study proposes two sets of novel features derived from the single frequency filtering (SFF) method: (1) SFF cepstral coefficients (SFFCC) and (2) MFCCs from the SFF (MFCC-SFF) for the severity classification of PD. Prior studies have demonstrated that SFF offers greater spectro-temporal resolution compared to the short-time Fourier transform. The study uses the PC-GITA database, which includes speech of PD patients and healthy controls produced in three speaking tasks (vowels, sentences, text reading). Experiments using the SVM classifier revealed that the proposed features outperformed the conventional MFCCs in all three speaking tasks. The proposed SFFCC and MFCC-SFF features gave a relative improvement of 5.8% and 2.3% for the vowel task, 7.0% & 1.8% for the sentence task, and 2.4% and 1.1% for the read text task, in comparison to MFCC features.
    摘要 发展客观的评估parkinson病(PD)严重度的方法是至关重要的,以提高诊断和治疗的效果。这项研究提出了两个新的特征集:(1)单频范围滤波(SFF)cepstral coefficient(SFFCC)和(2)基于SFF的Mel-frequency cepstral coefficients(MFCC-SFF),用于分类PD严重度。先前的研究表明,SFF比短时间傅立叶 transform(STFT)具有更高的spectro-temporal分辨率。这项研究使用了PC-GITA数据库,包括PD患者和健康控制人员在三种说话任务(元音、句子和文本读取)中的语音。实验使用SVM分类器显示,提出的特征比折衔MFCC更高,在三种说话任务中都有显著提高。SFFCC和MFCC-SFF特征在元音任务中增加了5.8%和2.3%,在句子任务中增加了7.0%和1.8%,在文本读取任务中增加了2.4%和1.1%,相比MFCC特征。

Home monitoring for frailty detection through sound and speaker diarization analysis

  • paper_url: http://arxiv.org/abs/2308.08985
  • repo_url: None
  • paper_authors: Yannis Tevissen, Dan Istrate, Vincent Zalc, Jérôme Boudy, Gérard Chollet, Frédéric Petitpont, Sami Boutamine
  • for: 这篇论文是为了研究如何通过人类日常生活声音识别和语音存在检测来实现可靠和隐私保护的家庭监测系统。
  • methods: 这篇论文使用了最新的声音处理和 speaker diarization 技术来改进现有的嵌入式系统。
  • results: 研究发现,使用 DNN 基本方法可以提高性能约100%。
    Abstract As the French, European and worldwide populations are aging, there is a strong interest for new systems that guarantee a reliable and privacy preserving home monitoring for frailty prevention. This work is a part of a global environmental audio analysis system which aims to help identification of Activities of Daily Life (ADL) through human and everyday life sounds recognition, speech presence and number of speakers detection. The focus is made on the number of speakers detection. In this article, we present how recent advances in sound processing and speaker diarization can improve the existing embedded systems. We study the performances of two new methods and discuss the benefits of DNN based approaches which improve performances by about 100%.
    摘要 As the French, European, and worldwide populations are aging, there is a strong interest in new systems that can provide reliable and privacy-preserving home monitoring for frailty prevention. This work is part of a global environmental audio analysis system that aims to help identify Activities of Daily Life (ADL) through human and everyday life sounds recognition, speech presence, and number of speakers detection. The focus is on the number of speakers detection. In this article, we discuss how recent advances in sound processing and speaker diarization can improve the existing embedded systems. We evaluate the performance of two new methods and explore the benefits of deep neural network (DNN) based approaches, which can improve performances by about 100%.

Explicit Estimation of Magnitude and Phase Spectra in Parallel for High-Quality Speech Enhancement

  • paper_url: http://arxiv.org/abs/2308.08926
  • repo_url: None
  • paper_authors: Ye-Xin Lu, Yang Ai, Zhen-Hua Ling
  • for: 提高speech质量和可理解性
  • methods: 提出MP-SENet模型,通过并行地恢复大小和相位特征图像来提高speech质量
  • results: 在多个任务上实现高质量的speech提升,包括speech干声、抑护、宽频扩展等任务,并且比存在相位感知的方法更好地避免了相位与大小相互赔偿的问题
    Abstract Phase information has a significant impact on speech perceptual quality and intelligibility. However, existing speech enhancement methods encounter limitations in explicit phase estimation due to the non-structural nature and wrapping characteristics of the phase, leading to a bottleneck in enhanced speech quality. To overcome the above issue, in this paper, we proposed MP-SENet, a novel Speech Enhancement Network which explicitly enhances Magnitude and Phase spectra in parallel. The proposed MP-SENet adopts a codec architecture in which the encoder and decoder are bridged by time-frequency Transformers along both time and frequency dimensions. The encoder aims to encode time-frequency representations derived from the input distorted magnitude and phase spectra. The decoder comprises dual-stream magnitude and phase decoders, directly enhancing magnitude and wrapped phase spectra by incorporating a magnitude estimation architecture and a phase parallel estimation architecture, respectively. To train the MP-SENet model effectively, we define multi-level loss functions, including mean square error and perceptual metric loss of magnitude spectra, anti-wrapping loss of phase spectra, as well as mean square error and consistency loss of short-time complex spectra. Experimental results demonstrate that our proposed MP-SENet excels in high-quality speech enhancement across multiple tasks, including speech denoising, dereverberation, and bandwidth extension. Compared to existing phase-aware speech enhancement methods, it successfully avoids the bidirectional compensation effect between the magnitude and phase, leading to a better harmonic restoration. Notably, for the speech denoising task, the MP-SENet yields a state-of-the-art performance with a PESQ of 3.60 on the public VoiceBank+DEMAND dataset.
    摘要 干扰信息对语音辨识质量和可读性有着重要的影响。然而,现有的语音增强方法受到不可靠的阶段估计的限制,由于干扰信息的非结构性和包袋特性,导致增强语音质量的瓶颈。为了突破这个问题,在这篇论文中,我们提出了MP-SENet,一种新的语音增强网络,可以并行地增强干扰信息的大小和阶段 spectra。MP-SENet 采用了 codec 架构,其中编码器和解码器通过时间频率变换器连接。编码器的目的是将输入损坏的干扰信息转化为时间频率表示。解码器包括两个独立的大小和包袋解码器,直接将输入干扰信息的大小和包袋 spectra 提高,并采用了大小估计架构和包袋平行估计架构。为了训练 MP-SENet 模型,我们定义了多个层次损失函数,包括平均方差损失和听觉指标损失,反射损失、适应损失和短时间复杂 spectra 的损失。实验结果表明,我们提出的 MP-SENet 在多个任务上实现了高质量的语音增强,包括语音减雷、抑制抑震和频率扩展。相比现有的阶段意识的语音增强方法,MP-SENet 成功避免了阶段相互赔偿效果,从而更好地保持干扰信息的谱干整复。特别是在语音减雷任务上,MP-SENet 实现了公共 VoiceBank+DEMAND 数据集上的状态级表现,PESQ 为 3.60。

Long-frame-shift Neural Speech Phase Prediction with Spectral Continuity Enhancement and Interpolation Error Compensation

  • paper_url: http://arxiv.org/abs/2308.08850
  • repo_url: https://github.com/yangai520/lfs-nspp
  • paper_authors: Yang Ai, Ye-Xin Lu, Zhen-Hua Ling
  • for: 提高信号处理领域内的speech phase预测精度,使得可以准确地从 amplitude-related features 中预测长框偏移 phase spectra。
  • methods: 提出了一种基于 neural network 的长框偏移 speech phase 预测方法 (LFS-NSPP),包括三个阶段: interpolate、predict 和 decimate。首先,将 long-frame-shift log amplitude spectra 转换为 short-frame-shift log amplitude spectra mediante frequency-by-frequency interpolation,然后使用 NSPP 模型预测 short-frame-shift phase spectra,最后,将 long-frame-shift phase spectra 得到于 short-frame-shift phase spectra mediante frame-by-frame decimation。
  • results: 实验结果表明,提出的 LFS-NSPP 方法可以在预测 long-frame-shift phase spectra 方面达到更高的精度,比原 NSPP 模型和其他信号处理基于 phase 估计算法更好。
    Abstract Speech phase prediction, which is a significant research focus in the field of signal processing, aims to recover speech phase spectra from amplitude-related features. However, existing speech phase prediction methods are constrained to recovering phase spectra with short frame shifts, which are considerably smaller than the theoretical upper bound required for exact waveform reconstruction of short-time Fourier transform (STFT). To tackle this issue, we present a novel long-frame-shift neural speech phase prediction (LFS-NSPP) method which enables precise prediction of long-frame-shift phase spectra from long-frame-shift log amplitude spectra. The proposed method consists of three stages: interpolation, prediction and decimation. The short-frame-shift log amplitude spectra are first constructed from long-frame-shift ones through frequency-by-frequency interpolation to enhance the spectral continuity, and then employed to predict short-frame-shift phase spectra using an NSPP model, thereby compensating for interpolation errors. Ultimately, the long-frame-shift phase spectra are obtained from short-frame-shift ones through frame-by-frame decimation. Experimental results show that the proposed LFS-NSPP method can yield superior quality in predicting long-frame-shift phase spectra than the original NSPP model and other signal-processing-based phase estimation algorithms.
    摘要 <>使用signal processing的研究焦点中的speech phase prediction(SPP)方法可以从振荡功率相关特征中恢复speech phase spectra。然而,现有的SPP方法只能recover short frame shift的phase spectra,这些框架偏移远小于理论最大值需要的短时傅立埃 transform(STFT)的波形重建。为解决这个问题,我们提出了一种新的long-frame-shift neural speech phase prediction(LFS-NSPP)方法,可以高精度地预测long-frame-shift phase spectra从long-frame-shift log amplitude spectra。LFS-NSPP方法包括三个阶段:interpolation、prediction和decimation。首先,通过频率域的 interpolate来从long-frame-shift amplitude spectra中提取short-frame-shift log amplitude spectra,以提高spectral continuity。然后,使用NSPP模型预测short-frame-shift phase spectra,以补偿interpolation error。最后,通过frame-by-frame decimation,从short-frame-shift phase spectra中提取long-frame-shift phase spectra。实验结果表明,提出的LFS-NSPP方法可以在预测long-frame-shift phase spectra方面比原始的NSPP模型和其他基于signal processing的phase estimation算法更高质量。Note: The translation is in Simplified Chinese, which is a standardized form of Chinese used in mainland China. The translation may vary depending on the specific dialect or region.

META-SELD: Meta-Learning for Fast Adaptation to the new environment in Sound Event Localization and Detection

  • paper_url: http://arxiv.org/abs/2308.08847
  • repo_url: None
  • paper_authors: Jinbo Hu, Yin Cao, Ming Wu, Feiran Yang, Ziying Yu, Wenwu Wang, Mark D. Plumbley, Jun Yang
  • for: 本研究旨在提高学习型 зву响事件定位检测(SELD)方法在不同听控环境中的性能。
  • methods: 本研究使用meta学习方法,以便快速适应新环境。
  • results: 实验结果表明,Meta-SELD在适应新环境时的性能较高,比传统的 fine-tuning 方法更加灵活和高效。
    Abstract For learning-based sound event localization and detection (SELD) methods, different acoustic environments in the training and test sets may result in large performance differences in the validation and evaluation stages. Different environments, such as different sizes of rooms, different reverberation times, and different background noise, may be reasons for a learning-based system to fail. On the other hand, acquiring annotated spatial sound event samples, which include onset and offset time stamps, class types of sound events, and direction-of-arrival (DOA) of sound sources is very expensive. In addition, deploying a SELD system in a new environment often poses challenges due to time-consuming training and fine-tuning processes. To address these issues, we propose Meta-SELD, which applies meta-learning methods to achieve fast adaptation to new environments. More specifically, based on Model Agnostic Meta-Learning (MAML), the proposed Meta-SELD aims to find good meta-initialized parameters to adapt to new environments with only a small number of samples and parameter updating iterations. We can then quickly adapt the meta-trained SELD model to unseen environments. Our experiments compare fine-tuning methods from pre-trained SELD models with our Meta-SELD on the Sony-TAU Realistic Spatial Soundscapes 2023 (STARSSS23) dataset. The evaluation results demonstrate the effectiveness of Meta-SELD when adapting to new environments.
    摘要 为了解决学习基于 зву频事件检测和位置标注(SELD)方法中环境不同导致验证和评估阶段表现差异较大的问题,我们提出了Meta-SELD方法。这种方法利用元学习技术来实现环境适应快速。具体来说,基于Model Agnostic Meta-Learning(MAML),我们的Meta-SELD目标是在新环境中找到好的元初始化参数,使其适应新环境只需要少量样本和参数更新迭代。这样可以快速地适应元训练的SELD模型中未看过的环境。我们在STARSSS23 dataset上进行了对照研究,并证明了Meta-SELD在适应新环境方面的效果。

Graph Neural Network Backend for Speaker Recognition

  • paper_url: http://arxiv.org/abs/2308.08767
  • repo_url: None
  • paper_authors: Liang He, Ruida Li, Mengqi Niu
  • for: 提高 speaker recognition 精度
  • methods: 使用图 neural network (GNN) backend 挖掘嵌入在低维度空间中的 latent 关系
  • results: 在 NIST SRE14 i-vector 挑战 task 和 VoxCeleb1-O、VoxCeleb1-E、VoxCeleb1-H dataset 上实现了显著的性能提升
    Abstract Currently, most speaker recognition backends, such as cosine, linear discriminant analysis (LDA), or probabilistic linear discriminant analysis (PLDA), make decisions by calculating similarity or distance between enrollment and test embeddings which are already extracted from neural networks. However, for each embedding, the local structure of itself and its neighbor embeddings in the low-dimensional space is different, which may be helpful for the recognition but is often ignored. In order to take advantage of it, we propose a graph neural network (GNN) backend to mine latent relationships among embeddings for classification. We assume all the embeddings as nodes on a graph, and their edges are computed based on some similarity function, such as cosine, LDA+cosine, or LDA+PLDA. We study different graph settings and explore variants of GNN to find a better message passing and aggregation way to accomplish the recognition task. Experimental results on NIST SRE14 i-vector challenging, VoxCeleb1-O, VoxCeleb1-E, and VoxCeleb1-H datasets demonstrate that our proposed GNN backends significantly outperform current mainstream methods.
    摘要 当前大多数 speaker recognition 后端,如cosine、线性混合分析(LDA)或 probabilistic 线性混合分析(PLDA),做出决策时通常计算投入和测试嵌入的相似性或距离。然而,每个嵌入都有自己本地结构,这些结构可能对识别有帮助,但通常被忽略。为了利用这些结构,我们提议一种基于图 neural network(GNN)的后端,以挖掘嵌入之间的隐藏关系,并用于分类。我们将所有嵌入视为图中的节点,并根据某种相似函数(如cosine、LDA+cosine或LDA+PLDA)计算它们之间的边。我们研究了不同的图设置和GNN变种,以找到更好的消息传递和聚合方式,以完成识别任务。实验结果表明,我们提议的 GNN 后端在 NIST SRE14 i-vector 挑战任务、VoxCeleb1-O、VoxCeleb1-E 和 VoxCeleb1-H 数据集上显著超越了当前主流方法。

The DKU-MSXF Speaker Verification System for the VoxCeleb Speaker Recognition Challenge 2023

  • paper_url: http://arxiv.org/abs/2308.08766
  • repo_url: None
  • paper_authors: Ze Li, Yuke Lin, Xiaoyi Qin, Ning Jiang, Guoqing Zhao, Ming Li
  • For: 本研究是DKU-MSXF系统的 Track1、Track2 和 Track3 的 VoxCeleb Speaker Recognition Challenge 2023(VoxSRC-23)系统描述。* Methods: 我们利用基于 ResNet 网络结构的训练方法,并构建了跨年龄 QMF 训练集,从而实现了显著提高系统性能。* Results: 我们在 Track 2 中使用预训练模型,并通过将 VoxBlink-clean 数据集 incorporated 进行混合训练,相比 Track 1,包含 VoxBlink-clean 数据集的模型表现提高了 более чем 10%。在 Track 3 中,我们采用了一种新的 Pseudo-labeling 方法,并使用 triple thresholds 和 sub-center purification,实现了预测领域的适应。最终提交得到了 task1 的 mDCF 0.1243、Track 2 的 mDCF 0.1165 和 Track 3 的 EER 4.952%。
    Abstract This paper is the system description of the DKU-MSXF System for the track1, track2 and track3 of the VoxCeleb Speaker Recognition Challenge 2023 (VoxSRC-23). For Track 1, we utilize a network structure based on ResNet for training. By constructing a cross-age QMF training set, we achieve a substantial improvement in system performance. For Track 2, we inherite the pre-trained model from Track 1 and conducte mixed training by incorporating the VoxBlink-clean dataset. In comparison to Track 1, the models incorporating VoxBlink-clean data exhibit a performance improvement by more than 10% relatively. For Track3, the semi-supervised domain adaptation task, a novel pseudo-labeling method based on triple thresholds and sub-center purification is adopted to make domain adaptation. The final submission achieves mDCF of 0.1243 in task1, mDCF of 0.1165 in Track 2 and EER of 4.952% in Track 3.
    摘要 这份文章是DKU-MSXF系统的描述,用于VoxCeleb Speaker Recognition Challenge 2023(VoxSRC-23)的track1、track2和track3。在track1中,我们使用基于ResNet的网络结构进行训练,通过构建跨年龄QMF训练集,实现了显著提高系统性能。在track2中,我们继承了track1中的预训练模型,并通过将VoxBlink-clean数据集 incorporated进行混合训练,相比track1,包含VoxBlink-clean数据集的模型表现相对提高了 более10%。在track3中,我们采用了一种新的半有限预测方法,基于三个阈值和子中心纯化,进行预测领域适应。最终提交的结果为task1中的mDCF为0.1243,track2中的mDCF为0.1165,以及track3中的EER为4.952%。

Decoding Emotions: A comprehensive Multilingual Study of Speech Models for Speech Emotion Recognition

  • paper_url: http://arxiv.org/abs/2308.08713
  • repo_url: https://github.com/95anantsingh/decoding-emotions
  • paper_authors: Anant Singh, Akshat Gupta
  • for: 这个研究旨在评估基于变换器的speech表示模型在多种语言的语音情感识别 tasks 中的表现,以及这些模型内部的表示方式。
  • methods: 本文使用了八种speech表示模型和六种语言进行了一系列的比较和探索,包括 probing 实验来探究这些模型内部的工作方式。
  • results: 研究发现,使用单个最佳层的speech模型特征可以降低错误率32%的平均值,并在七个数据集中达到了德语和波斯语的国际前景。 probing 结果表明,speech模型的中间层 capture 最重要的情感信息。
    Abstract Recent advancements in transformer-based speech representation models have greatly transformed speech processing. However, there has been limited research conducted on evaluating these models for speech emotion recognition (SER) across multiple languages and examining their internal representations. This article addresses these gaps by presenting a comprehensive benchmark for SER with eight speech representation models and six different languages. We conducted probing experiments to gain insights into inner workings of these models for SER. We find that using features from a single optimal layer of a speech model reduces the error rate by 32\% on average across seven datasets when compared to systems where features from all layers of speech models are used. We also achieve state-of-the-art results for German and Persian languages. Our probing results indicate that the middle layers of speech models capture the most important emotional information for speech emotion recognition.
    摘要