cs.SD - 2023-09-03

NADiffuSE: Noise-aware Diffusion-based Model for Speech Enhancement

  • paper_url: http://arxiv.org/abs/2309.01212
  • repo_url: None
  • paper_authors: Wen Wang, Dongchao Yang, Qichen Ye, Bowen Cao, Yuexian Zou
  • for: 这种paper的目的是提高干扰除 noise 的speech增强技术。
  • methods: 这种paper使用的方法包括 diffusion models (DM) 和 generator-plus-conditioner (GPC) 结构,以及多stage frameworks。
  • results: 实验结果表明,这种 noise-aware diffusion-based speech enhancement 模型(NADiffuSE)可以提高干扰除 noise 的性能,并且可以在不同的干扰环境下保持良好的声音质量。
    Abstract The goal of speech enhancement (SE) is to eliminate the background interference from the noisy speech signal. Generative models such as diffusion models (DM) have been applied to the task of SE because of better generalization in unseen noisy scenes. Technical routes for the DM-based SE methods can be summarized into three types: task-adapted diffusion process formulation, generator-plus-conditioner (GPC) structures and the multi-stage frameworks. We focus on the first two approaches, which are constructed under the GPC architecture and use the task-adapted diffusion process to better deal with the real noise. However, the performance of these SE models is limited by the following issues: (a) Non-Gaussian noise estimation in the task-adapted diffusion process. (b) Conditional domain bias caused by the weak conditioner design in the GPC structure. (c) Large amount of residual noise caused by unreasonable interpolation operations during inference. To solve the above problems, we propose a noise-aware diffusion-based SE model (NADiffuSE) to boost the SE performance, where the noise representation is extracted from the noisy speech signal and introduced as a global conditional information for estimating the non-Gaussian components. Furthermore, the anchor-based inference algorithm is employed to achieve a compromise between the speech distortion and noise residual. In order to mitigate the performance degradation caused by the conditional domain bias in the GPC framework, we investigate three model variants, all of which can be viewed as multi-stage SE based on the preprocessing networks for Mel spectrograms. Experimental results show that NADiffuSE outperforms other DM-based SE models under the GPC infrastructure. Audio samples are available at: https://square-of-w.github.io/NADiffuSE-demo/.
    摘要 目标是减少背景干扰,使得听写 speech 信号中的干扰消失。生成模型如扩散模型(DM)已经应用于干扰消除任务,因为它们在未看到的干扰场景中更好地泛化。技术 Routes 可以概括为三种:任务适应扩散过程的形ulation,生成器+条件器(GPC)结构和多Stage 框架。我们主要关注前两种方法,它们在 GPC 架构下使用任务适应扩散过程来更好地处理实际的干扰。然而,这些干扰消除模型的性能受以下问题的限制:(a) 任务适应扩散过程中的非高斯噪声估计。(b) GPC 结构中弱条件器设计引起的 conditional 领域偏见。(c) 推理过程中不合理的插值操作导致的大量剩余噪声。为解决以上问题,我们提出了一种噪声意识 diffusion-based SE 模型(NADiffuSE),其中噪声表示被提取自听写 speech 信号,并作为全局条件信息来估计非高斯噪声成分。此外,我们采用了 anchor-based 推理算法以实现权衡 speech 损害和剩余噪声。为了 Mitigate GPC 框架中的 conditional 领域偏见问题,我们进行了三种模型变体的研究,它们都可以视为基于 Mel spectrograms 的多Stage SE。实验结果表明,NADiffuSE 在 GPC 结构下的性能明显超越了其他 DM-based SE 模型。听写样本可以在以下网站上找到:https://square-of-w.github.io/NADiffuSE-demo/。

MSM-VC: High-fidelity Source Style Transfer for Non-Parallel Voice Conversion by Multi-scale Style Modeling

  • paper_url: http://arxiv.org/abs/2309.01142
  • repo_url: None
  • paper_authors: Zhichao Wang, Xinsheng Wang, Qicong Xie, Tao Li, Lei Xie, Qiao Tian, Yuping Wang
  • for: 这篇论文主要针对的是voice conversion(VC)任务中保持源语音的发音风格,并且实现高质量的转换。
  • methods: 该论文提出了一种多级风格模型法(MSM-VC),该法利用不同级别的特征来模型源语音的发音风格。具体来说,该法使用了不同级别的特征,包括末尾特征、本地特征和全局特征,来模型语音的帧级、本地级和全局级风格。此外,该法还引入了一个Explicit Constraint Module,以确保源语音的风格模型和目标说话人的特征 preserved。
  • results: 实验表明,MSM-VC方法可以高效地模型源语音的风格,同时保持高质量的语音转换和说话人相似性。
    Abstract In addition to conveying the linguistic content from source speech to converted speech, maintaining the speaking style of source speech also plays an important role in the voice conversion (VC) task, which is essential in many scenarios with highly expressive source speech, such as dubbing and data augmentation. Previous work generally took explicit prosodic features or fixed-length style embedding extracted from source speech to model the speaking style of source speech, which is insufficient to achieve comprehensive style modeling and target speaker timbre preservation. Inspired by the style's multi-scale nature of human speech, a multi-scale style modeling method for the VC task, referred to as MSM-VC, is proposed in this paper. MSM-VC models the speaking style of source speech from different levels. To effectively convey the speaking style and meanwhile prevent timbre leakage from source speech to converted speech, each level's style is modeled by specific representation. Specifically, prosodic features, pre-trained ASR model's bottleneck features, and features extracted by a model trained with a self-supervised strategy are adopted to model the frame, local, and global-level styles, respectively. Besides, to balance the performance of source style modeling and target speaker timbre preservation, an explicit constraint module consisting of a pre-trained speech emotion recognition model and a speaker classifier is introduced to MSM-VC. This explicit constraint module also makes it possible to simulate the style transfer inference process during the training to improve the disentanglement ability and alleviate the mismatch between training and inference. Experiments performed on the highly expressive speech corpus demonstrate that MSM-VC is superior to the state-of-the-art VC methods for modeling source speech style while maintaining good speech quality and speaker similarity.
    摘要 在voice conversion(VC)任务中,保持源语音的说话风格也非常重要,特别在高度表情化的源语音中,如重 lip-sync 和数据增强。先前的工作通常使用源语音中的显式 просодические特征或固定长度的风格嵌入来模拟源语音的说话风格,但这并不能实现完整的风格模型化和目标 speaker timbre 保持。受到人类语音的风格多尺度性的启发,本文提出了一种多尺度风格模型化方法(MSM-VC)。MSM-VC 模拟源语音的说话风格从不同的水平,以达到更好的风格模型化和目标 speaker timbre 保持。具体来说,在不同水平上,采用不同的表示方式来模拟风格。例如,在帧水平上采用抑制特征、在本地水平上采用预训练 ASR 模型的瓶颈特征,在全局水平上采用通过自我超VI的方法提取的特征。此外,为了平衡源风格模型化和目标 speaker timbre 保持,我们引入了一个explicit constraint module,包括一个预训练的语音情感认知模型和一个 speaker classifier。这个explicit constraint module 也使得在训练过程中可以模拟风格传递INFERENCE进程,以提高分离度和减少训练和测试之间的差异。实验表明,在高度表情化语音库中,MSM-VC 比状态之前的VC方法更好地模拟源语音的说话风格,同时保持良好的语音质量和 speaker similarity。