cs.SD - 2023-10-22

An overview of text-to-speech systems and media applications

  • paper_url: http://arxiv.org/abs/2310.14301
  • repo_url: https://github.com/Aryia-Behroziuan/References
  • paper_authors: Mohammad Reza Hasanabadi
  • for: 这个论文主要针对 Text-To-Speech (TTS) 系统的研究和发展。
  • methods: 论文介绍了 TTS 系统的关键组件,包括文本分析、声学模型和编解码器。还提到了一些深度学习方法的应用。
  • results: 论文详细介绍了一些现代 TTS 系统的设计和实现,包括 Tacotron 2、Transformer TTS、WaveNet 和 FastSpeech 1。这些系统在评价指标(MOS)方面表现出色。在讨论部分,论文还提出了一些建议来开发一个适合应用的 TTS 系统。
    Abstract Producing synthetic voice, similar to human-like sound, is an emerging novelty of modern interactive media systems. Text-To-Speech (TTS) systems try to generate synthetic and authentic voices via text input. Besides, well known and familiar dubbing, announcing and narrating voices, as valuable possessions of any media organization, can be kept forever by utilizing TTS and Voice Conversion (VC) algorithms . The emergence of deep learning approaches has made such TTS systems more accurate and accessible. To understand TTS systems better, this paper investigates the key components of such systems including text analysis, acoustic modelling and vocoding. The paper then provides details of important state-of-the-art TTS systems based on deep learning. Finally, a comparison is made between recently released systems in term of backbone architecture, type of input and conversion, vocoder used and subjective assessment (MOS). Accordingly, Tacotron 2, Transformer TTS, WaveNet and FastSpeech 1 are among the most successful TTS systems ever released. In the discussion section, some suggestions are made to develop a TTS system with regard to the intended application.
    摘要 现代交互媒体系统中的一项新特性是生成人类化的语音。文本到语音(TTS)系统的目的是通过文本输入生成真实和合成的语音。此外,媒体组织所拥有的著名和熟悉的配音、宣传和演讲voice可以永久保留,通过使用TTS和voice转换(VC)算法。深入理解TTS系统的关键组件,包括文本分析、声学模型和 vocoding。文章随后提供了深度学习方法实现TTS系统的详细介绍,并对最新的TTS系统进行比较,包括后准体系统、输入类型和转换、使用的 vocoder 以及主观评价(MOS)。根据最新的研究,Tacotron 2、Transformer TTS、WaveNet 和 FastSpeech 1 是目前最成功的 TTS 系统之一。在讨论部分,有些建议是如何开发一个适合应用的 TTS 系统。

MFCC-GAN Codec: A New AI-based Audio Coding

  • paper_url: http://arxiv.org/abs/2310.14300
  • repo_url: None
  • paper_authors: Mohammad Reza Hasanabadi
  • for: 这个论文是关于使用MFCC特征进行AI音频编码,在对抗 Setting中进行。
  • methods: 这个论文使用了一种结合传统编码器和对抗学习解码器的方法,以更好地重建原始波形。GAN提供了隐式概率估计,因此这些模型更加不容易过拟合。
  • results: 这个论文的结果显示,使用MFCCGAN_36k和MFCCGAN_13k可以达到比较高的SNR和NISQA-MOS水平,并且比其他五种知名编码器(AAC、AC3、Opus、Vorbis和Speex)更高。此外,MFCCGAN_13k也可以达到与AC3_128k和AAC_112k相同的SNR水平,而且具有远远低于这些编码器的比特率。
    Abstract In this paper, we proposed AI-based audio coding using MFCC features in an adversarial setting. We combined a conventional encoder with an adversarial learning decoder to better reconstruct the original waveform. Since GAN gives implicit density estimation, therefore, such models are less prone to overfitting. We compared our work with five well-known codecs namely AAC, AC3, Opus, Vorbis, and Speex, performing on bitrates from 2kbps to 128kbps. MFCCGAN_36k achieved the state-of-the-art result in terms of SNR despite a lower bitrate in comparison to AC3_128k, AAC_112k, Vorbis_48k, Opus_48k, and Speex_48K. On the other hand, MFCCGAN_13k also achieved high SNR=27 which is equal to that of AC3_128k, and AAC_112k while having a significantly lower bitrate (13 kbps). MFCCGAN_36k achieved higher NISQA-MOS results compared to AAC_48k while having a 20% lower bitrate. Furthermore, MFCCGAN_13k obtained NISQAMOS= 3.9 which is much higher than AAC_24k, AAC_32k, AC3_32k, and AAC_48k. For future work, we finally suggest adopting loss functions optimizing intelligibility and perceptual metrics in the MFCCGAN structure to improve quality and intelligibility simultaneously.
    摘要 在这篇论文中,我们提出了基于人工智能的音频编码方法,使用MFCC特征在反对抗Setting中进行实现。我们将传统编码器和反对抗学习解码器结合起来,以更好地重建原始波形。由于GAN提供了隐式概率估计,因此这些模型更加不易过拟合。我们对五种公知编码器进行比较,分别是AAC、AC3、Opus、Vorbis和Speex,在比特率从2kbps到128kbps之间进行测试。MFCCGAN_36k实现了最佳的SNR水平,即使在比特率较低的情况下,相比AC3_128k、AAC_112k、Vorbis_48k、Opus_48k和Speex_48K。另一方面,MFCCGAN_13k也实现了高SNR=27,与AC3_128k和AAC_112k的SNR相同,而且比特率远低(13kbps)。MFCCGAN_36k在NISQA-MOS方面获得了高于AAC_48k的结果,同时比特率下降20%。此外,MFCCGAN_13k获得了NISQAMOS=3.9,高于AAC_24k、AAC_32k、AC3_32k和AAC_48k。为未来的工作,我们建议采用MFCCGAN结构中优化智能和感知度量的损失函数,以同时提高质量和智能性。

Diffusion-Based Adversarial Purification for Speaker Verification

  • paper_url: http://arxiv.org/abs/2310.14270
  • repo_url: None
  • paper_authors: Yibo Bai, Xiao-Lei Zhang
  • for: 提高自动话语识别系统的安全性和可靠性,对抗恶意攻击。
  • methods: 提出了一种噪声扩散模型来纯化恶意攻击的示例,并实现了一种返回原始清晰音频的逆噪声处理。
  • results: 实验结果表明,提出的方法可以有效地增强自动话语识别系统的安全性,同时减少恶意攻击所带来的噪声影响。
    Abstract Recently, automatic speaker verification (ASV) based on deep learning is easily contaminated by adversarial attacks, which is a new type of attack that injects imperceptible perturbations to audio signals so as to make ASV produce wrong decisions. This poses a significant threat to the security and reliability of ASV systems. To address this issue, we propose a Diffusion-Based Adversarial Purification (DAP) method that enhances the robustness of ASV systems against such adversarial attacks. Our method leverages a conditional denoising diffusion probabilistic model to effectively purify the adversarial examples and mitigate the impact of perturbations. DAP first introduces controlled noise into adversarial examples, and then performs a reverse denoising process to reconstruct clean audio. Experimental results demonstrate the efficacy of the proposed DAP in enhancing the security of ASV and meanwhile minimizing the distortion of the purified audio signals.
    摘要 Our method uses a conditional denoising diffusion probabilistic model to effectively purify adversarial examples and mitigate the impact of perturbations. DAP first introduces controlled noise into adversarial examples and then performs a reverse denoising process to reconstruct clean audio.Experimental results show that the proposed DAP method is effective in enhancing the security of ASV systems while minimizing the distortion of the purified audio signals. This provides a reliable and secure solution for ASV applications.

First-Shot Unsupervised Anomalous Sound Detection With Unknown Anomalies Estimated by Metadata-Assisted Audio Generation

  • paper_url: http://arxiv.org/abs/2310.14173
  • repo_url: None
  • paper_authors: Hejing Zhang, Qiaoxi Zhu, Jian Guan, Haohe Liu, Feiyang Xiao, Jiantong Tian, Xinhao Mei, Xubo Liu, Wenwu Wang
  • for: 这篇论文旨在解决无法在训练过程中获得异常声音数据的问题,尤其是在首次聆听 зада务中。
  • methods: 这篇论文提出了一个新的框架,将 metadata-assisted 音乐生成用于估计未知异常,通过利用可用的机器信息(例如 metadata 和声音数据)来微调一个文本-to-音乐生成模型,以生成每种不同机器类型的异常声音,这些声音具有对应的类型特有的音响特征。
  • results: 这篇论文的提出的 FS-TWFR-GMM 方法在 DCASE 2023 挑战任务中 achieve 竞争性的表现,而且只需要 1% 的模型参数来检测,这结果显示了这个方法在首次聆听任务中的可行性。
    Abstract First-shot (FS) unsupervised anomalous sound detection (ASD) is a brand-new task introduced in DCASE 2023 Challenge Task 2, where the anomalous sounds for the target machine types are unseen in training. Existing methods often rely on the availability of normal and abnormal sound data from the target machines. However, due to the lack of anomalous sound data for the target machine types, it becomes challenging when adapting the existing ASD methods to the first-shot task. In this paper, we propose a new framework for the first-shot unsupervised ASD, where metadata-assisted audio generation is used to estimate unknown anomalies, by utilising the available machine information (i.e., metadata and sound data) to fine-tune a text-to-audio generation model for generating the anomalous sounds that contain unique acoustic characteristics accounting for each different machine types. We then use the method of Time-Weighted Frequency domain audio Representation with Gaussian Mixture Model (TWFR-GMM) as the backbone to achieve the first-shot unsupervised ASD. Our proposed FS-TWFR-GMM method achieves competitive performance amongst top systems in DCASE 2023 Challenge Task 2, while requiring only 1% model parameters for detection, as validated in our experiments.
    摘要 首射(FS)无监督异常声音检测(ASD)是在 DCASE 2023 挑战任务 2 中引入的一个全新任务,其异常声音对目标机器类型没有在训练中seen。现有的方法frequently rely on normal 和异常声音数据从目标机器类型。然而,由于目标机器类型的异常声音数据的不可预测,使得现有的 ASD 方法在适应首射任务时变得困难。在本文中,我们提出了一个新的首射无监督 ASD 框架,其中 metadata-assisted 音频生成用于估算未知异常,通过利用可用的机器信息(即 metadata 和声音数据)来细化一个文本到音频生成模型,以生成包含每种不同机器类型的异常声音的唯一频谱特征。然后,我们使用 Time-Weighted Frequency domain 音频表示法(TWFR)作为核心来实现首射无监督 ASD。我们的提出的 FS-TWFR 方法在 DCASE 2023 挑战任务 2 中实现了竞争性的表现,而且只需要1%的模型参数进行检测,如我们在实验中 validate。