cs.SD - 2023-11-25

Multi-Scale Sub-Band Constant-Q Transform Discriminator for High-Fidelity Vocoder

  • paper_url: http://arxiv.org/abs/2311.14957
  • repo_url: None
  • paper_authors: Yicheng Gu, Xueyao Zhang, Liumeng Xue, Zhizheng Wu
  • for: This study aims to improve the discriminator of Generative Adversarial Network (GAN) based vocoders to promote their inference speed and synthesis quality.
  • methods: The proposed method utilizes the Constant-Q Transform (CQT) instead of the traditional Short-Time Fourier Transform (STFT) to improve the time-frequency resolution and flexibility in modeling different frequency bands. The Multi-Scale Sub-Band CQT (MS-SB-CQT) Discriminator is proposed to operate on the CQT spectrogram at multiple scales and perform sub-band processing according to different octaves.
  • results: Experimental results on both speech and singing voices confirm the effectiveness of the proposed method, with the MOS of HiFi-GAN boosted from 3.27 to 3.87 for seen singers and from 3.40 to 3.78 for unseen singers when combined with the existing MS-STFT Discriminator.
    Abstract Generative Adversarial Network (GAN) based vocoders are superior in inference speed and synthesis quality when reconstructing an audible waveform from an acoustic representation. This study focuses on improving the discriminator to promote GAN-based vocoders. Most existing time-frequency-representation-based discriminators are rooted in Short-Time Fourier Transform (STFT), whose time-frequency resolution in a spectrogram is fixed, making it incompatible with signals like singing voices that require flexible attention for different frequency bands. Motivated by that, our study utilizes the Constant-Q Transform (CQT), which owns dynamic resolution among frequencies, contributing to a better modeling ability in pitch accuracy and harmonic tracking. Specifically, we propose a Multi-Scale Sub-Band CQT (MS-SB-CQT) Discriminator, which operates on the CQT spectrogram at multiple scales and performs sub-band processing according to different octaves. Experiments conducted on both speech and singing voices confirm the effectiveness of our proposed method. Moreover, we also verified that the CQT-based and the STFT-based discriminators could be complementary under joint training. Specifically, enhanced by the proposed MS-SB-CQT and the existing MS-STFT Discriminators, the MOS of HiFi-GAN can be boosted from 3.27 to 3.87 for seen singers and from 3.40 to 3.78 for unseen singers.
    摘要 生成对抗网络(GAN)基于 vocoder 在推理速度和合成质量方面表现出色,这项研究专注于提高 discriminator 以提高 GAN-based vocoder 的性能。现有的大多数时间频域表示基于 Short-Time Fourier Transform (STFT) 的 discriminator 具有固定的时间频域分辨率,使得无法适应如歌唱voice 等需要灵活注意的信号。为此,我们的研究利用 Constant-Q Transform (CQT),它具有动态的频谱分辨率,从而提高了模型对抗准确性和和律追踪性。特别是,我们提出了一种多尺度子带 CQT (MS-SB-CQT) Discriminator,它在 CQT spectrogram 中进行多级scaling并根据不同的 oktave 进行子带处理。实验结果表明,我们的提议方法可以在 both speech 和 singing voices 上提高 HiFi-GAN 的 MOS 值,从 3.27 提高到 3.87 для seen singers,从 3.40 提高到 3.78 для unseen singers。此外,我们还证明了 CQT-based 和 STFT-based Discriminators 在共同训练下可以增强 HiFi-GAN 的性能。