cs.SD - 2023-11-14

ChoralSynth: Synthetic Dataset of Choral Singing

  • paper_url: http://arxiv.org/abs/2311.08350
  • repo_url: None
  • paper_authors: Jyoti Narang, Viviana De La Vega, Xavier Lizarraga, Oscar Mayor, Hector Parra, Jordi Janer, Xavier Serra
  • for: 本研究是为了提供高质量的choral singing数据集,以便进行Music Information Retrieval(MIR)研究。
  • methods: 本研究使用了现代音乐合成器,创造和 curae quality renditions。数据来源于Choral Public Domain Library(CPDL)。
  • results: 本研究提供了一个完整的数据集,包括相关的metadata,以及方法和技术。这些数据和方法将为Singing Voice研究开创新的 Avenues。
    Abstract Choral singing, a widely practiced form of ensemble singing, lacks comprehensive datasets in the realm of Music Information Retrieval (MIR) research, due to challenges arising from the requirement to curate multitrack recordings. To address this, we devised a novel methodology, leveraging state-of-the-art synthesizers to create and curate quality renditions. The scores were sourced from Choral Public Domain Library(CPDL). This work is done in collaboration with a diverse team of musicians, software engineers and researchers. The resulting dataset, complete with its associated metadata, and methodology is released as part of this work, opening up new avenues for exploration and advancement in the field of singing voice research.
    摘要 合唱歌唱,一种广泛实践的 ensemble 唱歌形式,在音乐信息检索(MIR)研究领域缺乏完整的数据集,因为需要合成多轨录音。为解决这个问题,我们提出了一种新的方法,利用当今最佳的 sintizer 创建和精心编辑高质量的演唱。歌谱来自choral Public Domain Library(CPDL)。这项工作和一群多元化的音乐家、软件工程师和研究人员合作完成,并随此工作发布了相关的数据集和方法。这些数据和方法对唱音研究领域开启了新的探索途径。

Generative De-Quantization for Neural Speech Codec via Latent Diffusion

  • paper_url: http://arxiv.org/abs/2311.08330
  • repo_url: None
  • paper_authors: Haici Yang, Inseon Jang, Minje Kim
  • for: 这 paper 的目的是提出一种 separable 的 speech coding 网络,以提高 speech 质量和简化网络结构。
  • methods: 该 paper 使用了 end-to-end 编码器来学习紧凑的特征,并使用了 latent diffusion 模型来解码紧凑的特征。
  • results: 该 paper 的实验结果显示,该模型在两个低比特率(1.5和3kbps)下的主观听测试中表现出色,并且比现有的模型更高效。
    Abstract In low-bitrate speech coding, end-to-end speech coding networks aim to learn compact yet expressive features and a powerful decoder in a single network. A challenging problem as such results in unwelcome complexity increase and inferior speech quality. In this paper, we propose to separate the representation learning and information reconstruction tasks. We leverage an end-to-end codec for learning low-dimensional discrete tokens and employ a latent diffusion model to de-quantize coded features into a high-dimensional continuous space, relieving the decoder's burden of de-quantizing and upsampling. To mitigate the issue of over-smooth generation, we introduce midway-infilling with less noise reduction and stronger conditioning. In ablation studies, we investigate the hyperparameters for midway-infilling and latent diffusion space with different dimensions. Subjective listening tests show that our model outperforms the state-of-the-art at two low bitrates, 1.5 and 3 kbps. Codes and samples of this work are available on our webpage.
    摘要 低比特率 speech 编码中,端到端 speech 编码网络目标是学习紧凑而表达力强的特征和一个强大的解码器在单个网络中。这是一个具有挑战性的问题,会导致不良复杂性增加和声音质量下降。在这篇论文中,我们提议将表征学习和信息重建任务分离开。我们利用端到端编码器来学习低维度的整数token,并使用幽默扩散模型将编码后的特征转换为高维度连续空间,从而减轻解码器的幽默扩散和采样加工负担。为了缓解过度平滑生成的问题,我们引入中途填充,并对其进行较强的条件和噪声减少。在分析研究中,我们研究了不同维度的幽默扩散空间和中途填充的hyperparameters。主观听测试显示,我们的模型在1.5和3kbps两个低比特率下表现出色,超过了当前状态的质量。我们的代码和样本在我们的网站上可以获得。

DQR-TTS: Semi-supervised Text-to-speech Synthesis with Dynamic Quantized Representation

  • paper_url: http://arxiv.org/abs/2311.07965
  • repo_url: None
  • paper_authors: Jiangzong Wang, Pengcheng Li, Xulong Zhang, Ning Cheng, Jing Xiao
  • for: 提高 neural-based 文本译音方法在低资源条件下的表现
  • methods: 使用 semi-supervised 模型,学习 both paired 和 unpaired 数据,并使用动态量化表示模块
  • results: 与传统方法相比,只使用 less than 120 minutes 的 paired 数据,提高了 subjective 和 objective 评价指标
    Abstract Most existing neural-based text-to-speech methods rely on extensive datasets and face challenges under low-resource condition. In this paper, we introduce a novel semi-supervised text-to-speech synthesis model that learns from both paired and unpaired data to address this challenge. The key component of the proposed model is a dynamic quantized representation module, which is integrated into a sequential autoencoder. When given paired data, the module incorporates a trainable codebook that learns quantized representations under the supervision of the paired data. However, due to the limited paired data in low-resource scenario, these paired data are difficult to cover all phonemes. Then unpaired data is fed to expand the dynamic codebook by adding quantized representation vectors that are sufficiently distant from the existing ones during training. Experiments show that with less than 120 minutes of paired data, the proposed method outperforms existing methods in both subjective and objective metrics.
    摘要 现有的神经网络基于文本至话方法大多需要广泛的数据集和面临低资源情况下遇到挑战。在这篇文章中,我们介绍了一个新的半监督文本至话合成模型,可以从对称和无对称数据进行学习,以解决这个挑战。这个模型的关键 комponents是动态量化表现模块,它被组入了一个排序自适应器。当 given paired data 时,这个模块包含一个可调数表示的对称码库,可以在对称数据的监督下学习量化表现。但在低资源情况下,这些对称数据很难覆盖所有的音响。然后,无对称数据被 feed 到扩展动态码库,在训练时添加量化表现向量,以便在训练时与现有的向量 sufficiently distant 的情况下增加量化表现向量。实验显示,仅使用 less than 120 分钟的对称数据,提案方法已经在主观和客观指标中超过现有方法。