cs.SD - 2023-10-08

VITS-based Singing Voice Conversion System with DSPGAN post-processing for SVCC2023

paper_url: http://arxiv.org/abs/2310.05118
repo_url: None
paper_authors: Yiquan Zhou, Meng Chen, Yi Lei, Jihua Zhu, Weifeng Zhao
for: 这项研究的目的是为SVCC2023提供一个系统，以便在 singing voice conversion 领域实现高质量的音频转换。
methods: 该系统包括三个模块：特征提取器、声音转换器和后处理器。特征提取器使用 HuBERT 模型提取 singing voice 中的 F0 轨迹和 speaker-independent 语言内容。声音转换器使用 target speaker 的声音特征、F0 和语言内容来生成目标speaker 的波形。此外，为了进一步提高音质，我们还使用了一个精度调整的 DSPGAN vocoder。
results: 在 official challenge 结果中，我们的系统在 cross-domain 任务中表现出色，得分第1和第2位，分别在自然性和相似性两个指标上。此外，我们还进行了一些缓解分析，以证明我们的系统设计的有效性。

Abstract
This paper presents the T02 team's system for the Singing Voice Conversion Challenge 2023 (SVCC2023). Our system entails a VITS-based SVC model, incorporating three modules: a feature extractor, a voice converter, and a post-processor. Specifically, the feature extractor provides F0 contours and extracts speaker-independent linguistic content from the input singing voice by leveraging a HuBERT model. The voice converter is employed to recompose the speaker timbre, F0, and linguistic content to generate the waveform of the target speaker. Besides, to further improve the audio quality, a fine-tuned DSPGAN vocoder is introduced to re-synthesise the waveform. Given the limited target speaker data, we utilize a two-stage training strategy to adapt the base model to the target speaker. During model adaptation, several tricks, such as data augmentation and joint training with auxiliary singer data, are involved. Official challenge results show that our system achieves superior performance, especially in the cross-domain task, ranking 1st and 2nd in naturalness and similarity, respectively. Further ablation justifies the effectiveness of our system design.

摘要

Partial Rank Similarity Minimization Method for Quality MOS Prediction of Unseen Speech Synthesis Systems in Zero-Shot and Semi-supervised setting

paper_url: http://arxiv.org/abs/2310.05078
repo_url: https://github.com/nii-yamagishilab/partial_rank_similarity
paper_authors: Hemant Yadav, Erica Cooper, Junichi Yamagishi, Sunayana Sitaram, Rajiv Ratn Shah
for: 这项研究旨在提出一种新的质量 mean opinion score（MOS）预测函数，用于评估未经见过的语音合成系统的质量。
methods: 该函数measure相对位置的相似性，而不是实际的MOS值，通过测量partial rank similarity（PRS）而不是L1损失函数。
results: 实验表明，PRS在零shot和半supervised设定下表现出色，与真实值更高度相关，而MSE和linear correlation coefficient metric可能不适用于评估MOS预测模型。

Abstract
This paper introduces a novel objective function for quality mean opinion score (MOS) prediction of unseen speech synthesis systems. The proposed function measures the similarity of relative positions of predicted MOS values, in a mini-batch, rather than the actual MOS values. That is the partial rank similarity is measured (PRS) rather than the individual MOS values as with the L1 loss. Our experiments on out-of-domain speech synthesis systems demonstrate that the PRS outperforms L1 loss in zero-shot and semi-supervised settings, exhibiting stronger correlation with ground truth. These findings highlight the importance of considering rank order, as done by PRS, when training MOS prediction models. We also argue that mean squared error and linear correlation coefficient metrics may be unreliable for evaluating MOS prediction models. In conclusion, PRS-trained models provide a robust framework for evaluating speech quality and offer insights for developing high-quality speech synthesis systems. Code and models are available at github.com/nii-yamagishilab/partial_rank_similarity/

摘要
这份论文介绍了一种新的评价函数，用于预测未看过的语音合成系统的质量 mean opinion score（MOS）。提出的函数测量在一个小批次中预测的MOS值相对位置的相似性，而不是实际的MOS值。即使使用了partial rank similarity（PRS）而不是L1损失，我们的实验表明，PRS在零批次和半指导学习 Setting 中表现更好，与基准数据 exhibit stronger correlation。这些发现反映了考虑 rank order 的重要性，当训练 MOS 预测模型时。我们还认为 mean squared error 和 linear correlation coefficient metrics 可能不可靠地评价 MOS 预测模型。 conclusion，PRS 训练的模型提供了一种robust的 speech quality 评价框架，并且为开发高质量语音合成系统提供了意见。代码和模型可以在github.com/nii-yamagishilab/partial_rank_similarity/ 找到。

SALT: Distinguishable Speaker Anonymization Through Latent Space Transformation

paper_url: http://arxiv.org/abs/2310.05051
repo_url: https://github.com/bakerbunker/salt
paper_authors: Yuanjun Lv, Jixun Yao, Peikun Chen, Hongbin Zhou, Heng Lu, Lei Xie
for: 隐藏发音人的身份，保持语音质量和可理解性。
methods: 基于隐藏空间转换的发音人匿名系统（SALT），包括自主学习特征提取器和随机抽取多个发音人和其权重，并通过 interpolate 实现发音人匿名。同时，我们还 explore 了扩展方法以提高假发音人的多样性。
results: 在 Voice Privacy Challenge 数据集上，我们的系统实现了最佳的匿名度指标，同时保持语音质量和可理解性。

Abstract
Speaker anonymization aims to conceal a speaker's identity without degrading speech quality and intelligibility. Most speaker anonymization systems disentangle the speaker representation from the original speech and achieve anonymization by averaging or modifying the speaker representation. However, the anonymized speech is subject to reduction in pseudo speaker distinctiveness, speech quality and intelligibility for out-of-distribution speaker. To solve this issue, we propose SALT, a Speaker Anonymization system based on Latent space Transformation. Specifically, we extract latent features by a self-supervised feature extractor and randomly sample multiple speakers and their weights, and then interpolate the latent vectors to achieve speaker anonymization. Meanwhile, we explore the extrapolation method to further extend the diversity of pseudo speakers. Experiments on Voice Privacy Challenge dataset show our system achieves a state-of-the-art distinctiveness metric while preserving speech quality and intelligibility. Our code and demo is availible at https://github.com/BakerBunker/SALT .

摘要

PromptSpeaker: Speaker Generation Based on Text Descriptions

paper_url: http://arxiv.org/abs/2310.05001
repo_url: None
paper_authors: Yongmao Zhang, Guanghou Liu, Yi Lei, Yunlin Chen, Hao Yin, Lei Xie, Zhifei Li
for: 这项研究旨在实现文本描述基于的发音人生成（text-guided speaker generation），即通过文本描述控制发音人生成过程。
methods: 该研究提出了一种名为PromptSpeaker的文本指导发音人生成系统，该系统包括提取器、零批量VITS和Glow模型。提取器预测基于文本描述的含义表示，并从这个分布中采样以获取 semantic representation。Glow模型将含义表示转换成发音人表示，而零批量VITS最后将发音人表示转换成真实的发音。
results: 研究证明PromptSpeaker可以生成与训练集外的新发音人，并且synthetic speaker voice具有相对合理的主观匹配质量。

Abstract
Recently, text-guided content generation has received extensive attention. In this work, we explore the possibility of text description-based speaker generation, i.e., using text prompts to control the speaker generation process. Specifically, we propose PromptSpeaker, a text-guided speaker generation system. PromptSpeaker consists of a prompt encoder, a zero-shot VITS, and a Glow model, where the prompt encoder predicts a prior distribution based on the text description and samples from this distribution to obtain a semantic representation. The Glow model subsequently converts the semantic representation into a speaker representation, and the zero-shot VITS finally synthesizes the speaker's voice based on the speaker representation. We verify that PromptSpeaker can generate speakers new from the training set by objective metrics, and the synthetic speaker voice has reasonable subjective matching quality with the speaker prompt.

摘要
近些时间，文本指导内容生成已经受到了广泛关注。在这项工作中，我们探索了使用文本描述来控制发音生成过程的可能性。具体来说，我们提出了PromptSpeaker，一种文本指导的发音生成系统。PromptSpeaker包括一个描述符编码器、一个零拟合VITS和一个Glow模型，其中描述符编码器根据文本描述预测一个优先分布，并从这个分布中采样以获取一个semantic表示。Glow模型然后将semantic表示转化为发音表示，零拟合VITS最后将发音表示转化为声音。我们证明了PromptSpeaker可以新生成不同于训练集的发音，并且synthetic声音具有合理的主观匹配质量与发音描述。