eess.AS - 2023-07-19

An analysis on the effects of speaker embedding choice in non auto-regressive TTS

  • paper_url: http://arxiv.org/abs/2307.09898
  • repo_url: None
  • paper_authors: Adriana Stan, Johannah O’Mahony
  • for: 这个论文是研究非 autoregressive 多说话人语音合成架构如何利用不同的说话人嵌入集的信息,以提高目标说话人标识的质量。
  • methods: 这个论文使用了一种新的非 autoregressive 多说话人语音合成架构,并分析了在不同的嵌入集和学习策略下,是否会产生任何质量改进。
  • results: 研究发现, regardless of the used set of embeddings and learning strategy, the network can handle various speaker identities equally well, with barely noticeable variations in speech output quality, and that speaker leakage within the core structure of the synthesis system is inevitable in the standard training procedures adopted thus far。
    Abstract In this paper we introduce a first attempt on understanding how a non-autoregressive factorised multi-speaker speech synthesis architecture exploits the information present in different speaker embedding sets. We analyse if jointly learning the representations, and initialising them from pretrained models determine any quality improvements for target speaker identities. In a separate analysis, we investigate how the different sets of embeddings impact the network's core speech abstraction (i.e. zero conditioned) in terms of speaker identity and representation learning. We show that, regardless of the used set of embeddings and learning strategy, the network can handle various speaker identities equally well, with barely noticeable variations in speech output quality, and that speaker leakage within the core structure of the synthesis system is inevitable in the standard training procedures adopted thus far.
    摘要 在这篇论文中,我们首次尝试了理解非 autoregressive 多 speaker speech synthesis 架构如何利用不同 speaker embedding 集的信息。我们分析了在一起学习表示和从预训练模型初始化是否会提高目标 speaker 身份的质量改进。在另一项分析中,我们研究了不同 embedding 集的影响于核心语音抽象(即零条件)中的 speaker identity 和表示学习。我们发现,无论使用哪个 embedding 集和学习策略,网络都可以平等地处理不同的 speaker 身份,并且 speech 输出质量的变化幅度几乎无法注意。此外,我们发现在标准的训练程序中,speaker leakage 会在核心Synthesis 系统的结构中发生,这是不可避免的。

Self-Supervised Acoustic Word Embedding Learning via Correspondence Transformer Encoder

  • paper_url: http://arxiv.org/abs/2307.09871
  • repo_url: None
  • paper_authors: Jingru Lin, Xianghu Yue, Junyi Ao, Haizhou Li
  • For: The paper aims to learn robust acoustic word embeddings (AWEs) from a large-scale unlabelled speech corpus.* Methods: The proposed method, called Correspondence Transformer Encoder (CTE), uses a teacher-student learning framework and pre-trains the model with a word-level loss to ensure that different realisations of the same word are close in the underlying embedding space.* Results: The embeddings extracted from the CTE model are robust to speech variations, such as speakers and domains, and achieve new state-of-the-art performance on a low-resource cross-lingual setting.Here’s the simplified Chinese text for the three key points:* For: 这个论文目标是从大规模无标签语音集合中学习坚定的语音字幕嵌入 (AWEs)。* Methods: 提posed方法是使用教师学生学习框架,先在word级别损失下预训练模型,以确保不同的语音实现都是 embedding 空间中的近似。* Results: CTE模型提取的嵌入具有对语音变化(如speaker和domain)的抗变性,并在low-resource cross-lingual设定中达到新的状态纪录性。
    Abstract Acoustic word embeddings (AWEs) aims to map a variable-length speech segment into a fixed-dimensional representation. High-quality AWEs should be invariant to variations, such as duration, pitch and speaker. In this paper, we introduce a novel self-supervised method to learn robust AWEs from a large-scale unlabelled speech corpus. Our model, named Correspondence Transformer Encoder (CTE), employs a teacher-student learning framework. We train the model based on the idea that different realisations of the same word should be close in the underlying embedding space. Specifically, we feed the teacher and student encoder with different acoustic instances of the same word and pre-train the model with a word-level loss. Our experiments show that the embeddings extracted from the proposed CTE model are robust to speech variations, e.g. speakers and domains. Additionally, when evaluated on Xitsonga, a low-resource cross-lingual setting, the CTE model achieves new state-of-the-art performance.
    摘要 听音词嵌入(AWEs)的目标是将变长的语音段映射到固定维度的表示中。高质量的AWEs应该具有不变性,例如持续时间、音高和说话人。在这篇论文中,我们介绍了一种新的自动学习方法,以学习高质量的AWEs从大规模的无标签语音 corpus 中。我们的模型,即匹配变换器 encoder(CTE),采用了教师学生学习框架。我们在不同的语音实例中Feed 教师和学生encoder,并在Word level上预训练模型。我们的实验表明,从我们提出的CTE模型中提取的嵌入是不同语音变化(例如,说话人和频谱)的抗衡。此外,当我们在Xitsonga,一个低资源的 Cross-Lingual Setting 中评估CTE模型时,它达到了新的状态前的最佳性能。