cs.SD - 2023-07-15

Single and Multi-Speaker Cloned Voice Detection: From Perceptual to Learned Features

  • paper_url: http://arxiv.org/abs/2307.07683
  • repo_url: https://github.com/audio-df-ucb/clonedvoicedetection
  • paper_authors: Sarah Barrington, Romit Barua, Gautham Koorma, Hany Farid
  • for: 本研究旨在 diferenciating real and cloned voices, particularly in the context of synthetic-voice cloning technologies.
  • methods: 本研究使用三种不同的方法来分辨真实的voice和假的voice,包括基于低维度感知特征的方法、基于普通频谱特征的方法以及基于端到端学习的方法。
  • results: 研究显示这三种方法可以准确地分辨真实的voice和假的voice,特别是当使用多个音频示例时。learned features consistently yield an equal error rate between $0%$ and $4%$, and are reasonably robust to adversarial laundering.
    Abstract Synthetic-voice cloning technologies have seen significant advances in recent years, giving rise to a range of potential harms. From small- and large-scale financial fraud to disinformation campaigns, the need for reliable methods to differentiate real and synthesized voices is imperative. We describe three techniques for differentiating a real from a cloned voice designed to impersonate a specific person. These three approaches differ in their feature extraction stage with low-dimensional perceptual features offering high interpretability but lower accuracy, to generic spectral features, and end-to-end learned features offering less interpretability but higher accuracy. We show the efficacy of these approaches when trained on a single speaker's voice and when trained on multiple voices. The learned features consistently yield an equal error rate between $0\%$ and $4\%$, and are reasonably robust to adversarial laundering.
    摘要 人工声音克隆技术在最近几年内得到了显著的进步,导致了一系列的可能性问题。从小规模到大规模的金融诈骗到假信息攻击,有必要的可靠方法来分辨真实的声音和假声音。我们描述了三种方法来分辨真实的声音和假声音,这三种方法在特征提取阶段有不同的特征。低维度感知特征提供了高度可解释性,但精度较低,而通用频谱特征和终端学习特征则提供了更高的精度,但是解释性较低。我们展示了这些方法在单个 speaker的声音和多个声音上的效果,并证明了这些方法在不同的场景下的可靠性。学习得到的特征在0%到4%的等错误率之间具有恒定的性,并在恶意洗涤下保持了一定的Robustness。

Towards spoken dialect identification of Irish

  • paper_url: http://arxiv.org/abs/2307.07436
  • repo_url: None
  • paper_authors: Liam Lonergan, Mengjie Qian, Neasa Ní Chiaráin, Christer Gobl, Ailbhe Ní Chasaide
  • for: 本研究旨在开发一个用于识别爱尔兰语言方言的语音识别系统,以便在识别爱尔兰语言时提供更加准确的结果。
  • methods: 本研究使用了两种音频分类模型:XLS-R和ECAPA-TDNN,以及一个基于预训练的爱尔兰语言BERT模型来进行文本分类。ECAPA-TDNN模型在整体上表现最佳,具有73%的准确率,而将其与文本模型进行融合可以提高准确率至76%。
  • results: 研究发现,最精准地识别爱尔兰语言的方言是 Ulster 方言,具有94%的准确率。然而,模型在识别康нахacht和慕尼黑方言时存在困难,这表明可能需要采用更加细化的方法来准确地分辨这些方言。
    Abstract The Irish language is rich in its diversity of dialects and accents. This compounds the difficulty of creating a speech recognition system for the low-resource language, as such a system must contend with a high degree of variability with limited corpora. A recent study investigating dialect bias in Irish ASR found that balanced training corpora gave rise to unequal dialect performance, with performance for the Ulster dialect being consistently worse than for the Connacht or Munster dialects. Motivated by this, the present experiments investigate spoken dialect identification of Irish, with a view to incorporating such a system into the speech recognition pipeline. Two acoustic classification models are tested, XLS-R and ECAPA-TDNN, in conjunction with a text-based classifier using a pretrained Irish-language BERT model. The ECAPA-TDNN, particularly a model pretrained for language identification on the VoxLingua107 dataset, performed best overall, with an accuracy of 73%. This was further improved to 76% by fusing the model's outputs with the text-based model. The Ulster dialect was most accurately identified, with an accuracy of 94%, however the model struggled to disambiguate between the Connacht and Munster dialects, suggesting a more nuanced approach may be necessary to robustly distinguish between the dialects of Irish.
    摘要 爱尔兰语言具有多样性的方言和口音,这使得为低资源语言创建语音识别系统的问题更加复杂,因为系统需要处理巨量的变化和有限的数据集。一项最近的研究发现,在爱尔兰ASR中的方言偏见会导致不均匀的方言表现,其中 Ulster 方言的表现一直比 Connacht 和 Munster 方言差。为了解决这个问题,当前的实验探索了爱尔兰口音的识别,以便将其 integrate 到语音识别管道中。两种音频分类模型,XLS-R 和 ECAPA-TDNN,以及一个基于 Irish-language BERT 模型的文本分类器被测试。ECAPA-TDNN 模型,特别是在 VoxLingua107 数据集上进行语言预训练,表现最佳,准确率为 73%。通过将模型的输出与文本分类器进行拟合,准确率可以提高到 76%。 Ulster 方言的识别率最高,为 94%,但模型在 Connacht 和 Munster 方言之间的异化方面存在困难,这表明可能需要采取更加细化的方法,以具有更加精准地分识爱尔兰方言。