eess.AS - 2023-11-17

GhostVec: A New Threat to Speaker Privacy of End-to-End Speech Recognition System

  • paper_url: http://arxiv.org/abs/2311.10689
  • repo_url: None
  • paper_authors: Xiaojiao Chen, Sheng Li, Jiyi Li, Hao Huang, Yang Cao, Liang He
  • for: 这篇论文旨在探讨语音识别系统中的隐私问题,具体来说是攻击者可以通过访问语音识别系统来提取说话人的信息。
  • methods: 这篇论文提出了一种简单高效的攻击方法,即GhostVec,可以在基于 transformer 的语音识别系统中提取说话人的信息,无需外部的说话人验证系统或自然的人声为参考。
  • results: 实验结果表明,GhostVec 可以将 transformer 基于语音识别系统中的说话人信息提取出来,并且可以达到 10.83% EER 和 0.47 minDCF 的水平,这表明了提案的方法的效iveness。
    Abstract Speaker adaptation systems face privacy concerns, for such systems are trained on private datasets and often overfitting. This paper demonstrates that an attacker can extract speaker information by querying speaker-adapted speech recognition (ASR) systems. We focus on the speaker information of a transformer-based ASR and propose GhostVec, a simple and efficient attack method to extract the speaker information from an encoder-decoder-based ASR system without any external speaker verification system or natural human voice as a reference. To make our results quantitative, we pre-process GhostVec using singular value decomposition (SVD) and synthesize it into waveform. Experiment results show that the synthesized audio of GhostVec reaches 10.83\% EER and 0.47 minDCF with target speakers, which suggests the effectiveness of the proposed method. We hope the preliminary discovery in this study to catalyze future speech recognition research on privacy-preserving topics.
    摘要 喋 speaker adaptation 系统面临着隐私问题,因为这些系统通常是基于私人数据集训练的,并且容易过拟合。这篇论文显示,攻击者可以通过询问 speaker-adapted 语音识别(ASR)系统来提取speaker信息。我们关注于基于 transformer 的 ASR 系统中的 speaker 信息,并提出了 GhostVec,一种简单而高效的攻击方法,可以在 encoder-decoder 结构的 ASR 系统中提取speaker 信息,不需要外部的 speaker 验证系统或自然的人声作为参考。为了让我们的结果变量,我们使用了特征值分解(SVD)来预处理 GhostVec,并将其转换成波形。实验结果表明, synthesized audio 的 GhostVec 达到了 10.83% EER 和 0.47 minDCF 的目标 speaker,这表明了我们提出的方法的有效性。我们希望这一初步发现可以推动未来的语音识别研究,尤其是在隐私保护方面。

Reprogramming Self-supervised Learning-based Speech Representations for Speaker Anonymization

  • paper_url: http://arxiv.org/abs/2311.10664
  • repo_url: None
  • paper_authors: Xiaojiao Chen, Sheng Li, Jiyi Li, Hao Huang, Yang Cao, Liang He
  • for: 隐藏说话人的身份,特别是使用自动生成学(SSL)模型时需要巨量的计算资源。
  • methods: 提出一种高效、参数灵活的说话人隐藏方法,基于最新的端到端模型重程技术。首先提取大型SSL模型中的说话人表示,并将其重程为一个 Pseudo 领域,以隐藏说话人的身份。
  • results: 在 VoicePrivacy Challenge (VPC) 2022 数据集上进行了广泛的实验,证明了我们的提出的参数灵活学习隐藏方法的效iveness。同时,我们的方法可以在隐藏过程中减少计算资源的消耗。
    Abstract Current speaker anonymization methods, especially with self-supervised learning (SSL) models, require massive computational resources when hiding speaker identity. This paper proposes an effective and parameter-efficient speaker anonymization method based on recent End-to-End model reprogramming technology. To improve the anonymization performance, we first extract speaker representation from large SSL models as the speaker identifies. To hide the speaker's identity, we reprogram the speaker representation by adapting the speaker to a pseudo domain. Extensive experiments are carried out on the VoicePrivacy Challenge (VPC) 2022 datasets to demonstrate the effectiveness of our proposed parameter-efficient learning anonymization methods. Additionally, while achieving comparable performance with the VPC 2022 strong baseline 1.b, our approach consumes less computational resources during anonymization.
    摘要 Translation Notes:* "speaker anonymization" is translated as "声音隐私" (shēng diàn yǐn bì)* "self-supervised learning" is translated as "自我超vision" (zì wǒ chāo wén)* "End-to-End model reprogramming" is translated as "端到端模型重写" (dìan dào dian módel zhòng xiǎng)* "pseudo domain" is translated as "假领域" (jiǎ lǐng yì)* "computational resources" is translated as "计算资源" (jìsuàn zhīyuán)Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore.

LE-SSL-MOS: Self-Supervised Learning MOS Prediction with Listener Enhancement

  • paper_url: http://arxiv.org/abs/2311.10656
  • repo_url: None
  • paper_authors: Zili Qi, Xinhui Hu, Wangjin Zhou, Sheng Li, Hao Wu, Jian Lu, Xinkang Xu
    for:This paper proposes a novel fusion model for MOS (Mean Opinion Score) prediction that combines supervised and unsupervised approaches to improve the accuracy of predicting subjective evaluations for speech synthesis systems, especially on out-of-domain test sets.methods:The proposed fusion model uses a combination of supervised and unsupervised techniques, including pre-trained self-supervised learning models, fine-tuning of unit language models, and ensemble learning with ASR confidence.results:The experimental results on the VoiceMOS Challenge 2023 show that the proposed LE-SSL-MOS system achieves better performance than the baseline, with an absolute improvement of 13% on the noisy and enhanced speech track. The system ranked 1st and 2nd, respectively, in the French speech synthesis track and the challenge’s noisy and enhanced speech track.
    Abstract Recently, researchers have shown an increasing interest in automatically predicting the subjective evaluation for speech synthesis systems. This prediction is a challenging task, especially on the out-of-domain test set. In this paper, we proposed a novel fusion model for MOS prediction that combines supervised and unsupervised approaches. In the supervised aspect, we developed an SSL-based predictor called LE-SSL-MOS. The LE-SSL-MOS utilizes pre-trained self-supervised learning models and further improves prediction accuracy by utilizing the opinion scores of each utterance in the listener enhancement branch. In the unsupervised aspect, two steps are contained: we fine-tuned the unit language model (ULM) using highly intelligible domain data to improve the correlation of an unsupervised metric - SpeechLMScore. Another is that we utilized ASR confidence as a new metric with the help of ensemble learning. To our knowledge, this is the first architecture that fuses supervised and unsupervised methods for MOS prediction. With these approaches, our experimental results on the VoiceMOS Challenge 2023 show that LE-SSL-MOS performs better than the baseline. Our fusion system achieved an absolute improvement of 13% over LE-SSL-MOS on the noisy and enhanced speech track. Our system ranked 1st and 2nd, respectively, in the French speech synthesis track and the challenge's noisy and enhanced speech track.
    摘要 近些时候,研究人员对自动预测语音合成系统的主观评价有增加的兴趣。这个预测任务,尤其是在域外测试集上是一项挑战。在这篇论文中,我们提出了一种新的融合模型,用于MOS预测。我们的LE-SSL-MOS模型 combining supervised和Unsupervised方法。在supervised方面,我们开发了基于自我超vised学习模型的SSL-based predictor。在无supervised方面,我们finetune了unit语言模型(ULM),使其与高度可识别的频谱数据进行更好的对应。此外,我们还使用ASR确idence作为一个新的度量,并通过ensemble学习来利用其。到我们所知,这是首个将supervised和Unsupervised方法融合的MOS预测架构。我们的实验结果表明,LE-SSL-MOS在VoiceMOS Challenge 2023中表现出色,与基准相比,LE-SSL-MOS在噪音和加强的语音轨道上具有13%的绝对改进。我们的融合系统在法语语音合成轨道和挑战的噪音和加强语音轨道上分别 ranking 1st和2nd。