cs.SD - 2023-09-02

Timbre-reserved Adversarial Attack in Speaker Identification

  • paper_url: http://arxiv.org/abs/2309.00929
  • repo_url: None
  • paper_authors: Qing Wang, Jixun Yao, Li Zhang, Pengcheng Guo, Lei Xie
  • for: 本研究旨在使SID系统遭受攻击时,不仅利用攻击者模型的漏洞,而且保留目标话者的时变特征。
  • methods: 本研究使用了voice conversion(VC)模型的不同训练阶段来生成具有攻击者识别label的对抗攻击音频。具体来说,在VC模型的训练过程中,透过将攻击者识别label加入模型训练,以便控制VC模型生成的音频具有目标话者的时变特征。
  • results: 本研究发现,透过将攻击者识别label加入VC模型训练,可以生成timbre-reserved的对抗攻击音频,具有目标话者的时变特征。这些对抗攻击音频可以让SID系统错误识别攻击者,并且保留目标话者的时变特征。
    Abstract As a type of biometric identification, a speaker identification (SID) system is confronted with various kinds of attacks. The spoofing attacks typically imitate the timbre of the target speakers, while the adversarial attacks confuse the SID system by adding a well-designed adversarial perturbation to an arbitrary speech. Although the spoofing attack copies a similar timbre as the victim, it does not exploit the vulnerability of the SID model and may not make the SID system give the attacker's desired decision. As for the adversarial attack, despite the SID system can be led to a designated decision, it cannot meet the specified text or speaker timbre requirements for the specific attack scenarios. In this study, to make the attack in SID not only leverage the vulnerability of the SID model but also reserve the timbre of the target speaker, we propose a timbre-reserved adversarial attack in the speaker identification. We generate the timbre-reserved adversarial audios by adding an adversarial constraint during the different training stages of the voice conversion (VC) model. Specifically, the adversarial constraint is using the target speaker label to optimize the adversarial perturbation added to the VC model representations and is implemented by a speaker classifier joining in the VC model training. The adversarial constraint can help to control the VC model to generate the speaker-wised audio. Eventually, the inference of the VC model is the ideal adversarial fake audio, which is timbre-reserved and can fool the SID system.
    摘要 为了使骗谋攻击(SID)系统不仅利用骗谋模型的漏洞,还保留目标说话人的时征特征,我们在这种研究中提出了一种具有时征保留的敌意攻击。我们在不同的训练阶段中添加了一个敌意约束,以控制VC模型生成说话人级别的声音。具体来说,我们使用目标说话人标签来优化骗谋模型表示中的敌意干扰,通过一个说话人分类器参与VC模型训练。这种敌意约束可以帮助控制VC模型生成说话人级别的声音,最终得到骗谋模型的恶意假声音,这个声音保留了目标说话人的时征特征。

DiCLET-TTS: Diffusion Model based Cross-lingual Emotion Transfer for Text-to-Speech – A Study between English and Mandarin

  • paper_url: http://arxiv.org/abs/2309.00883
  • repo_url: None
  • paper_authors: Tao Li, Chenxu Hu, Jian Cong, Xinfa Zhu, Jingbei Li, Qiao Tian, Yuping Wang, Lei Xie
  • for: 这篇研究旨在提高cross-lingual TTS的自然度和情感表达能力。
  • methods: 提出了一个基于传播过程的Diffusion model based Cross-Lingual Emotion Transfer方法(DiCLET-TTS),通过将情感从源语言 speaker 转移到内部和跨语言目标 speaker 上,以提高语言转移后的自然度和情感表达能力。
  • results: 试验结果显示DiCLET-TTS 比较优秀于多种竞争模型,并且OP-EDM 能够学习 speaker-irrelevant yet emotion-discriminative embedding。
    Abstract While the performance of cross-lingual TTS based on monolingual corpora has been significantly improved recently, generating cross-lingual speech still suffers from the foreign accent problem, leading to limited naturalness. Besides, current cross-lingual methods ignore modeling emotion, which is indispensable paralinguistic information in speech delivery. In this paper, we propose DiCLET-TTS, a Diffusion model based Cross-Lingual Emotion Transfer method that can transfer emotion from a source speaker to the intra- and cross-lingual target speakers. Specifically, to relieve the foreign accent problem while improving the emotion expressiveness, the terminal distribution of the forward diffusion process is parameterized into a speaker-irrelevant but emotion-related linguistic prior by a prior text encoder with the emotion embedding as a condition. To address the weaker emotional expressiveness problem caused by speaker disentanglement in emotion embedding, a novel orthogonal projection based emotion disentangling module (OP-EDM) is proposed to learn the speaker-irrelevant but emotion-discriminative embedding. Moreover, a condition-enhanced DPM decoder is introduced to strengthen the modeling ability of the speaker and the emotion in the reverse diffusion process to further improve emotion expressiveness in speech delivery. Cross-lingual emotion transfer experiments show the superiority of DiCLET-TTS over various competitive models and the good design of OP-EDM in learning speaker-irrelevant but emotion-discriminative embedding.
    摘要 Traditional cross-lingual TTS methods based on monolingual corpora have made significant progress in recent years, but they still suffer from the problem of foreign accents, which limits the naturalness of the speech. Moreover, current methods ignore the modeling of emotion, which is essential paralinguistic information in speech delivery. In this paper, we propose DiCLET-TTS, a diffusion model-based cross-lingual emotion transfer method that can transfer emotion from a source speaker to the intra- and cross-lingual target speakers. To address the foreign accent problem and improve emotion expressiveness, we use a prior text encoder with an emotion embedding as a condition to parameterize the terminal distribution of the forward diffusion process. To further improve emotion expressiveness, we propose a novel orthogonal projection-based emotion disentangling module (OP-EDM) to learn speaker-irrelevant but emotion-discriminative embeddings. In addition, we introduce a condition-enhanced DPM decoder to strengthen the modeling ability of the speaker and the emotion in the reverse diffusion process. Cross-lingual emotion transfer experiments show that DiCLET-TTS outperforms various competitive models and demonstrates the effectiveness of OP-EDM in learning speaker-irrelevant but emotion-discriminative embeddings.