cs.SD - 2023-12-06

Golden Gemini is All You Need: Finding the Sweet Spots for Speaker Verification

  • paper_url: http://arxiv.org/abs/2312.03620
  • repo_url: https://github.com/wenet-e2e/wespeaker
  • paper_authors: Tianchi Liu, Kong Aik Lee, Qiongqiong Wang, Haizhou Li
  • for: 提高 ResNet 模型在人脸识别中的表现
  • methods: 研究优化 ResNet 模型中的步长配置,以提高人脸识别性能
  • results: 提出 Golden Gemini 原则,可以提高 ResNet 模型在 VoxCeleb、SITW 和 CNCeleb 等 dataset 上的表现,同时减少参数量和计算量Here’s a more detailed explanation of each point:
  • for: The paper aims to improve the performance of ResNet models in speaker verification tasks.
  • methods: The authors research and optimize the stride configuration of ResNet models to improve the performance in speaker verification tasks.
  • results: The authors propose a new principle called Golden Gemini, which can significantly improve the performance of ResNet models on various datasets, while reducing the number of parameters and computational cost.
    Abstract Previous studies demonstrate the impressive performance of residual neural networks (ResNet) in speaker verification. The ResNet models treat the time and frequency dimensions equally. They follow the default stride configuration designed for image recognition, where the horizontal and vertical axes exhibit similarities. This approach ignores the fact that time and frequency are asymmetric in speech representation. In this paper, we address this issue and look for optimal stride configurations specifically tailored for speaker verification. We represent the stride space on a trellis diagram, and conduct a systematic study on the impact of temporal and frequency resolutions on the performance and further identify two optimal points, namely Golden Gemini, which serves as a guiding principle for designing 2D ResNet-based speaker verification models. By following the principle, a state-of-the-art ResNet baseline model gains a significant performance improvement on VoxCeleb, SITW, and CNCeleb datasets with 7.70%/11.76% average EER/minDCF reductions, respectively, across different network depths (ResNet18, 34, 50, and 101), while reducing the number of parameters by 16.5% and FLOPs by 4.1%. We refer to it as Gemini ResNet. Further investigation reveals the efficacy of the proposed Golden Gemini operating points across various training conditions and architectures. Furthermore, we present a new benchmark, namely the Gemini DF-ResNet, using a cutting-edge model.
    摘要 We represent the stride space on a trellis diagram and conduct a systematic study on the impact of temporal and frequency resolutions on performance. Our results identify two optimal points, which we refer to as the "Golden Gemini" principle. By following this principle, we design a state-of-the-art ResNet baseline model that achieves significant performance improvements on VoxCeleb, SITW, and CNCeleb datasets with an average EER/minDCF reduction of 7.70%/11.76%, respectively, across different network depths (ResNet18, 34, 50, and 101). This is achieved while reducing the number of parameters by 16.5% and FLOPs by 4.1%. We refer to this model as the "Gemini ResNet".Our further investigation shows that the proposed Golden Gemini operating points are effective across various training conditions and architectures. Additionally, we introduce a new benchmark, the "Gemini DF-ResNet", which uses a cutting-edge model.

Detecting Voice Cloning Attacks via Timbre Watermarking

  • paper_url: http://arxiv.org/abs/2312.03410
  • repo_url: None
  • paper_authors: Chang Liu, Jie Zhang, Tianwei Zhang, Xi Yang, Weiming Zhang, Nenghai Yu
  • for: 本研究旨在做出一种可靠地检测和防止语音伪认攻击的方法,以保护语音内容的时域特征免被伪造。
  • methods: 该方法基于一种新的概念——“时域特征水印”,通过在目标个体的语音中嵌入水印信息,最终击退语音伪认攻击。为确保水印抵抗语音伪认模型的学习过程,我们设计了一个端到端的语音伪认抗性检测框架。
  • results: 我们的实验表明,提出的时域特征水印方法可以鲁棒地抵抗不同的语音伪认攻击,并且在实际服务中(如PaddleSpeech、Voice-Cloning-App、so-vits-svc)实现了做用性。此外,我们还进行了一系列的ablation研究,以验证我们的设计的有效性。部分音频样本可以在https://timbrewatermarking.github.io/samples上找到。
    Abstract Nowadays, it is common to release audio content to the public. However, with the rise of voice cloning technology, attackers have the potential to easily impersonate a specific person by utilizing his publicly released audio without any permission. Therefore, it becomes significant to detect any potential misuse of the released audio content and protect its timbre from being impersonated. To this end, we introduce a novel concept, "Timbre Watermarking", which embeds watermark information into the target individual's speech, eventually defeating the voice cloning attacks. To ensure the watermark is robust to the voice cloning model's learning process, we design an end-to-end voice cloning-resistant detection framework. The core idea of our solution is to embed and extract the watermark in the frequency domain in a temporally invariant manner. To acquire generalization across different voice cloning attacks, we modulate their shared process and integrate it into our framework as a distortion layer. Experiments demonstrate that the proposed timbre watermarking can defend against different voice cloning attacks, exhibit strong resistance against various adaptive attacks (e.g., reconstruction-based removal attacks, watermark overwriting attacks), and achieve practicality in real-world services such as PaddleSpeech, Voice-Cloning-App, and so-vits-svc. In addition, ablation studies are also conducted to verify the effectiveness of our design. Some audio samples are available at https://timbrewatermarking.github.io/samples.
    摘要 现在,发布音频内容对公众而言很常见。然而,随着声音恶搅技术的出现,攻击者有可能轻松地冒充特定人的声音,无需得到其授权。因此,检测可能的声音内容不当使用并保护声音特征免受冒充的重要性日益增加。为此,我们提出了一种新的概念——“声音水印”,将水印信息嵌入目标人的说话中,最终抵御声音恶搅攻击。为确保水印具有对声音恶搅模型学习过程的鲜度,我们设计了一个端到端声音恶搅抗性检测框架。我们的核心思想是在频域中嵌入和提取水印,以保持时间不变的方式。为了在不同的声音恶搅攻击下实现通用性,我们对声音恶搅模型的共同过程进行修饰,并将其纳入我们的框架中作为干扰因素。实验表明,我们的声音水印可以对不同的声音恶搅攻击进行防御,并且在实际应用中具有实用性,如PaddleSpeech、Voice-Cloning-App和so-vits-svc等。此外,我们还进行了ablation experiment来证明我们的设计的有效性。有关audio samples的详细信息可以在上找到。