results: 实验结果表明,提出的方法可以在Voxceleb1 dataset上达到19%的提升,并超过许多现有的状态对技术。Abstract
Contrastive self-supervised learning (CSL) for speaker verification (SV) has drawn increasing interest recently due to its ability to exploit unlabeled data. Performing data augmentation on raw waveforms, such as adding noise or reverberation, plays a pivotal role in achieving promising results in SV. Data augmentation, however, demands meticulous calibration to ensure intact speaker-specific information, which is difficult to achieve without speaker labels. To address this issue, we introduce a novel framework by incorporating clean and augmented segments into the contrastive training pipeline. The clean segments are repurposed to pair with noisy segments to form additional positive and negative pairs. Moreover, the contrastive loss is weighted to increase the difference between the clean and augmented embeddings of different speakers. Experimental results on Voxceleb1 suggest that the proposed framework can achieve a remarkable 19% improvement over the conventional methods, and it surpasses many existing state-of-the-art techniques.
摘要
“对话自我超vised学习(CSL)在 speaker verification(SV)中已引起了越来越多的关注,因为它可以利用无标签数据。在 raw waveform 上进行数据增强,如添加噪音或投射,对 SV 的表现产生了关键作用。然而,数据增强需要仔细调整,以确保保留Speaker-specific信息,这是 без speaker 标签很难实现。为解决这个问题,我们提出了一种新的框架,通过将清晰和增强段 integrate 到对比训练管道中。清晰段被重新用于与噪音段成对,以形成额外的正方向和负方向对。此外,对比损失被权重,以增加不同Speaker的增强 embeddings 之间的差异。实验结果表明,我们的方法可以在 Voxceleb1 上实现了非常出色的 19% 提高,并超越了许多现有的state-of-the-art技术。”Note that the translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you prefer Traditional Chinese, I can provide that as well.