paper_authors: Fangyuan Wang, Ming Hao, Yuhai Shi, Bo Xu
for: This paper aims to improve the conventional recipe for Automatic Speech Recognition (ASR) models by rethinking and updating the early stopping and checkpoint averaging methods from the perspective of the bias-variance tradeoff.
methods: The proposed method, called Approximated Bias-Variance Tradeoff (ApproBiVT), uses the training loss and validation loss as proxies of bias and variance to guide the early stopping and checkpoint averaging.
results: When evaluated on the AISHELL-1 and AISHELL-2 datasets, the proposed recipe provided a CER reduction of 2.5%-3.7% and 3.1%-4.6%, respectively, compared to the conventional recipe.Abstract
The conventional recipe for Automatic Speech Recognition (ASR) models is to 1) train multiple checkpoints on a training set while relying on a validation set to prevent overfitting using early stopping and 2) average several last checkpoints or that of the lowest validation losses to obtain the final model. In this paper, we rethink and update the early stopping and checkpoint averaging from the perspective of the bias-variance tradeoff. Theoretically, the bias and variance represent the fitness and variability of a model and the tradeoff of them determines the overall generalization error. But, it's impractical to evaluate them precisely. As an alternative, we take the training loss and validation loss as proxies of bias and variance and guide the early stopping and checkpoint averaging using their tradeoff, namely an Approximated Bias-Variance Tradeoff (ApproBiVT). When evaluating with advanced ASR models, our recipe provides 2.5%-3.7% and 3.1%-4.6% CER reduction on the AISHELL-1 and AISHELL-2, respectively.
摘要
传统的自动语音识别(ASR)模型制作流程是:1)在训练集上训练多个Checkpoint,并且使用验证集来防止过拟合,使用早期停止和Checkpoint平均来获得最终模型。在这篇论文中,我们重新思考和更新了早期停止和Checkpoint平均的方法,从偏差-变差质量的角度来考虑。在理论上,偏差和变差代表模型的适应度和多样性,它们之间的质量评价是模型的总泛化误差的关键因素。但是,很难准确地评价它们。因此,我们使用训练损失和验证损失作为偏差和变差的代理,并使用它们之间的质量评价来引导早期停止和Checkpoint平均,即 Approximated Bias-Variance Tradeoff(ApproBiVT)。在使用高级ASR模型进行评估时,我们的制作流程可以提供2.5%-3.7%和3.1%-4.6%的CER减少在AISHELL-1和AISHELL-2上。
A Systematic Exploration of Joint-training for Singing Voice Synthesis
results: 对多个数据集进行了广泛的实验,并达到了比基eline更稳定的性能,同时提高了整个框架的可解释性。Abstract
There has been a growing interest in using end-to-end acoustic models for singing voice synthesis (SVS). Typically, these models require an additional vocoder to transform the generated acoustic features into the final waveform. However, since the acoustic model and the vocoder are not jointly optimized, a gap can exist between the two models, leading to suboptimal performance. Although a similar problem has been addressed in the TTS systems by joint-training or by replacing acoustic features with a latent representation, adopting corresponding approaches to SVS is not an easy task. How to improve the joint-training of SVS systems has not been well explored. In this paper, we conduct a systematic investigation of how to better perform a joint-training of an acoustic model and a vocoder for SVS. We carry out extensive experiments and demonstrate that our joint-training strategy outperforms baselines, achieving more stable performance across different datasets while also increasing the interpretability of the entire framework.
摘要
Recently, there has been growing interest in using end-to-end acoustic models for singing voice synthesis (SVS). Typically, these models require an additional vocoder to transform the generated acoustic features into the final waveform. However, since the acoustic model and the vocoder are not jointly optimized, a gap can exist between the two models, leading to suboptimal performance. Although a similar problem has been addressed in TTS systems by joint-training or by replacing acoustic features with a latent representation, adopting corresponding approaches to SVS is not an easy task. How to improve the joint-training of SVS systems has not been well explored. In this paper, we conduct a systematic investigation of how to better perform a joint-training of an acoustic model and a vocoder for SVS. We carry out extensive experiments and demonstrate that our joint-training strategy outperforms baselines, achieving more stable performance across different datasets while also increasing the interpretability of the entire framework.
Bootstrapping Contrastive Learning Enhanced Music Cold-Start Matching
results: 作者们通过对离线数据集和在线系统进行广泛的实验,证明了他们的方法的有效性和高效性。此外,他们还在NetEase Cloud Music上部署了这种方法,影响了数百万用户。Abstract
We study a particular matching task we call Music Cold-Start Matching. In short, given a cold-start song request, we expect to retrieve songs with similar audiences and then fastly push the cold-start song to the audiences of the retrieved songs to warm up it. However, there are hardly any studies done on this task. Therefore, in this paper, we will formalize the problem of Music Cold-Start Matching detailedly and give a scheme. During the offline training, we attempt to learn high-quality song representations based on song content features. But, we find supervision signals typically follow power-law distribution causing skewed representation learning. To address this issue, we propose a novel contrastive learning paradigm named Bootstrapping Contrastive Learning (BCL) to enhance the quality of learned representations by exerting contrastive regularization. During the online serving, to locate the target audiences more accurately, we propose Clustering-based Audience Targeting (CAT) that clusters audience representations to acquire a few cluster centroids and then locate the target audiences by measuring the relevance between the audience representations and the cluster centroids. Extensive experiments on the offline dataset and online system demonstrate the effectiveness and efficiency of our method. Currently, we have deployed it on NetEase Cloud Music, affecting millions of users. Code will be released in the future.
摘要
我们研究了一个特定的匹配任务,称之为音乐冷启始匹配(Music Cold-Start Matching)。简而言之,给定一个冷启始歌曲请求,我们期望检索到与其相似的听众,然后快速推广冷启始歌曲到检索到的听众中,以便让其热身。然而,现有的研究对此任务的研究非常有限。因此,在这篇论文中,我们将Music Cold-Start Matching问题进行详细化,并提出一种方案。在线上训练时,我们尝试学习高质量的歌曲表示,基于歌曲内容特征。然而,我们发现监督信号通常遵循力量分布,导致表示学习受到扭曲的影响。为解决这个问题,我们提出了一种新的对比学习方法,称之为启动对比学习(Bootstrapping Contrastive Learning,BCL),以提高学习的质量。在线上服务时,我们提出了分组听众定向(Clustering-based Audience Targeting,CAT),将听众表示分组为一些集中点,然后通过测量听众表示和集中点之间的相似度来确定目标听众。我们对历史数据集和在线系统进行了广泛的实验,证明了我们的方法的有效性和效率。目前,我们已经将其部署到NetEase Cloud Music上,影响了数百万名用户。代码将在未来发布。
Self-Distillation Network with Ensemble Prototypes: Learning Robust Speaker Representations without Supervision
results: 在 VoxCeleb 数据集上进行了详细的实验,并达到了新的 SOTA 水平( i.e., 等Error rate 1.94%、1.99% 和 3.77%),不使用任何标签数据进行训练。Abstract
Training speaker-discriminative and robust speaker verification systems without speaker labels is still challenging and worthwhile to explore. Previous studies have noted a substantial performance disparity between self-supervised and fully supervised approaches. In this paper, we propose an effective Self-Distillation network with Ensemble Prototypes (SDEP) to facilitate self-supervised speaker representation learning. A range of experiments conducted on the VoxCeleb datasets demonstrate the superiority of the SDEP framework in speaker verification. SDEP achieves a new SOTA on Voxceleb1 speaker verification evaluation benchmark ( i.e., equal error rate 1.94\%, 1.99\%, and 3.77\% for trial Vox1-O, Vox1-E and Vox1-H , respectively), discarding any speaker labels in the training phase. Code will be publicly available at https://github.com/alibaba-damo-academy/3D-Speaker.
摘要使用无标签的语音训练说话人识别系统仍然是一项挑战,但也是值得探索的。过去的研究表明自我超vised和完全超vised方法之间存在很大的性能差异。在这篇论文中,我们提出了一种有效的自我蒸馏网络与集成观察者(SDEP),以便无标签语音表征学习。对于VoxCeleb数据集进行了一系列实验, demonstarted SDEP框架在说话人识别中的优越性。SDEP在Voxceleb1说话人识别评价标准(即错误率1.94%、1.99%和3.77%)上达到了新的最佳性能,不使用任何说话人标签在训练阶段。代码将在https://github.com/alibaba-damo-academy/3D-Speaker上公开。Note that Simplified Chinese is used in the translation, as it is the more widely used standard in mainland China. If you prefer Traditional Chinese, I can provide that as well.