cs.SD - 2023-08-12

Alternative Pseudo-Labeling for Semi-Supervised Automatic Speech Recognition

  • paper_url: http://arxiv.org/abs/2308.06547
  • repo_url: None
  • paper_authors: Han Zhu, Dongji Gao, Gaofeng Cheng, Daniel Povey, Pengyuan Zhang, Yonghong Yan
  • for: 提高自动语音识别器的性能在半监督学习中,当 Label 缺乏时
  • methods: 提议一种新的替代 pseudo-labeling 框架,包括一种通用的 CTC 损失函数、一种 confidence-based 错误检测方法和一种自动调整 threshold 方法
  • results: 对比 traditional CTC 损失函数和 confidence-based 错误检测方法,提议的替代 pseudo-labeling 框架可以更好地处理含有错误 tokens 的 pseudo-Label,并且不需要手动调整 threshold
    Abstract When labeled data is insufficient, semi-supervised learning with the pseudo-labeling technique can significantly improve the performance of automatic speech recognition. However, pseudo-labels are often noisy, containing numerous incorrect tokens. Taking noisy labels as ground-truth in the loss function results in suboptimal performance. Previous works attempted to mitigate this issue by either filtering out the nosiest pseudo-labels or improving the overall quality of pseudo-labels. While these methods are effective to some extent, it is unrealistic to entirely eliminate incorrect tokens in pseudo-labels. In this work, we propose a novel framework named alternative pseudo-labeling to tackle the issue of noisy pseudo-labels from the perspective of the training objective. The framework comprises several components. Firstly, a generalized CTC loss function is introduced to handle noisy pseudo-labels by accepting alternative tokens in the positions of incorrect tokens. Applying this loss function in pseudo-labeling requires detecting incorrect tokens in the predicted pseudo-labels. In this work, we adopt a confidence-based error detection method that identifies the incorrect tokens by comparing their confidence scores with a given threshold, thus necessitating the confidence score to be discriminative. Hence, the second proposed technique is the contrastive CTC loss function that widens the confidence gap between the correctly and incorrectly predicted tokens, thereby improving the error detection ability. Additionally, obtaining satisfactory performance with confidence-based error detection typically requires extensive threshold tuning. Instead, we propose an automatic thresholding method that uses labeled data as a proxy for determining the threshold, thus saving the pain of manual tuning.
    摘要 当标注数据短缺时,半超vised学习采用pseudo-标签技术可以显著提高自动语音识别的性能。然而,pseudo-标签经常含有许多错误的token。将含有错误token的pseudo-标签作为真实标签在损失函数中使用会导致优化性能下降。前一些工作尝试了通过过滤 pseudo-标签中最含糟糕的token或提高总体pseudo-标签质量来缓解这个问题。虽然这些方法有一定的效果,但是完全消除pseudo-标签中的错误token是不现实的。在这种情况下,我们提出了一种新的框架名为代理 pseudo-标签。该框架包括以下几个组成部分。首先,我们引入一种通用的CTC损失函数,可以处理含有错误token的pseudo-标签。在使用这种损失函数进行pseudo-标签时,需要检测pseudo-标签中的错误token。在这种情况下,我们采用一种 confidence-based 错误检测方法,通过比较错误token的信任分数与一个给定的阈值,以确定错误token的存在。因此,第二个提出的技术是增强CTC损失函数,以增强错误检测的能力。此外,通过 confidence-based 错误检测获得良好性能通常需要进行广泛的阈值调整。而我们提出的自动阈值调整方法,通过使用标注数据作为代理,自动地调整阈值,从而避免了手动调整的痛苦。

BigWavGAN: A Wave-To-Wave Generative Adversarial Network for Music Super-Resolution

  • paper_url: http://arxiv.org/abs/2308.06483
  • repo_url: None
  • paper_authors: Yenan Zhang, Hiroshi Watanabe
  • for: 这个论文目的是提高音乐超解析(SR)领域中深度神经网络(DNN)的性能。
  • methods: 这个论文使用了大型DNN模型,并结合了State-Of-The-Art(SOTA)的激励函数和对抗训练策略。它的权衡器包括多尺度权衡器(MSD)和多分辨率权衡器(MRD)。
  • results: 对于音乐SR问题,BigWavGAN模型表现出色,超过了基eline模型和State-Of-The-Art(SOTA)音乐SR模型。它还能够处理异常数据,并且有较好的总体化能力。
    Abstract Generally, Deep Neural Networks (DNNs) are expected to have high performance when their model size is large. However, large models failed to produce high-quality results commensurate with their scale in music Super-Resolution (SR). We attribute this to that DNNs cannot learn information commensurate with their size from standard mean square error losses. To unleash the potential of large DNN models in music SR, we propose BigWavGAN, which incorporates Demucs, a large-scale wave-to-wave model, with State-Of-The-Art (SOTA) discriminators and adversarial training strategies. Our discriminator consists of Multi-Scale Discriminator (MSD) and Multi-Resolution Discriminator (MRD). During inference, since only the generator is utilized, there are no additional parameters or computational resources required compared to the baseline model Demucs. Objective evaluation affirms the effectiveness of BigWavGAN in music SR. Subjective evaluations indicate that BigWavGAN can generate music with significantly high perceptual quality over the baseline model. Notably, BigWavGAN surpasses the SOTA music SR model in both simulated and real-world scenarios. Moreover, BigWavGAN represents its superior generalization ability to address out-of-distribution data. The conducted ablation study reveals the importance of our discriminators and training strategies. Samples are available on the demo page.
    摘要 通常情况下,深度神经网络(DNNs)预期会在模型大小增加时表现出色。然而,大型模型在音乐超分解(SR)中并没有达到预期的高质量效果。我们认为这是因为DNNs无法从标准方差误差损失中学习足够的信息。为了解放大型DNN模型在音乐SR中的潜力,我们提出了BigWavGAN,它将大规模涉及的wave-to-wave模型Demucs融合到了领先的推误器和对抗训练策略中。我们的推误器包括多尺度推误器(MSD)和多分辨率推误器(MRD)。在推理过程中,由于只有生成器被使用,因此没有额外的参数或计算资源的需求,与基线模型Demucs相比。对象评估表明BigWavGAN在音乐SR中的效果非常高。主观评估表明BigWavGAN可以生成具有显著高媒体质量的音乐,比基线模型高。此外,BigWavGAN在实际和 simulate 的情况下都能够超越领先的音乐SR模型。此外,BigWavGAN在处理异常数据的能力方面表现出了superior的普适性。进行的ablation研究表明我们的推误器和训练策略的重要性。样例可以在 demo 页面中找到。

Bilingual Streaming ASR with Grapheme units and Auxiliary Monolingual Loss

  • paper_url: http://arxiv.org/abs/2308.06327
  • repo_url: None
  • paper_authors: Mohammad Soleymanpour, Mahmoud Al Ismail, Fahimeh Bahmaninezhad, Kshitiz Kumar, Jian Wu
  • for: 支持英语为次要地区的半自动语音识别(ASR)设置
  • methods: 使用全双语对照模型、双流Transformer模型、并行编码结构和语言标识(LID)损失
  • results: 提高英语混合码能力,对代码混合ES和IT应用进行大规模训练和测试,并显示出优于LID损失的特点
    Abstract We introduce a bilingual solution to support English as secondary locale for most primary locales in hybrid automatic speech recognition (ASR) settings. Our key developments constitute: (a) pronunciation lexicon with grapheme units instead of phone units, (b) a fully bilingual alignment model and subsequently bilingual streaming transformer model, (c) a parallel encoder structure with language identification (LID) loss, (d) parallel encoder with an auxiliary loss for monolingual projections. We conclude that in comparison to LID loss, our proposed auxiliary loss is superior in specializing the parallel encoders to respective monolingual locales, and that contributes to stronger bilingual learning. We evaluate our work on large-scale training and test tasks for bilingual Spanish (ES) and bilingual Italian (IT) applications. Our bilingual models demonstrate strong English code-mixing capability. In particular, the bilingual IT model improves the word error rate (WER) for a code-mix IT task from 46.5% to 13.8%, while also achieving a close parity (9.6%) with the monolingual IT model (9.5%) over IT tests.
    摘要 我们介绍了一种双语解决方案,以支持英语为次要地区的多地点自动语音识别(ASR)设置。我们的关键发展包括:(a) 使用字节单位 вместоPhone单位的发音词典。(b)一个完全双语对应模型和随后的双语流transformer模型。(c)一个并行编码结构,并且添加语言标识(LID)损失。(d)并行编码器,并且添加一个辅助损失来特化到各自的单语言本地。我们结合了这些发展,并进行了大规模的训练和测试任务,以评估我们的方法在双语西班牙(ES)和双语意大利(IT)应用中的性能。我们的双语模型在英语混合码中表现出色,特别是双语IT模型在一个混合IT任务中,从46.5%降低到13.8%,同时也与单语意大利模型(9.5%)在意大利测试上凑平。