methods: 作者使用了一种两步方法,首先使用平行 corpus 将whisper中的声学特征转换成相应的phonatedEquivalents,然后分析声学特征来预测phonated pitch contour的变化。
results: 研究发现,使用这种方法可以确定whisper中的声学特征和phonated pitch contour之间的关系,并揭示了whisper中的implicit pitch contour。Abstract
Whispered speech is characterised by a noise-like excitation that results in the lack of fundamental frequency. Considering that prosodic phenomena such as intonation are perceived through f0 variation, the perception of whispered prosody is relatively difficult. At the same time, studies have shown that speakers do attempt to produce intonation when whispering and that prosodic variability is being transmitted, suggesting that intonation "survives" in whispered formant structure. In this paper, we aim to estimate the way in which formant contours correlate with an "implicit" pitch contour in whisper, using a machine learning model. We propose a two-step method: using a parallel corpus, we first transform the whispered formants into their phonated equivalents using a denoising autoencoder. We then analyse the formant contours to predict phonated pitch contour variation. We observe that our method is effective in establishing a relationship between whispered and phonated formants and in uncovering implicit pitch contours in whisper.
摘要
含秘语言特征为噪声类刺激,导致基本频率的缺失。由于语音学中的听觉现象如声调变化是通过f0变化传递的,因此听众对潜 voce 的识别相对较难。然而,研究表明,当speaker whispering时,他们仍会尝试生成声调,并且发现了不同的语音变化,表明声调在潜 voce 中存在。在这篇论文中,我们想使用机器学习模型来估算潜 voce 中形式轨迹与隐藏的声调轨迹之间的相关性。我们提出了一种两步方法:首先,使用平行 корпу斯,将潜 voce 的形式轨迹转换为其相应的声调轨迹,使用杜因噪声自适应神经网络。然后,我们分析形式轨迹,预测声调轨迹的变化。我们发现,我们的方法能够有效地建立潜 voce 中形式轨迹和声调轨迹之间的关系,并且揭示了隐藏的声调轨迹。
Label-Synchronous Neural Transducer for End-to-End ASR
results: 实验表明,相比标准神经转换器,提出的 LS-Transducer 在内部预测 Librispeech-100h 数据上减少了10%的相对WRER(文本识别错误率),以及在跨频度的 TED-LIUM 2 和 AESRC2020 数据上减少了17%和19%的相对WRER。Abstract
Neural transducers provide a natural approach to streaming ASR. However, they augment output sequences with blank tokens which leads to challenges for domain adaptation using text data. This paper proposes a label-synchronous neural transducer (LS-Transducer), which extracts a label-level encoder representation before combining it with the prediction network output. Hence blank tokens are no longer needed and the prediction network can be easily adapted using text data. An Auto-regressive Integrate-and-Fire (AIF) mechanism is proposed to generate the label-level encoder representation while retaining the streaming property. In addition, a streaming joint decoding method is designed to improve ASR accuracy. Experiments show that compared to standard neural transducers, the proposed LS-Transducer gave a 10% relative WER reduction (WERR) for intra-domain Librispeech-100h data, as well as 17% and 19% relative WERRs on cross-domain TED-LIUM 2 and AESRC2020 data with an adapted prediction network.
摘要
“神经变换器提供了自然的流处理ASR方法。然而,它们在输出序列中添加空token,导致领域适应使用文本数据具有挑战。这篇论文提议了一种标签同步神经变换器(LS-Transducer),它在组合预测网络输出之前提取标签水平Encoder表示。因此,空token不再需要,预测网络可以轻松地适应文本数据。此外,一种自动重启综合射频(AIF)机制被提议,以生成标签水平Encoder表示,同时保持流处理性。此外,一种流处理共同解码方法被设计,以提高ASR准确性。实验表明,相比标准神经变换器,提议的LS-Transducer在内领域Librispeech-100h数据上减少了10%的相对WRER(文本识别错误率),以及在跨领域TED-LIUM 2和AESRC2020数据上适应预测网络后减少了17%和19%的相对WRER。”Note that the translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you prefer Traditional Chinese, I can provide that as well.