paper_authors: Dominik Klement, Mireia Diez, Federico Landini, Lukáš Burget, Anna Silnova, Marc Delcroix, Naohiro Tawara
for: 这 paper 是为了提出一种新的VBx参数更新方法,以优化泛化率和识别率。
methods: 这 paper 使用了泛化率模型、生成学习的可能性分析和极大似然估计来估算 x-vector 的分配。
results: Proof-of-concept 结果表明,这种方法可以自动找到最佳的hyperparameter,并与广泛的搜索比较相似的性能。此外,文章还表明,可以通过准确地训练 PLDA 进一步提高模型的性能。Abstract
Bayesian HMM clustering of x-vector sequences (VBx) has become a widely adopted diarization baseline model in publications and challenges. It uses an HMM to model speaker turns, a generatively trained probabilistic linear discriminant analysis (PLDA) for speaker distribution modeling, and Bayesian inference to estimate the assignment of x-vectors to speakers. This paper presents a new framework for updating the VBx parameters using discriminative training, which directly optimizes a predefined loss. We also propose a new loss that better correlates with the diarization error rate compared to binary cross-entropy $\unicode{x2013}$ the default choice for diarization end-to-end systems. Proof-of-concept results across three datasets (AMI, CALLHOME, and DIHARD II) demonstrate the method's capability of automatically finding hyperparameters, achieving comparable performance to those found by extensive grid search, which typically requires additional hyperparameter behavior knowledge. Moreover, we show that discriminative fine-tuning of PLDA can further improve the model's performance. We release the source code with this publication.
摘要
bayesian hmm clustering of x-vector sequences (VBx) 已成为发表和挑战中的标准基线模型。它使用 HMM 模型说话者转移,使用生成训练的概率Linear Discriminant Analysis (PLDA) 模型说话者分布,并使用 bayesian 推理来估算 x-vector 分配到说话者。本文提出了一种新的更新 VBx 参数使用推导式训练方法,直接优化预定的损失函数。我们还提出了一种新的损失函数,与干扰率更好地相关于 диари化错误率,相比于 binary cross-entropy,通常用于 диари化端到端系统的默认选择。三个 dataset (AMI, CALLHOME, DIHARD II) 的证明结果表明了方法的自动找到超参数的能力,与极大搜索的性能相当,通常需要额外的超参数行为知识。此外,我们还证明了 discriminative 细化 PLDA 可以进一步提高模型的性能。我们在发布了源代码。
Multi-resolution HuBERT: Multi-resolution Speech Self-Supervised Learning with Masked Unit Prediction
results: 实验结果表明,提出的模型不仅实现了更高效的推理,还在多个任务上显示出超过原始HuBERT模型的性能提升。Abstract
Existing Self-Supervised Learning (SSL) models for speech typically process speech signals at a fixed resolution of 20 milliseconds. This approach overlooks the varying informational content present at different resolutions in speech signals. In contrast, this paper aims to incorporate multi-resolution information into speech self-supervised representation learning. We introduce a SSL model that leverages a hierarchical Transformer architecture, complemented by HuBERT-style masked prediction objectives, to process speech at multiple resolutions. Experimental results indicate that the proposed model not only achieves more efficient inference but also exhibits superior or comparable performance to the original HuBERT model over various tasks. Specifically, significant performance improvements over the original HuBERT have been observed in fine-tuning experiments on the LibriSpeech speech recognition benchmark as well as in evaluations using the Speech Universal PERformance Benchmark (SUPERB) and Multilingual SUPERB (ML-SUPERB).
摘要
当存在的自我监督学习(SSL)模型在语音处理中通常采用固定的20毫秒分辨率。这种方法忽略了语音信号中不同分辨率含有的不同信息内容。相比之下,本文提出了包含多resolution信息的语音自我监督表征学习模型。我们引入了层次结构的Transformer架构,并补充了HuBERT风格的遮盲预测目标,以处理多resolution语音信号。实验结果表明,提议的模型不仅实现了更高效的识别,还在不同任务上示出了superior或相当于原 HuBERT 模型的性能。具体来说,在LibriSpeech语音识别benchmark上的 fine-tuning实验和SUPERB和ML-SUPERB多语言识别benchmark上的评估中,提议的模型都表现出了显著的性能提升。
BA-MoE: Boundary-Aware Mixture-of-Experts Adapter for Code-Switching Speech Recognition
paper_authors: Peikun Chen, Fan Yu, Yuhao Lian, Hongfei Xue, Xucheng Wan, Naijun Zheng, Huan Zhou, Lei Xie
for: 提高 code-switching 自动语音识别中的语言边界识别精度
methods: 使用 cross-layer language adapter 和 boundary-aware training 方法
results: 比基eline提高16.55%,降低 mixture error rateAbstract
Mixture-of-experts based models, which use language experts to extract language-specific representations effectively, have been well applied in code-switching automatic speech recognition. However, there is still substantial space to improve as similar pronunciation across languages may result in ineffective multi-language modeling and inaccurate language boundary estimation. To eliminate these drawbacks, we propose a cross-layer language adapter and a boundary-aware training method, namely Boundary-Aware Mixture-of-Experts (BA-MoE). Specifically, we introduce language-specific adapters to separate language-specific representations and a unified gating layer to fuse representations within each encoder layer. Second, we compute language adaptation loss of the mean output of each language-specific adapter to improve the adapter module's language-specific representation learning. Besides, we utilize a boundary-aware predictor to learn boundary representations for dealing with language boundary confusion. Our approach achieves significant performance improvement, reducing the mixture error rate by 16.55\% compared to the baseline on the ASRU 2019 Mandarin-English code-switching challenge dataset.
摘要
基于权值的模型,利用语言专家提取语言特定表示效果地应用在代码切换自动语音识别中,已经获得了良好的成果。然而,在同一个语言中的相似发音可能会导致多语言模型的不准确的语言界面估计和不 efective 的语言模型。为了解决这些缺点,我们提议一种跨层语言适配器和语言界面训练方法,即边界意识的混合型专家(BA-MoE)。 Specifically,我们引入语言特定适配器来分离语言特定表示,并在每个编码层中引入统一的阻塞层来融合表示。其次,我们计算语言适配率的语言特定表示学习损失。此外,我们利用边界意识预测器来学习处理语言界面混乱的边界表示。我们的方法实现了显著的性能提升,在ASRU 2019年度的普通话英语混合语音挑战数据集上降低了混合错误率by 16.55%。
Improving severity preservation of healthy-to-pathological voice conversion with global style tokens
results: 列表测试显示,该框架可以保持源样本的严重程度,同时模拟目标 speaker的语音特征。此外,还发现(a)疾病会影响 x-vector,但不是所有 speaker information 都会丢失,(b)根据严重性标签选择源 speaker 并不够。Abstract
In healthy-to-pathological voice conversion (H2P-VC), healthy speech is converted into pathological while preserving the identity. The paper improves on previous two-stage approach to H2P-VC where (1) speech is created first with the appropriate severity, (2) then the speaker identity of the voice is converted while preserving the severity of the voice. Specifically, we propose improvements to (2) by using phonetic posteriorgrams (PPG) and global style tokens (GST). Furthermore, we present a new dataset that contains parallel recordings of pathological and healthy speakers with the same identity which allows more precise evaluation. Listening tests by expert listeners show that the framework preserves severity of the source sample, while modelling target speaker's voice. We also show that (a) pathology impacts x-vectors but not all speaker information is lost, (b) choosing source speakers based on severity labels alone is insufficient.
摘要
健康至病态语音转换(H2P-VC)是将健康语音转换为病态语音,保持语音特征的技术。这篇论文提出了改进过去的两阶段方法,在第一阶段创建合适严重度的语音,然后在第二阶段将说话者的身份转换,保持语音严重度。具体来说,我们提出了使用phonetic posteriorgrams(PPG)和全局风格标识符(GST)来改进第二阶段。此外,我们提供了一个新的数据集,包含健康和病态说话者的并行录音,允许更精确的评估。专业听众对这个框架进行了听测,表明框架可以保持源样本的严重度,同时模拟目标说话者的voice。此外,我们还证明了(a)疾病对x-vector产生影响,但不是所有说话者信息都会产生损失,(b)只根据严重度标签选择源说话者是不充分的。
Shaping the Epochal Individuality and Generality: The Temporal Dynamics of Uncertainty and Prediction Error in Musical Improvisation
methods: 这个研究使用了 HBSL 模型,分析了 456 首 Jazz improvvisation 音乐,从 1905 年到 2009 年,来自 78 位 Jazz 音乐家。
results: 研究发现,不同时期的音乐创作具有不同的时间特征,特别是在旋律和旋律序列方面,可以反映不同时期的文化和情感特征。此外,rhythm sequence 在不同时期中具有一致的不确定性。Abstract
Musical improvisation, much like spontaneous speech, reveals intricate facets of the improviser's state of mind and emotional character. However, the specific musical components that reveal such individuality remain largely unexplored. Within the framework of brain's statistical learning and predictive processing, this study examined the temporal dynamics of uncertainty and surprise (prediction error) in a piece of musical improvisation. This study employed the HBSL model to analyze a corpus of 456 Jazz improvisations, spanning 1905 to 2009, from 78 distinct Jazz musicians. The results indicated distinctive temporal patterns of surprise and uncertainty, especially in pitch and pitch-rhythm sequences, revealing era-specific features from the early 20th to the 21st centuries. Conversely, rhythm sequences exhibited a consistent degree of uncertainty across eras. Further, the acoustic properties remain unchanged across different periods. These findings highlight the importance of how temporal dynamics of surprise and uncertainty in improvisational music change over periods, profoundly influencing the distinctive methodologies artists adopt for improvisation in each era. Further, it is suggested that the development of improvisational music can be attributed to the brain's adaptive statistical learning mechanisms, which constantly refine internal models to mirror the cultural and emotional nuances of their respective epochs. This study unravels the evolutionary trajectory of improvisational music and highlights the nuanced shifts artists employ to resonate with the cultural and emotional landscapes of their times.
摘要
音乐即兴演奏,类似于互动式说话,揭示了即兴演奏者的心理状态和情感特征。然而,特定的音乐元素,即使在即兴演奏中,仍然未得到了足够的探索。本研究基于大脑的统计学学习和预测处理框架,研究了即兴演奏中的时间动态特征,包括不确定性和意外性(预测错误)。本研究使用HBSL模型分析了456首爵士音乐即兴演奏,覆盖1905-2009年间78位爵士音乐家的作品。结果显示,具有不同时期特点的不确定性和意外性,尤其在旋律和旋律Sequence中,而rhythmSequence则显示了不变的不确定性水平。此外,音响性质保持不变 across不同时期。这些发现提醒我们,即兴演奏音乐的时间动态特征在不同时期中发生了重大变化,这些变化对于艺术家采取的即兴演奏方法产生了深远的影响。此外,这些发现还表明了大脑的适应性统计学学习机制在即兴演奏音乐的发展中发挥了关键作用,这些机制不断地练习和修改内部模型,以适应不同时期的文化和情感特征。本研究揭示了即兴演奏音乐的演进历史,并指出了艺术家在不同时期中采取的细微调整,以满足不同时期的文化和情感风貌。