paper_authors: Giulia Comini, Manuel Sam Ribeiro, Fan Yang, Heereen Shim, Jaime Lorenzo-Trueba for: 这个论文的目的是提出一个多语言统一的前端系统,用于解决语音识别相关的任务,通常由不同模块处理。methods: 该论文使用的方法包括Grapheme-to-Phoneme关系的预测和语言特定的规则系统,以解决语音识别中的homograph和多音字识别、后缀规则和隐式 диакритизацию问题。results: 该论文的实验结果显示,该多语言统一前端系统在不同语言和任务上具有竞争力,但有一些与等效的单语言解决方案进行比较时的交易。Abstract
Phonetic information and linguistic knowledge are an essential component of a Text-to-speech (TTS) front-end. Given a language, a lexicon can be collected offline and Grapheme-to-Phoneme (G2P) relationships are usually modeled in order to predict the pronunciation for out-of-vocabulary (OOV) words. Additionally, post-lexical phonology, often defined in the form of rule-based systems, is used to correct pronunciation within or between words. In this work we showcase a multilingual unified front-end system that addresses any pronunciation related task, typically handled by separate modules. We evaluate the proposed model on G2P conversion and other language-specific challenges, such as homograph and polyphones disambiguation, post-lexical rules and implicit diacritization. We find that the multilingual model is competitive across languages and tasks, however, some trade-offs exists when compared to equivalent monolingual solutions.
摘要
phonetic information和语言知识是texttospeech(TTS)前端的重要组成部分。给定一种语言,一个词典可以在线上采集,并且用grapheme-to-phoneme(G2P)关系来预测语音读法 для未在词汇中出现的词语(OOV words)。此外,在词语之间或在词语中进行phonology修正还需要使用post-lexical规则。在这项工作中,我们展示了一个多语言统一前端系统,可以解决任何语音相关的任务,通常由 separatemodules 处理。我们评估了提议的模型在G2P转换和其他语言特有的挑战中的性能,包括homograph和多音字的歧义、post-lexical规则和隐式 diacritization。我们发现该多语言模型在语言和任务方面都具有竞争力,但是存在一些与等效单语言解决方案相比存在的交换。
Improving grapheme-to-phoneme conversion by learning pronunciations from speech recordings
results: 对于不同语言和数据量,our approach 都能够降低 G2P 系统的 phone error rate,并且可以学习 out-of-vocabulary 单词的拼写规则。Abstract
The Grapheme-to-Phoneme (G2P) task aims to convert orthographic input into a discrete phonetic representation. G2P conversion is beneficial to various speech processing applications, such as text-to-speech and speech recognition. However, these tend to rely on manually-annotated pronunciation dictionaries, which are often time-consuming and costly to acquire. In this paper, we propose a method to improve the G2P conversion task by learning pronunciation examples from audio recordings. Our approach bootstraps a G2P with a small set of annotated examples. The G2P model is used to train a multilingual phone recognition system, which then decodes speech recordings with a phonetic representation. Given hypothesized phoneme labels, we learn pronunciation dictionaries for out-of-vocabulary words, and we use those to re-train the G2P system. Results indicate that our approach consistently improves the phone error rate of G2P systems across languages and amount of available data.
摘要
“文本转语音”(G2P)任务的目的是将文字转换为不同语言的精确的语音表示。这种转换可以应用于多种语音处理应用程序,例如文本转语音和语音识别。但是,这些应用程序通常需要手动检核的发音词典,这可能需要很长时间和成本。在这篇论文中,我们提出了一种方法来改善G2P转换任务,通过从语音录音中学习发音例子。我们的方法是从一小量标注的示例开始,使用G2P模型训练多种语言的语音识别系统,然后将语音录音转换为精确的语音表示。假设有假设的发音标签,我们可以从这些标签中学习不在词汇中的发音词典,并将它们用于重新训练G2P系统。结果显示,我们的方法可以在不同语言和可用数据量之间帮助G2P系统改善电话误差率。
All-In-One Metrical And Functional Structure Analysis With Neighborhood Attentions on Demixed Audio
results: 模型在Harmonix Set上的四个任务中均 achieve state-of-the-art表现,同时具有相对较低的参数数量,而且我们的ablation研究表明,同时学习节奏、下节奏和段落可以提高性能,每个任务彼此互助。Abstract
Music is characterized by complex hierarchical structures. Developing a comprehensive model to capture these structures has been a significant challenge in the field of Music Information Retrieval (MIR). Prior research has mainly focused on addressing individual tasks for specific hierarchical levels, rather than providing a unified approach. In this paper, we introduce a versatile, all-in-one model that jointly performs beat and downbeat tracking as well as functional structure segmentation and labeling. The model leverages source-separated spectrograms as inputs and employs dilated neighborhood attentions to capture temporal long-term dependencies, along with non-dilated attentions for local instrumental dependencies. Consequently, the proposed model achieves state-of-the-art performance in all four tasks on the Harmonix Set while maintaining a relatively lower number of parameters compared to recent state-of-the-art models. Furthermore, our ablation study demonstrates that the concurrent learning of beats, downbeats, and segments can lead to enhanced performance, with each task mutually benefiting from the others.
摘要
音乐具有复杂的层次结构,在音乐信息检索(MIR)领域建立一个全面的模型是一项重要的挑战。先前的研究主要集中在解决特定层次级别的任务上,而不是提供一个统一的方法。在本文中,我们介绍了一种通用的、全部一个模型,可同时执行节拍和下节拍跟踪、功能结构分割和标注。该模型使用源分离的spectrogram作为输入,利用扩大 neighborgraph attention capture temporal长期关系,并使用非扩大 attention capture当地乐器关系。因此,我们提出的模型在Harmonix Set上的四个任务中均 achieve state-of-the-art性能,同时具有相对较少的参数量 compared to recent state-of-the-art模型。此外,我们的ablation study表明,同时学习节拍、下节拍和分割可以导致提高性能,每个任务受到另外两个任务的帮助。
Robust Self Supervised Speech Embeddings for Child-Adult Classification in Interactions involving Children with Autism
paper_authors: Rimita Lahiri, Tiantian Feng, Rajat Hebbar, Catherine Lord, So Hyun Kim, Shrikanth Narayanan
for: automatic child-adult speaker classification in child-inclusive spoken interactions
methods: pre-training with child-inclusive interactions and self-supervision algorithms (Wav2vec 2.0 and WavLM) with a contrastive loss objective
results: 9-13% relative improvement over state-of-the-art baseline in classification F1 scores on two clinical interaction datasets involving children with Autism, with analysis of pre-training under different conditions based on demographic factors.Here’s the text in Simplified Chinese:
results: 在两个临床交互数据集上,与状态艺术基准相比,实现了9-13%的相对提升,并对不同儿童子宫因素进行了分析。Abstract
We address the problem of detecting who spoke when in child-inclusive spoken interactions i.e., automatic child-adult speaker classification. Interactions involving children are richly heterogeneous due to developmental differences. The presence of neurodiversity e.g., due to Autism, contributes additional variability. We investigate the impact of additional pre-training with more unlabelled child speech on the child-adult classification performance. We pre-train our model with child-inclusive interactions, following two recent self-supervision algorithms, Wav2vec 2.0 and WavLM, with a contrastive loss objective. We report 9 - 13% relative improvement over the state-of-the-art baseline with regards to classification F1 scores on two clinical interaction datasets involving children with Autism. We also analyze the impact of pre-training under different conditions by evaluating our model on interactions involving different subgroups of children based on various demographic factors.
摘要
我们识别了儿童 inclusive 的 spoken interactions 中的发言人问题,即自动儿童成人发言分类。儿童交流中存在较大的多样性,因为儿童的发展差异。另外,由于Autism等神经多样性的存在,会增加发言多样性。我们研究了额外预训练更多的无标签儿童语音对child-adult分类性能的影响。我们采用了两种最新的自我超视图算法,Wav2vec 2.0和WavLM,并使用了对比损失目标函数。我们发现,与状态艺术基eline相比,我们的模型在两个临床交流数据集上的分类 F1 分数有9-13%的相对改善。此外,我们还分析了预训练在不同条件下的影响,通过评估我们的模型在不同子群儿童基于各种民生因素的交流中的表现。
Pre-training End-to-end ASR Models with Augmented Speech Samples Queried by Text
paper_authors: Eric Sun, Jinyu Li, Jian Xue, Yifan Gong
for: 提高语音识别系统的语言扩展性
methods: 使用无对应的语音特征段和文本数据生成增强样本,无需额外的语音数据
results: 与使用多语言 Raw 语音数据预训练的模型相比,在Italian Transformer 抽取器模型预训练中,使用我们的方法可以获得8.7%的Relative Word Error Rate 降低,并且在与多语言数据合并预训练新模型时,可以获得12.2%的Relative Word Error Rate 降低。Abstract
In end-to-end automatic speech recognition system, one of the difficulties for language expansion is the limited paired speech and text training data. In this paper, we propose a novel method to generate augmented samples with unpaired speech feature segments and text data for model pre-training, which has the advantage of low cost without using additional speech data. When mixing 20,000 hours augmented speech data generated by our method with 12,500 hours original transcribed speech data for Italian Transformer transducer model pre-training, we achieve 8.7% relative word error rate reduction. The pre-trained model achieves similar performance as the model pre-trained with multilingual transcribed 75,000 hours raw speech data. When merging the augmented speech data with the multilingual data to pre-train a new model, we achieve even more relative word error rate reduction of 12.2% over the baseline, which further verifies the effectiveness of our method for speech data augmentation.
摘要
在端到端自动语音识别系统中,一个问题是扩展语言的困难,因为有限的配对的语音和文本训练数据。在这篇论文中,我们提出了一种新的方法,通过将无配对语音特征段和文本数据混合生成增强样本,以便模型预训练,这种方法的优点是低成本,不需要额外的语音数据。当混合我们生成的20,000小时增强语音数据和12,500小时原始译文数据 для意大陆 transformer 抽取器模型预训练,我们实现了8.7%的相对单词错误率降低。预训练后的模型与使用多语言原始75,000小时Raw语音数据预训练的模型相似的性能。当混合增强语音数据与多语言数据预训练新模型时,我们实现了更多的相对单词错误率降低12.2%,这一 Again verifies the effectiveness of our method for speech data augmentation.