cs.SD - 2023-09-24

Cross-modal Alignment with Optimal Transport for CTC-based ASR

  • paper_url: http://arxiv.org/abs/2309.13650
  • repo_url: None
  • paper_authors: Xugang Lu, Peng Shen, Yu Tsao, Hisashi Kawai
  • for: 提高 CTCAASR 系统的准确率,使其能够更好地利用语言模型(LM)中的语言知识。
  • methods: 使用 optimal transport(OT)算法实现语音特征与文本特征之间的交叉模式对应,从而让语音特征编码上下文 dependent 语言特征。
  • results: 在 AISHELL-1 数据集上,我们的系统达到了 3.96% 和 4.27% 字符错误率(CER),对比基eline 系统而言,相对提高了 28.39% 和 29.42%。
    Abstract Temporal connectionist temporal classification (CTC)-based automatic speech recognition (ASR) is one of the most successful end to end (E2E) ASR frameworks. However, due to the token independence assumption in decoding, an external language model (LM) is required which destroys its fast parallel decoding property. Several studies have been proposed to transfer linguistic knowledge from a pretrained LM (PLM) to the CTC based ASR. Since the PLM is built from text while the acoustic model is trained with speech, a cross-modal alignment is required in order to transfer the context dependent linguistic knowledge from the PLM to acoustic encoding. In this study, we propose a novel cross-modal alignment algorithm based on optimal transport (OT). In the alignment process, a transport coupling matrix is obtained using OT, which is then utilized to transform a latent acoustic representation for matching the context-dependent linguistic features encoded by the PLM. Based on the alignment, the latent acoustic feature is forced to encode context dependent linguistic information. We integrate this latent acoustic feature to build conformer encoder-based CTC ASR system. On the AISHELL-1 data corpus, our system achieved 3.96% and 4.27% character error rate (CER) for dev and test sets, respectively, which corresponds to relative improvements of 28.39% and 29.42% compared to the baseline conformer CTC ASR system without cross-modal knowledge transfer.
    摘要 temporal connectionist temporal classification(CTC)基于自动语音识别(ASR)系统是最成功的端到端(E2E)ASR框架之一。然而,由于decode进程中的令符独立假设,需要一个外部语言模型(LM),这样会消除它的快速并行解码性。多个研究已经提出将语言知识从预训练语言模型(PLM)传递到CTC基于ASR系统。由于PLM是由文本建立的,而语音模型则是通过语音训练的,因此需要在语音编码和PLM中的语言知识之间进行交叉模式对齐。在本研究中,我们提出了一种基于最优运输(OT)的交叉模式对齐算法。在对齐过程中,使用OT获得了交叉运输矩阵,然后将其用于将 latent acoustic representation 变换为与语言模型(LM)中的上下文依赖的语言特征匹配。根据对齐,latent acoustic feature 被迫编码上下文依赖的语言信息。我们将这个latent acoustic feature 集成到基于CTC的ASR系统中,并在AISHELL-1数据集上进行测试。测试结果表明,我们的系统在dev和test集上的字符错误率(CER)分别为3.96%和4.27%,相对于基eline conformer CTC ASR系统而言,升幅分别为28.39%和29.42%。

Efficient Black-Box Speaker Verification Model Adaptation with Reprogramming and Backend Learning

  • paper_url: http://arxiv.org/abs/2309.13605
  • repo_url: None
  • paper_authors: Jingyu Li, Tan Lee
  • for: 这篇论文的目的是提出一种基于深度神经网络的话语识别系统中的领域匹配问题的解决方案,并且透过对模型的数据类型进行修改,以提高SV系统的性能。
  • methods: 这篇论文使用了一种基于对模型的数据类型进行修改的方法,即利用对模型的预设值进行修改,以实现领域匹配。这种方法通过估计模型的参数 gradients,将模型视为黑盒模型,并使用两层背景学习模组进行最终的适应。
  • results: 实验结果显示,这种方法可以在语言匹配情况下,对SV系统进行领域匹配,并且使用了 much less computation cost,实现了与完全调整的模型相似的性能。
    Abstract The development of deep neural networks (DNN) has significantly enhanced the performance of speaker verification (SV) systems in recent years. However, a critical issue that persists when applying DNN-based SV systems in practical applications is domain mismatch. To mitigate the performance degradation caused by the mismatch, domain adaptation becomes necessary. This paper introduces an approach to adapt DNN-based SV models by manipulating the learnable model inputs, inspired by the concept of adversarial reprogramming. The pre-trained SV model remains fixed and functions solely in the forward process, resembling a black-box model. A lightweight network is utilized to estimate the gradients for the learnable parameters at the input, which bypasses the gradient backpropagation through the black-box model. The reprogrammed output is processed by a two-layer backend learning module as the final adapted speaker embedding. The number of parameters involved in the gradient calculation is small in our design. With few additional parameters, the proposed method achieves both memory and parameter efficiency. The experiments are conducted in language mismatch scenarios. Using much less computation cost, the proposed method obtains close or superior performance to the fully finetuned models in our experiments, which demonstrates its effectiveness.
    摘要 Deep neural networks (DNN) 的发展有助于提高 speaker verification (SV) 系统的性能,但是在实际应用中,域名匹配问题仍然是一个主要的问题。为了解决这个问题,我们需要进行域名适应。这篇文章介绍了一种将 DNN-based SV 模型适应到域名不同的方法,通过修改可学习的模型输入,以及基于反对抗整形的概念。先前训练的 SV 模型保持不变,只参与前向处理,类似于黑盒模型。我们使用轻量级网络计算输入中的梯度,以便更新可学习参数。最终,我们使用两层后端学习模块来处理整形后的输出,并生成最终的适应的 speaker 嵌入。我们的设计具有少量参数,同时具有内存和参数效率。我们的实验结果表明,使用许多更少的计算成本,我们的方法可以在语言匹配场景中实现与完全训练模型相当或更好的性能。

The second multi-channel multi-party meeting transcription challenge (M2MeT) 2.0): A benchmark for speaker-attributed ASR

  • paper_url: http://arxiv.org/abs/2309.13573
  • repo_url: None
  • paper_authors: Yuhao Liang, Mohan Shi, Fan Yu, Yangze Li, Shiliang Zhang, Zhihao Du, Qian Chen, Lei Xie, Yanmin Qian, Jian Wu, Zhuo Chen, Kong Aik Lee, Zhijie Yan, Hui Bu
  • for: 本文主要探讨了一个实际场景中的“谁说了什么, WHEN”问题,即speaker-attributed ASR (SA-ASR)问题。
  • methods: 本文使用了两个子track:固定训练条件子track和开放训练条件子track。固定训练条件子track限制了训练数据的使用,但允许参与者使用任何开源预训练模型。开放训练条件子track则允许参与者使用所有可用数据和模型。
  • results: 本文公布了一个新的10小时测试集,用于排名挑战。本文还提供了参与者提交系统的结果和分析,作为SA-ASR领域的现状标准。
    Abstract With the success of the first Multi-channel Multi-party Meeting Transcription challenge (M2MeT), the second M2MeT challenge (M2MeT 2.0) held in ASRU2023 particularly aims to tackle the complex task of \emph{speaker-attributed ASR (SA-ASR)}, which directly addresses the practical and challenging problem of ``who spoke what at when" at typical meeting scenario. We particularly established two sub-tracks. The fixed training condition sub-track, where the training data is constrained to predetermined datasets, but participants can use any open-source pre-trained model. The open training condition sub-track, which allows for the use of all available data and models without limitation. In addition, we release a new 10-hour test set for challenge ranking. This paper provides an overview of the dataset, track settings, results, and analysis of submitted systems, as a benchmark to show the current state of speaker-attributed ASR.
    摘要 在M2MeT挑战的成功之后,M2MeT 2.0挑战在ASRU2023中进一步挑战了复杂的 speaker-attributed ASR(SA-ASR)任务,直接面临typical会议场景中的“谁说了什么,何时”问题。我们设置了两个子轨道。固定培训条件子轨道,团队可以使用预先训练的任何开源模型,但是团队必须使用 predetermined datasets 进行培训。开放培训条件子轨道,允许使用所有可用的数据和模型。此外,我们发布了一个新的10小时测试集,用于挑战排名。本文提供了数据集、轨道设置、结果和分析 submitted系统的概述,作为SA-ASR现状的标准 referential。

Coco-Nut: Corpus of Japanese Utterance and Voice Characteristics Description for Prompt-based Control

  • paper_url: http://arxiv.org/abs/2309.13509
  • repo_url: None
  • paper_authors: Aya Watanabe, Shinnosuke Takamichi, Yuki Saito, Wataru Nakata, Detai Xin, Hiroshi Saruwatari
  • for: 研究控制语音特征的多目的语音合成。
  • methods: 使用文本conditioned生成,如文本-图像生成,以实现直觉和复杂的语音特征控制。
  • results: 开发了一个新的语音 corpus,包括多样化的日本语音样本,以及相应的文本转录和自由形式语音特征描述。
    Abstract In text-to-speech, controlling voice characteristics is important in achieving various-purpose speech synthesis. Considering the success of text-conditioned generation, such as text-to-image, free-form text instruction should be useful for intuitive and complicated control of voice characteristics. A sufficiently large corpus of high-quality and diverse voice samples with corresponding free-form descriptions can advance such control research. However, neither an open corpus nor a scalable method is currently available. To this end, we develop Coco-Nut, a new corpus including diverse Japanese utterances, along with text transcriptions and free-form voice characteristics descriptions. Our methodology to construct this corpus consists of 1) automatic collection of voice-related audio data from the Internet, 2) quality assurance, and 3) manual annotation using crowdsourcing. Additionally, we benchmark our corpus on the prompt embedding model trained by contrastive speech-text learning.
    摘要 <>在文本到语音Synthesizer中,控制声音特征是关键以实现多种目标 speech synthesis。考虑到文本条件生成的成功,如文本到图像,自由形文本指令可以为Intuitive和复杂的声音控制提供便利。一个具有充分覆盖和多样性的声音样本库可以提高这种控制研究。然而,目前并没有公开的库 nor可扩展的方法。为此,我们开发了Coco-Nut,一个新的声音库,包括日本语音样本,以及文本转录和自由形声音特征描述。我们的方法包括:1. 自动从互联网上收集声音相关的音频数据2. 质量控制3. 使用人工投票来手动标注此外,我们对这个库进行了基于对比Speech-text学习的唤起式模型的测试。