cs.SD - 2023-09-16

Enhancing GAN-Based Vocoders with Contrastive Learning Under Data-limited Condition

  • paper_url: http://arxiv.org/abs/2309.09088
  • repo_url: None
  • paper_authors: Haoming Guo, Seth Z. Zhao, Jiachen Lian, Gopala Anumanchipalli, Gerald Friedland
  • for: 本研究旨在提高 vocoder 模型在数据有限情况下的质量,不修改模型结构或添加更多数据。
  • methods: 本研究使用了对mel-spectrogram进行对比学习,以提高 vocoder 模型的语音质量。此外, authors 还尝试了在多Modal情况下使用waveform进行学习,以解决权值逐出问题。
  • results: 研究结果表明,通过对 vocoder 模型进行对比学习,可以在数据有限情况下提高模型性能,并且分析结果表明,提posed方法可以successfully解决权值逐出问题,并生成高质量的语音。
    Abstract Vocoder models have recently achieved substantial progress in generating authentic audio comparable to human quality while significantly reducing memory requirement and inference time. However, these data-hungry generative models require large-scale audio data for learning good representations. In this paper, we apply contrastive learning methods in training the vocoder to improve the perceptual quality of the vocoder without modifying its architecture or adding more data. We design an auxiliary task with mel-spectrogram contrastive learning to enhance the utterance-level quality of the vocoder model under data-limited conditions. We also extend the task to include waveforms to improve the multi-modality comprehension of the model and address the discriminator overfitting problem. We optimize the additional task simultaneously with GAN training objectives. Our result shows that the tasks improve model performance substantially in data-limited settings. Our analysis based on the result indicates that the proposed design successfully alleviates discriminator overfitting and produces audio of higher fidelity.
    摘要 很多最新的 vocoder 模型已经取得了很大的进步,可以生成比人类质量更高的真实音频,同时减少了内存需求和计算时间。然而,这些数据夹带的生成模型需要大量的音频数据来学习良好的表示。在这篇论文中,我们使用了对比学习方法来在训练 vocoder 模型中提高模型的感知质量,无需修改模型结构或添加更多的数据。我们设计了一项 auxiliary 任务,通过 mel-spectrogram 对比学习来提高 vocoder 模型在数据有限的情况下的话语质量。我们还将这项任务扩展到包括波形,以提高模型的多模式理解和解决探测器过拟合问题。我们同时优化了这些额外任务和 GAN 训练目标。我们的结果表明,这些任务可以在数据有限情况下提高模型性能的极大程度。我们的分析表明,我们的设计成功解决了探测器过拟合问题,并生成了更高的准确性和音频质量。

SynthTab: Leveraging Synthesized Data for Guitar Tablature Transcription

  • paper_url: http://arxiv.org/abs/2309.09085
  • repo_url: None
  • paper_authors: Yongyi Zang, Yi Zhong, Frank Cwitkowitz, Zhiyao Duan
  • for: 这篇论文的目的是提高电子琴 Tablature Transcription (GTT) 模型的准确性和通用性,以应对现有的数据集规模和范围有限,导致现有的 GTT 模型容易过滤和没有通用性。
  • methods: 作者采用了多个商业电子琴和普通琴插件来生成 SynthTab,一个大规模的电子琴 Tablature Transcription 数据集。这个数据集是基于 DadaGP 提供的广泛的 Tablature 集,并且具有丰富的特征和技巧。
  • results: 实验显示,将先进 GTT 模型在 SynthTab 上进行预训后,可以提高同数据集的准确性,并且在跨数据集评估中具有较好的适应性和减少了过滤问题。
    Abstract Guitar tablature is a form of music notation widely used among guitarists. It captures not only the musical content of a piece, but also its implementation and ornamentation on the instrument. Guitar Tablature Transcription (GTT) is an important task with broad applications in music education and entertainment. Existing datasets are limited in size and scope, causing state-of-the-art GTT models trained on such datasets to suffer from overfitting and to fail in generalization across datasets. To address this issue, we developed a methodology for synthesizing SynthTab, a large-scale guitar tablature transcription dataset using multiple commercial acoustic and electric guitar plugins. This dataset is built on tablatures from DadaGP, which offers a vast collection and the degree of specificity we wish to transcribe. The proposed synthesis pipeline produces audio which faithfully adheres to the original fingerings, styles, and techniques specified in the tablature with diverse timbre. Experiments show that pre-training state-of-the-art GTT model on SynthTab improves transcription accuracy in same-dataset tests. More importantly, it significantly mitigates overfitting problems of GTT models in cross-dataset evaluation.
    摘要 吉他标谱是一种广泛用于吉他演奏的音乐notation,不仅记录了音乐内容,还包括 instru Meyer 和ornamentation。吉他标谱转写(GTT)是一个重要的任务,有广泛的应用在音乐教育和娱乐领域。现有的数据集 limitation 的size和scope,导致现有的GTT模型在这些数据集上进行训练后会出现过拟合和泛化问题。为解决这个问题,我们开发了一种方法ологи для生成 SynthTab,一个大规模的吉他标谱转写数据集,使用多种商业钢琴和电吉他插件。这个数据集基于DadaGP提供的大量标谱,我们可以根据我们的要求进行特定的转写。我们的合成管道可以生成具有多样 timbre 的音频,忠实地实现原始的手套、风格和技巧 specified in the tablature。实验表明,在 SynthTab 上先行训练 state-of-the-art GTT 模型可以提高同一个数据集的转写精度。更重要的是,它可以减轻 GTT 模型在不同数据集之间的过拟合问题。

Music Generation based on Generative Adversarial Networks with Transformer

  • paper_url: http://arxiv.org/abs/2309.09075
  • repo_url: None
  • paper_authors: Ziyi Jiang, Yi Zhong, Ruoxue Wu, Zhenghan Chen, Xiaoxuan Liang
  • for: 本研究旨在提高基于Transformers的自动生成音乐作品的质量,并且减少曝光偏见的影响。
  • methods: 我们使用了一种基于GAN框架的敌方损失函数,并使用了一个预训练的Span-BERT模型作为推论器。我们还使用了Gumbel-Softmax trick来实现整数序列的可微分化。
  • results: 我们通过人工评估和引入一种新的探测指标,证明了我们的方法比基于likelihood最大化的基eline模型具有更高的质量。
    Abstract Autoregressive models based on Transformers have become the prevailing approach for generating music compositions that exhibit comprehensive musical structure. These models are typically trained by minimizing the negative log-likelihood (NLL) of the observed sequence in an autoregressive manner. However, when generating long sequences, the quality of samples from these models tends to significantly deteriorate due to exposure bias. To address this issue, we leverage classifiers trained to differentiate between real and sampled sequences to identify these failures. This observation motivates our exploration of adversarial losses as a complement to the NLL objective. We employ a pre-trained Span-BERT model as the discriminator in the Generative Adversarial Network (GAN) framework, which enhances training stability in our experiments. To optimize discrete sequences within the GAN framework, we utilize the Gumbel-Softmax trick to obtain a differentiable approximation of the sampling process. Additionally, we partition the sequences into smaller chunks to ensure that memory constraints are met. Through human evaluations and the introduction of a novel discriminative metric, we demonstrate that our approach outperforms a baseline model trained solely on likelihood maximization.
    摘要 自适应模型基于Transformers已成为生成具有完整音乐结构的乐曲主要方法。这些模型通常通过逐步式拟合方式进行训练,以最小化负对数梯度(NLL)为目标。然而,在生成长序列时,这些模型的样本质量往往会受到曝光偏见的影响,导致样本质量下降。为 Addressing this issue, we leveraged classifiers trained to distinguish between real and sampled sequences to identify these failures. This observation motivates our exploration of adversarial losses as a complement to the NLL objective. We employed a pre-trained Span-BERT model as the discriminator in the Generative Adversarial Network (GAN) framework, which enhances training stability in our experiments. To optimize discrete sequences within the GAN framework, we utilized the Gumbel-Softmax trick to obtain a differentiable approximation of the sampling process. Additionally, we partitioned the sequences into smaller chunks to ensure that memory constraints were met. Through human evaluations and the introduction of a novel discriminative metric, we demonstrated that our approach outperformed a baseline model trained solely on likelihood maximization.Note: The translation is in Simplified Chinese, which is the standard written form of Chinese used in mainland China and Singapore. If you prefer Traditional Chinese, please let me know and I can provide the translation in that format as well.

Unifying Robustness and Fidelity: A Comprehensive Study of Pretrained Generative Methods for Speech Enhancement in Adverse Conditions

  • paper_url: http://arxiv.org/abs/2309.09028
  • repo_url: None
  • paper_authors: Heming Wang, Meng Yu, Hao Zhang, Chunlei Zhang, Zhongweiyang Xu, Muqiao Yang, Yixuan Zhang, Dong Yu
  • for: 提高噪音环境下的语音信号质量
  • methods: 使用预训练的生成方法重新生成干净的语音信号
  • results: 实验表明,使用代码生成器可以获得更高的主观分数,并且生成的语音质量更高,噪音和反射减少。
    Abstract Enhancing speech signal quality in adverse acoustic environments is a persistent challenge in speech processing. Existing deep learning based enhancement methods often struggle to effectively remove background noise and reverberation in real-world scenarios, hampering listening experiences. To address these challenges, we propose a novel approach that uses pre-trained generative methods to resynthesize clean, anechoic speech from degraded inputs. This study leverages pre-trained vocoder or codec models to synthesize high-quality speech while enhancing robustness in challenging scenarios. Generative methods effectively handle information loss in speech signals, resulting in regenerated speech that has improved fidelity and reduced artifacts. By harnessing the capabilities of pre-trained models, we achieve faithful reproduction of the original speech in adverse conditions. Experimental evaluations on both simulated datasets and realistic samples demonstrate the effectiveness and robustness of our proposed methods. Especially by leveraging codec, we achieve superior subjective scores for both simulated and realistic recordings. The generated speech exhibits enhanced audio quality, reduced background noise, and reverberation. Our findings highlight the potential of pre-trained generative techniques in speech processing, particularly in scenarios where traditional methods falter. Demos are available at https://whmrtm.github.io/SoundResynthesis.
    摘要 增强语音信号质量在不利的听音环境中是一个长期挑战的问题。现有的深度学习基于的增强方法经常在实际场景中不能有效地除去背景噪声和反射,从而影响听众体验。为解决这些挑战,我们提出了一种新的方法,使用预训练的生成方法将清晰、无反射的语音重新生成出来。这项研究利用预训练的 vocoder 或 codec 模型来生成高质量的语音,同时提高了对挑战性场景的抗性。生成方法可以有效处理语音信号中的信息损失,从而生成具有提高的听音质量和减少的artefacts的语音。通过利用预训练模型的能力,我们实现了原始语音的忠实复制在不利条件下。实验评估表明,我们的提议方法在模拟数据集和实际采样中具有显著的效果和稳定性。特别是通过利用 codec,我们在模拟和实际录音中获得了更高的主观评分。生成的语音具有提高的听音质量、减少的背景噪声和反射。我们的发现表明,预训练的生成技术在语音处理中具有潜在的潜力,特别是在传统方法失效的场景下。 Demo 可以在 中找到。

Decoder-only Architecture for Speech Recognition with CTC Prompts and Text Data Augmentation

  • paper_url: http://arxiv.org/abs/2309.08876
  • repo_url: None
  • paper_authors: Emiru Tsunoo, Hayato Futami, Yosuke Kashiwagi, Siddhant Arora, Shinji Watanabe
  • for: 提高自动语音识别(ASR)模型的精度和效率,使其可以使用文本数据进行训练。
  • methods: 采用decoder-only架构,使用简单的文本扩充,并使用CTC预测来提供音频信息。
  • results: 在LibriSpeech和Switchboard datasets上,提出的模型比普通CTC预测减少了0.3%和1.4%的单词错误率,并在LibriSpeech 100h和Switchboard训练场景中超过了传统的encoder-decoder ASR模型。
    Abstract Collecting audio-text pairs is expensive; however, it is much easier to access text-only data. Unless using shallow fusion, end-to-end automatic speech recognition (ASR) models require architecture modifications or additional training schemes to use text-only data. Inspired by recent advances in decoder-only language models (LMs), such as GPT-3 and PaLM adopted for speech-processing tasks, we propose using a decoder-only architecture for ASR with simple text augmentation. To provide audio information, encoder features compressed by CTC prediction are used as prompts for the decoder, which can be regarded as refining CTC prediction using the decoder-only model. Because the decoder architecture is the same as an autoregressive LM, it is simple to enhance the model by leveraging external text data with LM training. An experimental comparison using LibriSpeech and Switchboard shows that our proposed models with text augmentation training reduced word error rates from ordinary CTC by 0.3% and 1.4% on LibriSpeech test-clean and testother set, respectively, and 2.9% and 5.0% on Switchboard and CallHome. The proposed model had advantage on computational efficiency compared with conventional encoder-decoder ASR models with a similar parameter setup, and outperformed them on the LibriSpeech 100h and Switchboard training scenarios.
    摘要 收集音频文本对是costly的;但是可以轻松地获取文本数据。除非使用浅层融合,否则末端自动语音识别(ASR)模型需要建筑修改或额外训练方式来使用文本数据。受最近的语言模型(LM)的进步启发,我们提议使用decoder-only架构 для ASR,并使用简单的文本扩展。为了提供音频信息,encoder特征被CTC预测压缩后用作decoder的激活器,可以视为通过decoder-only模型来更正CTC预测。由于decoder架构与 autoregressive LM 相同,因此可以通过外部文本数据进行LM训练来增强模型。我们在LibriSpeech和Switchboard上进行了实验比较,发现我们提议的模型在文本扩展训练下降低了word error rate(PER)by 0.3%和1.4%在LibriSpeech test-clean和test-otherSet上,并且在Switchboard和CallHome上降低了2.9%和5.0%。此外,我们的模型在计算效率方面具有优势,并在LibriSpeech 100h和Switchboard训练enario上超过了传统的末端encoder-decoder ASR模型。

Contrastive Latent Space Reconstruction Learning for Audio-Text Retrieval

  • paper_url: http://arxiv.org/abs/2309.08839
  • repo_url: None
  • paper_authors: Kaiyi Luo, Xulong Zhang, Jianzong Wang, Huaxiong Li, Ning Cheng, Jing Xiao
  • for: 这个论文主要针对 audio-to-text 模式下的跨模态检索问题,即使用 audio clips 和文本进行对应。
  • methods: 该论文提出了一种新的 Contrastive Latent Space Reconstruction Learning (CLSR) 方法,它在对比表示学习中考虑了内模态分离性,并采用了 adaptive temperature control 策略。此外,该方法还包含了模态交互的latent representation reconstruction模块。
  • results: 对两个 audio-text 数据集进行比较,CLSR 方法表现出了较高的效果,胜过了一些当前最佳方法。
    Abstract Cross-modal retrieval (CMR) has been extensively applied in various domains, such as multimedia search engines and recommendation systems. Most existing CMR methods focus on image-to-text retrieval, whereas audio-to-text retrieval, a less explored domain, has posed a great challenge due to the difficulty to uncover discriminative features from audio clips and texts. Existing studies are restricted in the following two ways: 1) Most researchers utilize contrastive learning to construct a common subspace where similarities among data can be measured. However, they considers only cross-modal transformation, neglecting the intra-modal separability. Besides, the temperature parameter is not adaptively adjusted along with semantic guidance, which degrades the performance. 2) These methods do not take latent representation reconstruction into account, which is essential for semantic alignment. This paper introduces a novel audio-text oriented CMR approach, termed Contrastive Latent Space Reconstruction Learning (CLSR). CLSR improves contrastive representation learning by taking intra-modal separability into account and adopting an adaptive temperature control strategy. Moreover, the latent representation reconstruction modules are embedded into the CMR framework, which improves modal interaction. Experiments in comparison with some state-of-the-art methods on two audio-text datasets have validated the superiority of CLSR.
    摘要 跨模式检索(CMR)在不同领域得到了广泛应用,如多媒体搜索引擎和推荐系统。现有的大多数CMR方法强调图像到文本检索,而听音到文本检索则是一个未得到充分发展的领域,这主要是因为听音clip和文本之间找到特征点具有很大的挑战性。现有的研究受到以下两种限制:1. 大多数研究人员采用对偶学习来构建共同的特征空间,以便在数据之间可以测量相似性。然而,他们只考虑了跨模式变换,忽略了内模态分离性。此外,温度参数不适应性地调整,这会下降性能。2. 这些方法不会考虑隐藏表示的重建,这是必要的 дляsemantic alignment。本文提出了一种新的听音到文本 oriented CMR方法,称为对偶特征空间重建学习(CLSR)。CLSR方法改进了对偶表示学习,通过考虑内模态分离性和适应性温度控制策略。此外,模态交互模块被引入到CMR框架中,以提高模态交互。对两个音频到文本数据集进行比较 экспериментирова, Validated the superiority of CLSR。

FastGraphTTS: An Ultrafast Syntax-Aware Speech Synthesis Framework

  • paper_url: http://arxiv.org/abs/2309.08837
  • repo_url: None
  • paper_authors: Jianzong Wang, Xulong Zhang, Aolan Sun, Ning Cheng, Jing Xiao
  • for: 这篇论文旨在 integrate graph-to-sequence 到一个终端文本至语音框架中,以实现 syntax-aware 模型化。
  • methods: 这篇论文使用了 dependency parsing 模块将输入文本解析成一个 sintactic graph,然后使用 graph encoder 对这个 sintactic graph 进行编码,提取 sintactic hidden information,并与 phoneme embedding 进行拼接,并输入到 alignment 和 flow-based decoding 模块中,生成 raw audio waveform。
  • results: 实验结果表明,这种模型可以提供更好的语音合成效果,并且在 subjective prosodic evaluation 中获得了更高的分数。此外,模型还可以进行voice conversion。此外,通过 AI chip operator 的设计,模型的效率得到了5x的加速。
    Abstract This paper integrates graph-to-sequence into an end-to-end text-to-speech framework for syntax-aware modelling with syntactic information of input text. Specifically, the input text is parsed by a dependency parsing module to form a syntactic graph. The syntactic graph is then encoded by a graph encoder to extract the syntactic hidden information, which is concatenated with phoneme embedding and input to the alignment and flow-based decoding modules to generate the raw audio waveform. The model is experimented on two languages, English and Mandarin, using single-speaker, few samples of target speakers, and multi-speaker datasets, respectively. Experimental results show better prosodic consistency performance between input text and generated audio, and also get higher scores in the subjective prosodic evaluation, and show the ability of voice conversion. Besides, the efficiency of the model is largely boosted through the design of the AI chip operator with 5x acceleration.
    摘要 这篇论文将graph-to-sequence integrate到了一个端到端的文本到语音框架中,以实现文本的 syntax-aware 模型化。具体来说,输入文本首先被依赖分析模块解析,形成一个语法图。然后,语法图被图编码器编码,以提取语法隐藏信息。这些隐藏信息与phoneme embedding相加,并输入到对齐和流程基于解码模块中,以生成原始的音频波形。模型在英语和普通话两种语言上进行了实验,使用单个说话者、少量目标说话者和多个说话者的数据集,分别进行了实验。实验结果表明,模型可以更好地保持输入文本和生成的音频波形之间的PROSODIC 一致性,并在主观的PROSODIC 评价中获得更高的分数。此外,模型的效率得到了通过AI芯片运算符的5倍加速的大幅提升。

Boosting End-to-End Multilingual Phoneme Recognition through Exploiting Universal Speech Attributes Constraints

  • paper_url: http://arxiv.org/abs/2309.08828
  • repo_url: None
  • paper_authors: Hao Yen, Sabato Marco Siniscalchi, Chin-Hui Lee
  • for: 该研究旨在提出一种多语言自动语音识别(ASR)系统,以利用语音生成器的知识来提高系统的性能。
  • methods: 该研究使用了一种基于语音特征的 attribute-to-phoneme 映射方法,将知识基于生成器的特征映射到输出phoneme上,以限制系统的预测。
  • results: 该研究在多种语言的测试数据上进行了比较,并发现了与传统多语言方法相比,提出的解决方案能够提高系统的性能,平均提高6.85%。此外,研究还发现了该解决方案能够消除与特征不一致的phoneme预测。
    Abstract We propose a first step toward multilingual end-to-end automatic speech recognition (ASR) by integrating knowledge about speech articulators. The key idea is to leverage a rich set of fundamental units that can be defined "universally" across all spoken languages, referred to as speech attributes, namely manner and place of articulation. Specifically, several deterministic attribute-to-phoneme mapping matrices are constructed based on the predefined set of universal attribute inventory, which projects the knowledge-rich articulatory attribute logits, into output phoneme logits. The mapping puts knowledge-based constraints to limit inconsistency with acoustic-phonetic evidence in the integrated prediction. Combined with phoneme recognition, our phone recognizer is able to infer from both attribute and phoneme information. The proposed joint multilingual model is evaluated through phoneme recognition. In multilingual experiments over 6 languages on benchmark datasets LibriSpeech and CommonVoice, we find that our proposed solution outperforms conventional multilingual approaches with a relative improvement of 6.85% on average, and it also demonstrates a much better performance compared to monolingual model. Further analysis conclusively demonstrates that the proposed solution eliminates phoneme predictions that are inconsistent with attributes.
    摘要 我们提出一个初步的多语言端到端自动语音识别(ASR)方法,通过 интеGRATE知识About speech articulators。关键思想是利用一个丰富的基本单元,可以在所有的口语语言中 Universally defined,称为speech attributes,namely manner and place of articulation。特别是,我们构建了一些决定性的 attribute-to-phoneme mapping矩阵,基于预定的universal attribute inventory,将知识医学特征logits项目到输出phoneme logits。这种映射带有知识基础的约束,以限制与语音-phonetic证据的不一致。与phoneme recognition结合,我们的电话识别器能够从both attribute和phoneme信息中进行推理。我们提出的联合多语言模型在LibriSpeech和CommonVoice多语言测试集上进行了phoneme recognition测试,并 obtAIN了相对改善6.85%的平均提升,以及和单语言模型的较好表现。进一步的分析表明,我们的解决方案可以消除与 attribute不一致的phoneme预测。