cs.SD - 2023-07-04

Disentanglement in a GAN for Unconditional Speech Synthesis

paper_url: http://arxiv.org/abs/2307.01673
repo_url: https://github.com/rf5/simple-asgan
paper_authors: Matthew Baas, Herman Kamper
for: This paper is written for unconditional speech synthesis, specifically to learn a disentangled latent space for speech synthesis.
methods: The paper proposes a generative adversarial network (GAN) called AudioStyleGAN (ASGAN), which is tailored to learn a disentangled latent space for speech synthesis. The ASGAN model builds upon the StyleGAN family of image synthesis models, and it uses a modified adaptation of adaptive discriminator augmentation to successfully train the model.
results: The paper achieves state-of-the-art results in unconditional speech synthesis on the small-vocabulary Google Speech Commands digits dataset, and it is substantially faster than existing top-performing diffusion models. The paper also demonstrates that the ASGAN model’s latent space is disentangled, and that simple linear operations in the space can be used to perform several tasks unseen during training, such as voice conversion, speech enhancement, speaker verification, and keyword classification.

Abstract
Can we develop a model that can synthesize realistic speech directly from a latent space, without explicit conditioning? Despite several efforts over the last decade, previous adversarial and diffusion-based approaches still struggle to achieve this, even on small-vocabulary datasets. To address this, we propose AudioStyleGAN (ASGAN) -- a generative adversarial network for unconditional speech synthesis tailored to learn a disentangled latent space. Building upon the StyleGAN family of image synthesis models, ASGAN maps sampled noise to a disentangled latent vector which is then mapped to a sequence of audio features so that signal aliasing is suppressed at every layer. To successfully train ASGAN, we introduce a number of new techniques, including a modification to adaptive discriminator augmentation which probabilistically skips discriminator updates. We apply it on the small-vocabulary Google Speech Commands digits dataset, where it achieves state-of-the-art results in unconditional speech synthesis. It is also substantially faster than existing top-performing diffusion models. We confirm that ASGAN's latent space is disentangled: we demonstrate how simple linear operations in the space can be used to perform several tasks unseen during training. Specifically, we perform evaluations in voice conversion, speech enhancement, speaker verification, and keyword classification. Our work indicates that GANs are still highly competitive in the unconditional speech synthesis landscape, and that disentangled latent spaces can be used to aid generalization to unseen tasks. Code, models, samples: https://github.com/RF5/simple-asgan/

摘要
可以开发一个模型，将真实的语音直接从潜在空间 sintesize，无需显式的条件？过去十年多的尝试仍然无法完成这一点，即使在小词库dataset上。为解决这个问题，我们提出了AudioStyleGAN（ASGAN）——一个类型为生成对抗网络的构成，用于无条件语音合成。ASGAN使用抽象的随机变量，将样本的噪声映射到一个分离的潜在空间中，然后将这个空间映射到一系列的语音特征，以抑制信号扩散。为了成功地训练ASGAN，我们导入了一些新的技术，包括修改适应性的检测器更新，以及在检测器更新中 probabilistically skips 检测器更新。我们将其应用于 Google Speech Commands 小词库dataset，实现了无条件语音合成的州际级结果，并且比现有的扩散模型更快。我们还证明了 ASGAN 的潜在空间是分离的：我们显示了在训练中没有看到的任务上，可以使用简单的直线运算来完成多个任务。 Specifically, we perform evaluations in voice conversion, speech enhancement, speaker verification, and keyword classification. 我们的工作表明，GANs 在无条件语音合成领域仍然非常竞争，并且可以使用分离的潜在空间来帮助泛化到未见到的任务。codes, models, samples可以在 https://github.com/RF5/simple-asgan/ 上找到。

Pretraining Conformer with ASR or ASV for Anti-Spoofing Countermeasure

paper_url: http://arxiv.org/abs/2307.01546
repo_url: None
paper_authors: Yikang Wang, Hiromitsu Nishizaki, Ming Li
for: 这个论文旨在提出一种基于Transformer的多级特征聚合嵌入器（MFA-Conformer）结构，用于音频反假护照（CM）。MFA-Conformer可以同时聚合全局和本地信息，从而帮助CM系统捕捉到假造的音频特征。
methods: 该论文提出了一种基于Conformer模型的转移学习方法，使得CM系统可以通过使用预训练的Conformer模型来增强其鲁棒性。此外，论文还提出了一种使用嵌入器融合多级特征的方法，以提高CM系统的抗误差性。
results: 实验结果表明，MFA-Conformer模型在清晰语音库（FAD）的清洁集上达到了0.038%的EER，远远超过了基eline。此外，转移学习方法在纯音频段上进行了有效的提升。

Abstract
This paper introduces the Multi-scale Feature Aggregation Conformer (MFA-Conformer) structure for audio anti-spoofing countermeasure (CM). MFA-Conformer combines a convolutional neural networkbased on the Transformer, allowing it to aggregate global andlocal information. This may benefit the anti-spoofing CM system to capture the synthetic artifacts hidden both locally and globally. In addition, given the excellent performance of MFA Conformer on automatic speech recognition (ASR) and automatic speaker verification (ASV) tasks, we present a transfer learning method that utilizes pretrained Conformer models on ASR or ASV tasks to enhance the robustness of CM systems. The proposed method is evaluated on both Chinese and Englishs poofing detection databases. On the FAD clean set, the MFA-Conformer model pretrained on the ASR task achieves an EER of 0.038%, which dramatically outperforms the baseline. Moreover, experimental results demonstrate that proposed transfer learning method on Conformer is effective on pure speech segments after voice activity detection processing.

摘要
这篇论文介绍了一种基于Transformer的多级特征汇集声音防 spoofing countermeasure（MFA-Conformer）结构。MFA-Conformer结合了一个卷积神经网络，使其能够汇集全局和本地信息。这可能使防 spoofing CM 系统能够捕捉到假声音中的合成artefacts。此外，基于ASR和ASV任务的预训练Conformer模型的表现很出色，我们提出了一种在CM系统中使用这些预训练模型进行升级，以提高CM系统的Robustness。我们在中文和英文伪声检测数据库上评估了该方法。在FAD清洁集上，预训练MFA-Conformer模型在ASR任务上达到了EER值为0.038%，这在比基准值有很大的提升。此外，实验结果表明，在声音段后的语音活动检测处理后，提出的传输学习方法对Conformer是有效的。

Spatial-temporal Graph Based Multi-channel Speaker Verification With Ad-hoc Microphone Arrays

paper_url: http://arxiv.org/abs/2307.01386
repo_url: None
paper_authors: Yijiang Chen, Chengdong Liang, Xiao-Lei Zhang
for: 这篇论文目的是提高杂音环境中的多渠道话语识别率。
methods: 这篇论文使用的方法包括一个特性聚合块和一个通道选择块，两者都是基于图形。特性聚合块将不同时间和通道的话者特征聚合，使用空间时间图形对于多渠道话语识别。通道选择块则排除了可能对系统造成负面影响的杂音通道。
results: 实验结果显示，提案的方法与六种代表性方法相比，在实验数据中提供了15.39%的相对平均错误率（EER）下降，并在模拟数据中提供了17.70%的相对平均错误率下降。此外，其性能也具有不同讯号对比度和复响时间的Robustness。

Abstract
The performance of speaker verification degrades significantly in adverse acoustic environments with strong reverberation and noise. To address this issue, this paper proposes a spatial-temporal graph convolutional network (GCN) method for the multi-channel speaker verification with ad-hoc microphone arrays. It includes a feature aggregation block and a channel selection block, both of which are built on graphs. The feature aggregation block fuses speaker features among different time and channels by a spatial-temporal GCN. The graph-based channel selection block discards the noisy channels that may contribute negatively to the system. The proposed method is flexible in incorporating various kinds of graphs and prior knowledge. We compared the proposed method with six representative methods in both real-world and simulated environments. Experimental results show that the proposed method achieves a relative equal error rate (EER) reduction of $\mathbf{15.39\%}$ lower than the strongest referenced method in the simulated datasets, and $\mathbf{17.70\%}$ lower than the latter in the real datasets. Moreover, its performance is robust across different signal-to-noise ratios and reverberation time.

摘要
“对于具有强 reverberation 和噪声的恶劣类比测试环境， speaker verification 的性能会显著下降。为了解决这个问题，这篇论文提出了一个使用多条件参数的Graph Convolutional Network（GCN）方法，来进行多条件speaker verification。这个方法包括一个网格汇整块和一个网格选择块，它们都是基于图。网格汇整块会融合不同时间和通道的Speaker feature，通过对图进行汇整。网格基于的通道选择块将潜在干扰通道排除，以避免降低系统性能。提案的方法可以灵活地应用多种图和专业知识。我们与六种代表性方法进行比较，结果显示，提案的方法在实际和模拟环境中均有15.39%和17.70%的相对平均错误率（EER）下降，并且在不同的讯号载度和复合时间下保持稳定性。”

Semantic enrichment towards efficient speech representations

paper_url: http://arxiv.org/abs/2307.01323
repo_url: None
paper_authors: Gaëlle Laperrière, Ha Nguyen, Sahar Ghannay, Bassam Jabaian, Yannick Estève
for: 本研究的目的是提高 spoken language understanding 任务中的 semantic extraction，并且考虑 computation costs。
methods: 本研究使用 SAMU-XLSR 模型，通过特有的域内semantic enrichment来增强多语言Speech representation。同时，我们还使用 same-domain French和Italian benchmarks 来提高 low-resource language 的可移植性，以及 explore cross-domain capacities of the enriched SAMU-XLSR。
results: 本研究表明，通过特有的域内semantic enrichment，可以提高 spoken language understanding 任务中的 semantic extraction性能，同时也可以提高 low-resource language 的可移植性。

Abstract
Over the past few years, self-supervised learned speech representations have emerged as fruitful replacements for conventional surface representations when solving Spoken Language Understanding (SLU) tasks. Simultaneously, multilingual models trained on massive textual data were introduced to encode language agnostic semantics. Recently, the SAMU-XLSR approach introduced a way to make profit from such textual models to enrich multilingual speech representations with language agnostic semantics. By aiming for better semantic extraction on a challenging Spoken Language Understanding task and in consideration with computation costs, this study investigates a specific in-domain semantic enrichment of the SAMU-XLSR model by specializing it on a small amount of transcribed data from the downstream task. In addition, we show the benefits of the use of same-domain French and Italian benchmarks for low-resource language portability and explore cross-domain capacities of the enriched SAMU-XLSR.

摘要
在过去几年，自我超级学习的发音表示 emerged 作为解决 spoken language understanding (SLU) 任务的有用替代方案。同时，基于庞大文本数据的多语言模型被引入，以编码语言不受限制的 semantics。最近，SAMU-XLSR 方法引入了使用文本模型增强多语言speech表示的方法。通过寻求在具有挑战性的 SLU 任务中更好地EXTRACT semantic information和考虑计算成本，本研究探讨了特定领域内Semantic enhancement of the SAMU-XLSR model by specializing it on a small amount of transcribed data from the downstream task。此外，我们还展示了使用 same-domain French and Italian benchmarks 的低资源语言可移植性和跨领域 capacities of the enriched SAMU-XLSR.