cs.SD - 2023-08-11

Improving Joint Speech-Text Representations Without Alignment

paper_url: http://arxiv.org/abs/2308.06125
repo_url: None
paper_authors: Cal Peyser, Zhong Meng, Ke Hu, Rohit Prabhavalkar, Andrew Rosenberg, Tara N. Sainath, Michael Picheny, Kyunghyun Cho
for: 这个论文旨在探讨文本描述生成中的字段表示空间，以及如何在这个空间中同时表示文本和语音。
methods: 这个论文使用了联合语音文本编码器，通过将语音和文本域合并到一起，以大参数模型的能力为基础。这些方法显示了搭配性，但需要特殊地处理语音和文本序列长度的差异。
results: 这个论文提供了证据表明联合语音文本编码器可以自然地实现多媒体表示的一致性，并且可以通过抛弃序列长度来避免对预测的影响。这种损失可以提高下游WR的性能，包括大参数单语言和多语言系统。

Abstract
The last year has seen astonishing progress in text-prompted image generation premised on the idea of a cross-modal representation space in which the text and image domains are represented jointly. In ASR, this idea has found application as joint speech-text encoders that can scale to the capacities of very large parameter models by being trained on both unpaired speech and text. While these methods show promise, they have required special treatment of the sequence-length mismatch inherent in speech and text, either by up-sampling heuristics or an explicit alignment model. In this work, we offer evidence that joint speech-text encoders naturally achieve consistent representations across modalities by disregarding sequence length, and argue that consistency losses could forgive length differences and simply assume the best alignment. We show that such a loss improves downstream WER in both a large-parameter monolingual and multilingual system.

摘要
最近一年内，文本引导图像生成技术呈现出惊人的进步，基于跨Modal Representation空间的想法，在文本和图像领域 jointly 表示。在ASR中，这个想法得到应用，实现了合并语音和文本的encoder，可以通过训练大参数模型来扩大 capacities。虽然这些方法显示了承诺，但它们需要特殊地处理语音和文本的序列长度差异，可能通过上折表示或者显式对齐模型。在这项工作中，我们提供了证据，表明联合语音和文本encoder可以自然地实现多modalities的一致表示，并且提议使用一致损失来补偿序列长度差异。我们展示了这种损失可以提高下游WER，包括大参数训练的单语言和多语言系统。

Lip2Vec: Efficient and Robust Visual Speech Recognition via Latent-to-Latent Visual to Audio Representation Mapping

paper_url: http://arxiv.org/abs/2308.06112
repo_url: None
paper_authors: Yasser Abdelaziz Dahou Djilali, Sanath Narayan, Haithem Boussaid, Ebtessam Almazrouei, Merouane Debbah
for: 文章目的是提出一种简单的视觉语音识别（VSR）方法，以便在不需要大量标注数据的情况下进行训练和测试。
methods: 方法基于学习一个先验模型，将视觉语音编码器中的征表示映射到对应的音频对的征表示中，以实现有效的文本解码。
results: 对于LRS3数据集，提出的方法可以与完全监督学习方法相比，达到26个WRR（识别错误率）。与现有的SoTA方法不同，该方法在VoxCeleb测试集上保持了合理的性能。

Abstract
Visual Speech Recognition (VSR) differs from the common perception tasks as it requires deeper reasoning over the video sequence, even by human experts. Despite the recent advances in VSR, current approaches rely on labeled data to fully train or finetune their models predicting the target speech. This hinders their ability to generalize well beyond the training set and leads to performance degeneration under out-of-distribution challenging scenarios. Unlike previous works that involve auxiliary losses or complex training procedures and architectures, we propose a simple approach, named Lip2Vec that is based on learning a prior model. Given a robust visual speech encoder, this network maps the encoded latent representations of the lip sequence to their corresponding latents from the audio pair, which are sufficiently invariant for effective text decoding. The generated audio representation is then decoded to text using an off-the-shelf Audio Speech Recognition (ASR) model. The proposed model compares favorably with fully-supervised learning methods on the LRS3 dataset achieving 26 WER. Unlike SoTA approaches, our model keeps a reasonable performance on the VoxCeleb test set. We believe that reprogramming the VSR as an ASR task narrows the performance gap between the two and paves the way for more flexible formulations of lip reading.

摘要
“视觉语音识别（VSR）与常见的识别任务不同，它需要对视频序列进行更深刻的推理，甚至到了人类专家的水平。Despite recent advances in VSR, current approaches still rely on labeled data to fully train or fine-tune their models to predict the target speech, which hinders their ability to generalize well beyond the training set and leads to performance degradation under out-of-distribution challenging scenarios. Unlike previous works that involve auxiliary losses or complex training procedures and architectures, we propose a simple approach, named Lip2Vec, which is based on learning a prior model. Given a robust visual speech encoder, this network maps the encoded latent representations of the lip sequence to their corresponding latents from the audio pair, which are sufficient for effective text decoding. The generated audio representation is then decoded to text using an off-the-shelf Audio Speech Recognition (ASR) model. The proposed model compares favorably with fully-supervised learning methods on the LRS3 dataset, achieving 26 WER. Unlike SoTA approaches, our model maintains a reasonable performance on the VoxCeleb test set. We believe that reprogramming the VSR as an ASR task narrows the performance gap between the two and paves the way for more flexible formulations of lip reading.”Note that the translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you prefer Traditional Chinese, please let me know and I can provide the translation in that format as well.

An Autoethnographic Exploration of XAI in Algorithmic Composition

paper_url: http://arxiv.org/abs/2308.06089
repo_url: None
paper_authors: Ashley Noel-Hirst, Nick Bryan-Kinns
for: 本研究旨在探讨如何使用可解释的人工智能（XAI）生成模型来帮助音乐创作。
methods: 本研究使用MeasureVAE生成模型，并对其中的可解释层次进行训练，以便在爱尔兰传统音乐上进行音乐创作。
results: 研究发现，在音乐创作过程中，探索性的音乐工作流程更加强调音乐训练集中的音乐特征，而不是生成模型本身的特征。这种方法还表明XAI模型可以被合理地应用于更复杂和多元的音乐创作工作流程中。

Abstract
Machine Learning models are capable of generating complex music across a range of genres from folk to classical music. However, current generative music AI models are typically difficult to understand and control in meaningful ways. Whilst research has started to explore how explainable AI (XAI) generative models might be created for music, no generative XAI models have been studied in music making practice. This paper introduces an autoethnographic study of the use of the MeasureVAE generative music XAI model with interpretable latent dimensions trained on Irish folk music. Findings suggest that the exploratory nature of the music-making workflow foregrounds musical features of the training dataset rather than features of the generative model itself. The appropriation of an XAI model within an iterative workflow highlights the potential of XAI models to form part of a richer and more complex workflow than they were initially designed for.

摘要
文本翻译成简化中文：机器学习模型可以生成多种音乐类型，从民族音乐到古典音乐。然而，当前的生成音乐AI模型通常具有困难理解和控制的问题。研究已经开始探讨如何创建可解释的AI生成模型（XAI），但没有任何生成XAI模型在音乐创作实践中被研究。这篇论文介绍了一个自传式研究，使用了可解释的维度进行训练的MeasureVAE生成音乐XAI模型，并对 И尔兰民族音乐进行了应用。发现结果表明，音乐创作工作流程的探索性强调了训练数据集中的音乐特征，而不是生成模型自身的特征。在Iterative workflow中应用XAI模型，表明XAI模型可以成为更加复杂和多元的工作流程的一部分。

Audio is all in one: speech-driven gesture synthetics using WavLM pre-trained model

paper_url: http://arxiv.org/abs/2308.05995
repo_url: None
paper_authors: Fan Zhang, Naye Ji, Fuxing Gao, Siyuan Zhao, Zhaohan Wang, Shunman Li
for: 这 paper 的目的是为数字人类创造领域中的合作语言姿势生成。
methods: 这 paper 使用的方法是基于 transformer 架构的 speech-conditional 扩散模型，使用 WavLM 预训练模型提取低级和高级声音信息，并通过 adaptive layer norm 架构学习声音信息和合作姿势之间的关系。
results: 经过对 Trinity、ZEGGS 和 BEAT 数据集的评估，这 paper 的模型能够生成自然的合作姿势，并且可以控制姿势的风格和性格。

Abstract
The generation of co-speech gestures for digital humans is an emerging area in the field of virtual human creation. Prior research has made progress by using acoustic and semantic information as input and adopting classify method to identify the person's ID and emotion for driving co-speech gesture generation. However, this endeavour still faces significant challenges. These challenges go beyond the intricate interplay between co-speech gestures, speech acoustic, and semantics; they also encompass the complexities associated with personality, emotion, and other obscure but important factors. This paper introduces "diffmotion-v2," a speech-conditional diffusion-based and non-autoregressive transformer-based generative model with WavLM pre-trained model. It can produce individual and stylized full-body co-speech gestures only using raw speech audio, eliminating the need for complex multimodal processing and manually annotated. Firstly, considering that speech audio not only contains acoustic and semantic features but also conveys personality traits, emotions, and more subtle information related to accompanying gestures, we pioneer the adaptation of WavLM, a large-scale pre-trained model, to extract low-level and high-level audio information. Secondly, we introduce an adaptive layer norm architecture in the transformer-based layer to learn the relationship between speech information and accompanying gestures. Extensive subjective evaluation experiments are conducted on the Trinity, ZEGGS, and BEAT datasets to confirm the WavLM and the model's ability to synthesize natural co-speech gestures with various styles.

摘要
<>转换文本到简化中文。<>虚拟人类创造领域中的同声动作生成是一个emerging领域。先前的研究已经做出了进步，使用音声和语义信息作为输入，采用分类方法确定人的ID和情绪，以驱动同声动作生成。然而，这种努力仍然面临着重大挑战。这些挑战不仅包括同声动作、音声和语义之间的细微相互作用，还包括人格、情绪和其他重要而不可预测的因素。本文介绍“diffmotion-v2”模型，这是一种基于 transformer 架构的 speech-conditional 扩散型生成模型，使用 WavLM 预训练模型。它可以根据原始的 Raw Speech 音频生成个性化和风格化的全身同声动作，不需要详细的多媒体处理和手动标注。首先，我们认为音频不仅包含音声和语义信息，还包含人格特征和情绪信息，因此我们采用 WavLM 预训练模型来提取低级和高级音频信息。其次，我们引入 transformer 架构中的 adaptive 层 нор方法，以学习音频信息和同声动作之间的关系。我们在 Trinity、ZEGGS 和 BEAT dataset 上进行了大量主观评估实验，以证明 WavLM 和模型的能力生成自然的同声动作。

Advancing the study of Large-Scale Learning in Overlapped Speech Detection

paper_url: http://arxiv.org/abs/2308.05987
repo_url: None
paper_authors: Zhaohui Yin, Jingguang Tian, Xinhui Hu, Xinkang Xu
for: 多个party会话中的干扰语音检测（OSD）是speech应用中的一个重要部分，但大多数现有的OSD模型都是基于特定的数据集进行训练和评估，这限制了这些模型的应用场景。
methods: 本研究提出了大规模学习（LSL）在OSD中的应用，并设计了522小时不同语言和风格的标注音频作为大规模数据集。并通过对不同深度神经网络基于OSD模型的比较性试验来评估LSL在OSD任务中的效果和OSD模型的性能。
results: 研究结果表明，LSL可以显著提高OSD模型的性能和Robustness，并且CF-OSD模型基于LSL在16K单频OSD任务中取得了最佳性能，其F1分数为80.8%和52.0% separately在Alimeeting测试集和DIHARD II评估集上。

Abstract
Overlapped Speech Detection (OSD) is an important part of speech applications involving analysis of multi-party conversations. However, Most of the existing OSD models are trained and evaluated on specific dataset, which limits the application scenarios of these models. In order to solve this problem, we conduct a study of large-scale learning (LSL) in OSD and propose a more general 16K single-channel OSD model. In our study, 522 hours of labeled audio in different languages and styles are collected and used as the large-scale dataset. Rigorous comparative experiments are designed and used to evaluate the effectiveness of LSL in OSD task and the performance of OSD models based on different deep neural networks. The results show that LSL can significantly improve the performance and robustness of OSD models, and the OSD model based on Conformer (CF-OSD) with LSL is currently the best 16K single-channel OSD model. Moreover, the CF-OSD with LSL establishes a state-of-the-art performance with a F1-score of 80.8% and 52.0% on the Alimeeting test set and DIHARD II evaluation set, respectively.

摘要
《 overlap speech detection (OSD) 是一种重要的语音应用程序中的分析多方会话的一部分。然而，现有的大多数 OSD 模型都是基于特定数据集进行训练和评估，这限制了这些模型的应用场景。为解决这个问题，我们进行了大规模学习 (LSL) 在 OSD 中的研究，并提出了一种更加通用的 16K 单通道 OSD 模型。在我们的研究中，我们收集了 522 小时不同语言和风格的标注音频，并使用这些大规模数据集进行训练和测试。我们设计了严格的比较实验，用于评估 LSL 在 OSD 任务中的效果和不同深度神经网络上的 OSD 模型性能。结果显示，LSL 可以明显提高 OSD 模型的性能和 Robustness，并且 CF-OSD WITH LSL 目前是最佳的 16K 单通道 OSD 模型。此外，CF-OSD WITH LSL 在 Alimeeting 测试集和 DIHARD II 评估集上的 F1 分数分别达到了 80.8% 和 52.0%。

AudioLDM 2: Learning Holistic Audio Generation with Self-supervised Pretraining

paper_url: http://arxiv.org/abs/2308.05734
repo_url: https://github.com/haoheliu/AudioLDM2
paper_authors: Haohe Liu, Qiao Tian, Yi Yuan, Xubo Liu, Xinhao Mei, Qiuqiang Kong, Yuping Wang, Wenwu Wang, Yuxuan Wang, Mark D. Plumbley
for: 这个论文的目的是提出一个框架，用于同时生成不同类型的声音（如speech、音乐和声效），并且使用同一种学习方法来满足不同类型的目标和偏见。
methods: 该框架使用了一种通用的声音表示（LOA），将任何类型的声音都可以翻译为LOA，并使用一个GPT-2模型将不同类型的modalities翻译为LOA。在生成过程中，使用一个Latent Diffusion模型， conditioned on LOA，进行自我监督的声音生成学习。
results: 实验结果表明，该框架可以在主要的benchmark上达到新的州OF-THE-ART或与之前的方法竞争的性能。 Code和demo可以在https://audioldm.github.io/audioldm2中获取。

Abstract
Although audio generation shares commonalities across different types of audio, such as speech, music, and sound effects, designing models for each type requires careful consideration of specific objectives and biases that can significantly differ from those of other types. To bring us closer to a unified perspective of audio generation, this paper proposes a framework that utilizes the same learning method for speech, music, and sound effect generation. Our framework introduces a general representation of audio, called language of audio (LOA). Any audio can be translated into LOA based on AudioMAE, a self-supervised pre-trained representation learning model. In the generation process, we translate any modalities into LOA by using a GPT-2 model, and we perform self-supervised audio generation learning with a latent diffusion model conditioned on LOA. The proposed framework naturally brings advantages such as in-context learning abilities and reusable self-supervised pretrained AudioMAE and latent diffusion models. Experiments on the major benchmarks of text-to-audio, text-to-music, and text-to-speech demonstrate new state-of-the-art or competitive performance to previous approaches. Our demo and code are available at https://audioldm.github.io/audioldm2.

摘要
尽管听音生成存在不同类型的听音之间共同之处，例如语音、音乐和声音效果，但设计模型时需要仔细考虑每种类型的特定目标和偏见，这些偏见与其他类型的偏见可能有所不同。为了带我们更近到听音生成的统一视角，这篇论文提出了一个框架，该框架利用同一种学习方法来生成不同类型的听音。我们的框架引入了一个通用的听音表示（LOA），任何听音都可以根据AudioMAE自动学习的预训练表示学习模型翻译为LOA。在生成过程中，我们使用GPT-2模型将任何模式翻译为LOA，然后使用干扰扩散模型在LOA上进行自主学习。我们的提议框架自然带来了一些优点，例如在上下文学习能力和可重用的自动学习AudioMAE和干扰扩散模型。我们的实验在文本到听音、文本到音乐和文本到语音的主要benchmark上达到了新的州OF-THE-ART或与前一代方法竞争的性能。我们的demo和代码可以在https://audioldm.github.io/audioldm2中找到。