cs.SD - 2023-07-12

B-CLEAN-SC: CLEAN-SC for broadband sources

paper_url: http://arxiv.org/abs/2307.06181
repo_url: None
paper_authors: Armin Goudarzi
for: 这篇论文是为了描述一种适用于宽频源的 CLEAN-SC 变体，即 B-CLEAN-SC。
methods: 该方法在不同频率间进行处理，而不是对每个频率 individually进行� deconvolution。它在峰值位置上进行� deconvolution，而不是在标准 CLEAN-SC 中的最大值位置。
results: 对于 synthetic 和实际 экспериментах，B-CLEAN-SC 能够改善源重建效果，并且降低噪声水平，但是只需要增加内存空间，而不需要更多的计算资源。

Abstract
This paper presents B-CLEAN-SC, a variation of CLEAN-SC for broadband sources. Opposed to CLEAN-SC, which ``deconvolves'' the beamforming map for each frequency individually, B-CLEAN-SC processes frequency intervals. Instead of performing a deconvolution iteration at the location of the maximum level, B-CLEAN-SC performs it at the location of the over-frequency-averaged maximum to improve the location estimation. The method is validated and compared to standard CLEAN-SC on synthetic cases, and real-world experiments, for broad- and narrowband sources. It improves the source reconstruction at low and high frequencies and suppresses noise, while it only increases the need for memory but not computational effort.

摘要

Sumformer: A Linear-Complexity Alternative to Self-Attention for Speech Recognition

paper_url: http://arxiv.org/abs/2307.07421
repo_url: None
paper_authors: Titouan Parcollet, Rogier van Dalen, Shucong Zhang, Sourav Bhattacharya
for: 提高 speech recognition 系统的效率和可扩展性。
methods: 提出一种 linear-time 的替代方案，使用 mean 值来概括整个句子，并与时间特定信息相结合。
results: 在 state-of-the-art ASR 模型中引入 Summary Mixing 方法可以保持或超越先前的语音识别性能，同时降低训练和执行时间和内存预算。

Abstract
Modern speech recognition systems rely on self-attention. Unfortunately, token mixing with self-attention takes quadratic time in the length of the speech utterance, slowing down inference as well as training and increasing memory consumption. Cheaper alternatives to self-attention for ASR have been developed, but fail to consistently reach the same level of accuracy. In practice, however, the self-attention weights of trained speech recognizers take the form of a global average over time. This paper, therefore, proposes a linear-time alternative to self-attention for speech recognition. It summarises a whole utterance with the mean over vectors for all time steps. This single summary is then combined with time-specific information. We call this method ``Summary Mixing''. Introducing Summary Mixing in state-of-the-art ASR models makes it feasible to preserve or exceed previous speech recognition performance while lowering the training and inference times by up to 27% and reducing the memory budget by a factor of two.

摘要
现代语音识别系统几乎总是使用自注意力。然而，使用自注意力的代价很高，因为在语音句子的长度为参数的情况下，自注意力的计算时间为 quadratic time。这会使推理和训练速度下降，同时增加内存消耗。虽然有一些便宜的自注意力的替代方案出现了，但它们无法保持同样的准确率水平。实际上，已经训练的语音识别器的自注意力权重通常是全程时间的global average。因此，这篇论文提议了一种 linear-time 的自注意力替代方案，即“摘要混合”方法。这种方法总结了一个整个句子的摘要，然后将其与时间特定信息结合。我们称之为“摘要混合”。在现有的语音识别模型中引入摘要混合后，可以保持或超越之前的语音识别性能，同时降低训练和推理时间（最多下降27%）和内存预算（减半）。

Can Large Language Models Aid in Annotating Speech Emotional Data? Uncovering New Frontiers

paper_url: http://arxiv.org/abs/2307.06090
repo_url: None
paper_authors: Siddique Latif, Muhammad Usama, Mohammad Ibrahim Malik, Björn W. Schuller
for: 提高状态的讲话情感识别（SER）模型的性能
methods: 使用大型自然语言模型（LLM）对充足的 speech 数据进行标注
results: 通过实验表明，LLM 可以帮助提高 SER 的性能，并且可以在单击和几个 clicks 的情况下实现 improved 的结果，同时还可以通过数据扩展来提高结果的稳定性。

Abstract
Despite recent advancements in speech emotion recognition (SER) models, state-of-the-art deep learning (DL) approaches face the challenge of the limited availability of annotated data. Large language models (LLMs) have revolutionised our understanding of natural language, introducing emergent properties that broaden comprehension in language, speech, and vision. This paper examines the potential of LLMs to annotate abundant speech data, aiming to enhance the state-of-the-art in SER. We evaluate this capability across various settings using publicly available speech emotion classification datasets. Leveraging ChatGPT, we experimentally demonstrate the promising role of LLMs in speech emotion data annotation. Our evaluation encompasses single-shot and few-shots scenarios, revealing performance variability in SER. Notably, we achieve improved results through data augmentation, incorporating ChatGPT-annotated samples into existing datasets. Our work uncovers new frontiers in speech emotion classification, highlighting the increasing significance of LLMs in this field moving forward.

摘要
尽管最新的语音情感识别（SER）模型已经取得了 significiant progress，但是现代深度学习（DL）方法仍面临着有限的标注数据的挑战。大型自然语言模型（LLMs）已经革命化了我们对自然语言、语音和视觉的理解，探索了新的emergent property。这篇论文探讨了 LLMs 是否可以用来注释丰富的语音数据，以提高 SER 的状态。我们使用 ChatGPT 进行实验，并证明了 LLMs 在语音情感分类中的扮演。我们的评估包括单个和几个shot的enario，发现 SER 的性能具有一定的变化。特别是，我们通过数据扩展来提高结果，将 ChatGPT-注释的样本 incorporated 到现有的数据集中。我们的工作探索了新的frontier在语音情感分类领域，强调 LLMS 在这个领域的增长意义。

Language-Routing Mixture of Experts for Multilingual and Code-Switching Speech Recognition

paper_url: http://arxiv.org/abs/2307.05956
repo_url: None
paper_authors: Wenxuan Wang, Guodong Ma, Yuke Li, Binbin Du
for: 本研究旨在提高多语言并码换 speech recognition的性能，并且减少计算复杂性。
methods: 本研究使用Language-Routing Mixture of Experts（LR-MoE）网络，该网络通过语言专家组合（MLE）提取语言特有的表示，并通过框架级别语言 routings 机制引导学习。共享预处理器（LID）网络与 MLE 层共享参数。
results: 对比基eline，提出的方法在多语言并码换 speech recognition 中显著提高表现，同时具有相当的计算效率。

Abstract
Multilingual speech recognition for both monolingual and code-switching speech is a challenging task. Recently, based on the Mixture of Experts (MoE), many works have made good progress in multilingual and code-switching ASR, but present huge computational complexity with the increase of supported languages. In this work, we propose a computation-efficient network named Language-Routing Mixture of Experts (LR-MoE) for multilingual and code-switching ASR. LR-MoE extracts language-specific representations through the Mixture of Language Experts (MLE), which is guided to learn by a frame-wise language routing mechanism. The weight-shared frame-level language identification (LID) network is jointly trained as the shared pre-router of each MoE layer. Experiments show that the proposed method significantly improves multilingual and code-switching speech recognition performances over baseline with comparable computational efficiency.

摘要
多语言语音识别对于单语言和语言交换speech是一个挑战的任务。在最近，基于混合专家（MoE），许多工作在多语言和语言交换ASR中做出了良好的进展，但是随着支持语言的增加，计算复杂性增加得非常大。在这项工作中，我们提出了一种计算效率高的网络 named Language-Routing Mixture of Experts（LR-MoE） для多语言和语言交换ASR。LR-MoE通过混合语言专家（MLE）提取语言特有的表示，并由一个框架级别的语言路由机制引导学习。与每个MoE层共享权重的共享预处理网络（LID）在每个MoE层中被同时训练。实验显示，提议的方法可以明显提高多语言和语言交换speech识别性能，与基准值相比，计算效率相对较高。

SnakeSynth: New Interactions for Generative Audio Synthesis

paper_url: http://arxiv.org/abs/2307.05830
repo_url: None
paper_authors: Eric Easthope
for: 这篇论文旨在开发一种基于深度生成模型的轻量级音频合成器，可以通过实时两个维度（2D）输入控制变量长度的生成音频。
methods: 该论文使用了深度生成模型和实时二维度输入控制变量长度的生成音频。
results: 研究人员成功创造了一种可以在浏览器中运行的高精度音频合成器，并可以通过实时两个维度输入控制音频的长度和强度。

Abstract
I present "SnakeSynth," a web-based lightweight audio synthesizer that combines audio generated by a deep generative model and real-time continuous two-dimensional (2D) input to create and control variable-length generative sounds through 2D interaction gestures. Interaction gestures are touch and mobile-compatible with analogies to strummed, bowed, and plucked musical instrument controls. Point-and-click and drag-and-drop gestures directly control audio playback length and I show that sound length and intensity are modulated by interactions with a programmable 2D coordinate grid. Leveraging the speed and ubiquity of browser-based audio and hardware acceleration in Google's TensorFlow.js we generate time-varying high-fidelity sounds with real-time interactivity. SnakeSynth adaptively reproduces and interpolates between sounds encountered during model training, notably without long training times, and I briefly discuss possible futures for deep generative models as an interactive paradigm for musical expression.

摘要
我宣布“SnakeSynth”，一款基于网页的轻量级音频合成器，将深度生成模型生成的音频和实时连续二维（2D）输入结合起来，创造和控制可变长度生成音频的变量长度和强度通过2D交互姿势。交互姿势包括触摸和移动设备兼容的把弦、弓和拨弦乐器控制。点击和拖动手势直接控制音频播放长度，我显示了播放长度和强度是通过与可程序化2D坐标网格进行交互来修改的。利用浏览器基于的快速和 ubique 的音频和硬件加速，我们在 Google TensorFlow.js 中生成了时间变化的高品质音频，并且实现了实时交互。SnakeSynth 可以适应和 interpolate 在模型训练中所遇到的声音，而不需要长时间训练，我 briefly discuss 了深度生成模型作为交互式乐器表达的可能性。