paper_authors: Nithin Rao Koluguri, Samuel Kriman, Georgy Zelenfroind, Somshubra Majumdar, Dima Rekesh, Vahid Noroozi, Jagadeesh Balam, Boris Ginsburg
for: This paper provides an overview and evaluation of end-to-end ASR models on long-form audios, with a focus on three categories of models based on their core architecture.
methods: The paper evaluates Word Error Rate, maximum audio length, and real-time factor for each model on several long audio benchmarks, including Earnings-21 and 22, CORAAL, and TED-LIUM3.
results: The model with self-attention and local attention has the best accuracy, and CTC-based models are more robust and efficient than RNNT on long-form audio.Abstract
This paper presents an overview and evaluation of some of the end-to-end ASR models on long-form audios. We study three categories of Automatic Speech Recognition(ASR) models based on their core architecture: (1) convolutional, (2) convolutional with squeeze-and-excitation and (3) convolutional models with attention. We selected one ASR model from each category and evaluated Word Error Rate, maximum audio length and real-time factor for each model on a variety of long audio benchmarks: Earnings-21 and 22, CORAAL, and TED-LIUM3. The model from the category of self-attention with local attention and global token has the best accuracy comparing to other architectures. We also compared models with CTC and RNNT decoders and showed that CTC-based models are more robust and efficient than RNNT on long form audio.
摘要
本文提供了一个简报和ASR模型的评估,涵盖了长形音频的批处理。我们研究了三类自动语音识别(ASR)模型的核心结构:(1)卷积,(2)卷积加上压缩和刺激,以及(3)卷积模型加注意。我们从每个类别中选择了一个ASR模型,并对它们在多种长音频标准评估 datasets(Earnings-21和22、CORAAL、TED-LIUM3)上进行评估 Word Error Rate、最大音频长度和实时因素。我们发现,具有本地注意力和全局标识的模型在准确性方面表现最佳。此外,我们还比较了基于CTC和RNNT解码器的模型,并发现CTC-based模型在长形音频上更加稳定和高效。
Harmony and Duality: An introduction to Music Theory
results: 研究发现,对于这些简单的两个/三个voice约束,完整性是一种重要的特征,即Scale是最大的一个包含所有可能的音符的集。此外,对于这些约束,存在一种对应关系,可以将Scalesubject to two-voice constraint与Scalesubject to three-voice constraint进行对应。最后,通过组合这些约束,提供了一种分类推理的方法来分类和声。Abstract
We develop aspects of music theory related to harmony, such as scales, chord formation and improvisation from a combinatorial perspective. The goal is to provide a foundation for this subject by deriving the basic structure from a few assumptions, rather than writing down long lists of chords/scales to memorize without an underlying principle. Our approach involves introducing constraints that limit the possible scales we can consider. For example, we may impose the constraint that two voices cannot be only a semitone apart as this is too dissonant. We can then study scales that do not contain notes that are a semitone apart. A more refined constraint avoids three voices colliding by studying scales that do not have three notes separated only by semitones. Additionally, we require that our scales are complete, which roughly means that they are the maximal sets of tones that satisfy these constraints. As it turns out, completeness as applied to these simple two/three voice constraints characterizes the types of scales that are commonly used in music composition. Surprisingly, there is a correspondence between scales subject to the two-voice constraint and those subject to the three-voice constraint. We formulate this correspondence as a duality statement that provides a way to understand scales subject to one type of constraint in terms of scales subject to the other. Finally, we combine these constraint ideas to provide a classification of chords.
摘要
我们在音乐理论方面发展有关和声的方面,如约束、和声形成和演奏等。我们的目标是从一些假设出发,而不是直接记忆大量的和声和scale。我们的方法是引入约束,限制可考虑的约束。例如,我们可能假设两个声部不能夹紧只有一个半音之间,这太紧张。我们可以研究不含这种约束的约束。我们还需要保证我们的约束是完整的,这意味着它们是最大的满足这些约束的约束集。这些约束集是通常用于音乐创作中的约束。 surprisingly,存在一个对约束的对偶关系,它将约束一种类型的约束转换为另一种类型的约束。我们将这些约束融合,以提供和声的分类。
Frame-to-Utterance Convergence: A Spectra-Temporal Approach for Unified Spoofing Detection
results: 在多个数据集上(ASVspoof2019、ASVspoof2021、VSDC、partial spoofs和免疫深度伪造)进行了广泛的评估,并达到了多种声音应用场景中的高度可靠性和抗伪造性Abstract
Voice spoofing attacks pose a significant threat to automated speaker verification systems. Existing anti-spoofing methods often simulate specific attack types, such as synthetic or replay attacks. However, in real-world scenarios, the countermeasures are unaware of the generation schema of the attack, necessitating a unified solution. Current unified solutions struggle to detect spoofing artifacts, especially with recent spoofing mechanisms. For instance, the spoofing algorithms inject spectral or temporal anomalies, which are challenging to identify. To this end, we present a spectra-temporal fusion leveraging frame-level and utterance-level coefficients. We introduce a novel local spectral deviation coefficient (SDC) for frame-level inconsistencies and employ a bi-LSTM-based network for sequential temporal coefficients (STC), which capture utterance-level artifacts. Our spectra-temporal fusion strategy combines these coefficients, and an auto-encoder generates spectra-temporal deviated coefficients (STDC) to enhance robustness. Our proposed approach addresses multiple spoofing categories, including synthetic, replay, and partial deepfake attacks. Extensive evaluation on diverse datasets (ASVspoof2019, ASVspoof2021, VSDC, partial spoofs, and in-the-wild deepfakes) demonstrated its robustness for a wide range of voice applications.
摘要
声音骗陷poses a significant threat to automatic speaker verification systems. Existing anti-骗陷 methods often simulate specific attack types, such as synthetic or replay attacks. However, in real-world scenarios, the countermeasures are unaware of the generation schema of the attack, necessitating a unified solution. Current unified solutions struggle to detect spoofing artifacts, especially with recent spoofing mechanisms. For instance, the spoofing algorithms inject spectral or temporal anomalies, which are challenging to identify. To this end, we present a spectra-temporal fusion leveraging frame-level and utterance-level coefficients. We introduce a novel local spectral deviation coefficient (SDC) for frame-level inconsistencies and employ a bi-LSTM-based network for sequential temporal coefficients (STC), which capture utterance-level artifacts. Our spectra-temporal fusion strategy combines these coefficients, and an auto-encoder generates spectra-temporal deviated coefficients (STDC) to enhance robustness. Our proposed approach addresses multiple spoofing categories, including synthetic, replay, and partial deepfake attacks. Extensive evaluation on diverse datasets (ASVspoof2019, ASVspoof2021, VSDC, partial spoofs, and in-the-wild deepfakes) demonstrated its robustness for a wide range of voice applications.
Synth-AC: Enhancing Audio Captioning with Synthetic Supervision
methods: 提出了SynthAC框架,利用现有的音频生成模型和文本 corpus来创建Synthetic text-audio pairs,从而提高文本-音频表示。具体来说,使用文本-音频生成模型(AudioLDM)来生成Synthetic audio signals with captions from an image captioning dataset。
results: 实验表明,SynthAC框架可以增强音频描述方法,通过学习Synthetic text-audio pairs中的关系,提高文本-音频表示的质量。此外,SynthAC可以轻松地适应不同的现状前方法,导致表现提高。Abstract
Data-driven approaches hold promise for audio captioning. However, the development of audio captioning methods can be biased due to the limited availability and quality of text-audio data. This paper proposes a SynthAC framework, which leverages recent advances in audio generative models and commonly available text corpus to create synthetic text-audio pairs, thereby enhancing text-audio representation. Specifically, the text-to-audio generation model, i.e., AudioLDM, is used to generate synthetic audio signals with captions from an image captioning dataset. Our SynthAC expands the availability of well-annotated captions from the text-vision domain to audio captioning, thus enhancing text-audio representation by learning relations within synthetic text-audio pairs. Experiments demonstrate that our SynthAC framework can benefit audio captioning models by incorporating well-annotated text corpus from the text-vision domain, offering a promising solution to the challenge caused by data scarcity. Furthermore, SynthAC can be easily adapted to various state-of-the-art methods, leading to substantial performance improvements.
摘要
<>对于音频描述来说,数据驱动的方法具有承诺。然而,音频描述方法的开发可能会受到文本-音频数据的有限性和质量的限制。这篇论文提议一个SynthAC框架,它利用最近的音频生成模型和常见的文本库来创建 sintetic文本-音频对,从而提高文本-音频表示。具体来说,文本-音频生成模型,即AudioLDM,用于生成 sintetic音频信号和caption从一个图像描述集。我们的SynthAC扩展了文本-视觉领域中已有较好的注释的caption到音频描述领域,从而提高文本-音频表示的学习关系。实验表明,我们的SynthAC框架可以通过将文本-视觉领域中已有较好的注释的caption引入到音频描述领域,提高音频描述模型的性能。此外,SynthAC可以轻松地适应不同的现状推荐方法,导致显著的性能提升。
Scaling the time and Fourier domains to align periodically and their convolution
for: 这篇论文是用于解释如何使用频率或时间扩展来对 periodic signal 进行快速匹配的。
methods: 该论文使用了频率或时间扩展的方法来对 periodic signal 进行快速匹配。
results: 该论文的结果表明,通过频率或时间扩展可以快速地对 periodic signal 进行匹配,并且可以用于开发新的算法,如抑音度估计算法。Abstract
This note shows how to align a periodic signal with its the Fourier transform by means of frequency or time scaling. This may be useful in developing new algorithms, e.g. for pitch estimation. This note also convolves the signals and the frequency time convolution is denoted fxt.
摘要
这份笔记介绍了如何将周期信号与其快 Fourier 变换进行对齐,通过频率或时间扩展。这可能有用于开发新的算法,例如抑音测量。此外,这份笔记还将信号和频率时间卷积,并将其称为fxt。
Refining DNN-based Mask Estimation using CGMM-based EM Algorithm for Multi-channel Noise Reduction
results: 验证方法在三个最新的深度学习模型中,包括 DCUnet、DCCRN 和 FullSubNet,可以提高时间频谱屏障估计的准确性,并因此提高整体语音质量,测量方法为 PESQ 改进。改进是在所有三个 DNN 模型中具有一致性。Abstract
In this paper, we present a method that allows to further improve speech enhancement obtained with recently introduced Deep Neural Network (DNN) models. We propose a multi-channel refinement method of time-frequency masks obtained with single-channel DNNs, which consists of an iterative Complex Gaussian Mixture Model (CGMM) based algorithm, followed by optimum spatial filtration. We validate our approach on time-frequency masks estimated with three recent deep learning models, namely DCUnet, DCCRN, and FullSubNet. We show that our method with the proposed mask refinement procedure allows to improve the accuracy of estimated masks, in terms of the Area Under the ROC Curve (AUC) measure, and as a consequence the overall speech quality of the enhanced speech signal, as measured by PESQ improvement, and that the improvement is consistent across all three DNN models.
摘要
在这篇论文中,我们提出了一种方法,可以使用最近引入的深度神经网络(DNN)模型进一步提高抑制的 speech 质量。我们提议一种多通道刷新时域频谱屏障方法,包括一种迭代复杂 Gaussian Mixture Model(CGMM)基于算法,然后是最佳空间滤波。我们验证了我们的方法,使用三种最近的深度学习模型,即 DCUnet、DCCRN 和 FullSubNet。我们发现,我们的方法与提出的面积掩模预测方法可以提高抑制后 speech 质量的准确性, measured by AUC 度量,并且这种改进是所有三种 DNN 模型中的一致。
Electrolaryngeal Speech Intelligibility Enhancement Through Robust Linguistic Encoders
For: 提高电子喉咙 speech 知识权威性(intelligibility)* Methods: 使用Robust linguistic encoders和HuBERT输出特征,解决类型匹配问题和说话者匹配问题* Results: 比 conventional framework 提高16%的字符错误率和0.83的自然度分数Abstract
We propose a novel framework for electrolaryngeal speech intelligibility enhancement through the use of robust linguistic encoders. Pretraining and fine-tuning approaches have proven to work well in this task, but in most cases, various mismatches, such as the speech type mismatch (electrolaryngeal vs. typical) or a speaker mismatch between the datasets used in each stage, can deteriorate the conversion performance of this framework. To resolve this issue, we propose a linguistic encoder robust enough to project both EL and typical speech in the same latent space, while still being able to extract accurate linguistic information, creating a unified representation to reduce the speech type mismatch. Furthermore, we introduce HuBERT output features to the proposed framework for reducing the speaker mismatch, making it possible to effectively use a large-scale parallel dataset during pretraining. We show that compared to the conventional framework using mel-spectrogram input and output features, using the proposed framework enables the model to synthesize more intelligible and naturally sounding speech, as shown by a significant 16% improvement in character error rate and 0.83 improvement in naturalness score.
摘要
我们提出了一种新的框架,用于提高电子声门朗读 intelligibility。我们使用了Robust语言编码器,并在预训练和精度调整阶段使用这些编码器。然而,在大多数情况下,这些编码器可能会受到类型匹配问题的影响,导致转换性能下降。为解决这个问题,我们提议一种可以将EL和典型语音 проек到同一个准确空间中的语言编码器,同时仍能够提取正确的语言信息。此外,我们还引入了HuBERT输出特征,以降低说话者匹配问题,使得可以效果地使用大规模并行数据集进行预训练。我们的实验结果表明,相比传统框架使用MELspectrogram输入和输出特征,使用我们的框架可以让模型生成更加智能和自然的声音,Character error rate下降16%,Naturalness score提高0.83。
HumTrans: A Novel Open-Source Dataset for Humming Melody Transcription and Beyond
results: 该 Dataset 包含约56.22小时的嗓吟录音,是目前已知最大的嗓吟Dataset。论文将在 Hugging Face 上发布,并提供 GitHub 仓库,包含基线结果和评价代码。Abstract
This paper introduces the HumTrans dataset, which is publicly available and primarily designed for humming melody transcription. The dataset can also serve as a foundation for downstream tasks such as humming melody based music generation. It consists of 500 musical compositions of different genres and languages, with each composition divided into multiple segments. In total, the dataset comprises 1000 music segments. To collect this humming dataset, we employed 10 college students, all of whom are either music majors or proficient in playing at least one musical instrument. Each of them hummed every segment twice using the web recording interface provided by our designed website. The humming recordings were sampled at a frequency of 44,100 Hz. During the humming session, the main interface provides a musical score for students to reference, with the melody audio playing simultaneously to aid in capturing both melody and rhythm. The dataset encompasses approximately 56.22 hours of audio, making it the largest known humming dataset to date. The dataset will be released on Hugging Face, and we will provide a GitHub repository containing baseline results and evaluation codes.
摘要
Simplified Chinese translation:这篇论文介绍了一个名为HumTrans的 dataset,该 dataset 是公共可用的,主要用于唱响旋律识别。该 dataset 还可以用于下游任务,如基于唱响旋律的音乐生成。它包含了500首不同类型和语言的乐曲,每首乐曲被分成多个段落。总共,该 dataset 包含1000段乐曲。为了收集这个唱响 dataset,我们雇用了10名大学生,其中大多数是音乐专业或擅长至少一种乐器。每名学生在我们设计的网站上录制了每段乐曲两次。唱响录制的采样频率为44,100 Hz。在唱响会议中,主界面提供了一份乐谱,同时播放唱响音频,以帮助学生记录旋律和节奏。该 dataset 包含约56.22小时的音频,是目前已知最大的唱响 dataset。该 dataset 将在Hugging Face上发布,我们将在 GitHub 上提供基线结果和评估代码。
Spoofing attack augmentation: can differently-trained attack models improve generalisation?
results: 本研究发现,使用不同的训练条件时,深度学习基本探测器的性能可能会有很大的差异。但是,使用图像注意力网络和自我超vised learning的CM模型则能够保持稳定性。此外,对于不同的攻击方法进行训练可能不够,还需要对于伪讯攻击进行增强。Abstract
A reliable deepfake detector or spoofing countermeasure (CM) should be robust in the face of unpredictable spoofing attacks. To encourage the learning of more generaliseable artefacts, rather than those specific only to known attacks, CMs are usually exposed to a broad variety of different attacks during training. Even so, the performance of deep-learning-based CM solutions are known to vary, sometimes substantially, when they are retrained with different initialisations, hyper-parameters or training data partitions. We show in this paper that the potency of spoofing attacks, also deep-learning-based, can similarly vary according to training conditions, sometimes resulting in substantial degradations to detection performance. Nevertheless, while a RawNet2 CM model is vulnerable when only modest adjustments are made to the attack algorithm, those based upon graph attention networks and self-supervised learning are reassuringly robust. The focus upon training data generated with different attack algorithms might not be sufficient on its own to ensure generaliability; some form of spoofing attack augmentation at the algorithm level can be complementary.
摘要
一个可靠的深刻模仿检测器或假造防范措施(CM)应该具备对不可预测的假造攻击的鲜活性。为了鼓励学习更通用的特征,CM通常在训练时被 expose 到多种不同的攻击。尽管如此,深度学习基本的CM解决方案的性能会异时变化,有时会导致重大的性能下降。我们在这篇论文中表明,假造攻击也可以因为训练条件而变化,导致检测性能下降。然而,基于 RawNet2 CM 模型的模型可以在只有轻微调整攻击算法时受到攻击。相比之下,基于图注意力网络和自然学习的模型具备了更高的鲜活性。训练数据生成的不同攻击算法可能不 enough alone ensure 通用性;在算法层次上进行假造攻击加强可以是补充。
Spiking-LEAF: A Learnable Auditory front-end for Spiking Neural Networks
results: 在关键字搜寻和话者识别任务上,提案的Spiking-LEAF在类别精度、噪声耐性和编码效率方面都大于先前的SOTA脳神经网络听力前端和传统的实值静止特征。Abstract
Brain-inspired spiking neural networks (SNNs) have demonstrated great potential for temporal signal processing. However, their performance in speech processing remains limited due to the lack of an effective auditory front-end. To address this limitation, we introduce Spiking-LEAF, a learnable auditory front-end meticulously designed for SNN-based speech processing. Spiking-LEAF combines a learnable filter bank with a novel two-compartment spiking neuron model called IHC-LIF. The IHC-LIF neurons draw inspiration from the structure of inner hair cells (IHC) and they leverage segregated dendritic and somatic compartments to effectively capture multi-scale temporal dynamics of speech signals. Additionally, the IHC-LIF neurons incorporate the lateral feedback mechanism along with spike regularization loss to enhance spike encoding efficiency. On keyword spotting and speaker identification tasks, the proposed Spiking-LEAF outperforms both SOTA spiking auditory front-ends and conventional real-valued acoustic features in terms of classification accuracy, noise robustness, and encoding efficiency.
摘要
Brain-inspired spiking neural networks (SNNs) 处理时间信号的潜力很大,但是在声音处理方面表现有限因为缺乏有效的声音前端。为了解决这个限制,我们介绍Spiking-LEAF,一个learnable的声音前端,它精心设计用于基于SNN的声音处理。Spiking-LEAF结合了一个learnable滤波器和一个新的两个复体发射神经模型called IHC-LIF。IHC-LIF神经元受内毛槽细胞(IHC)的结构启发,并且利用分类的蕈葱和肉体部分来有效地捕捉声音信号的多尺度时间动态。此外,IHC-LIF神经元还包括 lateral feedback 机制和发射频率损失来增强发射码效率。在关键词搜寻和认知任务上,我们的提案的Spiking-LEAF比SOTA的脉搏式声音前端和传统的实值数字特征更高的分类精度、噪声Robustness和码编码效率。
Are Soft Prompts Good Zero-shot Learners for Speech Recognition?
paper_authors: Dianwen Ng, Chong Zhang, Ruixi Zhang, Yukun Ma, Fabian Ritter-Gutierrez, Trung Hieu Nguyen, Chongjia Ni, Shengkui Zhao, Eng Siong Chng, Bin Ma
for: 这个论文的目的是解释软提示在自动语音识别(ASR) task 中的作用,以及如何使用软提示来提高 ASR 性能。
methods: 这篇论文使用了软提示来提高 ASR 性能,并通过分析软提示在不同的情况下的作用来深入理解软提示的作用。
results: 研究发现,软提示可以在不需要任何训练数据的情况下提高 ASR 性能,并且可以帮助模型更好地适应噪音环境。此外,研究还发现软提示可以分为两个角色:内容细化和噪音信息增强,这两个角色都有助于提高模型的稳定性和robustness。Abstract
Large self-supervised pre-trained speech models require computationally expensive fine-tuning for downstream tasks. Soft prompt tuning offers a simple parameter-efficient alternative by utilizing minimal soft prompt guidance, enhancing portability while also maintaining competitive performance. However, not many people understand how and why this is so. In this study, we aim to deepen our understanding of this emerging method by investigating the role of soft prompts in automatic speech recognition (ASR). Our findings highlight their role as zero-shot learners in improving ASR performance but also make them vulnerable to malicious modifications. Soft prompts aid generalization but are not obligatory for inference. We also identify two primary roles of soft prompts: content refinement and noise information enhancement, which enhances robustness against background noise. Additionally, we propose an effective modification on noise prompts to show that they are capable of zero-shot learning on adapting to out-of-distribution noise environments.
摘要
大型自我超级预训练语音模型需要计算昂贵的精细调整,以便在下游任务中提高性能。软提示调整提供了一种简单 Parameters efficient的替代方案,可以提高可移植性,同时保持竞争性。然而,不多少人理解这种emerging方法的工作原理。在这项研究中,我们想要深入了解这种方法,通过调查软提示在自动语音识别(ASR)中的角色。我们发现软提示能够作为零例学习者,提高ASR性能,但也使其易受到黑客修改的威胁。软提示能够帮助泛化,但并不是推理的必需。我们还发现软提示有两个主要角色:内容细化和噪声信息增强,这有助于对噪声背景进行鲁棒化。此外,我们还提出了一种有效的噪声提示修改方法,以示证明它们可以适应外部噪声环境进行零例学习。