cs.SD - 2023-10-05

EFFUSE: Efficient Self-Supervised Feature Fusion for E2E ASR in Multilingual and Low Resource Scenarios

  • paper_url: http://arxiv.org/abs/2310.03938
  • repo_url: None
  • paper_authors: Tejes Srivastava, Jiatong Shi, William Chen, Shinji Watanabe
  • for: 提高多语言语音识别 task 的性能
  • methods: 使用 SSL 模型进行预测,并将多个 SSL 模型的特征进行预测
  • results: 提高了 ML-SUPERB benchmarck 中的平均 SUPERB 分数,并且减少了模型参数大小和执行时间
    Abstract Self-Supervised Learning (SSL) models have demonstrated exceptional performance in various speech tasks, particularly in low-resource and multilingual domains. Recent works show that fusing SSL models could achieve superior performance compared to using one SSL model. However, fusion models have increased model parameter size, leading to longer inference times. In this paper, we propose a novel approach of predicting other SSL models' features from a single SSL model, resulting in a light-weight framework with competitive performance. Our experiments show that SSL feature prediction models outperform individual SSL models in multilingual speech recognition tasks. The leading prediction model achieves an average SUPERB score increase of 135.4 in ML-SUPERB benchmarks. Moreover, our proposed framework offers an efficient solution, as it reduces the resulting model parameter size and inference times compared to previous fusion models.
    摘要

Challenges and Insights: Exploring 3D Spatial Features and Complex Networks on the MISP Dataset

  • paper_url: http://arxiv.org/abs/2310.03901
  • repo_url: None
  • paper_authors: Yiwen Shao
  • for: 本研究旨在探讨多通道多人说话识别问题中,如干扰声、延迟和 overlap 等问题,以及如何通过 Contextual cues 来分离目标说话人的speech。
  • methods: 本研究使用了3D spatial feature,具体来说是通过计算目标说话人的位势信息来提高识别率。
  • results: 研究发现,通过使用3D spatial feature,可以减少或完全消除中间处理步骤,从而提高识别率。此外,对 MISP 数据集的扩展和模型的验证也表明了该方法的可行性和有效性。
    Abstract Multi-channel multi-talker speech recognition presents formidable challenges in the realm of speech processing, marked by issues such as background noise, reverberation, and overlapping speech. Overcoming these complexities requires leveraging contextual cues to separate target speech from a cacophonous mix, enabling accurate recognition. Among these cues, the 3D spatial feature has emerged as a cutting-edge solution, particularly when equipped with spatial information about the target speaker. Its exceptional ability to discern the target speaker within mixed audio, often rendering intermediate processing redundant, paves the way for the direct training of "All-in-one" ASR models. These models have demonstrated commendable performance on both simulated and real-world data. In this paper, we extend this approach to the MISP dataset to further validate its efficacy. We delve into the challenges encountered and insights gained when applying 3D spatial features to MISP, while also exploring preliminary experiments involving the replacement of these features with more complex input and models.
    摘要 多通道多发言人语音识别面临多种复杂性,如背景噪音、反射和 overlap 的问题。为了解决这些复杂性,需要利用上下文ual cue 分离目标语音从杂乱的混音中,以实现准确的识别。在这些上下文ual cue 中,3D 空间特征已经成为一种前导的解决方案,特别当配备空间信息目标说话人时。它的出色能力在杂乱的音频中分离目标说话人,经常使得中间处理redundant,从而降低了ASR模型的训练复杂性。在这篇论文中,我们将这种方法应用到MISP数据集,以进一步验证其效果。我们将详细介绍在应用3D 空间特征时遇到的挑战和获得的洞察,同时也将展开一些将这些特征更换为更复杂的输入和模型的先期实验。

Audio Event-Relational Graph Representation Learning for Acoustic Scene Classification

  • paper_url: http://arxiv.org/abs/2310.03889
  • repo_url: None
  • paper_authors: Yuanbo Hou, Siyang Song, Chuang Yu, Wenwu Wang, Dick Botteldooren
  • for: 本研究旨在揭示实际生活中各种听觉场景与语音事件关系图中的Semantic embedding的关系。
  • methods: 本研究提出了一种事件关系图表示学习(ERGL)框架,用于实现听觉场景分类,同时清晰地表明分类所用的cue。在事件关系图中,每个事件的嵌入被视为节点,而每对节点之间的关系cue被描述为多维边特征。
  • results: 在一个真实的听觉场景分类 dataset上,提出的ERGL方法实现了与限制数量的语音事件嵌入相关的高度竞争性表现。结果表明可以通过语音事件关系图来识别多样化的听觉场景。可见化的语音事件关系图 Representation可以在这里(https://github.com/Yuanbo2020/ERGL) obtener。
    Abstract Most deep learning-based acoustic scene classification (ASC) approaches identify scenes based on acoustic features converted from audio clips containing mixed information entangled by polyphonic audio events (AEs). However, these approaches have difficulties in explaining what cues they use to identify scenes. This paper conducts the first study on disclosing the relationship between real-life acoustic scenes and semantic embeddings from the most relevant AEs. Specifically, we propose an event-relational graph representation learning (ERGL) framework for ASC to classify scenes, and simultaneously answer clearly and straightly which cues are used in classifying. In the event-relational graph, embeddings of each event are treated as nodes, while relationship cues derived from each pair of nodes are described by multi-dimensional edge features. Experiments on a real-life ASC dataset show that the proposed ERGL achieves competitive performance on ASC by learning embeddings of only a limited number of AEs. The results show the feasibility of recognizing diverse acoustic scenes based on the audio event-relational graph. Visualizations of graph representations learned by ERGL are available here (https://github.com/Yuanbo2020/ERGL).
    摘要 大多数深度学习基于的声音场景分类(ASC)方法都是基于声音特征,将混合多种声音事件(AEs)转换为声音特征。然而,这些方法往往难以解释它们如何标识场景。本文提出了第一个研究声音场景与 Semantic Embeddings 之间的关系的研究,以及一种Event-Relational Graph Representation Learning(ERGL)框架,用于分类场景。在事件关系图中,每个事件的嵌入被视为节点,而每对节点之间的关系cue被描述为多维边feature。实验表明,提出的ERGL可以在一个有限数量的AEs上达到竞争力的ASC性能。结果表明,可以通过声音事件关系图来识别多样化的声音场景。可以在这里查看Visualization of ERGL学习的图表(https://github.com/Yuanbo2020/ERGL)。

Securing Voice Biometrics: One-Shot Learning Approach for Audio Deepfake Detection

  • paper_url: http://arxiv.org/abs/2310.03856
  • repo_url: None
  • paper_authors: Awais Khan, Khalid Mahmood Malik
  • for: 防止冒饵攻击 voice biometrics 系统,尤其是运用 audio deepfakes 进行逻辑存取攻击。
  • methods: 使用 one-shot learning 和 Metric Learning 技术探测和识别不同统计分布的 Synthetic 攻击,并使用有效的 спектраль特征集抽出有价的时间嵌入。
  • results: 在 ASVspoof 2019 逻辑存取(LA)数据集上评估了 Quick-SpoofNet 的表现,并在不同的 deepfake 攻击下进行了测试。实验结果显示 Quick-SpoofNet 能够具有高度的攻击探测率和优化的一致性。
    Abstract The Automatic Speaker Verification (ASV) system is vulnerable to fraudulent activities using audio deepfakes, also known as logical-access voice spoofing attacks. These deepfakes pose a concerning threat to voice biometrics due to recent advancements in generative AI and speech synthesis technologies. While several deep learning models for speech synthesis detection have been developed, most of them show poor generalizability, especially when the attacks have different statistical distributions from the ones seen. Therefore, this paper presents Quick-SpoofNet, an approach for detecting both seen and unseen synthetic attacks in the ASV system using one-shot learning and metric learning techniques. By using the effective spectral feature set, the proposed method extracts compact and representative temporal embeddings from the voice samples and utilizes metric learning and triplet loss to assess the similarity index and distinguish different embeddings. The system effectively clusters similar speech embeddings, classifying bona fide speeches as the target class and identifying other clusters as spoofing attacks. The proposed system is evaluated using the ASVspoof 2019 logical access (LA) dataset and tested against unseen deepfake attacks from the ASVspoof 2021 dataset. Additionally, its generalization ability towards unseen bona fide speech is assessed using speech data from the VSDC dataset.
    摘要 “自动话语识别(ASV)系统面临伪造活动的威胁,包括语音深圳攻击(Deepfake)。这些深圳攻击对话语音识别器具有潜在的威胁,因为近年来的生成AI和语音合成技术得到了进步。虽然许多深度学习模型用于语音合成检测已经发展出来,但大多数它们在不同的统计分布下显示出差。因此,这篇文章提出了快速攻击网络(Quick-SpoofNet),用于检测ASV系统中见到和未见到的合成攻击。这个方法使用有效的спектраль特征集,将声音样本中的时间特征提取出来,并使用度量学习和三重损失来评估相似性指数。系统可以划分相似的声音嵌入,将真正的话语识别为目标类别,并识别其他嵌入为伪造攻击。这个系统在ASVspoof 2019逻辑存取(LA)数据集上进行评估,并对未见到的深圳攻击进行测试。此外,它的普遍能力也被评估使用VSDC数据集上的话语数据。”

Speaker localization using direct path dominance test based on sound field directivity

  • paper_url: http://arxiv.org/abs/2310.03688
  • repo_url: None
  • paper_authors: Boaz Rafaely, Koby Alhaiany
  • for: 这项研究的目的是开发一种robust to reverberation的DOA估计方法。
  • methods: 该方法基于时域频域分布的直接路径占据性测试,但不需要频率缓和矩阵分解。
  • results: 对比之前的方法,提议的方法在噪声和泛音条件下保持了相似的Robustness,并且计算效率高于原方法四倍。
    Abstract Estimation of the direction-of-arrival (DoA) of a speaker in a room is important in many audio signal processing applications. Environments with reverberation that masks the DoA information are particularly challenging. Recently, a DoA estimation method that is robust to reverberation has been developed. This method identifies time-frequency bins dominated by the contribution from the direct path, which carries the correct DoA information. However, its implementation is computationally demanding as it requires frequency smoothing to overcome the effect of coherent early reflections and matrix decomposition to apply the direct-path dominance (DPD) test. In this work, a novel computationally-efficient alternative to the DPD test is proposed, based on the directivity measure for sensor arrays, which requires neither frequency smoothing nor matrix decomposition, and which has been reformulated for sound field directivity with spherical microphone arrays. The paper presents the proposed method and a comparison to previous methods under a range of reverberation and noise conditions. Result demonstrate that the proposed method shows comparable performance to the original method in terms of robustness to reverberation and noise, and is about four times more computationally efficient for the given experiment.
    摘要 <>translate( Estimation of the direction-of-arrival (DoA) of a speaker in a room is important in many audio signal processing applications. Environments with reverberation that masks the DoA information are particularly challenging. Recently, a DoA estimation method that is robust to reverberation has been developed. This method identifies time-frequency bins dominated by the contribution from the direct path, which carries the correct DoA information. However, its implementation is computationally demanding as it requires frequency smoothing to overcome the effect of coherent early reflections and matrix decomposition to apply the direct-path dominance (DPD) test. In this work, a novel computationally-efficient alternative to the DPD test is proposed, based on the directivity measure for sensor arrays, which requires neither frequency smoothing nor matrix decomposition, and which has been reformulated for sound field directivity with spherical microphone arrays. The paper presents the proposed method and a comparison to previous methods under a range of reverberation and noise conditions. Result demonstrate that the proposed method shows comparable performance to the original method in terms of robustness to reverberation and noise, and is about four times more computationally efficient for the given experiment. )中文简体版:<>音频信号处理应用中,确定发声者的方向来(DoA)在房间中非常重要。尤其是在听到延迟响应的环境中,DoA信息会被遮盖。最近,一种可以在延迟响应的环境中具有高Robustness的DoA估算方法已经被开发出来。这种方法可以在时域频域中标识由直接路径提供的DoA信息的占据率。然而,它的实现具有计算挺大的问题,需要频率平滑以超越协同早期反射的效果,并且需要矩阵分解来应用直通性测试。在这项工作中,一种新的计算高效的代替方法被提出,基于探测阵列的直接性度,不需要频率平滑也不需要矩阵分解。这种方法的实现可以在圆形 Mikrofon 阵列上进行 reformulation。文章介绍了该方法,并对之前的方法进行比较,包括各种噪声和延迟的条件下的性能。结果表明,该方法在robustness和计算效率方面与原始方法相似,且计算效率高于原始方法四倍。

Performance and energy balance: a comprehensive study of state-of-the-art sound event detection systems

  • paper_url: http://arxiv.org/abs/2310.03455
  • repo_url: https://github.com/ronfrancesca/sed_carbon_footprint
  • paper_authors: Francesca Ronchini, Romain Serizel
  • for: 这篇研究旨在探讨深度学习系统中增加复杂性和能耗问题的趋势,以及这些系统对环境的影响。
  • methods: 本研究使用了过去两年的探测和分类响应挑战 зада务的提交作为基础,进行比较和详细分析。
  • results: 研究发现,过去两年中深度学习系统的复杂性和能耗问题有所增加,并且这些系统对环境的影响也逐渐增加。
    Abstract In recent years, deep learning systems have shown a concerning trend toward increased complexity and higher energy consumption. As researchers in this domain and organizers of one of the Detection and Classification of Acoustic Scenes and Events challenges tasks, we recognize the importance of addressing the environmental impact of data-driven SED systems. In this paper, we propose an analysis focused on SED systems based on the challenge submissions. This includes a comparison across the past two years and a detailed analysis of this year's SED systems. Through this research, we aim to explore how the SED systems are evolving every year in relation to their energy efficiency implications.
    摘要 近年来,深度学习系统在复杂性和能耗方面表现出了担忧的趋势。作为这个领域的研究人员和挑战任务组织者之一,我们认为对数据驱动的SED系统环境影响的问题非常重要。在这篇论文中,我们提出了基于挑战提交的SED系统分析。包括过去两年的比较和本年SED系统的详细分析。通过这些研究,我们希望探讨每年SED系统在能效环境方面的发展趋势。

VaSAB: The variable size adaptive information bottleneck for disentanglement on speech and singing voice

  • paper_url: http://arxiv.org/abs/2310.03444
  • repo_url: None
  • paper_authors: Frederik Bous, Axel Roebel
  • for: voice transformation, disentanglement of F0 parameter
  • methods: dropout-based information bottleneck auto-encoder, adaptive bottleneck size
  • results: improved disentanglement of F0 parameter for both speech and singing voice, improved synthesis quality, universal voice model for both speech and singing voice
    Abstract The information bottleneck auto-encoder is a tool for disentanglement commonly used for voice transformation. The successful disentanglement relies on the right choice of bottleneck size. Previous bottleneck auto-encoders created the bottleneck by the dimension of the latent space or through vector quantization and had no means to change the bottleneck size of a specific model. As the bottleneck removes information from the disentangled representation, the choice of bottleneck size is a trade-off between disentanglement and synthesis quality. We propose to build the information bottleneck using dropout which allows us to change the bottleneck through the dropout rate and investigate adapting the bottleneck size depending on the context. We experimentally explore into using the adaptive bottleneck for pitch transformation and demonstrate that the adaptive bottleneck leads to improved disentanglement of the F0 parameter for both, speech and singing voice leading to improved synthesis quality. Using the variable bottleneck size, we were able to achieve disentanglement for singing voice including extremely high pitches and create a universal voice model, that works on both speech and singing voice with improved synthesis quality.
    摘要 信息瓶颈自适应Encoder是一种常用的分解工具,通常用于音频变换。成功的分解取决于瓶颈大小的选择。过去的瓶颈自适应Encoder通过缺省空间维度或VECTOR量化来创建瓶颈,而无法改变特定模型中的瓶颈大小。因为瓶颈从分解表示中移除信息,因此瓶颈大小的选择是一种负担很大的负担,即分解和生成质量之间的权衡。我们提议使用dropout来构建信息瓶颈,这allow us可以通过dropout率来改变瓶颈大小,并且在不同的上下文中进行调整。我们通过实验explore使用可变瓶颈大小来进行音高变换,并证明可变瓶颈可以提高F0参数的分解,并且对于语音和歌唱voice都可以提高生成质量。使用可变瓶颈大小,我们可以实现对歌唱voice的分解,包括极高的音高,并创建一个通用的语音模型,可以在语音和歌唱voice上进行改进的生成。