cs.SD - 2023-09-23

Attention Is All You Need For Blind Room Volume Estimation

  • paper_url: http://arxiv.org/abs/2309.13504
  • repo_url: None
  • paper_authors: Chunxi Wang, Maoshen Jia, Meiran Li, Changchun Bao, Wenyu Jin
  • for: 这篇论文主要是针对听取环境动态参数化问题进行研究,具体来说是用于盲目估计室内声学参数。
  • methods: 这篇论文提出了一种基于注意力机制的纯注意力模型,用于盲目估计室内声学参数。模型使用了扩展的Transformer架构,并使用了多modal数据 Transfer learning来提高模型性能。
  • results: 实验结果表明,提出的模型在实际听取环境中表现出色,特别是在使用专门预训练和数据扩展方案时。模型的性能在各种听取环境中都有显著提高。
    Abstract In recent years, dynamic parameterization of acoustic environments has raised increasing attention in the field of audio processing. One of the key parameters that characterize the local room acoustics in isolation from orientation and directivity of sources and receivers is the geometric room volume. Convolutional neural networks (CNNs) have been widely selected as the main models for conducting blind room acoustic parameter estimation, which aims to learn a direct mapping from audio spectrograms to corresponding labels. With the recent trend of self-attention mechanisms, this paper introduces a purely attention-based model to blindly estimate room volumes based on single-channel noisy speech signals. We demonstrate the feasibility of eliminating the reliance on CNN for this task and the proposed Transformer architecture takes Gammatone magnitude spectral coefficients and phase spectrograms as inputs. To enhance the model performance given the task-specific dataset, cross-modality transfer learning is also applied. Experimental results demonstrate that the proposed model outperforms traditional CNN models across a wide range of real-world acoustics spaces, especially with the help of the dedicated pretraining and data augmentation schemes.
    摘要 最近几年,动态参数化的声学环境在声音处理领域受到了越来越多的关注。一个关键参数,用于隔离源和接收器的方向和方向性,是地形室内体积。深度学习神经网络(CNN)广泛选择为无目标声学参数估计的主要模型,该模型的目标是从声音спектрограм中直接学习到相应的标签。随着自动注意机制的潮流,这篇论文介绍了一种完全基于注意力的模型,用于无目标地估计室内体积,并将 Gammatone 大小 спектрограм和相位спектрограм作为输入。为了提高模型在这个任务上的表现,我们还应用了交叉模态转移学习。实验结果表明,我们提出的模型在真实世界的各种声学空间中, especial 在使用特定数据集和预训练 schemes 时,都能够超过传统的 CNN 模型。

Two vs. Four-Channel Sound Event Localization and Detection

  • paper_url: http://arxiv.org/abs/2309.13343
  • repo_url: None
  • paper_authors: Julia Wilkins, Magdalena Fuentes, Luca Bondi, Shabnam Ghaffarzadegan, Ali Abavisani, Juan Pablo Bello
  • for: 本研究旨在探讨DCASE 2022 SELD挑战 зада务中(任务3)模型在4个渠道设置下的性能,以及不同音频输入表示方式对SELD性能的影响。
  • methods: 本研究使用了DCASE 2022 SELD基线模型,并对不同音频输入表示方式进行比较分析,以评估它们对SELD性能的影响。
  • results: 研究发现,带声和ステレオ(即2个渠道)音频基于SELD模型仍然能够良好地定位和探测声源,尽管总体性能下降。此外,研究还 segmented 分析了不同场景中声源多样性的影响,以更好地理解不同音频输入表示方式对SELD性能的影响。
    Abstract Sound event localization and detection (SELD) systems estimate both the direction-of-arrival (DOA) and class of sound sources over time. In the DCASE 2022 SELD Challenge (Task 3), models are designed to operate in a 4-channel setting. While beneficial to further the development of SELD systems using a multichannel recording setup such as first-order Ambisonics (FOA), most consumer electronics devices rarely are able to record using more than two channels. For this reason, in this work we investigate the performance of the DCASE 2022 SELD baseline model using three audio input representations: FOA, binaural, and stereo. We perform a novel comparative analysis illustrating the effect of these audio input representations on SELD performance. Crucially, we show that binaural and stereo (i.e. 2-channel) audio-based SELD models are still able to localize and detect sound sources laterally quite well, despite overall performance degrading as less audio information is provided. Further, we segment our analysis by scenes containing varying degrees of sound source polyphony to better understand the effect of audio input representation on localization and detection performance as scene conditions become increasingly complex.
    摘要 听音事件地理位置和检测(SELD)系统估算听音源的方向到达(DOA)和时间上的类型。在DCASE 2022 SELD挑战(任务3)中,模型设计用4通道记录设置。虽然多通道记录设置可以进一步发展SELD系统,但大多数消费类电子设备通常只能记录两个通道。因此,在这个工作中,我们研究了DCASE 2022 SELD基准模型使用FOA、双耳和立体声三种听音输入表示方式的性能。我们进行了一项新的比较分析,描述这些听音输入表示方式对听音源的地理位置和检测性能的影响。我们发现,使用双耳和立体声(即2通道)听音基于SELD模型仍然能够准确地localize和检测听音源,尽管总体性能下降。此外,我们对不同场景中听音源的多重播放情况进行了分 segment 的分析,以更好地理解听音输入表示方式对听音源的地理位置和检测性能的影响,场景条件变得越来越复杂。

Contrastive Speaker Embedding With Sequential Disentanglement

  • paper_url: http://arxiv.org/abs/2309.13253
  • repo_url: None
  • paper_authors: Youzhi Tu, Man-Wai Mak, Jen-Tzung Chien
  • for: 本文旨在提出一种基于对比学习的语音说话人识别方法,该方法利用了顺序分解器(DSVAE)来除掉语言内容,从而使得只有说话人因素被用于构建对比损失目标。
  • methods: 本文提出的方法包括在传统的SimCLR框架中 incorporating 顺序分解器(DSVAE),以除掉语言内容,并使用对比学习来学习说话人特征。
  • results: 实验结果表明,提出的方法在 VoxCeleb1-test 上的表现Consistently 高于 SimCLR,这表明了应用顺序分解是有利于学习说话人特征的。
    Abstract Contrastive speaker embedding assumes that the contrast between the positive and negative pairs of speech segments is attributed to speaker identity only. However, this assumption is incorrect because speech signals contain not only speaker identity but also linguistic content. In this paper, we propose a contrastive learning framework with sequential disentanglement to remove linguistic content by incorporating a disentangled sequential variational autoencoder (DSVAE) into the conventional SimCLR framework. The DSVAE aims to disentangle speaker factors from content factors in an embedding space so that only the speaker factors are used for constructing a contrastive loss objective. Because content factors have been removed from the contrastive learning, the resulting speaker embeddings will be content-invariant. Experimental results on VoxCeleb1-test show that the proposed method consistently outperforms SimCLR. This suggests that applying sequential disentanglement is beneficial to learning speaker-discriminative embeddings.
    摘要 <>Translate given text into Simplified Chinese.<>对照性发言嵌入假设,即对于正例和负例对话段的差异归结于发言人身份。然而,这个假设是错误的,因为语音信号包含不仅发言人身份,还包含语言内容。在这篇论文中,我们提议一种含有顺序解解析的对照学习框架,使用嵌入空间中的分离顺序自动编码器(DSVAE)来除去语言内容。DSVAE的目标是在嵌入空间中分离发言人因素和语言因素,以便只使用发言人因素来构建对照损失对象。因此,在对照学习中移除了语言内容,得到的发言人嵌入将是内容不变的。实验结果表明,提议的方法在VoxCeleb1-test上一直高于SimCLR。这表明,在学习发言人特异性嵌入时,顺序解解析是有利的。