paper_authors: Tae Jin Park, He Huang, Ante Jukic, Kunal Dhawan, Krishna C. Puvvada, Nithin Koluguri, Nikolay Karpov, Aleksandr Laptev, Jagadeesh Balam, Boris Ginsburg
results: 这篇论文提出了一种基于NeMo工具包的多通道、多个话者语音识别系统,并通过了7个CHiME挑战任务的评测。系统的性能得到了 significiant进步,表明该系统在实际应用中具有极高的可靠性和精度。Abstract
We present the NVIDIA NeMo team's multi-channel speech recognition system for the 7th CHiME Challenge Distant Automatic Speech Recognition (DASR) Task, focusing on the development of a multi-channel, multi-speaker speech recognition system tailored to transcribe speech from distributed microphones and microphone arrays. The system predominantly comprises of the following integral modules: the Speaker Diarization Module, Multi-channel Audio Front-End Processing Module, and the ASR Module. These components collectively establish a cascading system, meticulously processing multi-channel and multi-speaker audio input. Moreover, this paper highlights the comprehensive optimization process that significantly enhanced our system's performance. Our team's submission is largely based on NeMo toolkits and will be publicly available.
摘要
我们现在介绍NVIDIA NeMo团队的多通道语音识别系统,用于7个CHiME挑战远程自动语音识别(DASR)任务。我们专注于通过分布式麦克风和麦克风数组记录语音的多通道多发言人语音识别系统的开发。这个系统主要由以下几个基本模块组成:说话人分类模块、多通道音频前端处理模块和ASR模块。这些组件结合起来形成了一个减法系统,精心处理多通道和多发言人的音频输入。此外,这篇论文还描述了我们对系统性能的全面优化过程,这有效地提高了我们的系统性能。我们的提交基于NeMo工具包,将在公共可用。
Property-Aware Multi-Speaker Data Simulation: A Probabilistic Modelling Technique for Synthetic Data Generation
results: 本文 demonstates 了该模拟器可以生成具有实际统计特性的大规模语音混合数据集,并且可以用于训练 speaker diarization 和 voice activity detection 模型,以实现高效的识别和分离。Abstract
We introduce a sophisticated multi-speaker speech data simulator, specifically engineered to generate multi-speaker speech recordings. A notable feature of this simulator is its capacity to modulate the distribution of silence and overlap via the adjustment of statistical parameters. This capability offers a tailored training environment for developing neural models suited for speaker diarization and voice activity detection. The acquisition of substantial datasets for speaker diarization often presents a significant challenge, particularly in multi-speaker scenarios. Furthermore, the precise time stamp annotation of speech data is a critical factor for training both speaker diarization and voice activity detection. Our proposed multi-speaker simulator tackles these problems by generating large-scale audio mixtures that maintain statistical properties closely aligned with the input parameters. We demonstrate that the proposed multi-speaker simulator generates audio mixtures with statistical properties that closely align with the input parameters derived from real-world statistics. Additionally, we present the effectiveness of speaker diarization and voice activity detection models, which have been trained exclusively on the generated simulated datasets.
摘要
我团队介绍了一种高级多话者语音数据 simulate器,特性是生成多话者语音录音。这个 simulate器的一个特点是通过调整统计参数来模拟 silence和 overlap 的分布。这种能力提供了一个适应性高的训练环境,用于开发适合 speaker diarization 和 voice activity detection 的神经网络模型。在多话者场景下获得大量的 speaker diarization 数据经常是一项大的挑战,而且 precisetimestamp 注释的语音数据是神经网络模型的训练必要因素。我们的提议的多话者 simulate器解决了这些问题,生成了具有统计性质相近于输入参数的大规模音频混合。我们还展示了通过 exclusively 在生成的模拟数据上训练的 speaker diarization 和 voice activity detection 模型的效果。
methods: 这项研究使用了许多端到端模型,以及多个工具包。它们 heavily 依赖了指导源分离(GSS)将多通道音频转化为单通道。另外,ASR 使用了自我超vised学习模型生成的语音表示,并进行了多个 ASR 系统的融合。
results: 研究中的系统使用了 oracle 分 segmentation,并在远场自动语音识别(DASR)领域实现了良好的成绩。Abstract
This paper describes the joint effort of Brno University of Technology (BUT), AGH University of Krakow and University of Buenos Aires on the development of Automatic Speech Recognition systems for the CHiME-7 Challenge. We train and evaluate various end-to-end models with several toolkits. We heavily relied on Guided Source Separation (GSS) to convert multi-channel audio to single channel. The ASR is leveraging speech representations from models pre-trained by self-supervised learning, and we do a fusion of several ASR systems. In addition, we modified external data from the LibriSpeech corpus to become a close domain and added it to the training. Our efforts were focused on the far-field acoustic robustness sub-track of Task 1 - Distant Automatic Speech Recognition (DASR), our systems use oracle segmentation.
摘要
这份报告描述布雷诺技术大学(BUT)、阿格大学(AGH)和布宜诺斯艾利斯大学(UBA)在CHiME-7挑战中开发自动语音识别系统的共同努力。我们使用了准则分离(GSS)将多通道音频转化为单通道,并使用自我supervised学习预训练的语音表示模型。我们还对外部数据集进行了修改,使其成为近频域数据集,并将其添加到训练中。我们的努力主要集中在Task 1 - 远程自动语音识别(DASR)的远场静音环境下,我们使用了oracle分割。
Physics-informed Neural Network for Acoustic Resonance Analysis
results: 在一个一维声波管中进行了共振分析,并通过对前向和反向分析进行比较,证明了提案的方法的有效性。Abstract
This study proposes the physics-informed neural network (PINN) framework to solve the wave equation for acoustic resonance analysis. ResoNet, the analytical model proposed in this study, minimizes the loss function for periodic solutions, in addition to conventional PINN loss functions, thereby effectively using the function approximation capability of neural networks, while performing resonance analysis. Additionally, it can be easily applied to inverse problems. Herein, the resonance in a one-dimensional acoustic tube was analyzed. The effectiveness of the proposed method was validated through the forward and inverse analyses of the wave equation with energy-loss terms. In the forward analysis, the applicability of PINN to the resonance problem was evaluated by comparison with the finite-difference method. The inverse analysis, which included the identification of the energy loss term in the wave equation and design optimization of the acoustic tube, was performed with good accuracy.
摘要
这项研究提出了物理学 Informed Neural Network(PINN)框架,以解决音频振荡分析中的波方程。ResoNet,这个研究所提出的分析模型,不仅会遵循普通的PINN损失函数,还会将 périodic solutions 作为损失函数的最小化,从而有效地利用神经网络的函数近似能力,进行振荡分析。此外,它可以轻松应用于反向问题。在这种情况下,我们对一维音频管进行了振荡分析。我们采用了PINN方法和finite-difference方法进行前向和反向分析,并证明了PINN方法的可靠性和精度。Note: The translation is in Simplified Chinese, which is one of the two standard versions of Chinese. The other version is Traditional Chinese.
Blind estimation of audio effects using an auto-encoder approach and differentiable signal processing
results: 我们的自适应oder方法可以在不知道AFX的具体实现情况下,对音频质量产生更好的估算,比传统的参数基于方法更好。Abstract
Blind Estimation of Audio Effects (BE-AFX) aims at estimating the Audio Effects (AFXs) applied to an original, unprocessed audio sample solely based on the processed audio sample. To train such a system traditional approaches optimize a loss between ground truth and estimated AFX parameters. This involves knowing the exact implementation of the AFXs used for the process. In this work, we propose an alternative solution that eliminates the requirement for knowing this implementation. Instead, we introduce an auto-encoder approach, which optimizes an audio quality metric. We explore, suggest, and compare various implementations of commonly used mastering AFXs, using differential signal processing or neural approximations. Our findings demonstrate that our auto-encoder approach yields superior estimates of the audio quality produced by a chain of AFXs, compared to the traditional parameter-based approach, even if the latter provides a more accurate parameter estimation.
摘要
《盲目估计音效(BE-AFX)》的目标是基于原始、未处理的音频样本估计音效(AFX)。传统方法通过优化损失函数来学习AFX参数。这需要了解AFX的具体实现。在这个工作中,我们提出了一种不同的解决方案,即使用自适应网络方法,优化音质指标。我们研究、建议和比较了各种通用音压缩AFX的实现方式,使用演变信号处理或神经网络近似。我们的发现表明,我们的自适应网络方法可以在不知道AFX实现细节的情况下提供更高质量的音频估计结果,与传统参数基本方法相比,即使后者可以更准确地估计参数。
EchoScan: Scanning Complex Indoor Geometries via Acoustic Echoes
results: 与视觉方法相比,EchoScan在具有不同形状的室内空间中表现出色,其可以准确地计算室内几何图像。Abstract
Accurate estimation of indoor space geometries is vital for constructing precise digital twins, whose broad industrial applications include navigation in unfamiliar environments and efficient evacuation planning, particularly in low-light conditions. This study introduces EchoScan, a deep neural network model that utilizes acoustic echoes to perform room geometry inference. Conventional sound-based techniques rely on estimating geometry-related room parameters such as wall position and room size, thereby limiting the diversity of inferable room geometries. Contrarily, EchoScan overcomes this limitation by directly inferring room floorplans and heights, thereby enabling it to handle rooms with arbitrary shapes, including curved walls. The key innovation of EchoScan is its ability to analyze the complex relationship between low- and high-order reflections in room impulse responses (RIRs) using a multi-aggregation module. The analysis of high-order reflections also enables it to infer complex room shapes when echoes are unobservable from the position of an audio device. Herein, EchoScan was trained and evaluated using RIRs synthesized from complex environments, including the Manhattan and Atlanta layouts, employing a practical audio device configuration compatible with commercial, off-the-shelf devices. Compared with vision-based methods, EchoScan demonstrated outstanding geometry estimation performance in rooms with various shapes.
摘要
准确地估算室内空间几何结构是建立精准数字 duplicates 的关键,其广泛的工业应用包括在不熟悉环境中导航和有效逃生规划,特别是在低照明条件下。本研究介绍EchoScan,一种深度神经网络模型,利用声学折射来进行室内空间几何推测。传统的声音基本技术是估算室内几何参数,例如墙position和室内大小,因此限制了可以推测的室内几何类型。相比之下,EchoScan可以直接推测室内平面图和高度,因此可以处理具有任意形状的室内空间,包括拱形墙。EchoScan的关键创新在于使用多维度聚合模块来分析室内响应函数(RIRs)中的低频和高频响应之复杂关系。通过分析高频响应,EchoScan可以推测复杂的室内形状,即使声音在室内设备的位置不可见。在本研究中,EchoScan通过使用来自复杂环境的RIRs进行训练和评估,并使用实际的音频设备配置,与商业市场上可以购买的设备相符。与视觉基本方法相比,EchoScan在具有不同形状的室内空间中表现出色。
Experimental Results of Underwater Sound Speed Profile Inversion by Few-shot Multi-task Learning
methods: state-of-the-art SSP inversion methods include frameworks of matched field processing (MFP), compressive sensing (CS), and feedforeward neural networks (FNN), among which the FNN shows better real-time performance while maintaining the same level of accuracy.
results: MTL outperforms the state-of-the-art methods in terms of accuracy for SSP inversion, while inheriting the real-time advantage of FNN during the inversion stage.Abstract
Underwater Sound Speed Profile (SSP) distribution has great influence on the propagation mode of acoustic signal, thus the fast and accurate estimation of SSP is of great importance in building underwater observation systems. The state-of-the-art SSP inversion methods include frameworks of matched field processing (MFP), compressive sensing (CS), and feedforeward neural networks (FNN), among which the FNN shows better real-time performance while maintain the same level of accuracy. However, the training of FNN needs quite a lot historical SSP samples, which is diffcult to be satisfied in many ocean areas. This situation is called few-shot learning. To tackle this issue, we propose a multi-task learning (MTL) model with partial parameter sharing among different traning tasks. By MTL, common features could be extracted, thus accelerating the learning process on given tasks, and reducing the demand for reference samples, so as to enhance the generalization ability in few-shot learning. To verify the feasibility and effectiveness of MTL, a deep-ocean experiment was held in April 2023 at the South China Sea. Results shows that MTL outperforms the state-of-the-art methods in terms of accuracy for SSP inversion, while inherits the real-time advantage of FNN during the inversion stage.
摘要
水下声速谱(SSP)分布对声音信号的传播模式有着很大的影响,因此快速和准确地估算SSP是建设水下观测系统的关键。现状的SSP拟合方法包括匹配场处理(MFP)、压缩感知(CS)和前向神经网络(FNN)等,其中FNN在实时性方面表现更好,同时保持同等的准确性。然而,FNN的训练需要很多历史SSP样本,这在许多海洋区域是困难的满足。这种情况被称为“少shot learning”。为解决这个问题,我们提出了多任务学习(MTL)模型,其中参数之间有部分共享。通过MTL,可以提取共同特征,因此加速学习过程,降低参考样本的需求,从而提高总体化能力在少shot learning中。为验证MTL的可行性和效果,在2023年4月在南海进行了深海实验。结果显示,MTL比现状的方法在SSP拟合精度方面表现更好,同时继承FNN在拟合阶段的实时优势。