eess.AS - 2023-07-17

Dynamic Kernel Convolution Network with Scene-dedicate Training for Sound Event Localization and Detection

paper_url: http://arxiv.org/abs/2307.08239
repo_url: None
paper_authors: Siwei Huang, Jianfeng Chen, Jisheng Bai, Yafei Jia, Dongzhe Zhang
for: 这篇论文的目的是提出一种高效的声事件地理位置检测和检测系统，用于真实的空间声场。
methods: 该系统使用动态核心 convolution 模块来适应不同的感知范围，以及 SELDnet 和 EINv2 框架。此外，在训练阶段，还引入了两种场景专门的策略以提高系统在真实空间声场中的通用性。
results: 实验结果表明，提出的系统在 Sony-TAu 真实空间声场 dataset 上的表现出色，并超过了 fixes-kernel convolution SELD 系统。此外，该系统在 DCASE SELD 任务中获得了0.348的 SELD 分数，超过了 State-of-the-Art 方法。

Abstract
DNN-based methods have shown high performance in sound event localization and detection(SELD). While in real spatial sound scenes, reverberation and the imbalanced presence of various sound events increase the complexity of the SELD task. In this paper, we propose an effective SELD system in real spatial scenes.In our approach, a dynamic kernel convolution module is introduced after the convolution blocks to adaptively model the channel-wise features with different receptive fields. Secondly, we incorporate the SELDnet and EINv2 framework into the proposed SELD system with multi-track ACCDOA. Moreover, two scene-dedicated strategies are introduced into the training stage to improve the generalization of the system in realistic spatial sound scenes. Finally, we apply data augmentation methods to extend the dataset using channel rotation, spatial data synthesis. Four joint metrics are used to evaluate the performance of the SELD system on the Sony-TAu Realistic Spatial Soundscapes 2022 dataset.Experimental results show that the proposed systems outperform the fixed-kernel convolution SELD systems. In addition, the proposed system achieved an SELD score of 0.348 in the DCASE SELD task and surpassed the SOTA methods.

摘要
使用 Deep Neural Network (DNN) 方法的 зву频事件localization和检测 (SELD) 表现非常高。然而，在实际空间声场中，吸收和各种声事件的不均衡存在，使得 SELD 任务变得更加复杂。在这篇论文中，我们提出了一个有效的 SELD 系统，用于实际空间声场中。我们的方法包括：1. 在卷积块后引入动态核心 convolution 模块，以自适应地处理不同感受场的通道特征。2. 将 SELDnet 和 EINv2 框架integrated 到我们的 SELD 系统中，并使用多轨迹 ACCDOA。3. 在训练阶段，引入了两种适应性战略，以提高系统在真实空间声场中的普适性。4. 使用数据扩展方法，将数据集扩展到更多的频率和空间数据。我们使用了四个联合评价指标来评价 SELD 系统在 Sony-TAu Realistic Spatial Soundscapes 2022 数据集上的性能。实验结果表明，我们的系统在固定核心 convolution SELD 系统上表现出色，并且在 DCASE SELD 任务中达到了最佳效果，超过了 State-of-the-Art 方法。

Adapting Large Language Model with Speech for Fully Formatted End-to-End Speech Recognition

paper_url: http://arxiv.org/abs/2307.08234
repo_url: https://github.com/openai/whisper
paper_authors: Shaoshi Ling, Yuxuan Hu, Shuangbei Qian, Guoli Ye, Yao Qian, Yifan Gong, Ed Lin, Michael Zeng
for: 这个论文的目的是提高端到端语音识别（E2E ASR）模型的性能。
methods: 这个论文使用了预训练的大型语言模型（LLMs）来改进E2E ASR模型的性能。
results: 该方法可以有效地利用预训练的LLMs来生成更易读的ASR转录。对于具有不同领域的完整E2E ASR转录任务，我们的模型可以超越强大的ASR模型，如Whisper，在识别错误率方面。

Abstract
Most end-to-end (E2E) speech recognition models are composed of encoder and decoder blocks that perform acoustic and language modeling functions. Pretrained large language models (LLMs) have the potential to improve the performance of E2E ASR. However, integrating a pretrained language model into an E2E speech recognition model has shown limited benefits due to the mismatches between text-based LLMs and those used in E2E ASR. In this paper, we explore an alternative approach by adapting a pretrained LLMs to speech. Our experiments on fully-formatted E2E ASR transcription tasks across various domains demonstrate that our approach can effectively leverage the strengths of pretrained LLMs to produce more readable ASR transcriptions. Our model, which is based on the pretrained large language models with either an encoder-decoder or decoder-only structure, surpasses strong ASR models such as Whisper, in terms of recognition error rate, considering formats like punctuation and capitalization as well.

摘要
大多数端到端（E2E）语音识别模型由编码和解码块组成，这些块执行音频和语言模型功能。预训练大型语言模型（LLMs）有可能提高E2E ASR的性能。然而，将预训练语言模型与E2E语音识别模型结合使用显示有限的好处，这主要是因为文本基于的LLMs与E2E ASR中使用的模型之间存在差异。在这篇论文中，我们探讨一种替代方法，即适应预训练LLMs到语音。我们的实验表明，我们的方法可以有效地利用预训练LLMs的优势，生成更易读的ASR讯号。我们的模型基于预训练大型语言模型，可以是编码-解码结构或解码 только结构，在各个领域的完全格式E2E ASR转写任务中表现出色，胜过如喊叫等强大ASR模型。