cs.SD - 2023-12-07

DiaPer: End-to-End Neural Diarization with Perceiver-Based Attractors

  • paper_url: http://arxiv.org/abs/2312.04324
  • repo_url: None
  • paper_authors: Federico Landini, Mireia Diez, Themos Stafylakis, Lukáš Burget
  • for: 这个论文主要是为了提出一种新的演说者分类模型,以提高演说者分类的性能和效率。
  • methods: 该模型使用了一种基于感知器的演说者分类模型,并将EEND-EDA模块 replaced with Perceiver-based模块,以提高模型的性能和精度。
  • results: 对于 widely studied Callhome dataset,该模型可以更好地分类演说者,并在长录音中运行推理的时间大幅提高。此外,与其他方法进行了对比,该模型( DiaPer)在性能和设计上达到了很高的水平,并且可以在较短的时间内完成推理。
    Abstract Until recently, the field of speaker diarization was dominated by cascaded systems. Due to their limitations, mainly regarding overlapped speech and cumbersome pipelines, end-to-end models have gained great popularity lately. One of the most successful models is end-to-end neural diarization with encoder-decoder based attractors (EEND-EDA). In this work, we replace the EDA module with a Perceiver-based one and show its advantages over EEND-EDA; namely obtaining better performance on the largely studied Callhome dataset, finding the quantity of speakers in a conversation more accurately, and running inference on almost half of the time on long recordings. Furthermore, when exhaustively compared with other methods, our model, DiaPer, reaches remarkable performance with a very lightweight design. Besides, we perform comparisons with other works and a cascaded baseline across more than ten public wide-band datasets. Together with this publication, we release the code of DiaPer as well as models trained on public and free data.
    摘要

Joint Training or Not: An Exploration of Pre-trained Speech Models in Audio-Visual Speaker Diarization

  • paper_url: http://arxiv.org/abs/2312.04131
  • repo_url: None
  • paper_authors: Huan Zhao, Li Zhang, Yue Li, Yannan Wang, Hongji Wang, Wei Rao, Qing Wang, Lei Xie
  • for: 提高音视频演讲者识别系统的性能,使其能够更好地识别演讲者的语音和视频。
  • methods: 利用预训练的超vised和self-supervised语音模型来进行音视频演讲者识别,包括ResNet和ECAPA-TDNN等supervised模型以及WavLM和HuBERT等self-supervised模型。
  • results: 在MISP dataset上进行实验,提出的方法实现了超越性能,并在MISP Challenge 2022中获得第三名。
    Abstract The scarcity of labeled audio-visual datasets is a constraint for training superior audio-visual speaker diarization systems. To improve the performance of audio-visual speaker diarization, we leverage pre-trained supervised and self-supervised speech models for audio-visual speaker diarization. Specifically, we adopt supervised~(ResNet and ECAPA-TDNN) and self-supervised pre-trained models~(WavLM and HuBERT) as the speaker and audio embedding extractors in an end-to-end audio-visual speaker diarization~(AVSD) system. Then we explore the effectiveness of different frameworks, including Transformer, Conformer, and cross-attention mechanism, in the audio-visual decoder. To mitigate the degradation of performance caused by separate training, we jointly train the audio encoder, speaker encoder, and audio-visual decoder in the AVSD system. Experiments on the MISP dataset demonstrate that the proposed method achieves superior performance and obtained third place in MISP Challenge 2022.
    摘要 缺乏标注的音频视频数据是训练出色的音频视频说话人分类系统的限制。为了提高音频视频说话人分类的性能,我们利用预训练的监督和自监督语音模型。具体来说,我们采用监督(ResNet和ECAPA-TDNN)和自监督预训练模型(WavLM和HuBERT)作为音频和视频嵌入提取器在结束到端音频视频说话人分类(AVSD)系统中。然后我们探索不同框架,包括Transformer、Conformer和交叉关注机制,在音频视频解码器中的效果。为了避免分离训练导致性能下降,我们在AVSD系统中同时训练音频编码器、说话人编码器和音频视频解码器。实验表明,我们提posed方法在MISP数据集上表现出色,并在MISP Challenge 2022中获得第三名。