eess.AS - 2023-08-15

GIST-AiTeR Speaker Diarization System for VoxCeleb Speaker Recognition Challenge (VoxSRC) 2023

  • paper_url: http://arxiv.org/abs/2308.07788
  • repo_url: None
  • paper_authors: Dongkeon Park, Ji Won Kim, Kang Ryeol Kim, Do Hyun Lee, Hong Kook Kim
  • for: 本文描述了GIST-AiTeR团队在VoxCeleb Speaker Recognition Challenge 2023(VoxSRC-23)Track 4的提交系统。
  • methods: 该提交系统集成了多种说话分类(SD)技术,包括ResNet293和MFA-Conformer,以及不同的段和跳长组合。
  • results: ResNet293和MFA-Conformer模型在VAL46上表现出了3.65%和3.83%的分类错误率(DER),分布 ensemble模型在VAL46上表现出了3.50%的DER,并在VoxSRC-23测试集上达到4.88%的DER。
    Abstract This report describes the submission system by the GIST-AiTeR team for the VoxCeleb Speaker Recognition Challenge 2023 (VoxSRC-23) Track 4. Our submission system focuses on implementing diverse speaker diarization (SD) techniques, including ResNet293 and MFA-Conformer with different combinations of segment and hop length. Then, those models are combined into an ensemble model. The ResNet293 and MFA-Conformer models exhibited the diarization error rates (DERs) of 3.65% and 3.83% on VAL46, respectively. The submitted ensemble model provided a DER of 3.50% on VAL46, and consequently, it achieved a DER of 4.88% on the VoxSRC-23 test set.
    摘要

The DKU-MSXF Diarization System for the VoxCeleb Speaker Recognition Challenge 2023

  • paper_url: http://arxiv.org/abs/2308.07595
  • repo_url: None
  • paper_authors: Ming Cheng, Weiqing Wang, Xiaoyi Qin, Yuke Lin, Ning Jiang, Guoqing Zhao, Ming Li
  • for: 这篇论文是为了描述DKU-MSXF在VoxCeleb Speaker Recognition Challenge 2023(VoxSRC-23)的识别系统提交。
  • methods: 该系统管道包括声动活跃检测、分组分割、重叠演讲检测和目标说话人活跃检测,每个过程都有3个子模型的拟合输出。
  • results: 最终通过DOVER-Lap进行拟合不同的分组和TSVAD系统,实现4.30%的识别错误率(DER),在track 4的挑战排行榜上名列第一。
    Abstract This paper describes the DKU-MSXF submission to track 4 of the VoxCeleb Speaker Recognition Challenge 2023 (VoxSRC-23). Our system pipeline contains voice activity detection, clustering-based diarization, overlapped speech detection, and target-speaker voice activity detection, where each procedure has a fused output from 3 sub-models. Finally, we fuse different clustering-based and TSVAD-based diarization systems using DOVER-Lap and achieve the 4.30% diarization error rate (DER), which ranks first place on track 4 of the challenge leaderboard.
    摘要 这篇论文描述DKU-MSXF在VoxCeleb Speaker Recognition Challenge 2023(VoxSRC-23)的订阅提交,我们的系统管道包括声音活动检测、集群化基于分类的分类、重叠说话检测和目标说话人声音活动检测,每个过程有3个子模型的融合输出。最后,我们将不同的集群化和TSVAD基于的分类系统融合使用DOVER-Lap,实现4.30%的分类错误率(DER),在赛事排名榜上名列第一。

AKVSR: Audio Knowledge Empowered Visual Speech Recognition by Compressing Audio Knowledge of a Pretrained Model

  • paper_url: http://arxiv.org/abs/2308.07593
  • repo_url: None
  • paper_authors: Jeong Hun Yeo, Minsu Kim, Jeongsoo Choi, Dae Hoe Kim, Yong Man Ro
  • for: 这个论文的目的是提出一个 Audio Knowledge empowered Visual Speech Recognition 框架 (AKVSR),以使用音频模式补充视觉模式中的不足信息。
  • methods: 提案的 AKVSR 使用大规模预训Audio模型所编码的丰富音频知识,将音频信息储存在单簇音频内存中,并通过音频桥接模组寻找最佳对应的音频特征。
  • results: 经过广泛的实验 validate the effectiveness of the proposed method,在 widely-used datasets LRS2 和 LRS3 上 achievement new state-of-the-art performances。
    Abstract Visual Speech Recognition (VSR) is the task of predicting spoken words from silent lip movements. VSR is regarded as a challenging task because of the insufficient information on lip movements. In this paper, we propose an Audio Knowledge empowered Visual Speech Recognition framework (AKVSR) to complement the insufficient speech information of visual modality by using audio modality. Different from the previous methods, the proposed AKVSR 1) utilizes rich audio knowledge encoded by a large-scale pretrained audio model, 2) saves the linguistic information of audio knowledge in compact audio memory by discarding the non-linguistic information from the audio through quantization, and 3) includes Audio Bridging Module which can find the best-matched audio features from the compact audio memory, which makes our training possible without audio inputs, once after the compact audio memory is composed. We validate the effectiveness of the proposed method through extensive experiments, and achieve new state-of-the-art performances on the widely-used datasets, LRS2 and LRS3.
    摘要 Visual Speech Recognition (VSR) 是指从舌部运动中预测说话的任务。由于舌部运动的信息不充分,VSR 被视为一项具有挑战性的任务。在这篇论文中,我们提出了一个 Audio Knowledge empowered Visual Speech Recognition 框架 (AKVSR),通过使用音频模式来补充视觉模式中的不充分信息。与前一些方法不同,我们的 AKVSR 具有以下特点:1. 使用大规模预训练的音频模型来编码丰富的音频知识。2. 通过归约来抛弃非语言信息,将音频知识储存在紧凑的音频内存中。3. 包括音频桥接模块,可以在训练时找到最佳匹配的音频特征,从而使我们的训练不需要音频输入,只需要一次性将紧凑音频内存构建。我们通过广泛的实验 validate 了我们的方法的效果,并在广泛使用的 dataset 上达到了新的状态码性表现。