results: 在IEMOCAP数据集上进行测试,提议的系统在AER、ASR和SD三个任务上均超越了两个基准系统,并且在时间权重的情感识别和 speaker 分类错误上提供了两个metric来评估AER性能。Abstract
Although automatic emotion recognition (AER) has recently drawn significant research interest, most current AER studies use manually segmented utterances, which are usually unavailable for dialogue systems. This paper proposes integrating AER with automatic speech recognition (ASR) and speaker diarisation (SD) in a jointly-trained system. Distinct output layers are built for four sub-tasks including AER, ASR, voice activity detection and speaker classification based on a shared encoder. Taking the audio of a conversation as input, the integrated system finds all speech segments and transcribes the corresponding emotion classes, word sequences, and speaker identities. Two metrics are proposed to evaluate AER performance with automatic segmentation based on time-weighted emotion and speaker classification errors. Results on the IEMOCAP dataset show that the proposed system consistently outperforms two baselines with separately trained single-task systems on AER, ASR and SD.
摘要
尽管自动情感识别(AER)最近吸引了广泛的研究兴趣,现有大多数AER研究使用手动分割的话语,这些话语通常不可用于对话系统。这篇论文提议将AER、自动语音识别(ASR)和speaker分类(SD)集成为一个联合训练系统。根据共享encoder构建了四个不同的输出层,用于四个子任务,包括AER、ASR、语音活动检测和speaker类型分类。将对话 audio作为输入,该集成系统可以找到所有的语音段落,并将对应的情感类别、单词序列和 speaker 标识符转录出来。为评估AER性能,提出了两种指标,一是基于时间Weighted Emotion Errors,另一是基于speaker Classification Errors。对于IEMOCAP dataset,提出的系统一直在AER、ASR和SD三个基础系统之上升级,并且在AER性能和自动分割性能两个方面均有优异表现。
VoxBlink: X-Large Speaker Verification Dataset on Camera
results: 论文的实验结果表明,通过在不同的基础结构上训练,可以获得13%-30%的性能提升,这些基础结构包括VoxCeleb2和VoxBlink-Clean。Abstract
In this paper, we contribute a novel and extensive dataset for speaker verification, which contains noisy 38k identities/1.45M utterances (VoxBlink) and relatively cleaned 18k identities/1.02M (VoxBlink-Clean) utterances for training. Firstly, we accumulate a 60K+ users' list with their avatars and download their short videos on YouTube. We then established an automatic and scalable pipeline to extract relevant speech and video segments from these videos. To our knowledge, the VoxBlink dataset is one of the largest speaker recognition datasets available. Secondly, we conduct a series of experiments based on different backbones trained on a mix of the VoxCeleb2 and the VoxBlink-Clean. Our findings highlight a notable performance improvement, ranging from 13% to 30%, across different backbone architectures upon integrating our dataset for training. The dataset will be made publicly available shortly.
摘要
在这篇论文中,我们贡献了一个新的和广泛的演说识别数据集,包括噪音38k个人/1.45万个语音(VoxBlink)和相对干净的18k个人/1.02万个语音(VoxBlink-Clean) для训练。首先,我们积累了60,000名用户的名单和他们的小视频在YouTube上,然后我们建立了一个自动和可扩展的管道来提取这些视频中的相关语音和视频段落。根据我们所知,VoxBlink数据集是目前最大的演说识别数据集之一。其次,我们进行了不同的核心结构在VoxCeleb2和VoxBlink-Clean上进行训练的一系列实验。我们的发现表明,在将我们的数据集integrated into training中,不同的核心结构在13%到30%之间具有显著的性能提升。这个数据集即将公开。
Improving Audio-Visual Speech Recognition by Lip-Subword Correlation Based Visual Pre-training and Cross-Modal Fusion Encoder
results: 在 MISP2021-AVSR 数据集上进行实验,证明了两种提posed技术的效果性。使用只有相对较少的训练数据,最终系统可以达到比现有系统更高的性能水平。Abstract
In recent research, slight performance improvement is observed from automatic speech recognition systems to audio-visual speech recognition systems in the end-to-end framework with low-quality videos. Unmatching convergence rates and specialized input representations between audio and visual modalities are considered to cause the problem. In this paper, we propose two novel techniques to improve audio-visual speech recognition (AVSR) under a pre-training and fine-tuning training framework. First, we explore the correlation between lip shapes and syllable-level subword units in Mandarin to establish good frame-level syllable boundaries from lip shapes. This enables accurate alignment of video and audio streams during visual model pre-training and cross-modal fusion. Next, we propose an audio-guided cross-modal fusion encoder (CMFE) neural network to utilize main training parameters for multiple cross-modal attention layers to make full use of modality complementarity. Experiments on the MISP2021-AVSR data set show the effectiveness of the two proposed techniques. Together, using only a relatively small amount of training data, the final system achieves better performances than state-of-the-art systems with more complex front-ends and back-ends.
摘要
近期研究发现,自动语音识别系统到audio-visual语音识别系统在端到端框架下的轻微性能提升。这可能是因为modalities之间的不匹配收敛率和特殊输入表示所致。在这篇论文中,我们提出了两种新的技巧来提高audio-visual语音识别(AVSR)在预训练和精度调整训练框架下。首先,我们研究了拼音和lip shape之间的相关性,以建立良好的帧级单词界限。这使得视频和音频流在视觉模型预训练和交叉模态融合时进行准确的同步。其次,我们提出了一种听音指导的交叉模态融合encoder(CMFE)神经网络,以利用主要训练参数进行多个交叉模态注意层,以便充分利用多模态的补做性。在MISP2021-AVSR数据集上进行的实验表明,两种提出的技巧具有效果。总之,使用只有相对较少的训练数据,最终系统可以超越state-of-the-art系统,具有更复杂的前端和后端。
The Sound Demixing Challenge 2023 $\unicode{x2013}$ Cinematic Demixing Track
paper_authors: Stefan Uhlich, Giorgio Fabbro, Masato Hirano, Shusuke Takahashi, Gordon Wichern, Jonathan Le Roux, Dipam Chakraborty, Sharada Mohanty, Kai Li, Yi Luo, Jianwei Yu, Rongzhi Gu, Roman Solovyev, Alexander Stempkovskiy, Tatiana Habruseva, Mikhail Sukhovei, Yuki Mitsufuji
results: 论文提供了参与者采用最successful的方法的 Insights,相比cocktail-fork基线,专门在 simulated Divide and Remaster(DnR)数据集上训练的系统在SDR中提高1.8dB,而开放排名中任何数据可以用 для训练的系统在SDR中提高5.7dB。Abstract
This paper summarizes the cinematic demixing (CDX) track of the Sound Demixing Challenge 2023 (SDX'23). We provide a comprehensive summary of the challenge setup, detailing the structure of the competition and the datasets used. Especially, we detail CDXDB23, a new hidden dataset constructed from real movies that was used to rank the submissions. The paper also offers insights into the most successful approaches employed by participants. Compared to the cocktail-fork baseline, the best-performing system trained exclusively on the simulated Divide and Remaster (DnR) dataset achieved an improvement of 1.8dB in SDR whereas the top performing system on the open leaderboard, where any data could be used for training, saw a significant improvement of 5.7dB.
摘要
Translation notes:* "cinematic demixing" (CDX) was translated as "电影分离" (diàn yǐng fēn tiē)* "Sound Demixing Challenge" (SDX) was translated as "声音分离挑战" (shēng yīn fēn tiē bàzhàng)* "CDXDB23" was translated as "CDXDB2023" (CDXDB2023)* "simulated Divide and Remaster" (DnR) dataset was translated as "模拟分离和重新制作" (mó xiǎo fēn tiē hé zhòng xiān zhì zuò) dataset* "open leaderboard" was translated as "开放排行榜" (kāifàng pǔhàng bǎng)
The Sound Demixing Challenge 2023 $\unicode{x2013}$ Music Demixing Track
results: 在competition中,提出了最高分的方法,并与之前的音乐分离挑战(SDX’21)进行了比较,得到了1.6dB的提高 Signal-to-distortion ratio,并通过 listening test 得到了聆听评价。Abstract
This paper summarizes the music demixing (MDX) track of the Sound Demixing Challenge (SDX'23). We provide a summary of the challenge setup and introduce the task of robust music source separation (MSS), i.e., training MSS models in the presence of errors in the training data. We propose a formalization of the errors that can occur in the design of a training dataset for MSS systems and introduce two new datasets that simulate such errors: SDXDB23_LabelNoise and SDXDB23_Bleeding1. We describe the methods that achieved the highest scores in the competition. Moreover, we present a direct comparison with the previous edition of the challenge (the Music Demixing Challenge 2021): the best performing system under the standard MSS formulation achieved an improvement of over 1.6dB in signal-to-distortion ratio over the winner of the previous competition, when evaluated on MDXDB21. Besides relying on the signal-to-distortion ratio as objective metric, we also performed a listening test with renowned producers/musicians to study the perceptual quality of the systems and report here the results. Finally, we provide our insights into the organization of the competition and our prospects for future editions.
摘要
Translated into Simplified Chinese:这篇论文总结了Sound Demixing Challenge(SDX'23)中的音乐分离(MDX)track。我们介绍了挑战的设置和音乐来源分离(MSS)任务,即在训练数据中存在错误时训练MSS系统。我们提出了错误的形式化,并介绍了两个新的数据集,即SDXDB23_LabelNoise和SDXDB23_Bleeding1,这些数据集模拟了这些错误。我们 Described the methods that achieved the highest scores in the competition. In addition, we present a direct comparison with the previous edition of the challenge (the Music Demixing Challenge 2021): the best performing system under the standard MSS formulation achieved an improvement of over 1.6dB in signal-to-distortion ratio over the winner of the previous competition, when evaluated on MDXDB21. Besides relying on the signal-to-distortion ratio as objective metric, we also performed a listening test with renowned producers/musicians to study the perceptual quality of the systems and report here the results. Finally, we provide our insights into the organization of the competition and our prospects for future editions.