results: 我们通过使用 sim2spk、sim3spk 和 sim4spk 数据集进行训练和评估,并证明了我们的框架可以准确地localize target speech events。此外,我们的框架还能够在三种关于分类的任务中表现出优秀的多样性:target speaker voice activity detection、overlapped speech detection 和 gender diarization。Abstract
We introduce a novel task named `target speech diarization', which seeks to determine `when target event occurred' within an audio signal. We devise a neural architecture called Prompt-driven Target Speech Diarization (PTSD), that works with diverse prompts that specify the target speech events of interest. We train and evaluate PTSD using sim2spk, sim3spk and sim4spk datasets, which are derived from the Librispeech. We show that the proposed framework accurately localizes target speech events. Furthermore, our framework exhibits versatility through its impressive performance in three diarization-related tasks: target speaker voice activity detection, overlapped speech detection and gender diarization. In particular, PTSD achieves comparable performance to specialized models across these tasks on both real and simulated data. This work serves as a reference benchmark and provides valuable insights into prompt-driven target speech processing.
摘要
我们介绍了一个新任务名为“目标语音分类”,它的目的是在声音信号中确定“target事件发生的时间”。我们提出了一种神经网络架构名为“启发式目标语音分类”(PTSD),它可以使用多种启发符Specify the target speech events of interest。我们在sim2spk、sim3spk和sim4spk数据集上训练和评估PTSD,这些数据集来自Librispeech。我们发现我们的框架可以准确地本地化target语音事件。此外,我们的框架也表现出了多样性,它在三个分类相关的任务中:目标说话人嗓音活动检测、 overlap speech检测和性别分类中都达到了相当的性能。特别是,PTSD在这些任务中与专门的模型相比具有相当的性能,包括实际数据和模拟数据。这项工作作为参考基准,提供了价值的启发式目标语音处理的理解。