cs.SD - 2023-10-12

End-to-end Online Speaker Diarization with Target Speaker Tracking

paper_url: http://arxiv.org/abs/2310.08696
repo_url: None
paper_authors: Weiqing Wang, Ming Li
for: 这个论文提出了一种基于在线检测的目标说话者语音活动检测系统，用于speaker diarization任务，不需要先知道 clustering-based diarization 系统中的目标说话者嵌入。
methods: 该系统使用了自适应的 conventional 目标说话者语音活动检测方法，并在实时运行中进行了适应。在推理阶段，我们使用了一个前端模型来提取每个块的帧级别说话者嵌入。然后，我们根据这些帧级别说话者嵌入和 previously estimatinated 目标说话者嵌入来预测每个块的检测状态。接着，我们更新了目标说话者嵌入，并在当前块中预测了每个块的检测结果。
results: 实验结果显示，提出的方法在 DIHARD III 和 AliMeeting 数据集上超过了停止 clustering-based diarization 系统的性能。此外，我们还扩展了该方法到多通道数据，并与现有的离线 diarization 系统具有相似的性能。

Abstract
This paper proposes an online target speaker voice activity detection system for speaker diarization tasks, which does not require a priori knowledge from the clustering-based diarization system to obtain the target speaker embeddings. By adapting the conventional target speaker voice activity detection for real-time operation, this framework can identify speaker activities using self-generated embeddings, resulting in consistent performance without permutation inconsistencies in the inference phase. During the inference process, we employ a front-end model to extract the frame-level speaker embeddings for each coming block of a signal. Next, we predict the detection state of each speaker based on these frame-level speaker embeddings and the previously estimated target speaker embedding. Then, the target speaker embeddings are updated by aggregating these frame-level speaker embeddings according to the predictions in the current block. Our model predicts the results for each block and updates the target speakers' embeddings until reaching the end of the signal. Experimental results show that the proposed method outperforms the offline clustering-based diarization system on the DIHARD III and AliMeeting datasets. The proposed method is further extended to multi-channel data, which achieves similar performance with the state-of-the-art offline diarization systems.

摘要
During the inference process, we employ a front-end model to extract the frame-level speaker embeddings for each coming block of a signal. Next, we predict the detection state of each speaker based on these frame-level speaker embeddings and the previously estimated target speaker embedding. Then, the target speaker embeddings are updated by aggregating these frame-level speaker embeddings according to the predictions in the current block. Our model predicts the results for each block and updates the target speakers' embeddings until reaching the end of the signal.Experimental results show that the proposed method outperforms the offline clustering-based diarization system on the DIHARD III and AliMeeting datasets. The proposed method is further extended to multi-channel data, which achieves similar performance with the state-of-the-art offline diarization systems.Translated into Simplified Chinese:这篇论文提出了一种在线目标说话人活动检测系统，用于说话人分类任务，不需要先知的 clustering-based 分类系统来获取目标说话人嵌入。通过适应传统的目标说话人活动检测系统进行实时操作，这种框架可以使用自生成的嵌入来识别说话人活动，从而避免在推理阶段出现 permutation 不一致的问题。在推理过程中，我们使用前端模型来提取每个块的帧级 speaker 嵌入。然后，我们预测每个说话人的检测状态基于这些帧级 speaker 嵌入和之前估计的目标说话人嵌入。然后，目标说话人嵌入会通过当前块的预测来更新。我们的模型会预测每个块的结果，并将目标说话人嵌入更新到信号的结束。实验结果显示，提出的方法在 DIHARD III 和 AliMeeting 数据集上超越了偏置 clustering-based 分类系统。此外，我们还将方法扩展到多通道数据，其性能与目前最佳的 offline 分类系统类似。

Crowdsourced and Automatic Speech Prominence Estimation

paper_url: http://arxiv.org/abs/2310.08464
repo_url: https://github.com/reseval/reseval
paper_authors: Max Morrison, Pranav Pawar, Nathan Pruyne, Jennifer Cole, Bryan Pardo
for: The paper is written for the purpose of developing an automated system for speech prominence estimation, which is useful for linguistic analysis and training automated systems for text-to-speech and emotion recognition.
methods: The paper uses crowdsourced annotations of a portion of the LibriTTS dataset to train a neural speech prominence estimator, and investigates the impact of dataset size and the number of annotations per utterance on the accuracy of the estimator.
results: The paper achieves high accuracy on unseen speakers, datasets, and speaking styles, and provides insights into the design decisions for neural prominence estimation and how annotation cost affects the performance of the estimator.

Abstract
The prominence of a spoken word is the degree to which an average native listener perceives the word as salient or emphasized relative to its context. Speech prominence estimation is the process of assigning a numeric value to the prominence of each word in an utterance. These prominence labels are useful for linguistic analysis, as well as training automated systems to perform emphasis-controlled text-to-speech or emotion recognition. Manually annotating prominence is time-consuming and expensive, which motivates the development of automated methods for speech prominence estimation. However, developing such an automated system using machine-learning methods requires human-annotated training data. Using our system for acquiring such human annotations, we collect and open-source crowdsourced annotations of a portion of the LibriTTS dataset. We use these annotations as ground truth to train a neural speech prominence estimator that generalizes to unseen speakers, datasets, and speaking styles. We investigate design decisions for neural prominence estimation as well as how neural prominence estimation improves as a function of two key factors of annotation cost: dataset size and the number of annotations per utterance.

摘要
spoken word 的发音度是指一个平均的本地语 listener 对该词的强调或强调度相对于其上下文的程度。 speech prominence estimation 是指将每个话语中的每个词的强调分配到一个数字值上。这些强调标签非常有用于语言分析，以及训练自动化系统来执行强调控制的文本到语音或情感识别。手动标注强调是时间consuming 和昂贵的，这种情况驱动了开发自动化方法 дляSpeech prominence estimation的发展。然而，使用机器学习方法开发这样的自动化系统需要人类标注数据。我们使用我们的系统来获取这些人类标注数据，并将其开源到LibriTTS dataset中。我们使用这些标注作为真实的参考数据，用于训练一个基于神经网络的声音发音度估计器，该估计器可以泛化到未看过的发音者、数据集和说话风格。我们也考虑了 neural prominence estimation 的设计决策，以及如何通过两个关键因素来提高强调估计器的准确性：数据集大小和每个话语中的标注数。

A cry for help: Early detection of brain injury in newborns

paper_url: http://arxiv.org/abs/2310.08338
repo_url: None
paper_authors: Charles C. Onu, Samantha Latremouille, Arsenii Gorin, Junhao Wang, Uchenna Ekwochi, Peter O. Ubuane, Omolara A. Kehinde, Muhammad A. Salisu, Datonye Briggs, Yoshua Bengio, Doina Precup
for: 这个研究的目的是用人工智能算法来检测新生儿的脑部损伤，以便在无法得到正确诊断的情况下提供可靠的诊断工具。
methods: 这个研究使用了一种新的训练方法来开发一种基于听音的疾病检测模型，并在5家医院 across 3 continent中收集了一个大量的新生儿哭声数据库。这个系统可以提取可解释的听音生物标志，并准确地检测新生儿的脑部损伤，其AUC为92.5%（88.7%的敏感度在80%的特异性下）。
results: 这个研究发现，通过使用听音来检测新生儿的脑部损伤，可以提供一种低成本、易用、不侵入的屏测工具，尤其是在发展中国家， где大多数生产不受训练的医生。这种系统可以减少新生儿需要经常受到物理疲劳或辐射暴露的诊断测试，如脑CT扫描。这项研究开创了将婴儿哭声作为生命 parameter的可能性，并表明了人工智能驱动的听音监测在未来的可 affordable healthcare中的潜力。

Abstract
Since the 1960s, neonatal clinicians have known that newborns suffering from certain neurological conditions exhibit altered crying patterns such as the high-pitched cry in birth asphyxia. Despite an annual burden of over 1.5 million infant deaths and disabilities, early detection of neonatal brain injuries due to asphyxia remains a challenge, particularly in developing countries where the majority of births are not attended by a trained physician. Here we report on the first inter-continental clinical study to demonstrate that neonatal brain injury can be reliably determined from recorded infant cries using an AI algorithm we call Roseline. Previous and recent work has been limited by the lack of a large, high-quality clinical database of cry recordings, constraining the application of state-of-the-art machine learning. We develop a new training methodology for audio-based pathology detection models and evaluate this system on a large database of newborn cry sounds acquired from geographically diverse settings -- 5 hospitals across 3 continents. Our system extracts interpretable acoustic biomarkers that support clinical decisions and is able to accurately detect neurological injury from newborns' cries with an AUC of 92.5% (88.7% sensitivity at 80% specificity). Cry-based neurological monitoring opens the door for low-cost, easy-to-use, non-invasive and contact-free screening of at-risk babies, especially when integrated into simple devices like smartphones or neonatal ICU monitors. This would provide a reliable tool where there are no alternatives, but also curtail the need to regularly exert newborns to physically-exhausting or radiation-exposing assessments such as brain CT scans. This work sets the stage for embracing the infant cry as a vital sign and indicates the potential of AI-driven sound monitoring for the future of affordable healthcare.

摘要
Previous and recent work has been limited by the lack of a large, high-quality clinical database of cry recordings, which has constrained the application of state-of-the-art machine learning. We developed a new training methodology for audio-based pathology detection models and evaluated this system on a large database of newborn cry sounds acquired from geographically diverse settings - 5 hospitals across 3 continents. Our system extracts interpretable acoustic biomarkers that support clinical decisions and can accurately detect neurological injury from newborns' cries with an AUC of 92.5% (88.7% sensitivity at 80% specificity).Cry-based neurological monitoring provides a low-cost, easy-to-use, non-invasive, and contact-free screening tool for at-risk babies, especially when integrated into simple devices like smartphones or neonatal ICU monitors. This would provide a reliable tool where there are no alternatives and curtail the need for regular, physically-exhausting, or radiation-exposing assessments such as brain CT scans. This work sets the stage for embracing the infant cry as a vital sign and indicates the potential of AI-driven sound monitoring for the future of affordable healthcare.

A Single Speech Enhancement Model Unifying Dereverberation, Denoising, Speaker Counting, Separation, and Extraction

paper_url: http://arxiv.org/abs/2310.08277
repo_url: None
paper_authors: Kohei Saijo, Wangyou Zhang, Zhong-Qiu Wang, Shinji Watanabe, Tetsunori Kobayashi, Tetsuji Ogawa
for: 这个论文是为了提出一种多任务普适语音增强（MUSE）模型，能够处理五种语音增强任务：逆声泵、噪声纠正、语音分离（SS）、目标说话人抽取（TSE）和说话人数量计算。
methods: 这个模型通过将两个模块 integrate into an SE model：1）内部分离模块，它同时完成了 speaker counting和分离；2）TSE模块，使用目标说话人征料来从internal separation输出中提取目标speech。
results: 论文的evaluation结果表明，提出的MUSE模型可以成功处理多个任务，并且可以在单个模型上实现这些任务，这已经没有被完成过。

Abstract
We propose a multi-task universal speech enhancement (MUSE) model that can perform five speech enhancement (SE) tasks: dereverberation, denoising, speech separation (SS), target speaker extraction (TSE), and speaker counting. This is achieved by integrating two modules into an SE model: 1) an internal separation module that does both speaker counting and separation; and 2) a TSE module that extracts the target speech from the internal separation outputs using target speaker cues. The model is trained to perform TSE if the target speaker cue is given and SS otherwise. By training the model to remove noise and reverberation, we allow the model to tackle the five tasks mentioned above with a single model, which has not been accomplished yet. Evaluation results demonstrate that the proposed MUSE model can successfully handle multiple tasks with a single model.

摘要
我们提出了一种多任务通用speech增强（MUSE）模型，可以执行五种speech增强（SE）任务：干扰除泛滥、噪声除除、speaker分离（SS）、target speaker抽象（TSE）和speaker数量计算。这是通过将两个模块集成到一个SE模型中来实现的：1）内部分离模块，用于计算speaker数量和分离；2）TSE模块，使用目标说话者信号来从内部分离输出中提取目标speech。模型在给定目标说话者信号时进行TSE训练，否则进行SS训练。通过训练模型去除噪声和泛滥，我们使得模型可以处理上述五个任务。评估结果表明，我们提出的MUSE模型可以成功处理多个任务。