cs.SD - 2023-07-07

The CHiME-7 UDASE task: Unsupervised domain adaptation for conversational speech enhancement

paper_url: http://arxiv.org/abs/2307.03533
repo_url: None
paper_authors: Simon Leglaive, Léonie Borne, Efthymios Tzinis, Mostafa Sadeghi, Matthieu Fraticelli, Scott Wisdom, Manuel Pariente, Daniel Pressnitzer, John R. Hershey
for: 这篇论文的目的是提出一个无监督领域适应对话音频减噪任务（UDASE），以利用实际采集的噪声听写记录来适应语音减噪模型。
methods: 这篇论文使用了无监督领域适应技术，利用实际采集的噪声听写记录来适应语音减噪模型。
results: 这篇论文提出了一个基eline系统，用于解决 conversational speech 中的噪声问题。

Abstract
Supervised speech enhancement models are trained using artificially generated mixtures of clean speech and noise signals, which may not match real-world recording conditions at test time. This mismatch can lead to poor performance if the test domain significantly differs from the synthetic training domain. In this paper, we introduce the unsupervised domain adaptation for conversational speech enhancement (UDASE) task of the 7th CHiME challenge. This task aims to leverage real-world noisy speech recordings from the target test domain for unsupervised domain adaptation of speech enhancement models. The target test domain corresponds to the multi-speaker reverberant conversational speech recordings of the CHiME-5 dataset, for which the ground-truth clean speech reference is not available. Given a CHiME-5 recording, the task is to estimate the clean, potentially multi-speaker, reverberant speech, removing the additive background noise. We discuss the motivation for the CHiME-7 UDASE task and describe the data, the task, and the baseline system.

摘要
<>转换文本到简化中文。<>超级vised语音提升模型通常通过人工生成的清晰语音和噪声信号混合来进行训练，这些混合可能不符合实际录音条件。这种匹配不符问题可能会导致测试时的性能差。在这篇论文中，我们介绍了无监督领域适应对话语音提升任务（UDASE）的7个CHiME挑战。这个任务的目标是利用真实世界的噪声语音记录来无监督适应语音提升模型。目标测试频道对应的是CHiME-5数据集中的多个说话人 reverberant conversational speech记录，其中没有清晰语音参考。给一个CHiME-5记录，任务是估算清晰、可能多个说话人、 reverberant speech，从噪声背景噪声中移除。我们介绍了CHiME-7 UDASE任务的动机和数据、任务和基eline系统。

Token-Level Serialized Output Training for Joint Streaming ASR and ST Leveraging Textual Alignments

paper_url: http://arxiv.org/abs/2307.03354
repo_url: None
paper_authors: Sara Papi, Peidong Wan, Junkun Chen, Jian Xue, Jinyu Li, Yashesh Gaur
for: 这篇论文旨在提高实时语音识别和翻译的效果，并且使用单一decoder进行同时生成ASR和ST输出。
methods: 该方法使用一种joint token-level serialized output训练方法，通过利用市场上的文本对齐器来实现源和目标词的混合。
results: 实验表明，该方法在单语言（it-en）和多语言（de,es,it）设置下均能够达到最佳的质量-延迟平衡，并且与分立的ASR和ST模型相比，输出质量不减、甚至提高了0.4 BLEU和1.1 WER。

Abstract
In real-world applications, users often require both translations and transcriptions of speech to enhance their comprehension, particularly in streaming scenarios where incremental generation is necessary. This paper introduces a streaming Transformer-Transducer that jointly generates automatic speech recognition (ASR) and speech translation (ST) outputs using a single decoder. To produce ASR and ST content effectively with minimal latency, we propose a joint token-level serialized output training method that interleaves source and target words by leveraging an off-the-shelf textual aligner. Experiments in monolingual (it-en) and multilingual (\{de,es,it\}-en) settings demonstrate that our approach achieves the best quality-latency balance. With an average ASR latency of 1s and ST latency of 1.3s, our model shows no degradation or even improves output quality compared to separate ASR and ST models, yielding an average improvement of 1.1 WER and 0.4 BLEU in the multilingual case.

摘要
在实际应用中，用户经常需要同时获得翻译和转写的speech内容，以提高其理解度，特别在流媒体enario中，需要逐步生成。这篇论文介绍了一个流动Transformer-Transducer，通过单个解码器同时生成自动语音识别（ASR）和语音翻译（ST）输出。为了在最小延迟下生成ASR和ST内容，我们提议使用单个Token水平的 serialized输出训练方法，通过利用市场上的文本对齐器来扩展源和目标词。实验在单语言（it-en）和多语言（de,es,it）的设置下表明，我们的方法可以实现最佳的质量-延迟平衡。我们的模型在1s的ASR延迟和1.3s的ST延迟下，不产生质量下降或者even improves输出质量，相对于分离的ASR和ST模型，平均提高1.1 WER和0.4 BLEU的多语言情况。

Gammatonegram Representation for End-to-End Dysarthric Speech Processing Tasks: Speech Recognition, Speaker Identification, and Intelligibility Assessment

paper_url: http://arxiv.org/abs/2307.03296
repo_url: https://github.com/areffarhadi/gammatonegram_cnn_dysarthric_speech
paper_authors: Aref Farhadipour, Hadi Veisi
for: This paper aims to develop a system for speech recognition, speaker identification, and intelligibility assessment for individuals with dysarthria.
methods: The proposed system uses gammatonegram to represent audio files with discriminative details, which are then fed into a convolutional neural network (CNN) for recognition. The system also employs transfer learning and Alexnet pre-training for improved accuracy.
results: The proposed system achieved 91.29% accuracy in speaker-dependent mode, 87.74% accuracy in text-dependent mode, and 96.47% accuracy in two-class mode for intelligibility assessment. Additionally, the multi-network speech recognition system achieved an accuracy of 92.3% WRR.

Abstract
Dysarthria is a disability that causes a disturbance in the human speech system and reduces the quality and intelligibility of a person's speech. Because of this effect, the normal speech processing systems can not work properly on impaired speech. This disability is usually associated with physical disabilities. Therefore, designing a system that can perform some tasks by receiving voice commands in the smart home can be a significant achievement. In this work, we introduce gammatonegram as an effective method to represent audio files with discriminative details, which is used as input for the convolutional neural network. On the other word, we convert each speech file into an image and propose image recognition system to classify speech in different scenarios. Proposed CNN is based on the transfer learning method on the pre-trained Alexnet. In this research, the efficiency of the proposed system for speech recognition, speaker identification, and intelligibility assessment is evaluated. According to the results on the UA dataset, the proposed speech recognition system achieved 91.29% accuracy in speaker-dependent mode, the speaker identification system acquired 87.74% accuracy in text-dependent mode, and the intelligibility assessment system achieved 96.47% accuracy in two-class mode. Finally, we propose a multi-network speech recognition system that works fully automatically. This system is located in a cascade arrangement with the two-class intelligibility assessment system, and the output of this system activates each one of the speech recognition networks. This architecture achieves an accuracy of 92.3% WRR. The source code of this paper is available.

摘要
<>嗣瑞thesis是一种功能受限的人类语言系统的异常，导致人类语音质量和可理解性降低。由于这种效果，常规的语音处理系统无法正常工作。这种疾病通常与物理障碍有关。因此，设计一个可以通过声音命令在智能家庭中完成一些任务的系统可以是一项重要成果。在这种工作中，我们介绍了一种有效的方法来表示音频文件的特征，即干扰agram。即将每个语音文件转换成图像，并提议图像识别系统来分类语音在不同的场景中。我们的CNN基于传输学习方法，使用预训练的Alexnet。在这项研究中，我们评估了提议的系统的效率，包括语音识别、说话人识别和可理解性评估。根据UA数据集的结果，我们的语音识别系统在 speaker-dependent 模式下达到了 91.29% 的准确率，说话人识别系统在 text-dependent 模式下达到了 87.74% 的准确率，而可理解性评估系统在 two-class 模式下达到了 96.47% 的准确率。最后，我们提议了一个多网络语音识别系统，该系统位于堆叠式排序中，以及每个语音识别网络的输出。这种架构达到了 92.3% WRR 的准确率。本文的源代码可以获取。<>

Performance Comparison of Pre-trained Models for Speech-to-Text in Turkish: Whisper-Small and Wav2Vec2-XLS-R-300M

paper_url: http://arxiv.org/abs/2307.04765
repo_url: None
paper_authors: Oyku Berfin Mercan, Sercan Cepni, Davut Emre Tasar, Sukru Ozan
for: 本研究探讨了两种预训练多语言模型（Whisper-Small和Wav2Vec2-XLS-R-300M）在土耳其语言上的表现。
methods: 本研究使用了Mozilla Common Voice版本11.0，这是一个开源的土耳其语言数据集，并对两种模型进行了微调。
results: 研究发现，使用Wav2Vec2-XLS-R-300M模型可以得到更高的语音识别精度（WER值为0.16），而使用Whisper-Small模型的WER值为0.28。此外，研究还发现，使用测试数据集，不包括在训练和验证数据集中的call center记录，可以提高模型的表现。

Abstract
In this study, the performances of the Whisper-Small and Wav2Vec2-XLS-R-300M models which are two pre-trained multilingual models for speech to text were examined for the Turkish language. Mozilla Common Voice version 11.0 which is prepared in Turkish language and is an open-source data set, was used in the study. The multilingual models, Whisper- Small and Wav2Vec2-XLS-R-300M were fine-tuned with this data set which contains a small amount of data. The speech to text performance of the two models was compared. WER values are calculated as 0.28 and 0.16 for the Wav2Vec2-XLS- R-300M and the Whisper-Small models respectively. In addition, the performances of the models were examined with the test data prepared with call center records that were not included in the training and validation dataset.

摘要
在这个研究中，我们研究了两种预训练的多语言模型：Whisper-Small和Wav2Vec2-XLS-R-300M，用于识别土耳其语。这些模型在土耳其语 Mozilla Common Voice 版本11.0 数据集上进行了训练和测试。这个数据集包含一小量的数据，我们使用这些数据来精度地训练和测试这两种模型。我们计算了 WER 值，其中 Wav2Vec2-XLS-R-300M 的 WER 值为 0.28，Whisper-Small 模型的 WER 值为 0.16。此外，我们还测试了这两种模型在未包括在训练和验证数据集中的测试数据上的性能。

Whisper-AT: Noise-Robust Automatic Speech Recognizers are Also Strong General Audio Event Taggers

paper_url: http://arxiv.org/abs/2307.03183
repo_url: https://github.com/YuanGongND/whisper-at
paper_authors: Yuan Gong, Sameer Khurana, Leonid Karlinsky, James Glass
for: 这个论文专注于 Whisper 模型，一种基于大量标注的语音识别模型，以及该模型在不同环境下的表现。
methods: 论文首先展示了 Whisper 模型对真实世界背景噪音的强健性，但它的音频表示并不是噪音不变的，而是高度相关于非语音噪音。基于这一发现，论文建立了一个简单的音频标记和语音识别模型 Whisper-AT，通过冻结 Whisper 的背bone，并在顶部添加一个轻量级的音频标记模型。
results: Whisper-AT 可以在单个前进 pass 中同时进行语音识别和音频标记，并且只需 <1% 的额外计算成本。

Abstract
In this paper, we focus on Whisper, a recent automatic speech recognition model trained with a massive 680k hour labeled speech corpus recorded in diverse conditions. We first show an interesting finding that while Whisper is very robust against real-world background sounds (e.g., music), its audio representation is actually not noise-invariant, but is instead highly correlated to non-speech sounds, indicating that Whisper recognizes speech conditioned on the noise type. With this finding, we build a unified audio tagging and speech recognition model Whisper-AT by freezing the backbone of Whisper, and training a lightweight audio tagging model on top of it. With <1% extra computational cost, Whisper-AT can recognize audio events, in addition to spoken text, in a single forward pass.

摘要
在这篇论文中，我们关注Whisper，一种最近的自动语音识别模型，使用了巨量的680万小时标注的语音词汇录音数据，录制在多种条件下。我们首先发现一个有趣的现象，即Whisper对实际世界背景声（如音乐）非常鲁棒，但它的音频表示并不是噪声不变的，而是高度相关于非语音声音，表明Whisper识别语音 conditional 于噪声类型。基于这一发现，我们构建了一个整合音频标记和语音识别模型Whisper-AT，通过冻结Whisper的背bone，并在其上训练一个轻量级音频标记模型。与<1%的额外计算成本，Whisper-AT可以在单个前进通过recognize audio事件，以及说话文本，在一个前进中完成。