cs.SD - 2023-09-29

Toward Universal Speech Enhancement for Diverse Input Conditions

paper_url: http://arxiv.org/abs/2309.17384
repo_url: https://github.com/Emrys365/Universal-SE-demo
paper_authors: Wangyou Zhang, Kohei Saijo, Zhong-Qiu Wang, Shinji Watanabe, Yanmin Qian
for: 本研究旨在开拓一种能够 universally 处理多种输入条件的数据驱动抖音提升技术。
methods: 我们提出了一种独特的抖音提升模型，该模型独立于麦克风通道、信号长度和采样频率。
results: 我们在各种数据集上进行了广泛的实验，结果表明，提出的单一模型可以成功地处理多种输入条件，并且表现出色。

Abstract
The past decade has witnessed substantial growth of data-driven speech enhancement (SE) techniques thanks to deep learning. While existing approaches have shown impressive performance in some common datasets, most of them are designed only for a single condition (e.g., single-channel, multi-channel, or a fixed sampling frequency) or only consider a single task (e.g., denoising or dereverberation). Currently, there is no universal SE approach that can effectively handle diverse input conditions with a single model. In this paper, we make the first attempt to investigate this line of research. First, we devise a single SE model that is independent of microphone channels, signal lengths, and sampling frequencies. Second, we design a universal SE benchmark by combining existing public corpora with multiple conditions. Our experiments on a wide range of datasets show that the proposed single model can successfully handle diverse conditions with strong performance.

摘要
过去一个十年，数据驱动的语音提升（SE）技术因深度学习而经历了重大的发展。现有的方法在某些常用的数据集上表现出色，但大多数都只适用于单一的条件（例如单通道、多通道或固定采样频率）或只考虑单一任务（例如干声除或抗雷声）。当前没有一种通用的SE方法，可以有效地处理多个输入条件，使用单个模型。在这篇论文中，我们首次进行了这一研究。我们首先设计了独立于 Microphone 通道、信号长度和采样频率的单一 SE 模型。然后，我们设计了一个通用的 SE 测试准则，将现有的公共数据集合并多个条件。我们在各种数据集上进行了广泛的实验，发现我们提议的单一模型可以成功地处理多个条件，表现出色。

Improving Audio Captioning Models with Fine-grained Audio Features, Text Embedding Supervision, and LLM Mix-up Augmentation

paper_url: http://arxiv.org/abs/2309.17352
repo_url: None
paper_authors: Shih-Lun Wu, Xuankai Chang, Gordon Wichern, Jee-weon Jung, François Germain, Jonathan Le Roux, Shinji Watanabe
for: 提高 seq2seq 自动音频描述系统的性能
methods: 利用预训练模型和大语言模型（LLM），并使用 BEATs 提取细腻的音频特征，以及 Instructor LLM 提取文本描述的语言特征，并通过 auxiliary InfoNCE 损失函数将语言特征注入到 BEATs 音频特征中。此外，我们也提出了一种新的数据增强方法，使用 ChatGPT 生成caption mix-ups，以增加训练数据的多样性和复杂性。
results: 我们的模型在 Clotho 评分分区上取得了新的状态之冠32.6 SPIDEr-FL 分数，并在 2023 年 DCASE AAC 挑战中赢得了比赛。

Abstract
Automated audio captioning (AAC) aims to generate informative descriptions for various sounds from nature and/or human activities. In recent years, AAC has quickly attracted research interest, with state-of-the-art systems now relying on a sequence-to-sequence (seq2seq) backbone powered by strong models such as Transformers. Following the macro-trend of applied machine learning research, in this work, we strive to improve the performance of seq2seq AAC models by extensively leveraging pretrained models and large language models (LLMs). Specifically, we utilize BEATs to extract fine-grained audio features. Then, we employ Instructor LLM to fetch text embeddings of captions, and infuse their language-modality knowledge into BEATs audio features via an auxiliary InfoNCE loss function. Moreover, we propose a novel data augmentation method that uses ChatGPT to produce caption mix-ups (i.e., grammatical and compact combinations of two captions) which, together with the corresponding audio mixtures, increase not only the amount but also the complexity and diversity of training data. During inference, we propose to employ nucleus sampling and a hybrid reranking algorithm, which has not been explored in AAC research. Combining our efforts, our model achieves a new state-of-the-art 32.6 SPIDEr-FL score on the Clotho evaluation split, and wins the 2023 DCASE AAC challenge.

摘要

LRPD: Large Replay Parallel Dataset

paper_url: http://arxiv.org/abs/2309.17298
repo_url: https://github.com/idrnd/lrpd-paper-code
paper_authors: Ivan Yakovlev, Mikhail Melnikov, Nikita Bukhal, Rostislav Makarov, Alexander Alenin, Nikita Torgashov, Anton Okhotnikov
For: 本研究旨在提高声音反馈攻击检测（VAS）领域的深度神经网络（DNN）性能，并提供大量数据集来促进神经网络系统的进步。* Methods: 本研究使用了大量的批处理并行数据集（LRPD），该数据集包含了19个录音设备在17个不同环境中收集的超过100万个语音示例。* Results: 基于LRPD数据集的模型在完全未知的条件下表现了一致性，其EER在评估subset中为0.28%，并在ASVpoof 2017评估集上达到11.91%的水平。这些结果表明模型在LRPD数据集上训练后在未知条件下具有良好的表现。

Abstract
The latest research in the field of voice anti-spoofing (VAS) shows that deep neural networks (DNN) outperform classic approaches like GMM in the task of presentation attack detection. However, DNNs require a lot of data to converge, and still lack generalization ability. In order to foster the progress of neural network systems, we introduce a Large Replay Parallel Dataset (LRPD) aimed for a detection of replay attacks. LRPD contains more than 1M utterances collected by 19 recording devices in 17 various environments. We also provide an example training pipeline in PyTorch [1] and a baseline system, that achieves 0.28% Equal Error Rate (EER) on evaluation subset of LRPD and 11.91% EER on publicly available ASVpoof 2017 [2] eval set. These results show that model trained with LRPD dataset has a consistent performance on the fully unknown conditions. Our dataset is free for research purposes and hosted on GDrive. Baseline code and pre-trained models are available at GitHub.

摘要
最新的声音反模仿（VAS）研究显示，深度神经网络（DNN）在声音攻击探测任务中表现出色，超过了经典方法如GMM。然而，DNN需要很多数据来融合，并且还缺乏泛化能力。为推动神经网络系统的进步，我们介绍了一个大量重复并行数据集（LRPD），用于检测重复攻击。LRPD包含了超过100万个语音样本，由19个录音设备在17个不同的环境中采集。我们还提供了一个PyTorch中的训练管道示例和一个基线系统，其在LRPD评估子集上达到0.28%的相同错误率（EER），并在公开可用的ASVpoof 2017评估集上达到11.91%的EER。这些结果表明，使用LRPD数据集训练的模型在完全未知的条件下具有一致的表现。我们的数据集是免费用于研究目的，并在GDrive上hosts。基线代码和预训练模型可以在GitHub上获取。

ReFlow-TTS: A Rectified Flow Model for High-fidelity Text-to-Speech

paper_url: http://arxiv.org/abs/2309.17056
repo_url: None
paper_authors: Wenhao Guan, Qi Su, Haodong Zhou, Shiyu Miao, Xingjia Xie, Lin Li, Qingyang Hong
for: 这个研究旨在提出一种基于流函数的高质量语音合成方法，以替代传统的数据生成模型。
methods: 我们的方法使用了一种简单的Ordinary Differential Equation（ODE）模型，将Gaussian分布运输到真实的Mel-spectrogram分布中，使用直线路径以最大化运输程度。
results: 我们的实验结果显示，我们的ReFlow-TTS方法在LJSpeechDataset上实现了最高的性能，并且单步抽样比较高质量的语音合成模型。

Abstract
The diffusion models including Denoising Diffusion Probabilistic Models (DDPM) and score-based generative models have demonstrated excellent performance in speech synthesis tasks. However, its effectiveness comes at the cost of numerous sampling steps, resulting in prolonged sampling time required to synthesize high-quality speech. This drawback hinders its practical applicability in real-world scenarios. In this paper, we introduce ReFlow-TTS, a novel rectified flow based method for speech synthesis with high-fidelity. Specifically, our ReFlow-TTS is simply an Ordinary Differential Equation (ODE) model that transports Gaussian distribution to the ground-truth Mel-spectrogram distribution by straight line paths as much as possible. Furthermore, our proposed approach enables high-quality speech synthesis with a single sampling step and eliminates the need for training a teacher model. Our experiments on LJSpeech Dataset show that our ReFlow-TTS method achieves the best performance compared with other diffusion based models. And the ReFlow-TTS with one step sampling achieves competitive performance compared with existing one-step TTS models.

摘要
Diffusion模型，包括Denosing Diffusion Probabilistic Models (DDPM)和分数基本生成模型，在语音合成任务中表现出色，但是其效果带来许多抽样步骤，导致高质量语音合成需要长时间抽样。这种缺点限制了它在实际场景中的实用性。在这篇论文中，我们介绍ReFlow-TTS，一种基于流函数的新方法，用于高质量语音合成。具体来说，ReFlow-TTS是一个Ordinary Differential Equation (ODE)模型，可以将 Gaussian 分布transport到真实的 Mel-spectrogram 分布，以最大程度地保持直线路径。此外，我们提出的方法可以在一步抽样下实现高质量语音合成，并减少了培训教师模型的需求。我们在LJSpeech Dataset上进行了实验，并证明了ReFlow-TTS方法在 diffusion 基于模型中表现最佳，并且ReFlow-TTS 一步抽样方法与现有的一步 TTS 模型相匹配。

Low-Resource Self-Supervised Learning with SSL-Enhanced TTS

paper_url: http://arxiv.org/abs/2309.17020
repo_url: https://github.com/lenguyensonnguyen/CS2309_ChuyenDeThiGiacMayTinh_LeNguyenSonNguyen_CH1702039_GiuaKy
paper_authors: Po-chun Hsu, Ali Elkahky, Wei-Ning Hsu, Yossi Adi, Tu Anh Nguyen, Jade Copet, Emmanuel Dupoux, Hung-yi Lee, Abdelrahman Mohamed
for: 提高低资源自动语音识别的性能
methods: 使用合成语音增强低资源预训练 corpus
results: 成功降低了90%的语音数据需求，只有轻微性能下降

Abstract
Self-supervised learning (SSL) techniques have achieved remarkable results in various speech processing tasks. Nonetheless, a significant challenge remains in reducing the reliance on vast amounts of speech data for pre-training. This paper proposes to address this challenge by leveraging synthetic speech to augment a low-resource pre-training corpus. We construct a high-quality text-to-speech (TTS) system with limited resources using SSL features and generate a large synthetic corpus for pre-training. Experimental results demonstrate that our proposed approach effectively reduces the demand for speech data by 90\% with only slight performance degradation. To the best of our knowledge, this is the first work aiming to enhance low-resource self-supervised learning in speech processing.

摘要
自我指导学习（SSL）技术在各种语音处理任务中获得了卓越的结果。然而，尚存在很大的挑战是减少预训练中需要庞大量的语音数据的依赖。本文提议使用生成的 sintetic speech 来增强低资源预训练集。我们使用 SSL 特征构建了高质量的 text-to-speech（TTS）系统，生成了大量的 sintetic 训练集，并实验结果表明，我们的提议方法可以将语音数据需求减少90%，只带来轻微的性能下降。这是我们知道的首次尝试增强低资源自我指导学习在语音处理中。

Synthetic Speech Detection Based on Temporal Consistency and Distribution of Speaker Features

paper_url: http://arxiv.org/abs/2309.16954
repo_url: None
paper_authors: Yuxiang Zhang, Zhuo Li, Jingze Lu, Wenchao Wang, Pengyuan Zhang
for: 本研究旨在提高现代语音识别（SSD）方法的Robustness和可解性，并解决现有方法在某些数据集上的问题。
methods: 本研究使用了分析文本读音（TTS）过程中的发音特征缺陷，并根据发音特征的时间一致性和间隔特征的分布，提出了一种基于时间一致性和发音特征分布的SSD方法。
results: 该方法在跨数据集和静音修剪场景中都表现出了低计算复杂性和良好的性能，能够提高SSD方法的Robustness和可解性。

Abstract
Current synthetic speech detection (SSD) methods perform well on certain datasets but still face issues of robustness and interpretability. A possible reason is that these methods do not analyze the deficiencies of synthetic speech. In this paper, the flaws of the speaker features inherent in the text-to-speech (TTS) process are analyzed. Differences in the temporal consistency of intra-utterance speaker features arise due to the lack of fine-grained control over speaker features in TTS. Since the speaker representations in TTS are based on speaker embeddings extracted by encoders, the distribution of inter-utterance speaker features differs between synthetic and bonafide speech. Based on these analyzes, an SSD method based on temporal consistency and distribution of speaker features is proposed. On one hand, modeling the temporal consistency of intra-utterance speaker features can aid speech anti-spoofing. On the other hand, distribution differences in inter-utterance speaker features can be utilized for SSD. The proposed method offers low computational complexity and performs well in both cross-dataset and silence trimming scenarios.

摘要
当前的语音合成检测（SSD）方法在某些数据集上表现良好，但仍面临 robustness 和可读性的问题。一个可能的原因是这些方法不进行真实语音的缺陷分析。在这篇论文中，文本到语音（TTS）过程中的 speaker 特征的缺陷被分析了。因为 TTS 中 speaker 特征的时间一致性不够细致，因此 intra-utterance 中的 speaker 特征存在差异。此外，基于 encoder 提取的 speaker 嵌入有所不同，因此 inter-utterance 中的 speaker 特征的分布不同。根据这些分析，一种基于时间一致性和 speaker 特征分布的 SSD 方法被提议。在一个手上，模拟 intra-utterance 中 speaker 特征的时间一致性可以帮助语音防 spoofing。在另一个手上，inter-utterance 中 speaker 特征的分布差异可以被利用于 SSD。提议的方法具有低计算复杂度，在cross-dataset 和沉寂截取场景中表现良好。

Enhancing Code-switching Speech Recognition with Interactive Language Biases

paper_url: http://arxiv.org/abs/2309.16953
repo_url: None
paper_authors: Hexin Liu, Leibny Paola Garcia, Xiangyu Zhang, Andy W. H. Khong, Sanjeev Khudanpur
for: 本研究旨在提高多语言交互 circumstance 下的自动话语识别（ASR）性能，通过对混合 CTC/注意力 ASR 模型进行多级语言信息偏好。
methods: 本研究使用了 hybrid CTC/注意力 ASR 模型，并通过在不同层次语言信息上进行偏好来提高模型性能。
results: 对 ASRU 2019 多语言交互挑战 dataset 进行了实验，并与基eline 进行了比较。结果表明，提出的交互语言偏好（ILB）方法可以提高模型性能，并且对各种语言偏好和它们之间的交互进行了深入的分析。

Abstract
Languages usually switch within a multilingual speech signal, especially in a bilingual society. This phenomenon is referred to as code-switching (CS), making automatic speech recognition (ASR) challenging under a multilingual scenario. We propose to improve CS-ASR by biasing the hybrid CTC/attention ASR model with multi-level language information comprising frame- and token-level language posteriors. The interaction between various resolutions of language biases is subsequently explored in this work. We conducted experiments on datasets from the ASRU 2019 code-switching challenge. Compared to the baseline, the proposed interactive language biases (ILB) method achieves higher performance and ablation studies highlight the effects of different language biases and their interactions. In addition, the results presented indicate that language bias implicitly enhances internal language modeling, leading to performance degradation after employing an external language model.

摘要
语言通常在多语言交流中切换，特别在双语社会中。这种现象被称为代码切换（CS），使自动听说识别（ASR）在多语言场景下变得更加困难。我们提出了改进CS-ASR方法，使用多级语言信息（frame和token）来偏袋混合CTC/注意力ASR模型。我们在这里探索不同分辨率的语言偏袋之间的交互作用。我们在ASRU 2019代码切换挑战 datasets上进行了实验，相比基eline，我们的互动语言偏袋（ILB）方法实现了更高的性能。归根结底，我们的研究表明，语言偏袋隐式地增强了内部语言模型，导致性能下降 después使用外部语言模型。