results: 比对基eline方法,提高了 almost 0.84 点 PESQ 和 1% STOI,且 computational cost 大幅减少Abstract
Speech enhancement concerns the processes required to remove unwanted background sounds from the target speech to improve its quality and intelligibility. In this paper, a novel approach for single-channel speech enhancement is presented, using colored spectrograms. We propose the use of a deep neural network (DNN) architecture adapted from the pix2pix generative adversarial network (GAN) and train it over colored spectrograms of speech to denoise them. After denoising, the colors of spectrograms are translated to magnitudes of short-time Fourier transform (STFT) using a shallow regression neural network. These estimated STFT magnitudes are later combined with the noisy phases to obtain an enhanced speech. The results show an improvement of almost 0.84 points in the perceptual evaluation of speech quality (PESQ) and 1% in the short-term objective intelligibility (STOI) over the unprocessed noisy data. The gain in quality and intelligibility over the unprocessed signal is almost equal to the gain achieved by the baseline methods used for comparison with the proposed model, but at a much reduced computational cost. The proposed solution offers a comparative PESQ score at almost 10 times reduced computational cost than a similar baseline model that has generated the highest PESQ score trained on grayscaled spectrograms, while it provides only a 1% deficit in STOI at 28 times reduced computational cost when compared to another baseline system based on convolutional neural network-GAN (CNN-GAN) that produces the most intelligible speech.
摘要
音响提升关注于从目标语音中除去不想要的背景声音,以提高其质量和可理解性。在这篇论文中,我们提出了一种基于深度神经网络(DNN)的单通道语音提升方法,使用颜色spectrogram。我们采用了基于 pix2pix生成对抗网络(GAN)的DNN架构,并在颜色spectrogram上训练其来减噪。减噪后,颜色spectrogram的颜色被翻译为快时傅立声变换(STFT)的大小使用一个浅层神经网络进行预测。这些估算的STFT大小后与噪音相加,以获得提升的语音。结果表明,与不处理噪音数据相比,提升语音质量和可理解性的改进约为0.84分(PESQ)和1%(STOI)。与比较基线方法相比,提升的质量和可理解性减噪量约为90%,而计算成本减少了约10倍。提议的解决方案提供了相对于基线方法的PESQ分数,但计算成本减少了约10倍。此外,与另一个基线系统(CNN-GAN)相比,提升的STOI减噪量约为28倍,而计算成本减少了约28倍。
Real-time Neonatal Chest Sound Separation using Deep Learning
results: 该论文在人工数据集上比前方法提高了2.01dB至5.06dB的对象扭曲度量,同时计算时间也提高了至少17倍。因此,该方法可以作为任何胸部听觉监测系统的预处理步骤。Abstract
Auscultation for neonates is a simple and non-invasive method of providing diagnosis for cardiovascular and respiratory disease. Such diagnosis often requires high-quality heart and lung sounds to be captured during auscultation. However, in most cases, obtaining such high-quality sounds is non-trivial due to the chest sounds containing a mixture of heart, lung, and noise sounds. As such, additional preprocessing is needed to separate the chest sounds into heart and lung sounds. This paper proposes a novel deep-learning approach to separate such chest sounds into heart and lung sounds. Inspired by the Conv-TasNet model, the proposed model has an encoder, decoder, and mask generator. The encoder consists of a 1D convolution model and the decoder consists of a transposed 1D convolution. The mask generator is constructed using stacked 1D convolutions and transformers. The proposed model outperforms previous methods in terms of objective distortion measures by 2.01 dB to 5.06 dB in the artificial dataset, as well as computation time, with at least a 17-time improvement. Therefore, our proposed model could be a suitable preprocessing step for any phonocardiogram-based health monitoring system.
摘要
来诊检测新生儿是一种简单且不侵入性的诊断方法,用于诊断循环和呼吸道疾病。然而,在大多数情况下,获取高质量心脏和肺声 зву乐是非常困难,因为胸部声音包含了心脏、肺声和噪音声音。为了解决这个问题,通常需要进行额外的处理,以分离胸部声音成为心脏声音和肺声音。这篇论文提出了一个新的深度学习方法,用于将胸部声音分类为心脏声音和肺声音。这个方法受到Conv-TasNet模型的激发,并包括Encoder、Decoder和面组生成器。Encoder由1D梯度核心组成,Decoder由转置1D梯度组成,而面组生成器则由堆叠1D梯度和对称器组成。这个方法在人工数据集上比前方法提高了2.01dB至5.06dB的对象歪斜度指数,以及计算时间,至少提高了17倍。因此,我们的提案方法可以作为任何phonocardiogram基于的医疗监控系统的适当预处理步骤。
Multi-Speaker Expressive Speech Synthesis via Semi-supervised Contrastive Learning
results: 透过 semi-supervised 训练和多元数据,提高 VITS 模型的表现,使其能够实现多种语音样式和情感的语音合成。Abstract
This paper aims to build an expressive TTS system for multi-speakers, synthesizing a target speaker's speech with multiple styles and emotions. To this end, we propose a novel contrastive learning-based TTS approach to transfer style and emotion across speakers. Specifically, we construct positive-negative sample pairs at both utterance and category (such as emotion-happy or style-poet or speaker A) levels and leverage contrastive learning to better extract disentangled style, emotion, and speaker representations from speech. Furthermore, we introduce a semi-supervised training strategy to the proposed approach to effectively leverage multi-domain data, including style-labeled data, emotion-labeled data, and unlabeled data. We integrate the learned representations into an improved VITS model, enabling it to synthesize expressive speech with diverse styles and emotions for a target speaker. Experiments on multi-domain data demonstrate the good design of our model.
摘要
这篇论文目标建立一个表达力强的多话者Text-to-Speech(TTS)系统,使得目标说话者的speech中包含多种风格和情感。为此,我们提出了一种基于对比学习的TTS方法,用于传递风格和情感 across speakers。具体来说,我们构建了一个utterance和类别(例如情感-高兴或风格-诗人或说话者A)两级的正负样本对,并利用对比学习来更好地提取speech中的分离风格、情感和说话者表示。此外,我们提出了一种半监督训练策略,以更好地利用多个频道数据,包括风格标注数据、情感标注数据和无标注数据。我们将学习的表示 integrate into an improved VITS模型,使其能够合成具有多种风格和情感的表达性speech for a target speaker。实验结果表明我们的模型设计很好。