paper_authors: Weiran Wang, Rohit Prabhavalkar, Dongseong Hwang, Qiujia Li, Khe Chai Sim, Bo Li, James Qin, Xingyu Cai, Adam Stooke, Zhong Meng, CJ Zheng, Yanzhang He, Tara Sainath, Pedro Moreno Mengibar
for: investigate two popular end-to-end automatic speech recognition (ASR) models, namely Connectionist Temporal Classification (CTC) and RNN-Transducer (RNN-T), for offline recognition of voice search queries.
methods: use the neural architecture of Google’s universal speech model (USM), with additional funnel pooling layers to significantly reduce the frame rate and speed up training and inference.
results: despite the speculation that larger CTC models can perform as well as RNN-T models, the authors observe that a 900M RNN-T model outperforms a 1.8B CTC model and is more tolerant to severe time reduction, although the WER gap can be largely removed by LM shallow fusion.Abstract
In this work, we investigate two popular end-to-end automatic speech recognition (ASR) models, namely Connectionist Temporal Classification (CTC) and RNN-Transducer (RNN-T), for offline recognition of voice search queries, with up to 2B model parameters. The encoders of our models use the neural architecture of Google's universal speech model (USM), with additional funnel pooling layers to significantly reduce the frame rate and speed up training and inference. We perform extensive studies on vocabulary size, time reduction strategy, and its generalization performance on long-form test sets. Despite the speculation that, as the model size increases, CTC can be as good as RNN-T which builds label dependency into the prediction, we observe that a 900M RNN-T clearly outperforms a 1.8B CTC and is more tolerant to severe time reduction, although the WER gap can be largely removed by LM shallow fusion.
摘要
在这项工作中,我们研究了两种流行的端到端自动语音识别(ASR)模型,即Connectionist Temporal Classification(CTC)和RNN-Transducer(RNN-T),用于离线识别voice搜索 queries,模型参数达2B。我们的模型encoder使用Google的通用语音模型(USM)的神经网络结构,并添加了挥发池化层以大幅降低帧率和加速训练和推理。我们进行了广泛的词汇大小、时间减少策略和长形测试集的总体性能研究。虽然有人推测,随着模型参数的增加,CTC可能与RNN-T相当,但我们发现一个900M RNN-T明显超过了1.8B CTC,并且更具耐用性。虽然WER差距可以通过LM浅合并大大减少,但CTC的性能仍然落后RNN-T。
VIC-KD: Variance-Invariance-Covariance Knowledge Distillation to Make Keyword Spotting More Robust Against Adversarial Attacks
results: 实验结果显示,提出的方法可以提高对current state-of-the-art robust distillation方法(ARD和RSLAD)的robust准确率,分别提高12%和8%。Abstract
Keyword spotting (KWS) refers to the task of identifying a set of predefined words in audio streams. With the advances seen recently with deep neural networks, it has become a popular technology to activate and control small devices, such as voice assistants. Relying on such models for edge devices, however, can be challenging due to hardware constraints. Moreover, as adversarial attacks have increased against voice-based technologies, developing solutions robust to such attacks has become crucial. In this work, we propose VIC-KD, a robust distillation recipe for model compression and adversarial robustness. Using self-supervised speech representations, we show that imposing geometric priors to the latent representations of both Teacher and Student models leads to more robust target models. Experiments on the Google Speech Commands datasets show that the proposed methodology improves upon current state-of-the-art robust distillation methods, such as ARD and RSLAD, by 12% and 8% in robust accuracy, respectively.
摘要
键词检索(KWS)是指在语音流中确定一组预定义的词语的任务。随着深度神经网络的发展,KWS已成为许多小设备,如语音助手的激活和控制技术。但是,基于这些模型的边缘设备可能会受到硬件限制。此外,对语音技术的敌意攻击也在增加,因此开发对抗这些攻击的解决方案已经成为一项重要任务。在这种情况下,我们提出了VIC-KD,一种鲁棒的混合整合法。我们使用自我supervised的语音表示,并在教师和学生模型的秘密表示中加入几何约束,以便更加鲁棒的目标模型。在Google Speech Commands数据集上进行了实验,结果显示,我们的方法性比现有的State-of-the-art robust distillation方法,如ARD和RSLAD,提高了12%和8%的鲁棒精度,分别。
DurIAN-E: Duration Informed Attention Network For Expressive Text-to-Speech Synthesis
results: 实验结果表明,提出的表达力强的TTS模型在对比前一个状态的方法的测试中表现出色,在主观意见分数(MOS)和偏好测试中都达到了更高的水平。Abstract
This paper introduces an improved duration informed attention neural network (DurIAN-E) for expressive and high-fidelity text-to-speech (TTS) synthesis. Inherited from the original DurIAN model, an auto-regressive model structure in which the alignments between the input linguistic information and the output acoustic features are inferred from a duration model is adopted. Meanwhile the proposed DurIAN-E utilizes multiple stacked SwishRNN-based Transformer blocks as linguistic encoders. Style-Adaptive Instance Normalization (SAIN) layers are exploited into frame-level encoders to improve the modeling ability of expressiveness. A denoiser incorporating both denoising diffusion probabilistic model (DDPM) for mel-spectrograms and SAIN modules is conducted to further improve the synthetic speech quality and expressiveness. Experimental results prove that the proposed expressive TTS model in this paper can achieve better performance than the state-of-the-art approaches in both subjective mean opinion score (MOS) and preference tests.
摘要
Here is the translation in Simplified Chinese:这篇论文介绍了一种改进的文本识别模型(DurIAN-E),它继承了原始DurIAN模型的自适应模型结构,并在语言编码器中使用多层折衔RNN-基于Transformer块。此外,提议中的DurIAN-E还利用了frame级别编码器中的样式适应实例归一化(SAIN)层,以提高模型的表达能力。此外,通过混合DDPM和SAIN模块,提高了生成的语音质量和表达性。实验结果表明,提议的表达力TTS模型在主观意见分数(MOS)和偏好测试中表现更好于当前状态的方法。
A Study on Incorporating Whisper for Robust Speech Assessment
methods: 这个研究的第一部分investigates the correlation between Whisper的嵌入特征和两种自动学习(SSL)模型的主观质量和语言可理解得分。第二部分评估了Whisper在实施更加稳定的声音评估模型方面的可用性。第三部分分析了将Whisper和SSL模型的表示结合在MOSA-Net+中的可能性。
results: 实验结果表明,Whisper的嵌入特征与主观质量和语言可理解得分更加强相关,从而提高MOSA-Net+的预测性能。此外,将Whisper和SSL模型的表示结合只会导致微妙的改善。相比MOSA-Net和其他基于SSL的声音评估模型,MOSA-Net+在评估主观质量和语言可理解得分上具有显著的改善。此外,MOSA-Net+在VoiceMOS挑战2023的Track 3上获得了总成绩的第一名。Abstract
This research introduces an enhanced version of the multi-objective speech assessment model, called MOSA-Net+, by leveraging the acoustic features from large pre-trained weakly supervised models, namely Whisper, to create embedding features. The first part of this study investigates the correlation between the embedding features of Whisper and two self-supervised learning (SSL) models with subjective quality and intelligibility scores. The second part evaluates the effectiveness of Whisper in deploying a more robust speech assessment model. Third, the possibility of combining representations from Whisper and SSL models while deploying MOSA-Net+ is analyzed. The experimental results reveal that Whisper's embedding features correlate more strongly with subjective quality and intelligibility than other SSL's embedding features, contributing to more accurate prediction performance achieved by MOSA-Net+. Moreover, combining the embedding features from Whisper and SSL models only leads to marginal improvement. As compared to MOSA-Net and other SSL-based speech assessment models, MOSA-Net+ yields notable improvements in estimating subjective quality and intelligibility scores across all evaluation metrics. We further tested MOSA-Net+ on Track 3 of the VoiceMOS Challenge 2023 and obtained the top-ranked performance.
摘要
Translation in Simplified Chinese:这项研究引入了一个改进版的多目标语音评估模型,称为MOSA-Net+,通过利用大型预训练的弱监督模型Whisper的语音特征来生成嵌入特征。研究的第一部分 investigate了Whisper和两个自动学习(SSL)模型的嵌入特征与主观质量和听解能力分数之间的相关性。研究的第二部分评估了Whisper是否可以提供更加可靠的语音评估模型。第三部分分析了将Whisper和SSL模型的表示结合使用时MOSA-Net+的效果。实验结果表明,Whisper的嵌入特征与主观质量和听解能力分数相关性更高,对MOSA-Net+的预测性能产生了较大的贡献。此外,将Whisper和SSL模型的表示结合使用只导致了微妙的改进。相比MOSA-Net和其他基于SSL的语音评估模型,MOSA-Net+在所有评价指标上具有显著的改善。我们将MOSA-Net+测试在2023年语音评估挑战赛(VoiceMOS Challenge 2023)的Track 3上,并取得了排名第一的表现。
CrossSinger: A Cross-Lingual Multi-Singer High-Fidelity Singing Voice Synthesizer Trained on Monolingual Singers
results: 实验结果表明,CrossSinger可以高准确地合成不同歌手的歌曲,并且能够在不同语言之间进行跨语言合成,包括code-switch情况。Abstract
It is challenging to build a multi-singer high-fidelity singing voice synthesis system with cross-lingual ability by only using monolingual singers in the training stage. In this paper, we propose CrossSinger, which is a cross-lingual singing voice synthesizer based on Xiaoicesing2. Specifically, we utilize International Phonetic Alphabet to unify the representation for all languages of the training data. Moreover, we leverage conditional layer normalization to incorporate the language information into the model for better pronunciation when singers meet unseen languages. Additionally, gradient reversal layer (GRL) is utilized to remove singer biases included in lyrics since all singers are monolingual, which indicates singer's identity is implicitly associated with the text. The experiment is conducted on a combination of three singing voice datasets containing Japanese Kiritan dataset, English NUS-48E dataset, and one internal Chinese dataset. The result shows CrossSinger can synthesize high-fidelity songs for various singers with cross-lingual ability, including code-switch cases.
摘要
“建立一个多人高精当唱歌声合成系统以采用单语言训练数据是挑战。本研究提出了 CrossSinger,它是一个基于 Xiaoicesing2 的跨语言唱歌声合成器。我们使用国际音标字母来统一所有语言训练数据的表示。此外,我们运用了条件层normalization来将语言信息 incorporated 到模型中,以更好地处理 singer 遇到未见的语言时的发音。此外,我们还使用了 Gradient Reversal Layer (GRL) 来移除 singer 的偏好,因为所有 singer 都是单语言,这意味着 singer 的识别是隐式地与文本相关。实验使用了三个唱歌声数据集,包括日本 Kiritan 数据集、英国 NUS-48E 数据集和一个内部的中文数据集。结果显示 CrossSinger 可以实现高精当度的唱歌声合成,包括 code-switch 情况。”
NTT speaker diarization system for CHiME-7: multi-domain, multi-microphone End-to-end and vector clustering diarization
results: 该系统在NTT的CHiME-7挑战中的远程自动语音识别任务中被提交,并在开发集和评估集上分别提高了65%和62%的相对提升,相比于提供的VC-基础系统。Abstract
This paper details our speaker diarization system designed for multi-domain, multi-microphone casual conversations. The proposed diarization pipeline uses weighted prediction error (WPE)-based dereverberation as a front end, then applies end-to-end neural diarization with vector clustering (EEND-VC) to each channel separately. It integrates the diarization result obtained from each channel using diarization output voting error reduction plus overlap (DOVER-LAP). To harness the knowledge from the target domain and results integrated across all channels, we apply self-supervised adaptation for each session by retraining the EEND-VC with pseudo-labels derived from DOVER-LAP. The proposed system was incorporated into NTT's submission for the distant automatic speech recognition task in the CHiME-7 challenge. Our system achieved 65 % and 62 % relative improvements on development and eval sets compared to the organizer-provided VC-based baseline diarization system, securing third place in diarization performance.
摘要
Here's the Simplified Chinese translation:这篇论文介绍了一个针对多个频道、多个麦克风的休闲对话中的 speaker diarization 系统。提posed系统使用了weighted prediction error(WPE)基于的前端,然后应用每个通道的端到端神经网络抽取(EEND-VC)。系统将每个通道的抽取结果集成使用抽取输出误差减少加上重叠(DOVER-LAP)。为了利用目标频道的知识和所有通道的结果的组合,系统使用了无监督适应(self-supervised adaptation),通过在DOVER-LAP中生成pseudo-labels来重新训练EEND-VC。提posed系统在CHiME-7 challenge中提交了,并在发展集和评估集上分别 achieved 65%和62%的相对提升,与组织者提供的基eline diarization系统相比,获得了第三名的表现。
SPGM: Prioritizing Local Features for enhanced speech separation performance
paper_authors: Jia Qi Yip, Shengkui Zhao, Yukun Ma, Chongjia Ni, Chong Zhang, Hao Wang, Trung Hieu Nguyen, Kun Zhou, Dianwen Ng, Eng Siong Chng, Bin Ma
for: 提高Speech separation模型(如Sepformer)的性能,减少参数数量。
methods: 使用Single-Path Global Modulation(SPGM)块取代inter-blocks,SPGM块具有无参数全球 Pooling模块和Modulation模块,共计2%的模型参数。
results: SPGM在WSJ0-2Mix和Libri2Mix上达到22.1 dB SI-SDRi和20.4 dB SI-SDRi,超过Sepformer的性能,并与最新的SOTA模型几乎匹配,但具有8倍少的参数数量。Abstract
Dual-path is a popular architecture for speech separation models (e.g. Sepformer) which splits long sequences into overlapping chunks for its intra- and inter-blocks that separately model intra-chunk local features and inter-chunk global relationships. However, it has been found that inter-blocks, which comprise half a dual-path model's parameters, contribute minimally to performance. Thus, we propose the Single-Path Global Modulation (SPGM) block to replace inter-blocks. SPGM is named after its structure consisting of a parameter-free global pooling module followed by a modulation module comprising only 2% of the model's total parameters. The SPGM block allows all transformer layers in the model to be dedicated to local feature modelling, making the overall model single-path. SPGM achieves 22.1 dB SI-SDRi on WSJ0-2Mix and 20.4 dB SI-SDRi on Libri2Mix, exceeding the performance of Sepformer by 0.5 dB and 0.3 dB respectively and matches the performance of recent SOTA models with up to 8 times fewer parameters.
摘要
<>转换文本到简化中文。<>双路是一种流行的语音分离模型(例如 Sepformer)的架构,将长序列分割成 overlap 的块,以便在块内和块间分别模型 intra-chunk 的本地特征和 inter-chunk 的全局关系。然而,在这种情况下,inter-blocks 占用了模型的一半参数,但是它们却对性能的贡献很小。因此,我们提出了单路全球修饰(SPGM)块来取代 inter-blocks。SPGM 得名于它的结构,包括一个无参数的全球汇集模块和一个修饰模块,该模块仅占用了模型总参数的 2%。SPGM 块使得所有转换层在模型中都专注于本地特征修饰,从而使整个模型变成单路。SPGM 在 WSJ0-2Mix 和 Libri2Mix 上 достиieves 22.1 dB SI-SDRi 和 20.4 dB SI-SDRi,超过 Sepformer 的性能 by 0.5 dB 和 0.3 dB,并与最新的 SOTA 模型几乎相当。
paper_authors: Ross Cutler, Ando Saabas, Tanel Parnamaa, Marju Purin, Evgenii Indenbom, Nicolae-Catalin Ristea, Jegor Gužvin, Hannes Gamper, Sebastian Braun, Robert Aichner
For: 这个挑战的目的是促进静音干扰(AEC)研究,提高语音干扰和音频通信中的声音质量。* Methods: 挑战使用了两个追踪器,包括一个基于人工智能的追踪器和一个基于模型的追踪器,以及一个全带宽AECMOS。* Results: 挑战开源了两个大规模的训练数据集,包括来自更多于10,000个真实的音频设备和人类说话者的实际环境记录,以及一个 sintetic 数据集。winning 的result是基于所有场景的平均意见度(MOS)和单词准确率(WAcc)。Abstract
The ICASSP 2023 Acoustic Echo Cancellation Challenge is intended to stimulate research in acoustic echo cancellation (AEC), which is an important area of speech enhancement and is still a top issue in audio communication. This is the fourth AEC challenge and it is enhanced by adding a second track for personalized acoustic echo cancellation, reducing the algorithmic + buffering latency to 20ms, as well as including a full-band version of AECMOS. We open source two large datasets to train AEC models under both single talk and double talk scenarios. These datasets consist of recordings from more than 10,000 real audio devices and human speakers in real environments, as well as a synthetic dataset. We open source an online subjective test framework and provide an objective metric for researchers to quickly test their results. The winners of this challenge were selected based on the average mean opinion score (MOS) achieved across all scenarios and the word accuracy (WAcc) rate.
摘要
ICASSP 2023 听音障碍挑战是要促进听音障碍(AEC)领域的研究,这是一个重要的声音提升领域,仍然是音频通信中的主要问题。这是第四个AEC挑战,它的改进包括添加个性化听音障碍追踪,降低算法+缓冲延迟至20毫秒,以及包括全带AECMOS。我们对AEC模型进行训练提供了两个大型数据集,包括单个说话和双个说话场景。这些数据集包括来自 более чем10,000个真实的音频设备和人类说话者在真实环境中的录音,以及一个 sintetic 数据集。我们提供了在线主观测试框架,并提供了一个对研究人员快速测试结果的 объек metric。挑战赛中的赢家是根据所有场景的平均主观评分(MOS)和单词准确率(WAcc)而选择的。