results: 实验结果表明,我们的模型在 AliMeeting 数据集上比 cascaded SA-ASR 模型减少了6.1%的相对Speaker-dependent Character Error Rate(SD-CER),并且与 SOTA 联合 AR SA-ASR 模型的 SD-CER 相似(34.8%),但具有1/10 RTF。Abstract
Joint modeling of multi-speaker ASR and speaker diarization has recently shown promising results in speaker-attributed automatic speech recognition (SA-ASR).Although being able to obtain state-of-the-art (SOTA) performance, most of the studies are based on an autoregressive (AR) decoder which generates tokens one-by-one and results in a large real-time factor (RTF). To speed up inference, we introduce a recently proposed non-autoregressive model Paraformer as an acoustic model in the SA-ASR model.Paraformer uses a single-step decoder to enable parallel generation, obtaining comparable performance to the SOTA AR transformer models. Besides, we propose a speaker-filling strategy to reduce speaker identification errors and adopt an inter-CTC strategy to enhance the encoder's ability in acoustic modeling. Experiments on the AliMeeting corpus show that our model outperforms the cascaded SA-ASR model by a 6.1% relative speaker-dependent character error rate (SD-CER) reduction on the test set. Moreover, our model achieves a comparable SD-CER of 34.8% with only 1/10 RTF compared with the SOTA joint AR SA-ASR model.
摘要
joint模型化多speaker ASR和speaker分类有最近显示出色的结果在speaker所有自动语音识别(SA-ASR)中。尽管能够获得状态之arte(SOTA)性能,大多数研究都基于一个autoregressive(AR) decoder,这会导致大量的实时因子(RTF)。为加速推理,我们引入了一个非autoregressive模型Paraformer作为SA-ASR模型中的声学模型。Paraformer使用单步decoder,以便并行生成,与SOTA AR transformer模型相当。此外,我们提出了一种speaker填充策略,以减少speaker认定错误,并采用一种inter-CTC策略,以提高encoder在声学模型中的能力。在AliMeeting corpus上的实验表明,我们的模型比杂合SA-ASR模型的测试集比例6.1%的相对speaker依赖字符错误率(SD-CER)下降。此外,我们的模型与SOTA联合AR SA-ASR模型相当的SD-CER值为34.8%,只需1/10 RTF。
methods: 本文提出了 Envelope Learning,一种新的 tone transfer 建模方法,通过在生成参数层次使用教学目标来映射音乐事件。
results: 本研究实现了在实时演奏场景中提高音频表达性、短语表达能力和多样性,并实现了精准地捕捉音频事件的开始和结束。Abstract
Tone Transfer is a novel deep-learning technique for interfacing a sound source with a synthesizer, transforming the timbre of audio excerpts while keeping their musical form content. Due to its good audio quality results and continuous controllability, it has been recently applied in several audio processing tools. Nevertheless, it still presents several shortcomings related to poor sound diversity, and limited transient and dynamic rendering, which we believe hinder its possibilities of articulation and phrasing in a real-time performance context. In this work, we present a discussion on current Tone Transfer architectures for the task of controlling synthetic audio with musical instruments and discuss their challenges in allowing expressive performances. Next, we introduce Envelope Learning, a novel method for designing Tone Transfer architectures that map musical events using a training objective at the synthesis parameter level. Our technique can render note beginnings and endings accurately and for a variety of sounds; these are essential steps for improving musical articulation, phrasing, and sound diversity with Tone Transfer. Finally, we implement a VST plugin for real-time live use and discuss possibilities for improvement.
摘要
<>采用深度学习技术,音源与 sintizer之间的 Tone Transfer 技术可以改变音频片断的时域特征,同时保持音乐形式内容。由于其音质效果良好和连续可控,因此在多个音频处理工具中应用。然而,它仍然存在许多缺点,如音色缺乏多样性和过渡和动态渲染的限制,这些缺点阻碍了 Tone Transfer 的表达和phrase演奏的可能性。在这项工作中,我们介绍了目前 Tone Transfer 架构的问题,以及控制 synthetic audio 的乐器的挑战。然后,我们介绍了 Envelope Learning,一种新的方法,可以在 synthesis 参数层次上使用 musical event 来训练 Tone Transfer 架构。我们的技术可以准确地渲染音乐事件的开始和结束,并且可以为不同的音频种类实现多样化。这些步骤对于提高 Tone Transfer 的表达、phrase 和音色多样性是关键。最后,我们实现了一个 VST 插件,用于实时现场使用,并讨论了进一步改进的可能性。
Multi-objective Progressive Clustering for Semi-supervised Domain Adaptation in Speaker Verification
results: 本文的提案方法(Multi-objective Progressive Clustering,MoPC)在VoxSRC 2023 track 3的评估集上获得4.95% EER,排名第一名。此外,我们还进行了额外的实验在FFSVC dataset上,取得了良好的结果。Abstract
Utilizing the pseudo-labeling algorithm with large-scale unlabeled data becomes crucial for semi-supervised domain adaptation in speaker verification tasks. In this paper, we propose a novel pseudo-labeling method named Multi-objective Progressive Clustering (MoPC), specifically designed for semi-supervised domain adaptation. Firstly, we utilize limited labeled data from the target domain to derive domain-specific descriptors based on multiple distinct objectives, namely within-graph denoising, intra-class denoising and inter-class denoising. Then, the Infomap algorithm is adopted for embedding clustering, and the descriptors are leveraged to further refine the target domain's pseudo-labels. Moreover, to further improve the quality of pseudo labels, we introduce the subcenter-purification and progressive-merging strategy for label denoising. Our proposed MoPC method achieves 4.95% EER and ranked the 1$^{st}$ place on the evaluation set of VoxSRC 2023 track 3. We also conduct additional experiments on the FFSVC dataset and yield promising results.
摘要
使用大规模无标签数据的 Pseudo-labeling 算法在语音识别任务中成为了重要的 semi-supervised 领域适应技术。在这篇论文中,我们提出了一种新的 Pseudo-labeling 方法,即 Multi-objective Progressive Clustering (MoPC),特意设计用于 semi-supervised 领域适应。首先,我们利用目标频道的有限标注数据来 derive 频道特有的描述符,基于多个不同的目标函数,即在 Graph 中的内部干扰、类内干扰和类间干扰。然后,我们采用 Infomap 算法进行嵌入聚类,并使用描述符进一步改进目标频道的 Pseudo-标签。此外,为了进一步提高 Pseudo-标签 的质量,我们引入 subcenter-purification 和 progressive-merging 策略来进行标签干扰。我们的提出的 MoPC 方法在 VoxSRC 2023 评测集上取得了 4.95% EER,并在评测集上排名第一。我们还进行了额外的 FFSVC 数据集的实验,并取得了有优的结果。
An Exploration of Task-decoupling on Two-stage Neural Post Filter for Real-time Personalized Acoustic Echo Cancellation
results: 实验结果表明,任务解耦模型可以比单一联合网络提供更好的性能,而且在任务解耦序列中,优化训练策略可以further提高模型的性能。Abstract
Deep learning based techniques have been popularly adopted in acoustic echo cancellation (AEC). Utilization of speaker representation has extended the frontier of AEC, thus attracting many researchers' interest in personalized acoustic echo cancellation (PAEC). Meanwhile, task-decoupling strategies are widely adopted in speech enhancement. To further explore the task-decoupling approach, we propose to use a two-stage task-decoupling post-filter (TDPF) in PAEC. Furthermore, a multi-scale local-global speaker representation is applied to improve speaker extraction in PAEC. Experimental results indicate that the task-decoupling model can yield better performance than a single joint network. The optimal approach is to decouple the echo cancellation from noise and interference speech suppression. Based on the task-decoupling sequence, optimal training strategies for the two-stage model are explored afterwards.
摘要
深度学习基于技术已经广泛应用于声学噪声抑制(AEC)领域。通过使用说话者表示方法,AEC的前iers延伸了,因此吸引了许多研究人员的关注。同时,任务解耦策略广泛应用于speech enhancement。为了进一步探索任务解耦方法,我们提议使用两个阶段任务解耦后filter(TDPF)在PAEC中。此外,我们采用多尺度本地-全球说话者表示方法来提高PAEC中的说话者抽取。实验结果表明,任务解耦模型可以在PAEC中提供更好的性能。最佳方法是在任务解耦序列中分离噪声和干扰音频抑制。基于任务解耦序列,我们后续探索了两个阶段模型的优化训练策略。
Spike-Triggered Contextual Biasing for End-to-End Mandarin Speech Recognition
paper_authors: Kaixun Huang, Ao Zhang, Binbin Zhang, Tianyi Xu, Xingchen Song, Lei Xie
for: 提高自动语音识别(ASR)系统对 Contextual Phrases 的识别性能
methods: 使用 Attention-based Deep Contextual Biasing 方法,同时支持显式和隐式偏袋
results: 实现了对 Contextual Phrases 的显著改进(32.0% 相对 CER 减少),并可以与浅混合方法相互协作以获得更好的结果Abstract
The attention-based deep contextual biasing method has been demonstrated to effectively improve the recognition performance of end-to-end automatic speech recognition (ASR) systems on given contextual phrases. However, unlike shallow fusion methods that directly bias the posterior of the ASR model, deep biasing methods implicitly integrate contextual information, making it challenging to control the degree of bias. In this study, we introduce a spike-triggered deep biasing method that simultaneously supports both explicit and implicit bias. Moreover, both bias approaches exhibit significant improvements and can be cascaded with shallow fusion methods for better results. Furthermore, we propose a context sampling enhancement strategy and improve the contextual phrase filtering algorithm. Experiments on the public WenetSpeech Mandarin biased-word dataset show a 32.0% relative CER reduction compared to the baseline model, with an impressively 68.6% relative CER reduction on contextual phrases.
摘要
针对给定的上下文表达,基于注意力的深度上下文偏好方法已经证明可以有效提高端到端自动语音识别(ASR)系统的识别性能。与浅层融合方法不同的是,深度偏好方法不直接偏袋ASR模型的 posterior,而是通过隐式地整合上下文信息,使其控制上下文偏好的难度增加。在本研究中,我们提出了触发器驱动的深度偏好方法,可以同时支持显式和隐式偏好。此外,两种偏好方法都展现出了显著的改善,可以与浅层融合方法相互协同使用。此外,我们还提出了上下文采样优化策略和上下文表达过滤算法改进。实验结果表明,在公共的WenetSpeech普通话biased-word数据集上,与基eline模型相比,我们的模型可以 дости到32.0%的相对报告错误率(CER)减少,其中上下文表达减少了68.6%的相对CER。
Neural2Speech: A Transfer Learning Framework for Neural-Driven Speech Reconstruction
results: 我们的提案Neural2Speech可以实现从脑电资料中重建语音,即使只有20分钟的脑电资料。与之比较,我们的方法在语音质量和准确性方面表现出色,较之前的基eline方法更高。Abstract
Reconstructing natural speech from neural activity is vital for enabling direct communication via brain-computer interfaces. Previous efforts have explored the conversion of neural recordings into speech using complex deep neural network (DNN) models trained on extensive neural recording data, which is resource-intensive under regular clinical constraints. However, achieving satisfactory performance in reconstructing speech from limited-scale neural recordings has been challenging, mainly due to the complexity of speech representations and the neural data constraints. To overcome these challenges, we propose a novel transfer learning framework for neural-driven speech reconstruction, called Neural2Speech, which consists of two distinct training phases. First, a speech autoencoder is pre-trained on readily available speech corpora to decode speech waveforms from the encoded speech representations. Second, a lightweight adaptor is trained on the small-scale neural recordings to align the neural activity and the speech representation for decoding. Remarkably, our proposed Neural2Speech demonstrates the feasibility of neural-driven speech reconstruction even with only 20 minutes of intracranial data, which significantly outperforms existing baseline methods in terms of speech fidelity and intelligibility.
摘要
<>转换给定文本到简化中文。>直接通过脑机器 interfaces 的沟通是重构自然语音的重要任务。先前的努力已经探索了使用复杂的深度神经网络(DNN)模型,通过大量神经记录数据进行训练,以实现语音重构。然而,在有限规模神经记录数据的情况下,实现满意的语音重构性能是困难的,主要是因为语音表示的复杂性和神经数据约束。为了解决这些挑战,我们提出了一种基于传输学习的神经驱动 speech reconstruction 框架,称为 Neural2Speech,该框架包括两个不同的训练阶段。首先,一个语音自适应器在可用的语音资料上进行预训练,以解码语音波形从编码的语音表示中。其次,一个轻量级的适配器在小规模神经记录数据上进行训练,以对神经活动和语音表示进行对应。值得注意的是,我们提出的 Neural2Speech 已经在只有 20 分钟的脑内数据上达到了比较出色的语音质量和可读性。