results: 该方法在方言域扩展任务上达到了3%的相对提升,相比基线模型,在同样的测试集上提高了37%。此外,该方法也在大量数据上实现了4%的相对提升。Abstract
Spoken languages show significant variation across mandarin and accent. Despite the high performance of mandarin automatic speech recognition (ASR), accent ASR is still a challenge task. In this paper, we introduce meta-learning techniques for fast accent domain expansion in mandarin speech recognition, which expands the field of accents without deteriorating the performance of mandarin ASR. Meta-learning or learn-to-learn can learn general relation in multi domains not only for over-fitting a specific domain. So we select meta-learning in the domain expansion task. This more essential learning will cause improved performance on accent domain extension tasks. We combine the methods of meta learning and freeze of model parameters, which makes the recognition performance more stable in different cases and the training faster about 20%. Our approach significantly outperforms other methods about 3% relatively in the accent domain expansion task. Compared to the baseline model, it improves relatively 37% under the condition that the mandarin test set remains unchanged. In addition, it also proved this method to be effective on a large amount of data with a relative performance improvement of 4% on the accent test set.
摘要
spoken languages show significant variation across mandarin and accent. despite the high performance of mandarin automatic speech recognition (ASR), accent ASR is still a challenge task. in this paper, we introduce meta-learning techniques for fast accent domain expansion in mandarin speech recognition, which expands the field of accents without deteriorating the performance of mandarin ASR. meta-learning or learn-to-learn can learn general relation in multi domains not only for over-fitting a specific domain. so we select meta-learning in the domain expansion task. this more essential learning will cause improved performance on accent domain extension tasks. we combine the methods of meta learning and freeze of model parameters, which makes the recognition performance more stable in different cases and the training faster about 20%. our approach significantly outperforms other methods about 3% relatively in the accent domain expansion task. compared to the baseline model, it improves relatively 37% under the condition that the mandarin test set remains unchanged. in addition, it also proved this method to be effective on a large amount of data with a relative performance improvement of 4% on the accent test set.
results: 实现了收集大量的阿拉伯语言口音数据,提供城市/国家精细的口音选择功能,并且可以进行多方合作,汇集多种阿拉伯语言数据。Abstract
We introduce MyVoice, a crowdsourcing platform designed to collect Arabic speech to enhance dialectal speech technologies. This platform offers an opportunity to design large dialectal speech datasets; and makes them publicly available. MyVoice allows contributors to select city/country-level fine-grained dialect and record the displayed utterances. Users can switch roles between contributors and annotators. The platform incorporates a quality assurance system that filters out low-quality and spurious recordings before sending them for validation. During the validation phase, contributors can assess the quality of recordings, annotate them, and provide feedback which is then reviewed by administrators. Furthermore, the platform offers flexibility to admin roles to add new data or tasks beyond dialectal speech and word collection, which are displayed to contributors. Thus, enabling collaborative efforts in gathering diverse and large Arabic speech data.
摘要
我团队介绍MyVoice,一个招待人寄语的平台,用于提高阿拉伯语言口音技术。该平台提供了大量地方口音数据的设计机会,并将其公共地发布。MyVoice让参与者可以选择城市/国家精细口音,并录制显示的语音。用户可以在角色之间切换,包括参与者和注释者。平台包含一个质量保证系统,过滤掉低质量和假语音记录,然后将其发送给验证。在验证阶段,参与者可以评估语音质量,注释和提供反馈,这些反馈会被管理员审核。此外,平台允许管理员添加新的数据或任务,以外语言和词汇收集,这些任务将被显示给参与者。因此,MyVoice平台可以促进多方合作,收集多样化和大量的阿拉伯语言数据。
Signal Reconstruction from Mel-spectrogram Based on Bi-level Consistency of Full-band Magnitude and Phase
results: 对话、音乐和环境信号都有较好的重建效果Abstract
We propose an optimization-based method for reconstructing a time-domain signal from a low-dimensional spectral representation such as a mel-spectrogram. Phase reconstruction has been studied to reconstruct a time-domain signal from the full-band short-time Fourier transform (STFT) magnitude. The Griffin-Lim algorithm (GLA) has been widely used because it relies only on the redundancy of STFT and is applicable to various audio signals. In this paper, we jointly reconstruct the full-band magnitude and phase by considering the bi-level relationships among the time-domain signal, its STFT coefficients, and its mel-spectrogram. The proposed method is formulated as a rigorous optimization problem and estimates the full-band magnitude based on the criterion used in GLA. Our experiments demonstrate the effectiveness of the proposed method on speech, music, and environmental signals.
摘要
我们提出了一种基于优化的方法,用于从低维特征表示(如MEL spectrogram)重建时域信号。 phase reconstruction 已经研究过了从全带快时域傅立叙Transform(STFT)大小取得时域信号的重建方法。格里菲恩-林算法(GLA)在广泛使用,因为它只凭借 STFT 的重复性而工作,适用于各种音频信号。在这篇论文中,我们同时重建全带大小和频谱图中的相对关系,并基于这些关系进行优化问题的形式化表述。我们的实验表明,提议的方法在语音、音乐和环境信号上具有效果。
Exploring the Integration of Speech Separation and Recognition with Self-Supervised Learning Representation
results: 研究人员使用了最新的自动学习表示(SSLR)来改进filterbank特征,并通过合理的训练策略来集成语音分离和识别。结果显示,该策略在噪音混响WHAMR!测试集上实现了2.5%单词错误率,与现有的面积基于MVDR扩展射频映射和filterbank集成的28.9%相比,显著提高了多个人识别性能。Abstract
Neural speech separation has made remarkable progress and its integration with automatic speech recognition (ASR) is an important direction towards realizing multi-speaker ASR. This work provides an insightful investigation of speech separation in reverberant and noisy-reverberant scenarios as an ASR front-end. In detail, we explore multi-channel separation methods, mask-based beamforming and complex spectral mapping, as well as the best features to use in the ASR back-end model. We employ the recent self-supervised learning representation (SSLR) as a feature and improve the recognition performance from the case with filterbank features. To further improve multi-speaker recognition performance, we present a carefully designed training strategy for integrating speech separation and recognition with SSLR. The proposed integration using TF-GridNet-based complex spectral mapping and WavLM-based SSLR achieves a 2.5% word error rate in reverberant WHAMR! test set, significantly outperforming an existing mask-based MVDR beamforming and filterbank integration (28.9%).
摘要
neuronal speech separation 已经取得了很大的进步,其与自动语音识别(ASR)的结合是实现多 speaker ASR 的重要方向。本工作提供了深入的Investigation of speech separation in reverberant and noisy-reverberant scenarios as an ASR front-end。具体来说,我们探讨了多通道分离方法,Mask-based beamforming和复杂的 spectral mapping,以及ASR back-end模型中最佳的特征。我们使用了最近的自然语言学习表示(SSLR)作为特征,并从filterbank特征中提高了认知性能。为了进一步提高多 speaker recognition性能,我们提出了一种优化的训练策略,将 speech separation和recognition与SSLR结合使用。我们使用TF-GridNet-based complex spectral mapping和WavLM-based SSLR,在抗噪抗干扰 WHAMR! 测试集上实现了2.5% 词错率,与现有的mask-based MVDR beamforming和filterbank结合(28.9%)相比,显著超越了。
Backdoor Attacks against Voice Recognition Systems: A Survey
for: This paper aims to provide a comprehensive survey on backdoor attacks against Voice Recognition Systems (VRSs) and to discuss the feasibility of deploying classic backdoor defense methods and generic audio defense techniques on VRSs.
methods: The paper employs a comprehensive taxonomy of backdoor attacks against VRSs from different perspectives, and analyzes the characteristic of different categories. It also reviews existing attack methods and classic backdoor defense methods, and discusses the feasibility of deploying them on VRSs.
results: The paper provides a thorough review of backdoor attacks against VRSs, and discusses the open issues and future research directions in this field. It also provides a comprehensive understanding of the vulnerabilities of VRSs to backdoor attacks and the potential solutions to mitigate these attacks.Abstract
Voice Recognition Systems (VRSs) employ deep learning for speech recognition and speaker recognition. They have been widely deployed in various real-world applications, from intelligent voice assistance to telephony surveillance and biometric authentication. However, prior research has revealed the vulnerability of VRSs to backdoor attacks, which pose a significant threat to the security and privacy of VRSs. Unfortunately, existing literature lacks a thorough review on this topic. This paper fills this research gap by conducting a comprehensive survey on backdoor attacks against VRSs. We first present an overview of VRSs and backdoor attacks, elucidating their basic knowledge. Then we propose a set of evaluation criteria to assess the performance of backdoor attack methods. Next, we present a comprehensive taxonomy of backdoor attacks against VRSs from different perspectives and analyze the characteristic of different categories. After that, we comprehensively review existing attack methods and analyze their pros and cons based on the proposed criteria. Furthermore, we review classic backdoor defense methods and generic audio defense techniques. Then we discuss the feasibility of deploying them on VRSs. Finally, we figure out several open issues and further suggest future research directions to motivate the research of VRSs security.
摘要
声认系统(VRS)利用深度学习进行语音识别和说话人识别。它们在各种现实应用中广泛应用,从智能语音助手到电信监测和生物认证。然而,先前的研究表明,VRS受到后门攻击的威胁,这对VRS的安全性和隐私具有重要性。然而,现有的文献缺乏对这个话题的全面审查。这篇论文填补了这个研究空白,通过进行VRS对后门攻击的全面评估。我们首先提供VRS和后门攻击的概述,并提出评估后门攻击方法的评价标准。然后,我们提出了VRS对后门攻击的多维分类,并分析不同类别的特点。接着,我们对现有的攻击方法进行了全面的审查,并分析了它们的优缺点。此外,我们还评估了经典的后门防御方法和通用音频防御技术,并评估了它们在VRS上的可行性。最后,我们提出了一些未解决的问题,并建议未来研究VRS的安全性。Note: Please note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China. If you need Traditional Chinese, please let me know.
Modality Confidence Aware Training for Robust End-to-End Spoken Language Understanding
methods: 我们提出了一种新的 E2E SLU 系统,利用 audio 和文本表示,并基于 ASR 假设的模态信息确定精度。我们采用了两种新技术:1)有效地编码 ASR 假设质量,2)有效地将其集成到 E2E SLU 模型中。
results: 我们在 STOP 数据集上进行了实验,并发现我们的方法可以提高准确率。我们还进行了分析,以证明我们的方法的有效性。Abstract
End-to-end (E2E) spoken language understanding (SLU) systems that generate a semantic parse from speech have become more promising recently. This approach uses a single model that utilizes audio and text representations from pre-trained speech recognition models (ASR), and outperforms traditional pipeline SLU systems in on-device streaming scenarios. However, E2E SLU systems still show weakness when text representation quality is low due to ASR transcription errors. To overcome this issue, we propose a novel E2E SLU system that enhances robustness to ASR errors by fusing audio and text representations based on the estimated modality confidence of ASR hypotheses. We introduce two novel techniques: 1) an effective method to encode the quality of ASR hypotheses and 2) an effective approach to integrate them into E2E SLU models. We show accuracy improvements on STOP dataset and share the analysis to demonstrate the effectiveness of our approach.
摘要
最近,端到端(E2E)的语音理解(SLU)系统在使用预训练的语音识别(ASR)模型时,变得更加有前途。这种方法使用单个模型,利用语音和文本表示从预训练的ASR模型中获取,并在设备上流动enario下超越传统的管道SLU系统。然而,E2E SLU系统仍然在文本表示质量低下时表现弱,这是因为ASR识别错误。为了解决这个问题,我们提出了一种新的E2E SLU系统,增强了对ASR错误的抗钝性。我们介绍了两种新技术:1)一种有效的ASR假设质量编码方法,2)一种有效的将其集成到E2E SLU模型中的方法。我们在STOP数据集上显示了准确性改进,并提供分析,以证明我们的方法的有效性。
Estimating speaker direction on a humanoid robot with binaural acoustic signals
results: 经过实验和分析,这种方法可以在实时应用场景中提供有效的位置估计结果,并且可以适应不同的对话场景。同时,这种方法也可以减少延迟时间,以便实现实时对话。Abstract
To achieve human-like behaviour during speech interactions, it is necessary for a humanoid robot to estimate the location of a human talker. Here, we present a method to optimize the parameters used for the direction of arrival (DOA) estimation, while also considering real-time applications for human-robot interaction scenarios. This method is applied to binaural sound source localization framework on a humanoid robotic head. Real data is collected and annotated for this work. Optimizations are performed via a brute force method and a Bayesian model based method, results are validated and discussed, and effects on latency for real-time use are also explored.
摘要
Translated into Simplified Chinese:为实现人类样式的语音互动, robot需要估计人类说话者的位置。我们现在提出一种优化DOA估计参数的方法,同时考虑实时应用场景。这种方法应用于人型机器人头部上的双耳声源定位框架。实际数据收集和标注,并对其进行优化。我们使用枚举方法和 Bayesian 模型基于方法进行优化,并对结果进行验证和讨论。我们还探讨了在实时使用中的延迟影响。