cs.SD - 2023-07-24

An objective evaluation of Hearing Aids and DNN-based speech enhancement in complex acoustic scenes

  • paper_url: http://arxiv.org/abs/2307.12888
  • repo_url: https://github.com/enricguso/guso_waspaa23
  • paper_authors: Enric Gusó, Joanna Luberadzka, Martí Baig, Umut Sayin Saraç, Xavier Serra
  • for: 评估五种高级商业听力器(HA)device的目标性能,并与基于深度神经网络(DNN)的抽象环境中的语音提高算法进行比较。
  • methods: 测量一个HAdevice的Head-Related Transfer Functions(HRTFs),用于Synthesize一个双抽象 dataset для训练两种 state-of-the-art causal和非 causal DNN增强模型。然后,通过Ambisonics loudspeaker设置生成一个评估集,并通过KU100 dummy head记录每个HA device上的语音,包括和不包括传统HA算法。
  • results: 发现DNN增强比传统HA算法在噪声抑制和对话情况中的对话智能度指标方面表现更好。
    Abstract We investigate the objective performance of five high-end commercially available Hearing Aid (HA) devices compared to DNN-based speech enhancement algorithms in complex acoustic environments. To this end, we measure the HRTFs of a single HA device to synthesize a binaural dataset for training two state-of-the-art causal and non-causal DNN enhancement models. We then generate an evaluation set of realistic speech-in-noise situations using an Ambisonics loudspeaker setup and record with a KU100 dummy head wearing each of the HA devices, both with and without the conventional HA algorithms, applying the DNN enhancers to the latter. We find that the DNN-based enhancement outperforms the HA algorithms in terms of noise suppression and objective intelligibility metrics.
    摘要 我们研究了五种高级商业可用的听力器(HA)device的目标性能,与基于深度学习(DNN)的语音提升算法在复杂的噪声环境中进行比较。为此,我们测量了一个单个HAdevice的Head-Related Transfer Functions(HRTFs),以生成一个双核心state-of-the-art causal和非 causal DNN增强模型的训练数据集。然后,我们使用一个Ambisonics喇叭设置生成一个评估集,记录了每个HA设备上的KU100假头和不同的传统HA算法,并应用DNN增强器。我们发现,DNN基于的增强方法在噪声抑制和对象智能度量标中超过HA算法。

Joint speech and overlap detection: a benchmark over multiple audio setup and speech domains

  • paper_url: http://arxiv.org/abs/2307.13012
  • repo_url: None
  • paper_authors: Martin Lebourdais, Théo Mariotte, Marie Tahon, Anthony Larcher, Antoine Laurent, Silvio Montresor, Sylvain Meignier, Jean-Hugh Thomas
  • for: 本研究旨在提供一个完整的精度评估材料,用于评估不同的语音分类器在单/多通道和多种语音频道上的性能。
  • methods: 本研究使用了一种新的多类别分类模型,将语音分类和 overlap speech detection 融合到一起进行训练。
  • results: 研究结果显示,该模型在多种语音频道和单/多通道上具有优秀的性能,与单独的语音分类和 overlap speech detection 系统相比,具有更高的一致性和更低的训练成本。
    Abstract Voice activity and overlapped speech detection (respectively VAD and OSD) are key pre-processing tasks for speaker diarization. The final segmentation performance highly relies on the robustness of these sub-tasks. Recent studies have shown VAD and OSD can be trained jointly using a multi-class classification model. However, these works are often restricted to a specific speech domain, lacking information about the generalization capacities of the systems. This paper proposes a complete and new benchmark of different VAD and OSD models, on multiple audio setups (single/multi-channel) and speech domains (e.g. media, meeting...). Our 2/3-class systems, which combine a Temporal Convolutional Network with speech representations adapted to the setup, outperform state-of-the-art results. We show that the joint training of these two tasks offers similar performances in terms of F1-score to two dedicated VAD and OSD systems while reducing the training cost. This unique architecture can also be used for single and multichannel speech processing.
    摘要 声音活动和重叠说话检测(简称VAD和OSD)是speaker分类的关键前置处理任务。最终 segmentation 性能强度取决于这两个子任务的稳定性。近年来研究表明,VAD和OSD可以通过多类型分类模型进行合作训练。然而,这些研究通常受到特定的speech域的限制,缺乏对系统总体化能力的信息。本文提出了一个完整的VAD和OSD模型 benchmark,在多个音频设置(单/多通道)和语音域(例如媒体、会议等)上进行测试。我们的2/3类系统,即将时间卷积网络与适应设置的语音表示相结合,在F1分数上超越了现有的state-of-the-artResult。我们显示,将这两个任务合作训练可以与专门的VAD和OSD系统相比,提高了training cost,同时保持了相似的F1分数性能。此特有的架构还可以用于单通道和多通道speech处理。

Integration of Frame- and Label-synchronous Beam Search for Streaming Encoder-decoder Speech Recognition

  • paper_url: http://arxiv.org/abs/2307.12767
  • repo_url: None
  • paper_authors: Emiru Tsunoo, Hayato Futami, Yosuke Kashiwagi, Siddhant Arora, Shinji Watanabe
  • for: 这个论文是为了提高自动语音识别的精度和Robustness,尤其是在遇到过去知识不足的情况下。
  • methods: 本论文使用了几种方法,包括frame-based模型和label-based注意力Encoder-Decoder,并通过在单一搜寻算法中交替进行F-Sync和L-Sync搜寻。
  • results: 实验结果显示,提案的搜寻算法比其他搜寻方法来得更低的误差率,并且在过去知识不足的情况下保持稳定性。
    Abstract Although frame-based models, such as CTC and transducers, have an affinity for streaming automatic speech recognition, their decoding uses no future knowledge, which could lead to incorrect pruning. Conversely, label-based attention encoder-decoder mitigates this issue using soft attention to the input, while it tends to overestimate labels biased towards its training domain, unlike CTC. We exploit these complementary attributes and propose to integrate the frame- and label-synchronous (F-/L-Sync) decoding alternately performed within a single beam-search scheme. F-Sync decoding leads the decoding for block-wise processing, while L-Sync decoding provides the prioritized hypotheses using look-ahead future frames within a block. We maintain the hypotheses from both decoding methods to perform effective pruning. Experiments demonstrate that the proposed search algorithm achieves lower error rates compared to the other search methods, while being robust against out-of-domain situations.
    摘要 尽管框架基模型,如 CTC 和传播器,与流动自动语音识别有着很好的相互作用,但它们的解码没有使用未来知识,这可能会导致错误的剪辑。相反,标签基于注意力Encoder-Decoder可以通过软注意力来输入,但它往往对它的训练领域偏好,不同于 CTC。我们利用这些 complementary 特点,并提议将帧和标签同步(F-/L-Sync)的解码 alternately 在同一个搜索方案中进行。F-Sync 解码在块级处理中领先,而 L-Sync 解码在预测未来帧内一个块中提供了优先的 гипотезы。我们保留了两个解码方法的假设,以实现有效的剪辑。实验表明,我们提议的搜索算法可以与其他搜索方法相比,在低误差情况下实现更好的性能,而且对于不同领域的情况也具有更高的稳定性。

Code-Switched Urdu ASR for Noisy Telephonic Environment using Data Centric Approach with Hybrid HMM and CNN-TDNN

  • paper_url: http://arxiv.org/abs/2307.12759
  • repo_url: https://github.com/sage-khan/code-switched-noisy-urdu-asr
  • paper_authors: Muhammad Danyal Khan, Raheem Ali, Arshad Aziz
  • For: The paper aims to develop a resource-efficient Automatic Speech Recognition (ASR) system for code-switched Urdu language in a noisy call-center environment.* Methods: The proposed system uses a Chain Hybrid HMM and CNN-TDNN approach, which combines the advantages of HMM and DNN models with less labelled data. The system also utilizes a noisy environment-aware CNN to improve accuracy.* Results: The proposed system achieves a Word Error Rate (WER) of 5.2% in both noisy and clean environments, outperforming other ASR systems for code-switched Urdu language. The system also shows improved performance in recognizing isolated words, numbers, and continuous spontaneous speech.
    Abstract Call Centers have huge amount of audio data which can be used for achieving valuable business insights and transcription of phone calls is manually tedious task. An effective Automated Speech Recognition system can accurately transcribe these calls for easy search through call history for specific context and content allowing automatic call monitoring, improving QoS through keyword search and sentiment analysis. ASR for Call Center requires more robustness as telephonic environment are generally noisy. Moreover, there are many low-resourced languages that are on verge of extinction which can be preserved with help of Automatic Speech Recognition Technology. Urdu is the $10^{th}$ most widely spoken language in the world, with 231,295,440 worldwide still remains a resource constrained language in ASR. Regional call-center conversations operate in local language, with a mix of English numbers and technical terms generally causing a "code-switching" problem. Hence, this paper describes an implementation framework of a resource efficient Automatic Speech Recognition/ Speech to Text System in a noisy call-center environment using Chain Hybrid HMM and CNN-TDNN for Code-Switched Urdu Language. Using Hybrid HMM-DNN approach allowed us to utilize the advantages of Neural Network with less labelled data. Adding CNN with TDNN has shown to work better in noisy environment due to CNN's additional frequency dimension which captures extra information from noisy speech, thus improving accuracy. We collected data from various open sources and labelled some of the unlabelled data after analysing its general context and content from Urdu language as well as from commonly used words from other languages, primarily English and were able to achieve WER of 5.2% with noisy as well as clean environment in isolated words or numbers as well as in continuous spontaneous speech.
    摘要 call centers 有庞大的音频数据,可以用于获得有价值的商业意见和电话交流的自动识别,以便搜索历史记录中的特定上下文和内容,进行自动监控、关键词搜索和情感分析。为了在电话交流中提高质量,需要一个更加可靠的自动识别系统,因为电话环境通常噪音。此外,有很多资源受限的语言正在濒临灭绝,可以通过自动识别技术来保存它们。urd 是全球第10大最流行的语言,有231295440名使用者,但它仍然是资源受限的语言。本文描述了在噪音电话交流环境中实现资源有效的自动识别/文本转语系统的实现框架,使用链式混合HMM和CNN-TDNN进行混合语言识别。通过将HMM和DNN结合使用,我们可以利用神经网络的优势,而不需要大量标注数据。另外,通过添加CNN和TDNN,我们可以在噪音环境中提高准确率,因为CNN的额外频率维度可以捕捉更多的噪音语音信息。我们从多个开源资源中收集数据,并对一些未标注的数据进行分析和标注,以获得urd语言的WER为5.2%,包括噪音和清晰环境下的隔离单词或数字以及连续自由语言。

IteraTTA: An interface for exploring both text prompts and audio priors in generating music with text-to-audio models

  • paper_url: http://arxiv.org/abs/2307.13005
  • repo_url: None
  • paper_authors: Hiromu Yakura, Masataka Goto
  • for: 帮助 novice 用户自由生成音乐音频,即使他们没有音乐知识,如和声进程和乐器知识。
  • methods: 我们使用 text-to-audio 生成技术,并提供了一个特定的 interface 以帮助用户逐步调整文本提示和选择有利的音频先导。
  • results: 通过这种双重探索方式,用户可以了解不同文本提示和音频先导对生成结果的影响,并逐步实现他们的模糊定义目标。
    Abstract Recent text-to-audio generation techniques have the potential to allow novice users to freely generate music audio. Even if they do not have musical knowledge, such as about chord progressions and instruments, users can try various text prompts to generate audio. However, compared to the image domain, gaining a clear understanding of the space of possible music audios is difficult because users cannot listen to the variations of the generated audios simultaneously. We therefore facilitate users in exploring not only text prompts but also audio priors that constrain the text-to-audio music generation process. This dual-sided exploration enables users to discern the impact of different text prompts and audio priors on the generation results through iterative comparison of them. Our developed interface, IteraTTA, is specifically designed to aid users in refining text prompts and selecting favorable audio priors from the generated audios. With this, users can progressively reach their loosely-specified goals while understanding and exploring the space of possible results. Our implementation and discussions highlight design considerations that are specifically required for text-to-audio models and how interaction techniques can contribute to their effectiveness.
    摘要 现代文本到音频生成技术具有让新手可以自由生成音频的潜力。即使他们没有音乐知识,如和声进程和乐器,用户仍可以尝试不同的文本提示来生成音频。然而,与图像领域不同,了解生成音频的可能性空间是困难的,因为用户无法同时听到生成音频的变化。我们因此为用户提供了探索不同文本提示和音频先前的机会。这种双重探索允许用户通过比较不同的文本提示和音频先前来了解不同的生成结果的影响。我们开发的界面IteraTTA专门为用户帮助制定文本提示和选择生成音频中有利的先前。通过这种方式,用户可以逐步实现自己的抽象目标,同时了解和探索生成结果的可能性空间。我们的实现和讨论探讨了特定于文本到音频模型的设计考虑因素,以及如何通过互动技术来提高其效iveness。

A Model for Every User and Budget: Label-Free and Personalized Mixed-Precision Quantization

  • paper_url: http://arxiv.org/abs/2307.12659
  • repo_url: None
  • paper_authors: Edward Fish, Umberto Michieli, Mete Ozay
  • for: 这篇论文目的是提出一种可以个别化的数字化模型优化方法,以提高自动语音识别(ASR)模型在移动设备上的部署。
  • methods: 本文提出了一种混合精度量化方法(myQASR),可以根据不同的用户和目标领域,生成特化的数字化方案,并且不需要精确调整。myQASR通过分析全精度活动值,自动评估网络层的数字化敏感度,然后生成个别化的混合精度量化方案。
  • results: 本文的实验结果显示,myQASR可以对大规模ASR模型进行个别化优化,以提高特定的性别、语言和说话者的表现。
    Abstract Recent advancement in Automatic Speech Recognition (ASR) has produced large AI models, which become impractical for deployment in mobile devices. Model quantization is effective to produce compressed general-purpose models, however such models may only be deployed to a restricted sub-domain of interest. We show that ASR models can be personalized during quantization while relying on just a small set of unlabelled samples from the target domain. To this end, we propose myQASR, a mixed-precision quantization method that generates tailored quantization schemes for diverse users under any memory requirement with no fine-tuning. myQASR automatically evaluates the quantization sensitivity of network layers by analysing the full-precision activation values. We are then able to generate a personalised mixed-precision quantization scheme for any pre-determined memory budget. Results for large-scale ASR models show how myQASR improves performance for specific genders, languages, and speakers.
    摘要 myQASR uses mixed-precision quantization to generate personalized schemes for each user. The method evaluates the quantization sensitivity of network layers by analyzing full-precision activation values. This allows for the creation of a personalized mixed-precision quantization scheme for any pre-determined memory budget.Results for large-scale ASR models show that myQASR improves performance for specific genders, languages, and speakers. This demonstrates the effectiveness of personalized quantization for ASR models, and highlights the potential of myQASR for improving the performance of ASR systems in a wide range of applications.Here is the text in Simplified Chinese:最近的自动语音识别(ASR)技术发展,导致大型AI模型的出现,但这些模型在移动设备上部署是不实际的。模型压缩是一种解决方案,但它只能部署到一个限制的子domain中。我们提出myQASR方法,它可以在压缩过程中为不同用户personal化ASR模型,而无需细化。myQASR使用混合精度压缩生成用户化的压缩方案,并通过分析全精度活动值来评估网络层的压缩敏感度。这allow for生成任何预先确定的内存预算的个性化混合精度压缩方案。results表明,myQASR可以对大规模ASR模型进行特定性别、语言和发音人的改进。这表明个性化压缩对ASR模型的性能有积极的影响,并 highlights myQASR在各种应用场景中的潜在应用前景。

Robust Automatic Speech Recognition via WavAugment Guided Phoneme Adversarial Training

  • paper_url: http://arxiv.org/abs/2307.12498
  • repo_url: https://github.com/WAPATASR/WAPAT
  • paper_authors: Gege Qi, Yuefeng Chen, Xiaofeng Mao, Xiaojun Jia, Ranjie Duan, Rong Zhang, Hui Xue
  • for: 提高自动语音识别(ASR)模型在小量干扰和大域转移下的稳定性。
  • methods: 使用phoneme空间中的对抗例进行听话示例的挤压,使模型对phoneme表示具有抗衰减性,并通过使用挤压示例的phoneme表示来引导对抗例生成,以找到更稳定和多样的梯度方向,提高总体性。
  • results: 在End-to-end Speech Challenge Benchmark(ESB)上实现了6.28%的WRER降低,超过原始模型,达到新的领域最优。
    Abstract Developing a practically-robust automatic speech recognition (ASR) is challenging since the model should not only maintain the original performance on clean samples, but also achieve consistent efficacy under small volume perturbations and large domain shifts. To address this problem, we propose a novel WavAugment Guided Phoneme Adversarial Training (wapat). wapat use adversarial examples in phoneme space as augmentation to make the model invariant to minor fluctuations in phoneme representation and preserve the performance on clean samples. In addition, wapat utilizes the phoneme representation of augmented samples to guide the generation of adversaries, which helps to find more stable and diverse gradient-directions, resulting in improved generalization. Extensive experiments demonstrate the effectiveness of wapat on End-to-end Speech Challenge Benchmark (ESB). Notably, SpeechLM-wapat outperforms the original model by 6.28% WER reduction on ESB, achieving the new state-of-the-art.
    摘要 发展一个实用robust的自动语音识别(ASR)模型是挑战,因为模型不仅需要保持干净样本上的原始性能,还需要在小量干扰和大域变化下达到一致的效果。为解决这问题,我们提出了一种新的WavAugment导向的phoneme adversarial Training(wapat)方法。wapat使用phoneme空间中的 adversarial example作为增强素,以使模型对phoneme表示的小变化免疫,并保持干净样本上的性能。此外,wapat使用增强后的phoneme表示来导引敌对生成,以找到更稳定和多样的梯度方向,从而提高了总体的一致性。我们在End-to-end Speech Challenge Benchmark(ESB)上进行了广泛的实验,并证明了wapat的有效性。特别是,SpeechLM-wapat比原始模型减少了6.28%的WRR,达到了新的状态态-of-the-art。

SCRAPS: Speech Contrastive Representations of Acoustic and Phonetic Spaces

  • paper_url: http://arxiv.org/abs/2307.12445
  • repo_url: None
  • paper_authors: Ivan Vallés-Pérez, Grzegorz Beringer, Piotr Bilinski, Gary Cook, Roberto Barra-Chicote
  • for: This paper aims to learn shared representations of phonetic and acoustic spaces in the speech domain using a CLIP-based model.
  • methods: The proposed model is trained using the CLIP framework, which enables deep learning systems to learn shared latent spaces between images and text descriptions.
  • results: The model shows sensitivity to phonetic changes and robustness against different types of noise, with a 91% score drop when replacing 20% of the phonemes at random and a 10% performance drop when mixing the audio with 75% of Gaussian noise. The resulting embeddings are also found to be useful for downstream applications such as intelligibility evaluation and speech generation.
    Abstract Numerous examples in the literature proved that deep learning models have the ability to work well with multimodal data. Recently, CLIP has enabled deep learning systems to learn shared latent spaces between images and text descriptions, with outstanding zero- or few-shot results in downstream tasks. In this paper we explore the same idea proposed by CLIP but applied to the speech domain, where the phonetic and acoustic spaces usually coexist. We train a CLIP-based model with the aim to learn shared representations of phonetic and acoustic spaces. The results show that the proposed model is sensible to phonetic changes, with a 91% of score drops when replacing 20% of the phonemes at random, while providing substantial robustness against different kinds of noise, with a 10% performance drop when mixing the audio with 75% of Gaussian noise. We also provide empirical evidence showing that the resulting embeddings are useful for a variety of downstream applications, such as intelligibility evaluation and the ability to leverage rich pre-trained phonetic embeddings in speech generation task. Finally, we discuss potential applications with interesting implications for the speech generation and recognition fields.
    摘要 多种例子在 literatur 中证明深度学习模型可以处理多模态数据。近期,CLIP 使得深度学习系统可以学习共同的封闭空间 между图像和文本描述, obtained outstanding zero- or few-shot results in downstream tasks。在这篇文章中,我们探索了同样的想法,但是应用到语音领域, где phonetic 和 acoustic 空间通常共存。我们使用 CLIP 基于的模型, aiming to learn shared representations of phonetic and acoustic spaces。结果显示,我们的模型对 phonetic 变化敏感,对于Randomly replacing 20% of phonemes 的情况下,得分下降了 91%,而对于不同类型的噪音混合情况下,得分下降了 10%。我们还提供了实验证明, embedding 是对下游应用场景有用,如智能识别和speech generation 任务中的质量评估。最后,我们讨论了可能的应用,具有 interessing 的潜在应用于语音生成和识别领域。