cs.SD - 2023-08-16

Mitigating the Exposure Bias in Sentence-Level Grapheme-to-Phoneme (G2P) Transduction

  • paper_url: http://arxiv.org/abs/2308.08442
  • repo_url: None
  • paper_authors: Eunseop Yoon, Hee Suk Yoon, Dhananjaya Gowda, SooHwan Eom, Daehyeok Kim, John Harvill, Heting Gao, Mark Hasegawa-Johnson, Chanwoo Kim, Chang D. Yoo
  • for: 本研究旨在改进 sentence-level 和 paragraph-level G2P 的性能。
  • methods: 本研究使用了一种基于 T5 的 tokenizer-free byte-level 模型,并提出了一种损失函数 sampling 方法来 Mitigate 批处理 bias。
  • results: 实验结果表明,使用 proposed 的损失函数 sampling 方法可以改善 sentence-level 和 paragraph-level G2P 的性能。
    Abstract Text-to-Text Transfer Transformer (T5) has recently been considered for the Grapheme-to-Phoneme (G2P) transduction. As a follow-up, a tokenizer-free byte-level model based on T5 referred to as ByT5, recently gave promising results on word-level G2P conversion by representing each input character with its corresponding UTF-8 encoding. Although it is generally understood that sentence-level or paragraph-level G2P can improve usability in real-world applications as it is better suited to perform on heteronyms and linking sounds between words, we find that using ByT5 for these scenarios is nontrivial. Since ByT5 operates on the character level, it requires longer decoding steps, which deteriorates the performance due to the exposure bias commonly observed in auto-regressive generation models. This paper shows that the performance of sentence-level and paragraph-level G2P can be improved by mitigating such exposure bias using our proposed loss-based sampling method.
    摘要 文本转换变换器(T5)最近被考虑用于字母到phoneme(G2P)转推。作为续作,一种不使用tokenizer的字节级模型基于T5,称为ByT5,最近在单词级G2P转换中表现出了有前途的结果,通过每个输入字符与其唯一的UTF-8编码相对应。虽然通常认为 sentence-level或paragraph-level G2P可以提高实际应用中的可用性,因为更适合处理同义词和 слова间的声音连接,但使用ByT5进行这些场景是不rivial的。因为ByT5运行在字符级别,它需要更长的解码步骤,这会降低性能,因为普遍存在在自动生成模型中的曝光偏见。这篇论文显示,使用我们提议的损失样本方法可以改善 sentence-level和paragraph-level G2P的性能。

Classifying Dementia in the Presence of Depression: A Cross-Corpus Study

  • paper_url: http://arxiv.org/abs/2308.08306
  • repo_url: None
  • paper_authors: Franziska Braun, Sebastian P. Bayerl, Paula A. Pérez-Toro, Florian Hönig, Hartmut Lehfeld, Thomas Hillemacher, Elmar Nöth, Tobias Bocklet, Korbinian Riedhammer
  • for: 这个研究旨在检测和诊断诸如 деменция 和中风症等认知障碍,以提高医疗系统的效率和患者的生活质量。
  • methods: 这篇论文使用了已知的基eline系统,将语音、文本和情感嵌入应用于三类分类问题(健康Control vs. 中风症 vs. деменcia),使用 semantic Verbal Fluency Test 和 Boston Naming Test 的语音和文本数据。
  • results: 研究人员通过在两个独立录制的德国数据集上进行交叉 корpus 和混合 корpus 实验,发现了更好的一致性和可重复性,并进行了详细的错误分析,以了解类ifiers 是如何学习的。
    Abstract Automated dementia screening enables early detection and intervention, reducing costs to healthcare systems and increasing quality of life for those affected. Depression has shared symptoms with dementia, adding complexity to diagnoses. The research focus so far has been on binary classification of dementia (DEM) and healthy controls (HC) using speech from picture description tests from a single dataset. In this work, we apply established baseline systems to discriminate cognitive impairment in speech from the semantic Verbal Fluency Test and the Boston Naming Test using text, audio and emotion embeddings in a 3-class classification problem (HC vs. MCI vs. DEM). We perform cross-corpus and mixed-corpus experiments on two independently recorded German datasets to investigate generalization to larger populations and different recording conditions. In a detailed error analysis, we look at depression as a secondary diagnosis to understand what our classifiers actually learn.
    摘要 自动化老年痴味检测可以提前发现和 intervene,从而减少医疗系统的成本和提高受影响人的生活质量。抑郁和痴味具有共同的症状,使诊断变得更加复杂。现有研究主要集中在使用图片描述测试来进行二分类诊断(DEM vs. HC)。在这项工作中,我们使用已有的基线系统来分类语音中的认知障碍,使用语音、文本和情感嵌入在3类分类问题中(HC vs. MCI vs. DEM)。我们进行了跨 korpus 和混合 korpus 实验,以 investigate 大量人口和不同的录音条件下的泛化性。在详细的错误分析中,我们查看了抑郁是作为次要诊断的影响。

ChinaTelecom System Description to VoxCeleb Speaker Recognition Challenge 2023

  • paper_url: http://arxiv.org/abs/2308.08181
  • repo_url: None
  • paper_authors: Mengjie Du, Xiang Fang, Jie Li
  • for: 这份技术报告是关于2023年VOXCELEB SpeakerRecognition Challenge(VoxSRC 2023)的中国电信系统 Track 1(关闭)的描述。
  • methods: 该系统包括了多种ResNet变体,只有在VoxCeleb2上进行训练。这些变体后来被融合以提高性能。还应用了每个变体和融合系统的分数抖合。
  • results: 最终提交得分为0.1066和EER为1.980%。
    Abstract This technical report describes ChinaTelecom system for Track 1 (closed) of the VoxCeleb2023 Speaker Recognition Challenge (VoxSRC 2023). Our system consists of several ResNet variants trained only on VoxCeleb2, which were fused for better performance later. Score calibration was also applied for each variant and the fused system. The final submission achieved minDCF of 0.1066 and EER of 1.980%.
    摘要 这份技术报告描述了我们在VoxCeleb2023 Speaker Recognition Challenge(VoxSRC 2023)的 Track 1(关闭)系统。我们的系统包括了多种ResNet变体,只在VoxCeleb2上进行训练,并将其进行了更好的性能的融合。在每个变体和融合系统上都进行了分数调整。最终提交的结果为minDCF为0.1066和EER为1.980%。

AffectEcho: Speaker Independent and Language-Agnostic Emotion and Affect Transfer for Speech Synthesis

  • paper_url: http://arxiv.org/abs/2308.08577
  • repo_url: None
  • paper_authors: Hrishikesh Viswanath, Aneesh Bhattacharya, Pascal Jutras-Dubé, Prerit Gupta, Mridu Prashanth, Yashvardhan Khaitan, Aniket Bera
  • for: 本研究旨在开发一种语言独立的情感翻译模型,以控制生成的语音中的情感表达,保持每个说话者的个性风格和情感气质。
  • methods: 该模型使用Vector Quantized codebook来模型情感在量化空间中,其中包含五级情感强度,以捕捉复杂的情感表达和细腻的差异。该模型不需要一个一维或显式强度 embedding,从而消除了模型的训练偏好。
  • results: 实验结果表明,该模型可以控制生成的语音中的情感表达,同时保持每个说话者的个性风格和情感气质。该模型还实现了语言独立的情感模型化能力,通过在英语和中文之间进行情感传递任务来证明。 Qualitative和量化 metric 上都达到了领先的result。
    Abstract Affect is an emotional characteristic encompassing valence, arousal, and intensity, and is a crucial attribute for enabling authentic conversations. While existing text-to-speech (TTS) and speech-to-speech systems rely on strength embedding vectors and global style tokens to capture emotions, these models represent emotions as a component of style or represent them in discrete categories. We propose AffectEcho, an emotion translation model, that uses a Vector Quantized codebook to model emotions within a quantized space featuring five levels of affect intensity to capture complex nuances and subtle differences in the same emotion. The quantized emotional embeddings are implicitly derived from spoken speech samples, eliminating the need for one-hot vectors or explicit strength embeddings. Experimental results demonstrate the effectiveness of our approach in controlling the emotions of generated speech while preserving identity, style, and emotional cadence unique to each speaker. We showcase the language-independent emotion modeling capability of the quantized emotional embeddings learned from a bilingual (English and Chinese) speech corpus with an emotion transfer task from a reference speech to a target speech. We achieve state-of-art results on both qualitative and quantitative metrics.
    摘要 “情感”是一种情感特征,包括浓淡、高潮和强度,它是促进真实对话的重要属性。现有的文本到语音(TTS)和语音到语音系统通常使用强制编码器和全局风格Token来捕捉情感,但这些模型表示情感为风格的一部分或者分类化表示。我们提出了“情感频谱”(AffectEcho),一种情感翻译模型,它使用量化频谱来模型情感在量化空间中的五级浓淡强度,以捕捉复杂的感受和微妙的差异。这些量化情感嵌入是从语音样本中提取的,不需要一个一个的强制编码器或者显式强制编码器。实验结果表明我们的方法可以控制生成的语音中的情感,保留每个说话者的个性、风格和情感 cadence。我们还展示了基于英语和中文双语语音资料的语言独立情感模型的能力,通过将参考语音中的情感传递到目标语音中。我们在质量和质量指标上达到了当前最佳结果。

SCANet: A Self- and Cross-Attention Network for Audio-Visual Speech Separation

  • paper_url: http://arxiv.org/abs/2308.08143
  • repo_url: None
  • paper_authors: Kai Li, Runxuan Yang, Xiaolin Hu
  • for: 本文主要研究 audio-visual 语音分离问题,提出一种基于注意机制的 Audio-Visual 特征融合模型(SCANet),以提高 audio-visual 特征融合效果。
  • methods: SCANet 模型包括两种注意块:自注意(SA)和跨注意(CA)块,其中 CA 块分布在网络的顶部(TCA)、中部(MCA)和底部(BCA)。这些块可以学习不同modalities的特征,并提取 audio-visual 特征中的不同semantics。
  • results: 对三个标准的 audio-visual 分离benchmark(LRS2、LRS3和VoxCeleb2)进行了广泛的实验,结果表明 SCANet 模型比现有的状态对(SOTA)方法高效,同时保持了相对的执行时间。
    Abstract The integration of different modalities, such as audio and visual information, plays a crucial role in human perception of the surrounding environment. Recent research has made significant progress in designing fusion modules for audio-visual speech separation. However, they predominantly focus on multi-modal fusion architectures situated either at the top or bottom positions, rather than comprehensively considering multi-modal fusion at various hierarchical positions within the network. In this paper, we propose a novel model called self- and cross-attention network (SCANet), which leverages the attention mechanism for efficient audio-visual feature fusion. SCANet consists of two types of attention blocks: self-attention (SA) and cross-attention (CA) blocks, where the CA blocks are distributed at the top (TCA), middle (MCA) and bottom (BCA) of SCANet. These blocks maintain the ability to learn modality-specific features and enable the extraction of different semantics from audio-visual features. Comprehensive experiments on three standard audio-visual separation benchmarks (LRS2, LRS3, and VoxCeleb2) demonstrate the effectiveness of SCANet, outperforming existing state-of-the-art (SOTA) methods while maintaining comparable inference time.
    摘要 In this paper, we propose a novel model called the self- and cross-attention network (SCANet), which leverages the attention mechanism for efficient audio-visual feature fusion. SCANet consists of two types of attention blocks: self-attention (SA) and cross-attention (CA) blocks. The CA blocks are distributed at the top (TCA), middle (MCA), and bottom (BCA) of SCANet, allowing the network to learn modality-specific features and extract different semantics from audio-visual features.Experiments on three standard audio-visual separation benchmarks (LRS2, LRS3, and VoxCeleb2) demonstrate the effectiveness of SCANet, outperforming existing state-of-the-art (SOTA) methods while maintaining comparable inference time.

Radio2Text: Streaming Speech Recognition Using mmWave Radio Signals

  • paper_url: http://arxiv.org/abs/2308.08125
  • repo_url: None
  • paper_authors: Running Zhao, Jiangtao Yu, Hang Zhao, Edith C. H. Ngai
  • for: This paper proposes a mmWave-based system for streaming automatic speech recognition (ASR) with a large vocabulary size.
  • methods: The proposed system, called Radio2Text, uses a tailored streaming Transformer that learns speech-related features effectively, and a cross-modal structure based on knowledge distillation to mitigate the negative effect of low quality mmWave signals.
  • results: The experimental results show that Radio2Text can achieve a character error rate of 5.7% and a word error rate of 9.4% for the recognition of a vocabulary consisting of over 13,000 words.
    Abstract Millimeter wave (mmWave) based speech recognition provides more possibility for audio-related applications, such as conference speech transcription and eavesdropping. However, considering the practicality in real scenarios, latency and recognizable vocabulary size are two critical factors that cannot be overlooked. In this paper, we propose Radio2Text, the first mmWave-based system for streaming automatic speech recognition (ASR) with a vocabulary size exceeding 13,000 words. Radio2Text is based on a tailored streaming Transformer that is capable of effectively learning representations of speech-related features, paving the way for streaming ASR with a large vocabulary. To alleviate the deficiency of streaming networks unable to access entire future inputs, we propose the Guidance Initialization that facilitates the transfer of feature knowledge related to the global context from the non-streaming Transformer to the tailored streaming Transformer through weight inheritance. Further, we propose a cross-modal structure based on knowledge distillation (KD), named cross-modal KD, to mitigate the negative effect of low quality mmWave signals on recognition performance. In the cross-modal KD, the audio streaming Transformer provides feature and response guidance that inherit fruitful and accurate speech information to supervise the training of the tailored radio streaming Transformer. The experimental results show that our Radio2Text can achieve a character error rate of 5.7% and a word error rate of 9.4% for the recognition of a vocabulary consisting of over 13,000 words.
    摘要 《 millimeter wave (mmWave) 基于语音识别提供更多的音频相关应用程序,如会议语音转文本和窃听。然而,在实际场景中,延迟和可识别词汇数量是两个不可或缺的因素。在本文中,我们提出 Radio2Text,第一个 mmWave 基于的流动自动语音识别(ASR)系统,可以识别 более чем 13,000 个词汇。Radio2Text 基于一种适应流动 Transformer,可以有效地学习语音相关特征的表示,为流动 ASR 开辟了新的可能性。为了解决流动网络无法访问整个未来输入的问题,我们提出了指导初始化,通过权重继承来传递非流动 Transformer 中关于全局上下文的特征知识到适应流动 Transformer。此外,我们提出了基于知识储存(KD)的 Cross-modal 结构,named cross-modal KD,以mitigate the negative effect of low quality mmWave signals on recognition performance。在 Cross-modal KD 中,音频流动 Transformer 提供了特征和回应指导,通过继承精准和有用的语音信息来监督适应流动 Transformer 的训练。实验结果表明,我们的 Radio2Text 可以在识别 More than 13,000 个词汇时 achieving a character error rate of 5.7% and a word error rate of 9.4%.

End-to-End Open Vocabulary Keyword Search With Multilingual Neural Representations

  • paper_url: http://arxiv.org/abs/2308.08027
  • repo_url: None
  • paper_authors: Bolaji Yusuf, Jan Cernocky, Murat Saraclar
  • for: 提高 keyword search 系统的效率和简化搜索管道,不需要自动语音识别(ASR)。
  • methods: 使用 neural network 编码器对查询和文档进行编码,并将编码器的输出进行点积 multiplication 进行搜索。
  • results: 对长查询和不在训练数据中出现的查询,提出了比 ASR-based 系统更高的性能。
    Abstract Conventional keyword search systems operate on automatic speech recognition (ASR) outputs, which causes them to have a complex indexing and search pipeline. This has led to interest in ASR-free approaches to simplify the search procedure. We recently proposed a neural ASR-free keyword search model which achieves competitive performance while maintaining an efficient and simplified pipeline, where queries and documents are encoded with a pair of recurrent neural network encoders and the encodings are combined with a dot-product. In this article, we extend this work with multilingual pretraining and detailed analysis of the model. Our experiments show that the proposed multilingual training significantly improves the model performance and that despite not matching a strong ASR-based conventional keyword search system for short queries and queries comprising in-vocabulary words, the proposed model outperforms the ASR-based system for long queries and queries that do not appear in the training data.
    摘要 传统的关键词搜索系统通常基于自动语音识别(ASR)输出,这会导致它们具有复杂的索引和搜索管道。这已经引起了寻求ASR-free方法的兴趣,以简化搜索过程。我们最近提出了一种基于神经网络的ASR-free关键词搜索模型,该模型在竞争性和效率方面具有优异表现,并且保持了简单的搜索管道。在这篇文章中,我们将对该模型进行多语言预训练和详细分析。我们的实验结果表明,训练多语言后的模型性能有显著提升,并且虽然不能与强大的ASR-based传统关键词搜索系统相比,但该模型对长 queries和不在训练数据中的 queries 表现出色。