cs.SD - 2023-07-05

Why can big.bi be changed to bi.gbi? A mathematical model of syllabification and articulatory synthesis

  • paper_url: http://arxiv.org/abs/2307.02299
  • repo_url: None
  • paper_authors: Frédéric Berthommier
  • for: 这篇论文是用来描述一种简化的语音合成模型,该模型包括四个阶段,用于计划语音动作和计算语音动态。
  • methods: 该模型使用了拓扑图和选择算子来计划语音动作,并使用了VLAM模型进行synthesis。
  • results: 该模型能够描述语音cluster的拼音、元音的拼音和语音的变化,并且能够模拟句子中的语音变化。
    Abstract A simplified model of articulatory synthesis involving four stages is presented. The planning of articulatory gestures is based on syllable graphs with arcs and nodes that are implemented in a complex representation. This was first motivated by a reduction in the many-to-one relationship between articulatory parameters and formant space. This allows for consistent trajectory planning and computation of articulation dynamics with coordination and selection operators. The flow of articulatory parameters is derived from these graphs with four equations. Many assertions of Articulatory Phonology have been abandoned. This framework is adapted to synthesis using VLAM (a Maeda's model) and simulations are performed with syllables including main vowels and the plosives /b,d,g/ only. The model is able to describe consonant-vowel coarticulation, articulation of consonant clusters, and verbal transformations are seen as transitions of the syllable graph structure.
    摘要 提出了一种简化的语音合成模型,包括四个阶段。词形图的规划基于句子树,实现了复杂的表示。这是由于减少了许多到形式空间的多对一关系。这使得运动规划和形成动力计算可以采用协调和选择运算。词形图中的四个方程描述了流体参数的流动。许多语音学理学的假设被抛弃。这种框架被适应到使用VLAM(Maeda的模型)进行合成,并在包括主元音和擦音 /b,d,g/ 的 syllables 中进行了模拟。该模型可以描述共振元音和擦音 cluster 的语音合成,以及 verb 转折的变化。

Exploring Multimodal Approaches for Alzheimer’s Disease Detection Using Patient Speech Transcript and Audio Data

  • paper_url: http://arxiv.org/abs/2307.02514
  • repo_url: https://github.com/shui-dun/multimodal_ad
  • paper_authors: Hongmin Cai, Xiaoke Huang, Zhengliang Liu, Wenxiong Liao, Haixing Dai, Zihao Wu, Dajiang Zhu, Hui Ren, Quanzheng Li, Tianming Liu, Xiang Li
  • for: 这个研究旨在检测阿尔ц海默病(AD)使用患者的语音和讲解数据,以便早期诊断这种疾病。
  • methods: 该研究使用预训练语言模型和图 neural network(GNN)构建语音和讲解的图,并从图中提取特征进行AD检测。此外,还使用了数据扩展技术,如同义词替换和GPT基于的扩展器,以解决小数据集问题。
  • results: 我们的实验结果表明,将语音和讲解数据扩展为更大的数据集可以提高AD检测的准确率。此外,我们还发现,将语音转换回原始音频,并使用它们进行对比学习可以提高AD检测的效果。
    Abstract Alzheimer's disease (AD) is a common form of dementia that severely impacts patient health. As AD impairs the patient's language understanding and expression ability, the speech of AD patients can serve as an indicator of this disease. This study investigates various methods for detecting AD using patients' speech and transcripts data from the DementiaBank Pitt database. The proposed approach involves pre-trained language models and Graph Neural Network (GNN) that constructs a graph from the speech transcript, and extracts features using GNN for AD detection. Data augmentation techniques, including synonym replacement, GPT-based augmenter, and so on, were used to address the small dataset size. Audio data was also introduced, and WavLM model was used to extract audio features. These features were then fused with text features using various methods. Finally, a contrastive learning approach was attempted by converting speech transcripts back to audio and using it for contrastive learning with the original audio. We conducted intensive experiments and analysis on the above methods. Our findings shed light on the challenges and potential solutions in AD detection using speech and audio data.
    摘要 阿尔茨海默病 (AD) 是一种常见的 демен茨病,它对病人的健康产生严重的影响。由于 AD 会导致病人语言理解和表达能力受损,因此病人的言语可以作为这种疾病的指标。这项研究通过使用患者的言语和文本数据,从德мент银行 Pit 数据库中提取数据,并利用预训练语言模型和图 neural network (GNN) 构建一个图,以提取特征用于 AD 检测。为了 Addressing the small dataset size, data augmentation techniques, including synonym replacement, GPT-based augmenter, and so on, were used. Additionally, audio data was also introduced, and WavLM model was used to extract audio features. These features were then fused with text features using various methods. Finally, a contrastive learning approach was attempted by converting speech transcripts back to audio and using it for contrastive learning with the original audio. We conducted intensive experiments and analysis on the above methods. Our findings shed light on the challenges and potential solutions in AD detection using speech and audio data.Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you need the translation in Traditional Chinese, please let me know.

Self-supervised learning with diffusion-based multichannel speech enhancement for speaker verification under noisy conditions

  • paper_url: http://arxiv.org/abs/2307.02244
  • repo_url: None
  • paper_authors: Sandipana Dowerah, Ajinkya Kulkarni, Romain Serizel, Denis Jouvet
  • for: 提高噪音和折射环境下的人脸识别性能
  • methods: 使用多通道散度概率模型进行多通道speech减去,并利用自我超vised学习来 JOINTLY优化Diff-Filter和预训练的ECAPA-TDNN人脸识别模型
  • results: 在MultiSV多通道人脸识别dataset上进行评估,并显示在噪音多通道条件下获得显著的提高
    Abstract The paper introduces Diff-Filter, a multichannel speech enhancement approach based on the diffusion probabilistic model, for improving speaker verification performance under noisy and reverberant conditions. It also presents a new two-step training procedure that takes the benefit of self-supervised learning. In the first stage, the Diff-Filter is trained by conducting timedomain speech filtering using a scoring-based diffusion model. In the second stage, the Diff-Filter is jointly optimized with a pre-trained ECAPA-TDNN speaker verification model under a self-supervised learning framework. We present a novel loss based on equal error rate. This loss is used to conduct selfsupervised learning on a dataset that is not labelled in terms of speakers. The proposed approach is evaluated on MultiSV, a multichannel speaker verification dataset, and shows significant improvements in performance under noisy multichannel conditions.
    摘要 文章介绍了一种多通道语音提升方法——Diff-Filter,该方法基于扩散概率模型,用于提高听话人识别性能在噪音和频率干扰的情况下。文章还提出了一种新的两步训练方法,利用自动学习的优势。在第一个阶段,Diff-Filter通过进行时间频谱语音筛选,使用得分基于扩散模型进行训练。在第二个阶段,Diff-Filter与预训练的 ECAPA-TDNN speaker识别模型进行共同优化,使用一种新的平等错误率损失函数进行自动学习。我们在 MultiSV 多通道听话人识别 dataset 进行评估,并发现该方法在噪音多通道情况下显著提高了性能。

LOAF-M2L: Joint Learning of Wording and Formatting for Singable Melody-to-Lyric Generation

  • paper_url: http://arxiv.org/abs/2307.02146
  • repo_url: None
  • paper_authors: Longshen Ou, Xichu Ma, Ye Wang
  • for: bridges the singability gap between generated lyrics and melodies, improving the compatibility of the outputs with the melody.
  • methods: jointly Learning wOrding And Formatting during Melody-to-Lyric training (LOAF-M2L), with a new objective informed by musicological research on the relationship between melody and lyrics.
  • results: achieves 3.75% and 21.44% absolute accuracy gains in the outputs’ number-of-line and syllable-per-line requirements, and demonstrates a 63.92% and 74.18% relative improvement of music-lyric compatibility and overall quality in the subjective evaluation, compared to the state-of-the-art melody-to-lyric generation model.
    Abstract Despite previous efforts in melody-to-lyric generation research, there is still a significant compatibility gap between generated lyrics and melodies, negatively impacting the singability of the outputs. This paper bridges the singability gap with a novel approach to generating singable lyrics by jointly Learning wOrding And Formatting during Melody-to-Lyric training (LOAF-M2L). After general-domain pretraining, our proposed model acquires length awareness first from a large text-only lyric corpus. Then, we introduce a new objective informed by musicological research on the relationship between melody and lyrics during melody-to-lyric training, which enables the model to learn the fine-grained format requirements of the melody. Our model achieves 3.75% and 21.44% absolute accuracy gains in the outputs' number-of-line and syllable-per-line requirements compared to naive fine-tuning, without sacrificing text fluency. Furthermore, our model demonstrates a 63.92% and 74.18% relative improvement of music-lyric compatibility and overall quality in the subjective evaluation, compared to the state-of-the-art melody-to-lyric generation model, highlighting the significance of formatting learning.
    摘要 尽管之前的尝试在旋律到歌词生成研究中已经做出了很多努力,但是还是存在很大的兼容性差距,这对生成的输出有负面影响。这篇论文通过一种新的approach来bridging这个兼容性差距,即在melody-to-lyric训练中同时学习wording和 formatting(LOAF-M2L)。在通用领域预训练后,我们的提议的模型首先从大量的文本只lyric corpus中获得了长度意识。然后,我们引入了基于音乐学研究的旋律和歌词之间的关系的新目标,使模型学习细腻的格式要求。我们的模型在输出的数量线和音节数要求上做出了3.75%和21.44%的绝对准确率提升,而无需牺牲文本流畅性。此外,我们的模型在主观评估中表现出了63.92%和74.18%的相对提升,相比之前的状态对旋律到歌词生成模型,这 highlights the significance of formatting learning。

Going Retro: Astonishingly Simple Yet Effective Rule-based Prosody Modelling for Speech Synthesis Simulating Emotion Dimensions

  • paper_url: http://arxiv.org/abs/2307.02132
  • repo_url: https://github.com/felixbur/syntact
  • paper_authors: Felix Burkhardt, Uwe Reichel, Florian Eyben, Björn Schuller
  • for: 这个论文是为了研究如何通过 modify 语音合成的 просоди来模拟情感表达。
  • methods: 这两个规则基于的模型使用 speech synthesis markup language (SSML) 来控制语音合成的 просоди,以模拟情感表达。
  • results: Results indicate that with a very simple method both dimensions arousal (.76 UAR) and valence (.43 UAR) can be simulated。
    Abstract We introduce two rule-based models to modify the prosody of speech synthesis in order to modulate the emotion to be expressed. The prosody modulation is based on speech synthesis markup language (SSML) and can be used with any commercial speech synthesizer. The models as well as the optimization result are evaluated against human emotion annotations. Results indicate that with a very simple method both dimensions arousal (.76 UAR) and valence (.43 UAR) can be simulated.
    摘要 我们介绍两个规律模型,用以 modify 语音合成中的情感表达。这些模型基于语音合成标记语言(SSML),可以与任何商业语音合成器使用。我们将这些模型以及优化结果评估于人类情感标注。结果显示,使用非常简单的方法,我们可以模拟出两个维度的情感表达,即情感刺激(.76 UAR)和情感负面(.43 UAR)。

A Database with Directivities of Musical Instruments

  • paper_url: http://arxiv.org/abs/2307.02110
  • repo_url: None
  • paper_authors: David Ackermann, Fabian Brinkmann, Stefan Weinzierl
  • for: 这个论文是为了提供41种现代和历史乐器的听录和射频模式数据,并计算每种乐器的一三 octave 带的直达性平均值,这些数据适用于音响模拟和听觉 simulate。
  • methods: 这个论文使用了32道球形 Mikrofon array 在零噪条件下测量乐器的听录和射频模式,并计算每种乐器的直达性平均值,然后使用球面spline interpolating 进行空间 upsampling,并转换为 OpenDAFF 和 GLL 格式,以便在房间声学和电子声学模拟软件中使用。
  • results: 这个论文提供了41种乐器的听录和射频模式数据,包括每种乐器的直达性平均值,以及在不同的射频带width 下的射频模式。这些数据适用于音响模拟和听觉 simulate,并且可以用于创建真实的听觉效果。
    Abstract We present a database of recordings and radiation patterns of individual notes for 41 modern and historical musical instruments, measured with a 32-channel spherical microphone array in anechoic conditions. In addition, directivities averaged in one-third octave bands have been calculated for each instrument, which are suitable for use in acoustic simulation and auralisation. The data are provided in SOFA format. Spatial upsampling of the directivities was performed based on spherical spline interpolation and converted to OpenDAFF and GLL format for use in room acoustic and electro-acoustic simulation software. For this purpose, a method is presented how these directivities can be referenced to a specific microphone position in order to achieve a physically correct auralisation without colouration. The data is available under the CC BY-SA 4.0 licence.
    摘要 我们提供了41种现代和历史乐器的录音和辐射模式数据,使用32道球形 микро链测量在闭合条件下。此外,我们还计算了每种乐器的一半 ок塔频率带的直达性平均值,这些数据适用于音响 simulation 和听众化。数据以SOFA格式提供,并使用球面插值和OpenDAFF/GLL格式进行空间上抽象,以便在房间音响和电子音响 simulate 软件中使用。为实现物理正确的听众化而不受抹音影响,我们还提供了一种方法,以便将这些直达性参考到特定的 Microphone 位置。数据以CC BY-SA 4.0 许可证提供。

Flowchase: a Mobile Application for Pronunciation Training

  • paper_url: http://arxiv.org/abs/2307.02051
  • repo_url: None
  • paper_authors: Noé Tits, Zoé Broisson
  • for: 提供个性化、实时反馈给英语学习者
  • methods: 使用流处理应用程序和语音技术分析语音段和超段特征
  • results: 通过组合机器学习模型实现联合强制对齐和音名识别,提供了对多个段和超段发音方面的反馈设计Here’s a breakdown of each point:
  • for: The paper is written for providing personalized and instant feedback to English learners.
  • methods: The paper uses a mobile application called Flowchase and a speech technology that can segment and analyze speech segmental and supra-segmental features. The speech processing pipeline receives linguistic information and a speech sample, and then performs joint forced-alignment and phonetic recognition using machine learning models based on speech representation learning.
  • results: The paper achieves accurate recognition of segmental and supra-segmental pronunciation aspects using a combination of machine learning models, which enables the provision of personalized and instant feedback to English learners.
    Abstract In this paper, we present a solution for providing personalized and instant feedback to English learners through a mobile application, called Flowchase, that is connected to a speech technology able to segment and analyze speech segmental and supra-segmental features. The speech processing pipeline receives linguistic information corresponding to an utterance to analyze along with a speech sample. After validation of the speech sample, a joint forced-alignment and phonetic recognition is performed thanks to a combination of machine learning models based on speech representation learning that provides necessary information for designing a feedback on a series of segmental and supra-segmental pronunciation aspects.
    摘要 在这篇论文中,我们提出了一种解决方案,通过一款名为Flowchase的移动应用程序,为英语学习者提供个性化和实时反馈。这个应用程序与一种可以分segment和supra-segment的speech技术相连接,以进行语音处理管道。在语音处理管道中,我们使用机器学习模型来学习语音表示学习,以获得 необходимы的信息,以设计一系列的segmental和supra-segmental发音方面的反馈。