cs.SD - 2023-10-09

JVNV: A Corpus of Japanese Emotional Speech with Verbal Content and Nonverbal Expressions

  • paper_url: http://arxiv.org/abs/2310.06072
  • repo_url: None
  • paper_authors: Detai Xin, Junfeng Jiang, Shinnosuke Takamichi, Yuki Saito, Akiko Aizawa, Hiroshi Saruwatari
  • for: The paper is written for researchers and developers working on emotional speech synthesis and related areas, as well as those interested in exploring the use of large language models for script generation.
  • methods: The paper proposes an automatic script generation method that uses a large language model (ChatGPT) and prompt engineering to produce emotional scripts with nonverbal vocalizations (NVs).
  • results: The paper demonstrates the effectiveness of the proposed method by showing that the generated scripts have better phoneme coverage and emotion recognizability than previous Japanese emotional speech corpora, and also highlights the challenges of synthesizing emotional speech with NVs.
    Abstract We present the JVNV, a Japanese emotional speech corpus with verbal content and nonverbal vocalizations whose scripts are generated by a large-scale language model. Existing emotional speech corpora lack not only proper emotional scripts but also nonverbal vocalizations (NVs) that are essential expressions in spoken language to express emotions. We propose an automatic script generation method to produce emotional scripts by providing seed words with sentiment polarity and phrases of nonverbal vocalizations to ChatGPT using prompt engineering. We select 514 scripts with balanced phoneme coverage from the generated candidate scripts with the assistance of emotion confidence scores and language fluency scores. We demonstrate the effectiveness of JVNV by showing that JVNV has better phoneme coverage and emotion recognizability than previous Japanese emotional speech corpora. We then benchmark JVNV on emotional text-to-speech synthesis using discrete codes to represent NVs. We show that there still exists a gap between the performance of synthesizing read-aloud speech and emotional speech, and adding NVs in the speech makes the task even harder, which brings new challenges for this task and makes JVNV a valuable resource for relevant works in the future. To our best knowledge, JVNV is the first speech corpus that generates scripts automatically using large language models.
    摘要 我们介绍JVNV,一个日本语言情感演讲集合,包含语言内容和非语言声音表达的脚本,这些脚本由大规模语言模型生成。现有的情感演讲集合缺乏不仅有正确的情感脚本,还缺乏非语言声音表达(NV),这些表达是 spoken language 中表达情感的重要组成部分。我们提出一种自动脚本生成方法,通过提供带有情感方向和非语言声音表达的缓解词汇,使用 ChatGPT 的提问工程来生成情感脚本。我们选择了514个脚本,以保证干扰词汇的覆盖率均匀。我们示出JVNV的效果,并证明JVNV在情感演讲Synthesize 中的表达效果更好,而且JVNV 的情感可识别性也更高。然后,我们对JVNV进行了情感文本到语音合成测试,并发现在添加NV后,合成语音的任务变得更加困难,这带来了新的挑战。根据我们所知,JVNV 是第一个使用大型语言模型自动生成的语音演讲集合。

Audio compression-assisted feature extraction for voice replay attack detection

  • paper_url: http://arxiv.org/abs/2310.05813
  • repo_url: None
  • paper_authors: Xiangyu Shi, Yuhao Luo, Li Wang, Haorui He, Hao Li, Lei Wang, Zhizheng Wu
  • for: 本研究旨在提出一种Feature Extraction Approach,用于检测返回攻击。
  • methods: 该方法使用音频压缩,以提取返回攻击中的频谱信息。
  • results: 经过大量数据增强和三种分类器的测试,该方法在ASVspoof 2021的物理访问(PA)集上达到了最低的EER值为22.71%。
    Abstract Replay attack is one of the most effective and simplest voice spoofing attacks. Detecting replay attacks is challenging, according to the Automatic Speaker Verification Spoofing and Countermeasures Challenge 2021 (ASVspoof 2021), because they involve a loudspeaker, a microphone, and acoustic conditions (e.g., background noise). One obstacle to detecting replay attacks is finding robust feature representations that reflect the channel noise information added to the replayed speech. This study proposes a feature extraction approach that uses audio compression for assistance. Audio compression compresses audio to preserve content and speaker information for transmission. The missed information after decompression is expected to contain content- and speaker-independent information (e.g., channel noise added during the replay process). We conducted a comprehensive experiment with a few data augmentation techniques and 3 classifiers on the ASVspoof 2021 physical access (PA) set and confirmed the effectiveness of the proposed feature extraction approach. To the best of our knowledge, the proposed approach achieves the lowest EER at 22.71% on the ASVspoof 2021 PA evaluation set.
    摘要 <>输入文本转换为简化字体中文。<>声重播攻击是voice spoofing最有效 simplest的一种,但检测声重播攻击具有挑战性,根据2021年自动说话人识别骗ichi Spoofing和Countermeasures Challenge (ASVspoof 2021),因为它们需要外壳speaker、 Microphone和听录条件(如背景噪音)。一个检测声重播攻击的障碍是找到Robust的特征表示,以反映在重播过程中添加的频率噪音信息。本研究提议一种特征提取方法,利用音频压缩。音频压缩将音频压缩到保持内容和说话人信息,以便进行传输。压缩后的信息缺失将包含内容和说话人独立的信息(如在重播过程中添加的频率噪音)。我们在一些数据增强技术和3种分类器的帮助下,对ASVspoof 2021physical access(PA)集进行了全面的实验,并证实了提议的特征提取方法的效iveness。根据我们所知,该方法在ASVspoof 2021 PA评估集上的最低EER为22.71%。

Technocratic model of the human auditory system

  • paper_url: http://arxiv.org/abs/2310.05639
  • repo_url: None
  • paper_authors: M. V. Semotiuk, A. V. Palagin
  • for: 这项研究探讨了生物体内耳膜中的横滤频和横立波现象。
  • methods: 研究者使用了一种技术化的方法,通过分析耳膜的形状和表面非均匀性,来模拟生物体内耳系统的physical processes。
  • results: 研究结果表明,耳膜的径向振荡和横立波是由耳膜的形状和表面非均匀性引起的,并且 Scala media作为信息采集和增强系统,在耳膜旋转轴上具有重要作用。
    Abstract In this work, we investigate the phenomenon of transverse resonance and transverse standing waves that occur within the cochlea of living organisms. It is demonstrated that the predisposing factor for their occurrence is the cochlear shape, which resembles a conical acoustic tube coiled into a spiral and exhibits non-uniformities on its internal surface. This cochlear structure facilitates the analysis of constituent sound signals akin to a spectrum analyzer, with a corresponding interpretation of the physical processes occurring in the auditory system. Additionally, we conclude that the cochlear duct's scala media, composed of a system of membranes and the organ of Corti, functions primarily as an information collection and amplification system along the cochlear spiral. Collectively, these findings enable the development of a novel, highly realistic wave model of the auditory system in living organisms based on a technocratic approach within the scientific context.
    摘要 在这项研究中,我们研究了生物体内耳膜中的横向振荡和横向站立波。我们示出了耳膜形状是这些现象的导火索,耳膜形状类似于梭形声学管,内部表面存在非均匀性。这种耳膜结构使得对听音信号的分析与听音系统物理过程的解释更加容易。此外,我们还得到结论, scala media在耳膜管中主要作为听音信号采集和增强系统,即耳膜管沿着听音螺旋的方向进行信息采集和增强。总之,这些发现可以基于科技方法,在科学上建立一种高度实际的听音系统模型。

Super Denoise Net: Speech Super Resolution with Noise Cancellation in Low Sampling Rate Noisy Environments

  • paper_url: http://arxiv.org/abs/2310.05629
  • repo_url: None
  • paper_authors: Junkang Yang, Hongqing Liu, Lu Gan, Yi Zhou
  • for: 提高语音超解析和噪声除去的性能,以适应实际场景中的噪声存在情况。
  • methods: 提出了一种基于神经网络的Super Denoise Net(SDNet)模型,通过阻止层和格网络层来增强修复能力和在时间频率轴上 capture 信息。
  • results: 在 DNS 2020 无投射测试集上,SDNet 模型与基eline 语音噪声和超解析模型相比,得到了更高的 объек тив和主观分数。
    Abstract Speech super-resolution (SSR) aims to predict a high resolution (HR) speech signal from its low resolution (LR) corresponding part. Most neural SSR models focus on producing the final result in a noise-free environment by recovering the spectrogram of high-frequency part of the signal and concatenating it with the original low-frequency part. Although these methods achieve high accuracy, they become less effective when facing the real-world scenario, where unavoidable noise is present. To address this problem, we propose a Super Denoise Net (SDNet), a neural network for a joint task of super-resolution and noise reduction from a low sampling rate signal. To that end, we design gated convolution and lattice convolution blocks to enhance the repair capability and capture information in the time-frequency axis, respectively. The experiments show our method outperforms baseline speech denoising and SSR models on DNS 2020 no-reverb test set with higher objective and subjective scores.
    摘要 <>TRANSLATE_TEXTSpeech super-resolution (SSR) aims to predict a high resolution (HR) speech signal from its low resolution (LR) corresponding part. Most neural SSR models focus on producing the final result in a noise-free environment by recovering the spectrogram of high-frequency part of the signal and concatenating it with the original low-frequency part. Although these methods achieve high accuracy, they become less effective when facing the real-world scenario, where unavoidable noise is present. To address this problem, we propose a Super Denoise Net (SDNet), a neural network for a joint task of super-resolution and noise reduction from a low sampling rate signal. To that end, we design gated convolution and lattice convolution blocks to enhance the repair capability and capture information in the time-frequency axis, respectively. The experiments show our method outperforms baseline speech denoising and SSR models on DNS 2020 no-reverb test set with higher objective and subjective scores.TRANSLATE_TEXT

Thech. Report: Genuinization of Speech waveform PMF for speaker detection spoofing and countermeasures

  • paper_url: http://arxiv.org/abs/2310.05534
  • repo_url: None
  • paper_authors: Itshak Lapidot, Jean-Francois Bonastre
  • for: 防止伪造攻击在语音识别系统中
  • methods: 提出一个名为“伪造化”的算法,可以降低伪造攻击所导致的语音波形分布差异
  • results: 实验结果显示,将伪造化算法应用于伪造攻击后,可以大幅提高伪动检测性能,并且在不同的实验情况下均有良好的表现。
    Abstract In the context of spoofing attacks in speaker recognition systems, we observed that the waveform probability mass function (PMF) of genuine speech differs significantly from the PMF of speech resulting from the attacks. This is true for synthesized or converted speech as well as replayed speech. We also noticed that this observation seems to have a significant impact on spoofing detection performance. In this article, we propose an algorithm, denoted genuinization, capable of reducing the waveform distribution gap between authentic speech and spoofing speech. Our genuinization algorithm is evaluated on ASVspoof 2019 challenge datasets, using the baseline system provided by the challenge organization. We first assess the influence of genuinization on spoofing performance. Using genuinization for the spoofing attacks degrades spoofing detection performance by up to a factor of 10. Next, we integrate the genuinization algorithm in the spoofing countermeasures and we observe a huge spoofing detection improvement in different cases. The results of our experiments show clearly that waveform distribution plays an important role and must be taken into account by anti-spoofing systems.
    摘要 在声音权限系统中的假声攻击中,我们发现了声波概率质量函数(PMF)的真实声音和假声音之间存在巨大的差异。这种差异适用于合成或转换的声音以及重播声音。我们还注意到,这一观察对假声检测性能产生了重要的影响。在本文中,我们提出了一种算法,称为真实化,可以减少真实声音和假声音之间的声波分布差异。我们的真实化算法在ASVspoof 2019挑战数据集上进行了评估,使用了挑战组织提供的基线系统。我们首先评估了假声攻击后真实化的影响。使用真实化对假声攻击减少了假声检测性能,最多减少了10倍。接着,我们将真实化算法 интеGRATED INTO spoofing countermeasures,并观察到了不同情况下的巨大假声检测改善。我们的实验结果显示,声波分布在反假检测系统中扮演着重要的角色。

AdvSV: An Over-the-Air Adversarial Attack Dataset for Speaker Verification

  • paper_url: http://arxiv.org/abs/2310.05369
  • repo_url: None
  • paper_authors: Li Wang, Jiaqi Li, Yuhao Luo, Jiahao Zheng, Lei Wang, Hao Li, Ke Xu, Chengfang Fang, Jie Shi, Zhizheng Wu
  • for: 本研究旨在提供一个开源的针对语音识别的抗击攻击数据集,以便进一步研究语音识别系统的安全性。
  • methods: 本研究使用了一种基于振荡器的抗击攻击方法,并在真实的声音环境中进行了测试。
  • results: 研究发现,使用这种抗击攻击方法可以在语音识别系统中引入攻击,并且可以在不同的声音环境下进行模拟。
    Abstract It is known that deep neural networks are vulnerable to adversarial attacks. Although Automatic Speaker Verification (ASV) built on top of deep neural networks exhibits robust performance in controlled scenarios, many studies confirm that ASV is vulnerable to adversarial attacks. The lack of a standard dataset is a bottleneck for further research, especially reproducible research. In this study, we developed an open-source adversarial attack dataset for speaker verification research. As an initial step, we focused on the over-the-air attack. An over-the-air adversarial attack involves a perturbation generation algorithm, a loudspeaker, a microphone, and an acoustic environment. The variations in the recording configurations make it very challenging to reproduce previous research. The AdvSV dataset is constructed using the Voxceleb1 Verification test set as its foundation. This dataset employs representative ASV models subjected to adversarial attacks and records adversarial samples to simulate over-the-air attack settings. The scope of the dataset can be easily extended to include more types of adversarial attacks. The dataset will be released to the public under the CC-BY license. In addition, we also provide a detection baseline for reproducible research.
    摘要 Deep neural networks 是容易受到敌意攻击的。尽管基于深度神经网络的自动说话识别(ASV)在控制场景下表现出了可靠性,但许多研究证明ASV对敌意攻击很敏感。数据集的缺乏标准化是研究的一大障碍,尤其是可重复性研究。在这项研究中,我们开发了一个开源的敌意攻击数据集 для说话识别研究。作为初始步骤,我们专注于无线电攻击。无线电攻击包括一个杂乱生成算法、一个喇叭、一个麦克风和一个声学环境。记录配置的变化使得前期研究的重现非常困难。 AdvSV 数据集基于 Voxceleb1 验证测试集作为基础,这个数据集使用了 Representative ASV 模型在敌意攻击下录制的样本,以模拟无线电攻击场景。数据集的范围可以轻松扩展到更多的敌意攻击类型。数据集将会在 CC-BY license 下公开发布。此外,我们还提供了一个可重复性的检测基线。

An Initial Investigation of Neural Replay Simulator for Over-the-Air Adversarial Perturbations to Automatic Speaker Verification

  • paper_url: http://arxiv.org/abs/2310.05354
  • repo_url: None
  • paper_authors: Jiaqi Li, Li Wang, Liumeng Xue, Lei Wang, Zhizheng Wu
  • for: 防御敏感语音识别系统免受 Physical Access 攻击
  • methods: 使用神经网络播放模拟器提高 Over-the-air 攻击 robustness
  • results: 使用神经网络播放模拟器可以大幅提高 Over-the-air 攻击成功率,提高 Physical Access 应用中语音识别系统的安全性问题
    Abstract Deep Learning has advanced Automatic Speaker Verification (ASV) in the past few years. Although it is known that deep learning-based ASV systems are vulnerable to adversarial examples in digital access, there are few studies on adversarial attacks in the context of physical access, where a replay process (i.e., over the air) is involved. An over-the-air attack involves a loudspeaker, a microphone, and a replaying environment that impacts the movement of the sound wave. Our initial experiment confirms that the replay process impacts the effectiveness of the over-the-air attack performance. This study performs an initial investigation towards utilizing a neural replay simulator to improve over-the-air adversarial attack robustness. This is achieved by using a neural waveform synthesizer to simulate the replay process when estimating the adversarial perturbations. Experiments conducted on the ASVspoof2019 dataset confirm that the neural replay simulator can considerably increase the success rates of over-the-air adversarial attacks. This raises the concern for adversarial attacks on speaker verification in physical access applications.
    摘要