eess.AS - 2023-10-13

Protecting Voice-Controlled Devices against LASER Injection Attacks

  • paper_url: http://arxiv.org/abs/2310.09404
  • repo_url: https://github.com/hashim19/Laser_Injection_Attack_Identification
  • paper_authors: Hashim Ali, Dhimant Khuttan, Rafi Ud Daula Refat, Hafiz Malik
  • for: 防御 MEMS 麦克风受到激光注入攻击
  • methods: 使用时频划分和高级统计特征来分辨声学和激光注入引起的响应
  • results: 实验结果表明,提posed 框架可以在random数据分区 setting中正确地分类 $98%$ 的声学和激光注入响应,并在 speaker-independent 和 text-independent 数据分区 setting中实现 $100%$ 的正确分类率。
    Abstract Voice-Controllable Devices (VCDs) have seen an increasing trend towards their adoption due to the small form factor of the MEMS microphones and their easy integration into modern gadgets. Recent studies have revealed that MEMS microphones are vulnerable to audio-modulated laser injection attacks. This paper aims to develop countermeasures to detect and prevent laser injection attacks on MEMS microphones. A time-frequency decomposition based on discrete wavelet transform (DWT) is employed to decompose microphone output audio signal into n + 1 frequency subbands to capture photo-acoustic related artifacts. Higher-order statistical features consisting of the first four moments of subband audio signals, e.g., variance, skew, and kurtosis are used to distinguish between acoustic and photo-acoustic responses. An SVM classifier is used to learn the underlying model that differentiates between an acoustic- and laser-induced (photo-acoustic) response in the MEMS microphone. The proposed framework is evaluated on a data set of 190 audios, consisting of 19 speakers. The experimental results indicate that the proposed framework is able to correctly classify $98\%$ of the acoustic- and laser-induced audio in a random data partition setting and $100\%$ of the audio in speaker-independent and text-independent data partition settings.
    摘要 “声控设备(VCD)在使用上升趋势,主要是因为MEMS麦克风的小型化和容易集成到现代设备中。然而,latest studies have shown that MEMS麦克风受到音频激光注入攻击的威胁。这篇论文旨在开发检测和防止音频激光注入攻击MEMS麦克风的countermeasures。使用时间频分解基于离散波折射(DWT),将麦克风输出的音频信号分解成n+1个频段,以捕捉摄像头相关的 artifacts。使用高级统计特征,包括频段音频信号的第四个积分,例如方差、偏好、拟合度,来分辨 между抗噪和激光注入的响应。使用支持向量机(SVM)分类器,学习MEMS麦克风中的激光注入和抗噪响应的下流模型。这种方框在190个音频数据集上进行了评估,结果表明,该方框能够在随机数据分区 setting中正确地分类98%的抗噪和激光注入音频,并在 speaker-independent和text-independent数据分区 setting中达到100%的正确率。”Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you prefer Traditional Chinese, please let me know.

Speaking rate attention-based duration prediction for speed control TTS

  • paper_url: http://arxiv.org/abs/2310.08846
  • repo_url: None
  • paper_authors: Jesuraj Bandekar, Sathvik Udupa, Abhayjeet Singh, Anjali Jayakumar, Deekshitha G, Sandhya Badiger, Saurabh Kumar, Pooja VH, Prasanta Kumar Ghosh
  • for: 控制不同速度因素的声音表达
  • methods: 使用时间预测器中的 speaking rate 控制,进行隐式的 speaking rate 控制
  • results: 通过对听写数据集进行训练,实现了对不同速度因素的声音synthesize,并且对比基eline模型,提高了对话速度的控制和人工评分Here’s a more detailed explanation of each point:
  • for: The paper is written to control the speaking rate of non-autoregressive text-to-speech (TTS) systems.
  • methods: The proposed approach uses a novel method of conditioning the speaking rate inside the duration predictor to achieve implicit speaking rate control.
  • results: The proposed method is evaluated using objective and subjective metrics, and is found to have higher subjective scores and lower speaker rate errors across many speaking rate factors compared to a baseline model.
    Abstract With the advent of high-quality speech synthesis, there is a lot of interest in controlling various prosodic attributes of speech. Speaking rate is an essential attribute towards modelling the expressivity of speech. In this work, we propose a novel approach to control the speaking rate for non-autoregressive TTS. We achieve this by conditioning the speaking rate inside the duration predictor, allowing implicit speaking rate control. We show the benefits of this approach by synthesising audio at various speaking rate factors and measuring the quality of speaking rate-controlled synthesised speech. Further, we study the effect of the speaking rate distribution of the training data towards effective rate control. Finally, we fine-tune a baseline pretrained TTS model to obtain speaking rate control TTS. We provide various analyses to showcase the benefits of using this proposed approach, along with objective as well as subjective metrics. We find that the proposed methods have higher subjective scores and lower speaker rate errors across many speaking rate factors over the baseline.
    摘要 高质量语音合成技术出现后,控制不同语速特性的兴趣很大。语速是对语音表达性的重要属性。在这项工作中,我们提出一种新的非自适应TTS语速控制方法。我们在duration预测器中控制语速,实现了隐式语速控制。我们通过合成各种语速因子的音频并测量合成的语音质量,证明了我们的方法的优势。此外,我们研究了培训数据语速分布对有效语速控制的影响。最后,我们练化一个基eline预测TTS模型,以获得语速控制TTS模型。我们提供了多种分析,证明了我们的方法的优点,包括对象 метри克和主观评分。我们发现,我们的方法在多种语速因子下具有更高的主观评分和更低的发音错误率,相比基eline。