cs.SD - 2023-08-02

Music De-limiter Networks via Sample-wise Gain Inversion

  • paper_url: http://arxiv.org/abs/2308.01187
  • repo_url: https://github.com/jeonchangbin49/de-limiter
  • paper_authors: Chang-Bin Jeon, Kyogu Lee
  • for: 这篇论文旨在解决音乐频谱压缩(loudness war)问题,即尽可能地提高音乐压缩后的质量,而不是增加压缩级别。
  • methods: 本论文提出了一种基于限制器的音乐去限制网络,使用sample-wise gain inversion(SGI)原理来实现。SGI是一种样本级别的增量降噪,可以减少音乐压缩后的噪音。
  • results: 根据实验结果,提出的去限制网络可以高效地提高音乐质量,SI-SDR值达23.8dB。此外, authors还提供了一个大量的训练数据集(musdb-XL-train),可以帮助实现实际中友好的去限制网络。
    Abstract The loudness war, an ongoing phenomenon in the music industry characterized by the increasing final loudness of music while reducing its dynamic range, has been a controversial topic for decades. Music mastering engineers have used limiters to heavily compress and make music louder, which can induce ear fatigue and hearing loss in listeners. In this paper, we introduce music de-limiter networks that estimate uncompressed music from heavily compressed signals. Inspired by the principle of a limiter, which performs sample-wise gain reduction of a given signal, we propose the framework of sample-wise gain inversion (SGI). We also present the musdb-XL-train dataset, consisting of 300k segments created by applying a commercial limiter plug-in for training real-world friendly de-limiter networks. Our proposed de-limiter network achieves excellent performance with a scale-invariant source-to-distortion ratio (SI-SDR) of 23.8 dB in reconstructing musdb-HQ from musdb- XL data, a limiter-applied version of musdb-HQ. The training data, codes, and model weights are available in our repository (https://github.com/jeonchangbin49/De-limiter).
    摘要 “喧嚣战争”,音乐业界一直持续的现象,表示音乐的最终强度不断增加,同时降低其 dinamic range。音乐调制工程师使用限制器将音乐更加压缩和更加 loud,这可能导致耳感觉疲劳和听觉损伤。在这篇论文中,我们介绍音乐去限制网络,将压缩后的音乐转换为未压缩的音乐。我们的方法基于限制器的原理,实现 sample-wise 几何减少 (SGI)。我们还提供 musdb-XL-train 数据集,包括 300 万个段落,通过商业限制器插件训练真实世界友善的去限制网络。我们的提案的去限制网络实现了 excellent 性能,si-sdr 23.8 dB 在重建 musdb-HQ FROM musdb-XL 数据中。训练数据、代码和模型预测项目可以从我们的存储库 (https://github.com/jeonchangbin49/De-limiter) 获取。

Inaudible Adversarial Perturbation: Manipulating the Recognition of User Speech in Real Time

  • paper_url: http://arxiv.org/abs/2308.01040
  • repo_url: None
  • paper_authors: Xinfeng Li, Chen Yan, Xuancun Lu, Zihan Zeng, Xiaoyu Ji, Wenyuan Xu
    for: 这篇论文旨在扩展现有的攻击途径,使其适用于用户存在的实际场景。methods: 这篇论文提出了一种使用无声振荡(IAP)攻击语音识别系统的方法,通过在用户说话时传输 ultrasound 来 manipulate ASR 输出。results: 实验结果表明,VRIFLE 可以在不同的配置和防御策略下实现有效的真实时操纵,并且可以在用户发生干扰时仍然保持有效。
    Abstract Automatic speech recognition (ASR) systems have been shown to be vulnerable to adversarial examples (AEs). Recent success all assumes that users will not notice or disrupt the attack process despite the existence of music/noise-like sounds and spontaneous responses from voice assistants. Nonetheless, in practical user-present scenarios, user awareness may nullify existing attack attempts that launch unexpected sounds or ASR usage. In this paper, we seek to bridge the gap in existing research and extend the attack to user-present scenarios. We propose VRIFLE, an inaudible adversarial perturbation (IAP) attack via ultrasound delivery that can manipulate ASRs as a user speaks. The inherent differences between audible sounds and ultrasounds make IAP delivery face unprecedented challenges such as distortion, noise, and instability. In this regard, we design a novel ultrasonic transformation model to enhance the crafted perturbation to be physically effective and even survive long-distance delivery. We further enable VRIFLE's robustness by adopting a series of augmentation on user and real-world variations during the generation process. In this way, VRIFLE features an effective real-time manipulation of the ASR output from different distances and under any speech of users, with an alter-and-mute strategy that suppresses the impact of user disruption. Our extensive experiments in both digital and physical worlds verify VRIFLE's effectiveness under various configurations, robustness against six kinds of defenses, and universality in a targeted manner. We also show that VRIFLE can be delivered with a portable attack device and even everyday-life loudspeakers.
    摘要 自动语音识别(ASR)系统已经被证明是易受到敌意例验(AE)的威胁。现有的成功假设用户会不注意或中断攻击过程,即使存在音乐/噪声类的声音和声音助手的自发响应。然而,在实际用户存在的场景下,用户的注意力可能会使攻击无法进行,因此我们在这篇论文中尝试将攻击扩展到用户存在的场景。我们提议了一种不可见的恶意扰动(IAP)攻击方法,通过ultrasound发送来控制ASR。由于听ible和ultrasound之间的本质差异,IAP发送面临了前所未有的挑战,如扭曲、噪声和不稳定。为此,我们设计了一种新的ultrasound转换模型,以增强制作的恶意扰动,使其在物理上有效并能够在远程发送。此外,我们采用了一系列的修改来增强VRIFLE的可靠性,包括在生成过程中对用户和实际世界的变化进行修改。这样,VRIFLE可以在不同的距离和用户说话时进行有效的实时 manipulate ASR输出,并且采用了altern和mute策略来抑制用户中断的影响。我们的广泛的实验证明了VRIFLE在多种配置下的效iveness,对六种防御机制的Robustness,以及在targeted方式下的通用性。此外,我们还示出了VRIFLE可以通过移动攻击设备和日常生活中的 loudspeaker 进行传输。

SALTTS: Leveraging Self-Supervised Speech Representations for improved Text-to-Speech Synthesis

  • paper_url: http://arxiv.org/abs/2308.01018
  • repo_url: None
  • paper_authors: Ramanan Sivaguru, Vasista Sai Lodagala, S Umesh
  • for: 提高 FastSpeech2 synthesized speech 质量
  • methods: 使用 Self-Supervised Learning (SSL) 模型 Representation 增强 FastSpeech2 encoder 输出
  • results: 比基eline FastSpeech2 具有更高的对象和主观评价指标表现
    Abstract While FastSpeech2 aims to integrate aspects of speech such as pitch, energy, and duration as conditional inputs, it still leaves scope for richer representations. As a part of this work, we leverage representations from various Self-Supervised Learning (SSL) models to enhance the quality of the synthesized speech. In particular, we pass the FastSpeech2 encoder's length-regulated outputs through a series of encoder layers with the objective of reconstructing the SSL representations. In the SALTTS-parallel implementation, the representations from this second encoder are used for an auxiliary reconstruction loss with the SSL features. The SALTTS-cascade implementation, however, passes these representations through the decoder in addition to having the reconstruction loss. The richness of speech characteristics from the SSL features reflects in the output speech quality, with the objective and subjective evaluation measures of the proposed approach outperforming the baseline FastSpeech2.
    摘要 而 FastSpeech2 目标是将语音特征如抑压、能量和持续时间作为条件输入集成,但还留有更加富有的表示空间。作为这项工作的一部分,我们利用了不同的自我超vision学习(SSL)模型来增强合成语音的质量。具体来说,我们将 FastSpeech2 Encoder 的长度调整后的输出通过一系列Encoder层进行重建,以达到重建 SSL 特征的目标。在 SALTTS-parallel 实现中,这些第二个 Encoder 的表示被用于 auxiliary 重建损失中的 SSL 特征。而 SALTTS-cascade 实现则是将这些表示通过 Decoder 以及重建损失进行处理。通过 SSL 特征中的语音特征的丰富性,我们发现在输出语音质量中能够获得更高的 Objective 和 Subjective 评价指标。