cs.SD - 2023-11-13

Distributed pressure matching strategy using diffusion adaptation

  • paper_url: http://arxiv.org/abs/2311.07729
  • repo_url: None
  • paper_authors: Mengfei Zhang, Junqing Zhang, Jie Chen, Cédric Richard
  • for: 这篇论文是关于如何解决个人声区系统(PSZ)任务中的时变Acoustic问题的。
  • methods: 这篇论文提出了一种分布式压力匹配(PM)方法,利用分散适应(DPM-D)技术来分散计算负担,从而解决中央化方法的高计算复杂性和高精度要求。
  • results: simulations和计算复杂性分析表明,分布式PM方法在多频分布式环境中具有更高的计算效率和精度,与中央化方法相比。
    Abstract Personal sound zone (PSZ) systems, which aim to create listening (bright) and silent (dark) zones in neighboring regions of space, are often based on time-varying acoustics. Conventional adaptive-based methods for handling PSZ tasks suffer from the collection and processing of acoustic transfer functions~(ATFs) between all the matching microphones and all the loudspeakers in a centralized manner, resulting in high calculation complexity and costly accuracy requirements. This paper presents a distributed pressure-matching (PM) method relying on diffusion adaptation (DPM-D) to spread the computational load amongst nodes in order to overcome these issues. The global PM problem is defined as a sum of local costs, and the diffusion adaption approach is then used to create a distributed solution that just needs local information exchanges. Simulations over multi-frequency bins and a computational complexity analysis are conducted to evaluate the properties of the algorithm and to compare it with centralized counterparts.
    摘要 personal sound zone (PSZ) 系统,目的在于在邻近空间区域创建听众(亮)和沉默(暗)区域,经常基于时变音响学。传统的适应基于方法(ATF)处理 PSZ 任务,由于需要收集和处理所有听笔和所有扬声器之间的听音传函数(ATF),因此会带来高度复杂的计算和昂贵的准确性要求。本文提出了分布式压力匹配(PM)方法,基于扩散适应(DPM-D)来分担计算负担,以超越这些问题。全局 PM 问题定义为一个Local cost的总和,然后使用扩散适应approach来创建分布式解决方案,只需要本地信息交换。通过多频分布和计算复杂性分析,评估算法的性能和与中央化对手进行比较。

Efficient bandwidth extension of musical signals using a differentiable harmonic plus noise mode

  • paper_url: http://arxiv.org/abs/2311.07363
  • repo_url: https://github.com/mathieulagrange/ddspmusicbandwidthextension
  • paper_authors: Pierre-Amaury Grumiaux, Mathieu Lagrange
  • for: Audio signal bandwidth extension, specifically for monophonic and polyphonic musical signals.
  • methods: Uses a differentiable digital signal processing (DDSP) model, which is a neural network with relatively few parameters that is trained to infer the parameters of a differentiable digital signal processing model.
  • results: Proposed models surpass a higher complexity deep learning model for an objective metric computed in the frequency domain, and are also confirmed to have superior perceptual quality through a MUSHRA listening test.Here’s the Chinese translation of the three points:
  • for: audio信号带宽扩展,特别是对于单声道和多声道乐音信号。
  • methods: 使用梯度可微的数字信号处理(DDSP)模型,该模型是一个具有相对少量参数的神经网络,用于推理DDSP模型的参数。
  • results: 提议的模型在频域中的一个对象指标上超过了更高复杂度的深度学习模型,并通过MUSHRA听力测试得到了更好的感知质量。
    Abstract The task of bandwidth extension addresses the generation of missing high frequencies of audio signals based on knowledge of the low-frequency part of the sound. This task applies to various problems, such as audio coding or audio restoration. In this article, we focus on efficient bandwidth extension of monophonic and polyphonic musical signals using a differentiable digital signal processing (DDSP) model. Such a model is composed of a neural network part with relatively few parameters trained to infer the parameters of a differentiable digital signal processing model, which efficiently generates the output full-band audio signal. We first address bandwidth extension of monophonic signals, and then propose two methods to explicitely handle polyphonic signals. The benefits of the proposed models are first demonstrated on monophonic and polyphonic synthetic data against a baseline and a deep-learning-based resnet model. The models are next evaluated on recorded monophonic and polyphonic data, for a wide variety of instruments and musical genres. We show that all proposed models surpass a higher complexity deep learning model for an objective metric computed in the frequency domain. A MUSHRA listening test confirms the superiority of the proposed approach in terms of perceptual quality.
    摘要 音频信号的带宽扩展问题是基于听到的低频部分声音的知识,生成缺失的高频部分。这个问题适用于各种问题,如音频编码或音频修复。在这篇文章中,我们关注使用拟 diferenciable digital signal processing(DDSP)模型进行高效的带宽扩展。这种模型由一个具有相对少量参数的神经网络部分和一个可微分的数字信号处理模型组成。这种模型可以高效地生成全带宽音频信号。我们首先对单声音信号进行带宽扩展,然后提出了两种方法来特别处理多声音信号。我们在单声音和多声音 sintetic 数据上对基线和深度学习模型进行比较,并证明了我们的模型在频域中的目标指标上胜过深度学习模型。在录制的单声音和多声音数据上,我们发现所有我们提出的模型都超过了深度学习模型的高复杂性。在Perceptual Quality 中,我们通过 MUSHRA 听测表明了我们的方法的优越性。

Zero-Shot Duet Singing Voices Separation with Diffusion Models

  • paper_url: http://arxiv.org/abs/2311.07345
  • repo_url: https://github.com/yoyololicon/duet-svs-diffusion
  • paper_authors: Chin-Yun Yu, Emilian Postolache, Emanuele Rodolà, György Fazekas
  • for: 这篇论文是为了解决对声音反问题中的源分离问题,具体来说是在分离二人或更多人的合唱声音中,保持歌手身份的一致性。
  • methods: 这篇论文使用了扩散模型作为假设,通过控制扩散过程来采样 posterior 分布中的目标信号。在解决对声音反问题中,提议使用 auto-regressive 方式进行 posterior 采样,并在每个过程中使用前一个过程的结果来保持歌手身份的一致性。
  • results: 在使用 MedleyVox 数据集进行评估时,提议的方法比基于 posterior 采样的基线方法表现更好,能够更好地保持歌手身份的一致性。
    Abstract In recent studies, diffusion models have shown promise as priors for solving audio inverse problems. These models allow us to sample from the posterior distribution of a target signal given an observed signal by manipulating the diffusion process. However, when separating audio sources of the same type, such as duet singing voices, the prior learned by the diffusion process may not be sufficient to maintain the consistency of the source identity in the separated audio. For example, the singer may change from one to another occasionally. Tackling this problem will be useful for separating sources in a choir, or a mixture of multiple instruments with similar timbre, without acquiring large amounts of paired data. In this paper, we examine this problem in the context of duet singing voices separation, and propose a method to enforce the coherency of singer identity by splitting the mixture into overlapping segments and performing posterior sampling in an auto-regressive manner, conditioning on the previous segment. We evaluate the proposed method on the MedleyVox dataset and show that the proposed method outperforms the naive posterior sampling baseline. Our source code and the pre-trained model are publicly available at https://github.com/yoyololicon/duet-svs-diffusion.
    摘要

Research and experimental verification on low-frequency long-range underwater sound propagation dispersion characteristics under dual-channel sound speed profiles in the Chukchi Plateau

  • paper_url: http://arxiv.org/abs/2311.08425
  • repo_url: None
  • paper_authors: Jinbao Weng, Yubo Qi, Yanming Yang, Hongtao Wen, Hongtao Zhou, Ruichao Xue
  • for: 研究了俄罗斯海和加拿大海湾的双通道声速 Profiling下的低频宽带声信号propagation特性
  • methods: 使用了正常模式理论来研究低频宽带声信号propagation dispersion的细结构,并使用修改后的扭变算符来分离正常模式
  • results: 解释了双通道声速 Profiling下normal mode dispersion曲线交叉的问题,分析了海底地形变化对dispersion结构的屏蔽效应,并通过长距离地震探测实验在俄罗斯海湾进行验证Here’s the same information in English:
  • for: Researched the low-frequency wide-band sound signal propagation characteristics under dual-channel sound speed profiles in the Chukchi Plateau and the Canadian Basin
  • methods: Used the theory of normal modes to study the fine structure of low-frequency wide-band sound propagation dispersion under dual-channel sound speed profiles, and used a modified warping operator to separate the normal modes
  • results: Explained the intersection of normal mode dispersion curves caused by the dual-channel sound speed profile, analyzed the blocking effect of seabed terrain changes on dispersion structures, and verified the results through a long-range seismic exploration experiment at the Chukchi Plateau. Additionally, proposed two methods for estimating the distance of sound sources based on acoustic signal characteristics in this environment, and verified these methods through experiment data at sea.
    Abstract The dual-channel sound speed profiles of the Chukchi Plateau and the Canadian Basin have become current research hotspots due to their excellent low-frequency sound signal propagation ability. Previous research has mainly focused on using sound propagation theory to explain the changes in sound signal energy. This article is mainly based on the theory of normal modes to study the fine structure of low-frequency wide-band sound propagation dispersion under dual-channel sound speed profiles. In this paper, the problem of the intersection of normal mode dispersion curves caused by the dual-channel sound speed profile (SSP) has been explained, the blocking effect of seabed terrain changes on dispersion structures has been analyzed, and the normal modes has been separated by using modified warping operator. The above research results have been verified through a long-range seismic exploration experiment at the Chukchi Plateau. At the same time, based on the acoustic signal characteristics in this environment, two methods for estimating the distance of sound sources have been proposed, and the experiment data at sea has also verified these two methods.
    摘要 “中险棚渠和加拿大海盆的双渠道声速 Profilestoday是研究热点,因为它们具有出色的低频声信号传播能力。以前的研究主要基于声传播理论来解释声信号能量的变化。本文基于准模理论来研究在双渠道声速 Profilestop下细腔低频宽带声信号传播折叠的细结构。文中解释了双渠道声速 Profilestop下准模折叠曲线的交叠问题,分析了海底地形变化对折叠结构的屏蔽效应,并使用修改的折卷算子来分离准模。研究结果得到了在鄂霍次克棚渠进行长距离地震探测实验的验证。同时,根据海洋环境中声信号特点,提出了两种方法来估计声源距离,并在实验数据上验证了这两种方法。”Note: Simplified Chinese is also known as "Mandarin" or "Standard Chinese".

SponTTS: modeling and transferring spontaneous style for TTS

  • paper_url: http://arxiv.org/abs/2311.07179
  • repo_url: https://github.com/kkksuper/SponTTS
  • paper_authors: Hanzhao Li, Xinfa Zhu, Liumeng Xue, Yang Song, Yunlin Chen, Lei Xie
    for:本文主要用于模型和传递自由式发言的 Style,以提高自由式发言的自然性、表达力和发音相似性。methods:本文提出了一种两阶段方法,首先采用 Conditional Variational Autoencoder (CVAE) 来捕捉自由式发言的准则,并通过偏置自由式现象嵌入预测损失来包含自由式现象。其次,我们引入了流程基于预测器来预测文本中的自由式样式表示,以增强在推理中的语音和语境特定自由式现象。results:实验表明,SponTTS 能够有效地模型自由式 Style 并传递 Style 到目标说话者,生成自由式语音具有高度的自然性、表达力和发音相似性。 zero-shot 自由式 Style TTS 测试进一步证明了 SponTTS 在生成未经见过的说话者的自由式语音时的普适性和稳定性。
    Abstract Spontaneous speaking style exhibits notable differences from other speaking styles due to various spontaneous phenomena (e.g., filled pauses, prolongation) and substantial prosody variation (e.g., diverse pitch and duration variation, occasional non-verbal speech like smile), posing challenges to modeling and prediction of spontaneous style. Moreover, the limitation of high-quality spontaneous data constrains spontaneous speech generation for speakers without spontaneous data. To address these problems, we propose SponTTS, a two-stage approach based on bottleneck (BN) features to model and transfer spontaneous style for TTS. In the first stage, we adopt a Conditional Variational Autoencoder (CVAE) to capture spontaneous prosody from a BN feature and involve the spontaneous phenomena by the constraint of spontaneous phenomena embedding prediction loss. Besides, we introduce a flow-based predictor to predict a latent spontaneous style representation from the text, which enriches the prosody and context-specific spontaneous phenomena during inference. In the second stage, we adopt a VITS-like module to transfer the spontaneous style learned in the first stage to target speakers. Experiments demonstrate that SponTTS is effective in modeling spontaneous style and transferring the style to the target speakers, generating spontaneous speech with high naturalness, expressiveness, and speaker similarity. The zero-shot spontaneous style TTS test further verifies the generalization and robustness of SponTTS in generating spontaneous speech for unseen speakers.
    摘要 自然口语风格 exhibits notable differences from other speaking styles due to various spontaneous phenomena (e.g., filled pauses, prolongation) and substantial prosody variation (e.g., diverse pitch and duration variation, occasional non-verbal speech like smile), posing challenges to modeling and prediction of spontaneous style. Moreover, the limitation of high-quality spontaneous data constrains spontaneous speech generation for speakers without spontaneous data. To address these problems, we propose SponTTS, a two-stage approach based on bottleneck (BN) features to model and transfer spontaneous style for TTS. In the first stage, we adopt a Conditional Variational Autoencoder (CVAE) to capture spontaneous prosody from a BN feature and involve the spontaneous phenomena by the constraint of spontaneous phenomena embedding prediction loss. Besides, we introduce a flow-based predictor to predict a latent spontaneous style representation from the text, which enriches the prosody and context-specific spontaneous phenomena during inference. In the second stage, we adopt a VITS-like module to transfer the spontaneous style learned in the first stage to target speakers. Experiments demonstrate that SponTTS is effective in modeling spontaneous style and transferring the style to the target speakers, generating spontaneous speech with high naturalness, expressiveness, and speaker similarity. The zero-shot spontaneous style TTS test further verifies the generalization and robustness of SponTTS in generating spontaneous speech for unseen speakers.

Research and experimental verification on low-frequency long-range sound propagation characteristics under ice-covered and range-dependent marine environment in the Arctic

  • paper_url: http://arxiv.org/abs/2311.07175
  • repo_url: None
  • paper_authors: Jinbao Weng, Yubo Qi, Yanming Yang, Hongtao Wen, Hongtao Zhou, Ruichao Xue
  • for: 这篇论文主要研究的是在北极冰层下propagation of low-frequency broadband acoustic signals,尤其是声传输损耗和声信号的时域波形和细谱结构。
  • methods: 该文使用了normal modes theory和 measurement of ocean environmental parameters和声场计算来研究北极深海环境中low-frequency long-range声信号的一般法律,并解释了环境因素如海底地形变化、水 column velocity profile变化和海冰覆盖的影响 mechanisms on low-frequency long-range sound propagation in the Arctic.
  • results: 该文通过在北极进行了一次声传播实验,并确认了上述研究观点,并首次使用了折射正常波的折射变换器来实现单一ydrophone based separation of normal waves and extraction of dispersion structures.
    Abstract At present, research on sound propagation under the Arctic ice mainly focuses on modeling and experimental verification of sound propagation under sea ice cover and unique sound velocity profiles. Among them, the main research object of concern is sound transmission loss, and this article will delve into the time-domain waveform and fine dispersion structure of low-frequency broadband acoustic signals. Firstly, based on the theory of normal modes, this article derives the horizontal wavenumber expression and warping transformation operator for refractive normal modes in the Arctic deep-sea environment. Subsequently, based on measured ocean environmental parameters and sound field simulation calculations, this article studied the general laws of low-frequency long-range sound propagation signals in the Arctic deep-sea environment, and elucidated the impact mechanism of environmental factors such as seabed terrain changes, horizontal changes in sound velocity profiles (SSPs), and sea ice cover on low-frequency long-range sound propagation in the Arctic. This article validates the above research viewpoint through a sound propagation experiment conducted in the Arctic with a propagation distance exceeding 1000km. The marine environment of this experiment has obvious horizontal variation characteristics. At the same time, this article takes the lead in utilizing the warping transformation of refractive normal waves in the Arctic waters to achieve single hydrophone based separation of normal waves and extraction of dispersion structures, which is conducive to future research on underwater sound source localization and environmental parameter inversion based on dispersion structures.
    摘要 Translated into Simplified Chinese:现在,在北极冰层下的声波传播研究主要集中在模拟和实验验证冰层下的声波传播,以及独特的声速 Profil。其中,研究对象主要是声传损,这篇文章将探讨声波时域波形和细谱振荡结构的低频广频声信号。首先,根据正常模式理论,这篇文章 derivates了在北极深海环境中的横向波数表达和折叠变换算子。然后,通过测量海洋环境参数和声场 simulate calculation,这篇文章研究了北极深海环境中低频长距离声波传播的通用规律,并描述了环境因素如海底地形变化、水 Column 的横向声速 Profil 变化和海冰覆盖的影响于北极深海环境中低频长距离声波传播。这篇文章通过在北极进行了超过 1000km 的声波传播实验, validate 了上述研究视角。实验 Marine 环境具有明显的横向变化特征。同时,这篇文章首次利用北极水域中的折叠变换来实现单 hydrophone 基于 separation of normal waves 和折叠结构的提取,这有助于未来基于折叠结构的声源 localization 和环境参数反射。

Music ControlNet: Multiple Time-varying Controls for Music Generation

  • paper_url: http://arxiv.org/abs/2311.07069
  • repo_url: None
  • paper_authors: Shih-Lun Wu, Chris Donahue, Shinji Watanabe, Nicholas J. Bryan
  • for: 这个论文的目的是提出一种基于扩散的音乐生成模型,允许用户在不同的时间点进行精细控制音乐的生成。
  • methods: 该模型使用了一种基于扩散的 conditional generative model,通过在音频spectrogram上进行微调来实现时间点的控制。
  • results: 该模型可以生成具有高质量和精细控制的音乐,并且在不同的时间点进行控制。与其他相似的音乐生成模型进行比较,该模型能够生成更 faithful 的音乐,即使它具有许多 fewer 参数和训练数据。
    Abstract Text-to-music generation models are now capable of generating high-quality music audio in broad styles. However, text control is primarily suitable for the manipulation of global musical attributes like genre, mood, and tempo, and is less suitable for precise control over time-varying attributes such as the positions of beats in time or the changing dynamics of the music. We propose Music ControlNet, a diffusion-based music generation model that offers multiple precise, time-varying controls over generated audio. To imbue text-to-music models with time-varying control, we propose an approach analogous to pixel-wise control of the image-domain ControlNet method. Specifically, we extract controls from training audio yielding paired data, and fine-tune a diffusion-based conditional generative model over audio spectrograms given melody, dynamics, and rhythm controls. While the image-domain Uni-ControlNet method already allows generation with any subset of controls, we devise a new strategy to allow creators to input controls that are only partially specified in time. We evaluate both on controls extracted from audio and controls we expect creators to provide, demonstrating that we can generate realistic music that corresponds to control inputs in both settings. While few comparable music generation models exist, we benchmark against MusicGen, a recent model that accepts text and melody input, and show that our model generates music that is 49% more faithful to input melodies despite having 35x fewer parameters, training on 11x less data, and enabling two additional forms of time-varying control. Sound examples can be found at https://MusicControlNet.github.io/web/.
    摘要 文本到音乐生成模型现在可以生成高质量的音乐声音,但是文本控制主要适用于globale musical attribute的控制,如种类、情感和旋律,而不太适合精确控制时间变化的属性,如音乐的拍点位置或变化的声音 dynamics。我们提出了Music ControlNet,一种 diffusion-based music生成模型,提供了多种精确、时间变化的控制方法。为了让文本到音乐模型具有时间变化控制,我们采用了像 pixel-wise control的image-domain ControlNet方法。具体来说,我们从训练音频中提取控制,并使用 diffusion-based conditional生成模型在音频spectrogram中进行了 fine-tune。而在image-domain ControlNet方法中,可以生成任何subset of controls,我们开发了一种新的策略,允许创作者输入只部分Specified in time的控制。我们对于从音频中提取的控制和我们预期创作者提供的控制进行评估,示出了我们可以在两种设置下生成符合控制输入的真实音乐。虽然只有少数相关的音乐生成模型存在,我们对MusicGen,一种最近的模型,进行了比较,并显示了我们的模型可以生成与输入旋律更加准确的音乐,即使有35倍少于参数,在11倍少于数据上进行了训练,并允许两种额外的时间变化控制。音乐示例可以在https://MusicControlNet.github.io/web/中找到。

Decoupling and Interacting Multi-Task Learning Network for Joint Speech and Accent Recognition

  • paper_url: http://arxiv.org/abs/2311.07062
  • repo_url: None
  • paper_authors: Qijie Shao, Pengcheng Guo, Jinghao Yan, Pengfei Hu, Lei Xie
  • for: 这个论文的目的是提出一种基于多任务学习的语音识别和口音识别模型(DIMNet),以提高多口音场景下的语音识别精度。
  • methods: 这个模型使用了分解连接主义时间分类(CTC)分支、语音识别(ASR)分支和口音识别(AR)分支,并使用了两种粒度的模型单元来学习任务特有的表示。此外,模型还使用了文本掌握和口音特征的交互来提高语音识别和口音识别的性能。
  • results: 试验结果表明,提出的模型在英语和中文数据集上都达到了比较出色的性能,相比标准基准模型,AR准确率提高21.45%/28.53%,ASR错误率下降32.33%/14.55%。
    Abstract Accents, as variations from standard pronunciation, pose significant challenges for speech recognition systems. Although joint automatic speech recognition (ASR) and accent recognition (AR) training has been proven effective in handling multi-accent scenarios, current multi-task ASR-AR approaches overlook the granularity differences between tasks. Fine-grained units capture pronunciation-related accent characteristics, while coarse-grained units are better for learning linguistic information. Moreover, an explicit interaction of two tasks can also provide complementary information and improve the performance of each other, but it is rarely used by existing approaches. In this paper, we propose a novel Decoupling and Interacting Multi-task Network (DIMNet) for joint speech and accent recognition, which is comprised of a connectionist temporal classification (CTC) branch, an AR branch, an ASR branch, and a bottom feature encoder. Specifically, AR and ASR are first decoupled by separated branches and two-granular modeling units to learn task-specific representations. The AR branch is from our previously proposed linguistic-acoustic bimodal AR model and the ASR branch is an encoder-decoder based Conformer model. Then, for the task interaction, the CTC branch provides aligned text for the AR task, while accent embeddings extracted from our AR model are incorporated into the ASR branch's encoder and decoder. Finally, during ASR inference, a cross-granular rescoring method is introduced to fuse the complementary information from the CTC and attention decoder after the decoupling. Our experiments on English and Chinese datasets demonstrate the effectiveness of the proposed model, which achieves 21.45%/28.53% AR accuracy relative improvement and 32.33%/14.55% ASR error rate relative reduction over a published standard baseline, respectively.
    摘要 字符识别(ASR)和口音识别(AR)是两个相关的任务,但是传统的多任务学习方法通常忽略了这两个任务之间的细腻差异。我们在本文提出了一种新的分离和互动多任务网络(DIMNet),用于同时进行字符识别和口音识别。该模型包括一个连接式时间分类(CTC)分支、一个AR分支、一个ASR分支和底层特征编码器。特别是,AR和ASR两个任务首先被分离为两个不同的分支和两个不同的模型单元,以学习任务特定的表示。AR分支采用我们之前提出的语言-听音双模型,而ASR分支采用一个基于Conformer模型的encoder-decoder结构。然后,为了实现任务之间的互动,CTC分支提供了对AR任务的已知文本,而AR模型中提取出来的口音嵌入被integrated到ASR分支的编码器和解码器中。最后,在ASR推理过程中,我们引入了一种交叉гра듷推理方法,以融合CTC和注意力解码器中的补充信息。我们在英语和中文 dataset上进行了实验,并证明了我们的模型的效果,其中对于英语 dataset,Relative AR准确率提高21.45%,Relative ASR错误率降低32.33%,对于中文 dataset,Relative AR准确率提高28.53%,Relative ASR错误率降低14.55%,相比之下,相对提高了21.45%/28.53%和32.33%/14.55%。