results: 我们提出了一种CRNN模型,可以在多个数据集上表现出色地估计移动声源的距离,超过了一个最近发表的方法。我们还进行了模型性能的函数分析,发现在声源真实距离不同和不同训练损失下,模型的性能存在最佳化点。本研究是首次通过深度学习实现了声源距离估计在多种听音条件下。Abstract
Localizing a moving sound source in the real world involves determining its direction-of-arrival (DOA) and distance relative to a microphone. Advancements in DOA estimation have been facilitated by data-driven methods optimized with large open-source datasets with microphone array recordings in diverse environments. In contrast, estimating a sound source's distance remains understudied. Existing approaches assume recordings by non-coincident microphones to use methods that are susceptible to differences in room reverberation. We present a CRNN able to estimate the distance of moving sound sources across multiple datasets featuring diverse rooms, outperforming a recently-published approach. We also characterize our model's performance as a function of sound source distance and different training losses. This analysis reveals optimal training using a loss that weighs model errors as an inverse function of the sound source true distance. Our study is the first to demonstrate that sound source distance estimation can be performed across diverse acoustic conditions using deep learning.
摘要
本文描述了一种基于深度学习的Sound Source Distance Estimation(SSDE)模型,可以在多种不同的室内环境中Estimate the distance of moving sound sources。我们使用了大量的开源数据集,并且使用了一种叫做Convolutional Recurrent Neural Network(CRNN)的模型,可以在多个不同的 datasets 中Outperform 已有的方法。我们还进行了一些性能分析,包括模型的训练损失函数的选择,以及模型在不同的室内环境下的性能。我们的研究表明,可以使用深度学习来Estimate the distance of moving sound sources across diverse acoustic conditions。这是一个新的领域,尚未得到过足够的研究。我们的模型可以在多个不同的室内环境下提供高精度的距离估计,并且可以在不同的训练损失函数下进行优化。在本研究中,我们使用了一些不同的训练损失函数,包括损失函数的weighted sum,以及一种叫做“inverse distance”的损失函数。我们的研究表明,使用“inverse distance”损失函数可以提高模型的性能,并且可以在不同的室内环境下提供更高的精度。总之,我们的研究表明,可以使用深度学习来Estimate the distance of moving sound sources across diverse acoustic conditions。我们的模型可以在多个不同的室内环境下提供高精度的距离估计,并且可以在不同的训练损失函数下进行优化。这对于各种应用场景,如智能家居、智能城市等,都具有重要的意义。
PromptVC: Flexible Stylistic Voice Conversion in Latent Space Driven by Natural Language Prompts
for: This paper aims to improve the style voice conversion process by using natural language prompts to generate a style vector and adapting the duration of discrete tokens.
methods: The proposed approach, called PromptVC, employs a latent diffusion model to sample the style vector from noise, with the process being conditioned on natural language prompts. The system also uses HuBERT to extract discrete tokens and replace them with the K-Means center embedding to minimize residual style information.
results: The subjective and objective evaluation results demonstrate the effectiveness of the proposed system, with improved style expressiveness and adaptability to different styles.Abstract
Style voice conversion aims to transform the style of source speech to a desired style according to real-world application demands. However, the current style voice conversion approach relies on pre-defined labels or reference speech to control the conversion process, which leads to limitations in style diversity or falls short in terms of the intuitive and interpretability of style representation. In this study, we propose PromptVC, a novel style voice conversion approach that employs a latent diffusion model to generate a style vector driven by natural language prompts. Specifically, the style vector is extracted by a style encoder during training, and then the latent diffusion model is trained independently to sample the style vector from noise, with this process being conditioned on natural language prompts. To improve style expressiveness, we leverage HuBERT to extract discrete tokens and replace them with the K-Means center embedding to serve as the linguistic content, which minimizes residual style information. Additionally, we deduplicate the same discrete token and employ a differentiable duration predictor to re-predict the duration of each token, which can adapt the duration of the same linguistic content to different styles. The subjective and objective evaluation results demonstrate the effectiveness of our proposed system.
摘要
《 Style Voice Conversion 》 aims to transform the style of source speech to a desired style based on real-world application demands. However, current methods rely on pre-defined labels or reference speech to control the conversion process, which limits style diversity and lacks intuitive and interpretable style representation. In this study, we propose PromptVC, a novel style voice conversion approach that employs a latent diffusion model to generate a style vector driven by natural language prompts. Specifically, the style vector is extracted by a style encoder during training, and the latent diffusion model is trained to sample the style vector from noise, conditioned on natural language prompts. To enhance style expressiveness, we leverage HuBERT to extract discrete tokens and replace them with the K-Means center embedding to minimize residual style information. Additionally, we deduplicate the same discrete token and employ a differentiable duration predictor to re-predict the duration of each token, allowing for adaptive duration adjustment based on different styles. Subjective and objective evaluation results demonstrate the effectiveness of our proposed system.
Zero- and Few-shot Sound Event Localization and Detection
results: 该论文的结果表明,使用 embed-ACCDOA 模型可以在零或几个批处理任务中提高静音定位和检测的性能,并且和完整的训练数据进行比较。Abstract
Sound event localization and detection (SELD) systems estimate direction-of-arrival (DOA) and temporal activation for sets of target classes. Neural network (NN)-based SELD systems have performed well in various sets of target classes, but they only output the DOA and temporal activation of preset classes that are trained before inference. To customize target classes after training, we tackle zero- and few-shot SELD tasks, in which we set new classes with a text sample or a few audio samples. While zero-shot sound classification tasks are achievable by embedding from contrastive language-audio pretraining (CLAP), zero-shot SELD tasks require assigning an activity and a DOA to each embedding, especially in overlapping cases. To tackle the assignment problem in overlapping cases, we propose an embed-ACCDOA model, which is trained to output track-wise CLAP embedding and corresponding activity-coupled Cartesian direction-of-arrival (ACCDOA). In our experimental evaluations on zero- and few-shot SELD tasks, the embed-ACCDOA model showed a better location-dependent scores than a straightforward combination of the CLAP audio encoder and a DOA estimation model. Moreover, the proposed combination of the embed-ACCDOA model and CLAP audio encoder with zero- or few-shot samples performed comparably to an official baseline system trained with complete train data in an evaluation dataset.
摘要
声音事件 lokalisierung和检测(SELD)系统估算irection-of-arrival(DOA)和时间活动 для集合Target classes。基于神经网络(NN)的SELD系统在不同的Target classes中表现良好,但它们只会在预测前训练的类型上输出DOA和时间活动。为了自定义目标类型 después de training,我们面临着零和几个shot SELD任务,在其中我们可以通过提供文本样本或几个音频样本来设置新的类型。零shot声音分类任务可以通过语音-语言预training(CLAP)的嵌入来实现,但零shot SELD任务需要将每个嵌入分配到活动和DOA,特别是在重叠的情况下。为了解决重叠情况中的分配问题,我们提出了一种嵌入-ACCDOA模型,该模型通过输出track-wise CLAP嵌入和相应的活动-联合Cartesian DOA来解决问题。在我们的实验评估中,embeds-ACCDOA模型在零和几个shot SELD任务中表现出色,其location-dependent scores比直接将CLAP音频编码器和DOA估算模型组合的 scores更高。此外,我们将embeds-ACCDOA模型和CLAP音频编码器与零或几个shot样本组合起来,与完整的训练数据进行评估,结果与官方基eline系统在评估数据集中的表现相当。