results: 该论文通过 listening tests 和对象度量来证明提出的模型的效果。Abstract
Timbre transfer techniques aim at converting the sound of a musical piece generated by one instrument into the same one as if it was played by another instrument, while maintaining as much as possible the content in terms of musical characteristics such as melody and dynamics. Following their recent breakthroughs in deep learning-based generation, we apply Denoising Diffusion Models (DDMs) to perform timbre transfer. Specifically, we apply the recently proposed Denoising Diffusion Implicit Models (DDIMs) that enable to accelerate the sampling procedure. Inspired by the recent application of DDMs to image translation problems we formulate the timbre transfer task similarly, by first converting the audio tracks into log mel spectrograms and by conditioning the generation of the desired timbre spectrogram through the input timbre spectrogram. We perform both one-to-one and many-to-many timbre transfer, by converting audio waveforms containing only single instruments and multiple instruments, respectively. We compare the proposed technique with existing state-of-the-art methods both through listening tests and objective measures in order to demonstrate the effectiveness of the proposed model.
摘要
timbre 传送技术目的在于将一种乐器生成的音乐作品的声音转换成另一种乐器上的同样的声音,保持音乐特点如旋律和声 dynamics 等。我们通过深度学习基本的生成技术,应用 Denoising Diffusion Models (DDMs) 来实现 timbre 传送。具体来说,我们使用最近提出的 Denoising Diffusion Implicit Models (DDIMs),以加速抽样过程。受图像翻译问题的应用启发,我们将 timbre 传送任务设置为类似的形式,首先将音频轨迹转换为宽度 spectrograms 并通过输入 timbre spectrogram 来控制生成 Desired timbre spectrogram。我们进行了一对一和多对多 timbre 传送,将单个乐器和多个乐器的音频波形转换为另一种 timbre。我们将比较我们的提议方法与现有状态的方法,通过听测试和 объек 的度量来证明我们的模型的效果。
Study on the Correlation between Objective Evaluations and Subjective Speech Quality and Intelligibility
results: 研究发现,新提出的深度学习模型可以准确预测语音质量和理解度,而且可以减少训练数据量。此外,包含主观语音质量评分在语音理解预测中的影响也得到了研究。Abstract
Subjective tests are the gold standard for evaluating speech quality and intelligibility, but they are time-consuming and expensive. Thus, objective measures that align with human perceptions are crucial. This study evaluates the correlation between commonly used objective measures and subjective speech quality and intelligibility using a Chinese speech dataset. Moreover, new objective measures are proposed combining current objective measures using deep learning techniques to predict subjective quality and intelligibility. The proposed deep learning model reduces the amount of training data without significantly impacting prediction performance. We interpret the deep learning model to understand how objective measures reflect subjective quality and intelligibility. We also explore the impact of including subjective speech quality ratings on speech intelligibility prediction. Our findings offer valuable insights into the relationship between objective measures and human perceptions.
摘要
Translation notes:* "Subjective tests" is translated as "主观测试" (zhǔ yì cè shì), which refers to tests that are evaluated subjectively by human raters.* "Objective measures" is translated as "对象评价指标" (duì yì bìng jí), which refers to measures that are evaluated objectively using quantitative methods.* "Speech quality" is translated as "语音质量" (yǔ yīn zhì liàng), which refers to the overall quality of speech.* "Intelligibility" is translated as "语音可识别度" (yǔ yīn kě shí bèi duō), which refers to the ability to understand speech.* "Deep learning model" is translated as "深度学习模型" (shēn dào xué xí mó delì), which refers to a type of machine learning model that uses artificial neural networks to analyze data.* "Subjective speech quality ratings" is translated as "主观语音质量评分" (zhǔ yì yǔ yīn zhì liàng píng fān), which refers to ratings given by human raters to evaluate the subjective quality of speech.
A Demand-Driven Perspective on Generative Audio AI
results: 调查结果表明当前最大的瓶颈是数据集的可用性,以及音质和控制性的现有挑战。研究还提出了一些解决这些问题的可能性,并提供了empirical证据。Abstract
To achieve successful deployment of AI research, it is crucial to understand the demands of the industry. In this paper, we present the results of a survey conducted with professional audio engineers, in order to determine research priorities and define various research tasks. We also summarize the current challenges in audio quality and controllability based on the survey. Our analysis emphasizes that the availability of datasets is currently the main bottleneck for achieving high-quality audio generation. Finally, we suggest potential solutions for some revealed issues with empirical evidence.
摘要
要成功推广人工智能研究,我们需要了解行业的需求。在这篇论文中,我们通过询问专业音频工程师来确定研究优先级和定义各种研究任务。我们还总结了现有数据的可用性是达到高质量音频生成的主要瓶颈。最后,我们提出了一些解决问题的可能性,并提供了实证证据。Note: Please note that the translation is in Simplified Chinese, which is used in mainland China and Singapore, while Traditional Chinese is used in Taiwan, Hong Kong, and other parts of the world.