cs.SD - 2023-11-23

Learning to Solve Inverse Problems for Perceptual Sound Matching

paper_url: http://arxiv.org/abs/2311.14213
repo_url: None
paper_authors: Han Han, Vincent Lostanlen, Mathieu Lagrange
for: 本研究旨在提出一种适合用于深度学习听控Synthesizer的损失函数，以优化听控Synthesizer的输入参数，以实现最佳的听控效果。
methods: 本研究使用深度学习来分析和重建预录样本，并提出一种基于听控的损失函数PNP，以解决听控Synthesizer生成的训练集中参数的不确定性问题。
results: 研究结果表明，使用PNP损失函数可以加速听控Synthesizer，并保持听控效果的高度相同性。此外，研究还评估了不同的设计选择的影响，包括参数缩放、预训练、听控表示和梯度截断，并发现PNP加速JTFS听控效果更大于任何其他设计选择。

Abstract
Perceptual sound matching (PSM) aims to find the input parameters to a synthesizer so as to best imitate an audio target. Deep learning for PSM optimizes a neural network to analyze and reconstruct prerecorded samples. In this context, our article addresses the problem of designing a suitable loss function when the training set is generated by a differentiable synthesizer. Our main contribution is perceptual-neural-physical loss (PNP), which aims at addressing a tradeoff between perceptual relevance and computational efficiency. The key idea behind PNP is to linearize the effect of synthesis parameters upon auditory features in the vicinity of each training sample. The linearization procedure is massively paralellizable, can be precomputed, and offers a 100-fold speedup during gradient descent compared to differentiable digital signal processing (DDSP). We demonstrate PNP on two datasets of nonstationary sounds: an AM/FM arpeggiator and a physical model of rectangular membranes. We show that PNP is able to accelerate DDSP with joint time-frequency scattering transform (JTFS) as auditory feature, while preserving its perceptual fidelity. Additionally, we evaluate the impact of other design choices in PSM: parameter rescaling, pretraining, auditory representation, and gradient clipping. We report state-of-the-art results on both datasets and find that PNP-accelerated JTFS has greater influence on PSM performance than any other design choice.

摘要
音响匹配（PSM）目的是找到输入参数，以最好地模拟音频目标。深度学习 для PSM 优化了神经网络，以分析和重建预录样本。在这种情况下，我们的文章面临着训练集是由可微分synthesizer生成的问题。我们的主要贡献是听觉神经物理损失（PNP），它寻求在训练样本附近Linear化synthesis参数对听觉特征的影响。这个Linearization过程可以并行执行，可以在梯度下降过程中速度100倍，比可微分数字信号处理（DDSP）更快。我们在两个非站点声音数据集上进行了PNP的示例：AM/FMarpeggio和物理模型的Rectangular Membrane。我们发现PNP可以加速DDSP，并且保持听觉准确性。此外，我们还评估了PSM中其他设计选择的影响：参数缩放、预训练、听觉表示和梯度clip。我们发现PNP加速JTFS在PSM中具有最大的影响，并且在两个数据集上达到了状态之最佳结果。