cs.SD - 2023-10-17

High-Fidelity Noise Reduction with Differentiable Signal Processing

paper_url: http://arxiv.org/abs/2310.11364
repo_url: None
paper_authors: Christian J. Steinmetz, Thomas Walther, Joshua D. Reiss
for: 这个论文主要是为了提高录音中的听音质量，使用深度学习和信号处理技术。
methods: 这个论文使用了深度学习模型和信号处理技术，将其结合起来实现自动化的听音质量提高。
results: 论文的实验结果表明，使用这种方法可以实现高精度的听音质量提高，并且比深度学习模型更高效和更少噪声。Listening examples are available online at https://tape.it/research/denoiser。

Abstract
Noise reduction techniques based on deep learning have demonstrated impressive performance in enhancing the overall quality of recorded speech. While these approaches are highly performant, their application in audio engineering can be limited due to a number of factors. These include operation only on speech without support for music, lack of real-time capability, lack of interpretable control parameters, operation at lower sample rates, and a tendency to introduce artifacts. On the other hand, signal processing-based noise reduction algorithms offer fine-grained control and operation on a broad range of content, however, they often require manual operation to achieve the best results. To address the limitations of both approaches, in this work we introduce a method that leverages a signal processing-based denoiser that when combined with a neural network controller, enables fully automatic and high-fidelity noise reduction on both speech and music signals. We evaluate our proposed method with objective metrics and a perceptual listening test. Our evaluation reveals that speech enhancement models can be extended to music, however training the model to remove only stationary noise is critical. Furthermore, our proposed approach achieves performance on par with the deep learning models, while being significantly more efficient and introducing fewer artifacts in some cases. Listening examples are available online at https://tape.it/research/denoiser .

摘要
“深度学习减声技术已经在录音质量提高方面表现出色。然而，这些方法在音频工程中的应用可能受到一些限制因素。这些因素包括仅适用于语音，无法支持音乐，缺乏实时功能，缺乏可解释的控制参数，运行在较低的� Sampling rate 下，并且会引入错误。另一方面，信号处理减声算法可以提供精确的控制和适用于广泛的内容，但是它们通常需要手动操作以 дости持最佳结果。为了解决这两种方法的限制，在这个工作中，我们提出了一种结合信号处理减声器和神经网络控制器的方法，允许完全自动和高精度的杂声除去，包括语音和音乐信号。我们使用了一系列的对照测试和听觉测试进行评估。我们发现，语音提高模型可以扩展到音乐，但是培训模型只需要去除静止杂声是critical。此外，我们的提案方法可以和深度学习模型的性能相似，同时更高效和更少的错误。听取示例可以在 https://tape.it/research/denoiser 上找到。”

Leveraging Content-based Features from Multiple Acoustic Models for Singing Voice Conversion

paper_url: http://arxiv.org/abs/2310.11160
repo_url: None
paper_authors: Xueyao Zhang, Yicheng Gu, Haopeng Chen, Zihao Fang, Lexiao Zou, Liumeng Xue, Zhizheng Wu
for: 这项研究旨在 Investigating the complementary roles of multiple content features in singing voice conversion (SVC), and developing a diffusion-based SVC model that integrates these features for superior conversion performance.
methods: 研究使用了三种不同的内容特征（来自WeNet、Whisper和ContentVec），并将其集成到一个扩散基于SVC模型中，以提高SVC的对象和主观评估表现。
results: 研究表明，通过将多种内容特征集成到SVC模型中，可以获得更高的对象和主观评估表现，比单个内容特征更好。Code和demo页面可以在https://www.zhangxueyao.com/data/MultipleContentsSVC/index.html中找到。

Abstract
Singing voice conversion (SVC) is a technique to enable an arbitrary singer to sing an arbitrary song. To achieve that, it is important to obtain speaker-agnostic representations from source audio, which is a challenging task. A common solution is to extract content-based features (e.g., PPGs) from a pretrained acoustic model. However, the choices for acoustic models are vast and varied. It is yet to be explored what characteristics of content features from different acoustic models are, and whether integrating multiple content features can help each other. Motivated by that, this study investigates three distinct content features, sourcing from WeNet, Whisper, and ContentVec, respectively. We explore their complementary roles in intelligibility, prosody, and conversion similarity for SVC. By integrating the multiple content features with a diffusion-based SVC model, our SVC system achieves superior conversion performance on both objective and subjective evaluation in comparison to a single source of content features. Our demo page and code can be available https://www.zhangxueyao.com/data/MultipleContentsSVC/index.html.

摘要
《歌唱voice转换（SVC）技术可以让任意歌手演唱任意歌曲。实现这一点具有挑战性，因为需要从源音频中提取无关于歌手的特征。常见的解决方案是从预训练的音声模型中提取内容基于特征（如PPGs）。然而，不同的音声模型可以提供不同的内容特征，是否可以将这些特征集成起来帮助彼此？以上问题驱动我们进行这种研究。本研究研究了三种不同的内容特征，分别来自WeNet、Whisper和ContentVec。我们研究了这些特征在elligibility、prosody和转换相似性方面的补充作用，并将这些特征集成到一个扩散型SVC模型中。在对象和主观评价中，我们的SVC系统在比较于单个内容特征时表现出更高的转换性能。我们的demo页和代码可以在上查看。》Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you prefer Traditional Chinese, please let me know and I can provide the translation in that format instead.

A High Fidelity and Low Complexity Neural Audio Coding

paper_url: http://arxiv.org/abs/2310.10992
repo_url: None
paper_authors: Wenzhe Liu, Wei Xiao, Meng Wang, Shan Yang, Yupeng Shi, Yuyong Kang, Dan Su, Shidong Shang, Dong Yu
for: 这个论文旨在提出一个整合架构，用于实时通信系统中的音频编码。
methods: 这个架构使用神经网络模型宽频域成分，并使用传统信号处理技术压缩高频域成分，根据人类听觉知识设计了根据听觉理论的损失函数，同时还提出了基于对抗学习的各种数据压缩方法。
results: 该方法与先进的神经音频编码相比，在主观和客观指标上均表现出色，并且可以在桌面和手持设备上进行实时推断。

Abstract
Audio coding is an essential module in the real-time communication system. Neural audio codecs can compress audio samples with a low bitrate due to the strong modeling and generative capabilities of deep neural networks. To address the poor high-frequency expression and high computational cost and storage consumption, we proposed an integrated framework that utilizes a neural network to model wide-band components and adopts traditional signal processing to compress high-band components according to psychological hearing knowledge. Inspired by auditory perception theory, a perception-based loss function is designed to improve harmonic modeling. Besides, generative adversarial network (GAN) compression is proposed for the first time for neural audio codecs. Our method is superior to prior advanced neural codecs across subjective and objective metrics and allows real-time inference on desktop and mobile.

摘要
Audio coding是实时通信系统中的一个重要模块。神经网络音频编码器可以通过深度神经网络的强大模型和生成能力，将音频样本压缩到低比特率。为了解决高频表达质量不佳和计算成本高的问题，我们提出了一个整合框架，利用神经网络模型宽频成分，采用传统的信号处理技术压缩高频成分，根据听觉知识。受听觉理论的启发，我们设计了基于听觉模型的损失函数，以改善和声模型。此外，我们还提出了基于生成敌对网络（GAN）的压缩方法，这是神经音频编码器中的首次应用。我们的方法在主观和客观指标上胜过先前的先进神经编码器，并允许实时推理在桌面和移动设备上。