cs.SD - 2023-10-25

Improved Panning on Non-Equidistant Loudspeakers with Direct Sound Level Compensation

paper_url: http://arxiv.org/abs/2310.17004
repo_url: None
paper_authors: Jan-Hendrik Hanschke, Daniel Arteaga, Giulio Cengarle, Joshua Lando, Mark R. P. Thomas, Alan Seefeldt
for: 这篇论文旨在提出一种基于直接声音和感知响应的方法，以便在非等距喇叭布局下实现扬声器扬声。
methods: 论文使用了一种新的方法，即基于直接声音和感知响应的方法，以便在非等距喇叭布局下实现扬声器扬声。
results: 试验表明，该方法可以减少喇叭布局不一致性导致的幻音源位偏移，并且可以大大改善声音扬声效果。

Abstract
Loudspeaker rendering techniques that create phantom sound sources often assume an equidistant loudspeaker layout. Typical home setups might not fulfill this condition as loudspeakers deviate from canonical positions, thus requiring a corresponding calibration. The standard approach is to compensate for delays and to match the loudness of each loudspeaker at the listener's location. It was found that a shift of the phantom image occurs when this calibration procedure is applied and one of a pair of loudspeakers is significantly closer to the listener than the other. In this paper, a novel approach to panning on non-equidistant loudspeaker layouts is presented whereby the panning position is governed by the direct sound and the perceived loudness is governed by the full impulse response. Subjective listening tests are presented that validate the approach and quantify the perceived effect of the compensation. In a setup where the standard calibration leads to an average error of 10 degrees, the proposed direct sound compensation largely returns the phantom source to its intended position.

摘要
喇叭渲染技术经常假设喇叭Layout是平等的，但家用设置通常不满足这个条件，因为喇叭与CanonicalPosition不匹配，因此需要相应的调整。标准方法是补偿延迟并将每个喇叭的响度在听众位置进行调整。研究发现，当应用这种调整程序时，如果一对喇叭中的一个喇叭远离听众更近， then the phantom image will shift. 本文提出了一种新的喇叭扫描非平等喇叭布局的方法，其中喇叭扫描位置由直接声音控制，而听众感受到的响度则由全冲响应控制。对比listen testing表明，该方法可以大幅提高喇叭扫描的精度。在一个标准调整后的平均错误为10度的设置下，提posed direct sound compensation方法可以几乎完全将幻音源返回到其原始位置。

Dynamic Processing Neural Network Architecture For Hearing Loss Compensation

paper_url: http://arxiv.org/abs/2310.16550
repo_url: None
paper_authors: Szymon Drgas, Lars Bramsløw, Archontis Politis, Gaurav Naithani, Tuomas Virtanen
for: 提高听力障碍者的语音理解能力（speech intelligibility）
methods: 使用神经网络（neural networks）和听力模型（hearing loss model）实现语音补偿（speech compensation），并提出一种可解释性模型（interpretable model）called dynamic processing network
results: 在使用STOI和HASPI指标评估的情况下，dynamic processing network在与Camfit规则相比 Displayath significant improvement in speech intelligibility, while a large enough convolutional neural network could outperform the interpretable model with higher computational load.

Abstract
This paper proposes neural networks for compensating sensorineural hearing loss. The aim of the hearing loss compensation task is to transform a speech signal to increase speech intelligibility after further processing by a person with a hearing impairment, which is modeled by a hearing loss model. We propose an interpretable model called dynamic processing network, which has a structure similar to band-wise dynamic compressor. The network is differentiable, and therefore allows to learn its parameters to maximize speech intelligibility. More generic models based on convolutional layers were tested as well. The performance of the tested architectures was assessed using spectro-temporal objective index (STOI) with hearing-threshold noise and hearing aid speech intelligibility (HASPI) metrics. The dynamic processing network gave a significant improvement of STOI and HASPI in comparison to popular compressive gain prescription rule Camfit. A large enough convolutional network could outperform the interpretable model with the cost of larger computational load. Finally, a combination of the dynamic processing network with convolutional neural network gave the best results in terms of STOI and HASPI.

摘要

A Novel Approach for Object Based Audio Broadcasting

paper_url: http://arxiv.org/abs/2310.16481
repo_url: None
paper_authors: Mohammad Reza Hasanabadi
for: 提供个性化和自定义的音频经验，适用于不同的平台，如广播、流媒体和电影音频。
methods: 提出了一种新的对象音频生成方法，即Sample-by-Sample Object Based Audio（SSOBA）嵌入。SSOBA将音频对象样本置于一起，让听众根据自己的兴趣和需求自由地个性化选择音频来源。
results: 对SSOBA的主要性能因素进行了研究，包括输入音频对象、输出通道数和采样率。实验结果表明，在编码和解码过程中，SSOBA可以保持高质量音频效果，并且可以在不需要特殊硬件的情况下实现。

Abstract
Object Based Audio (OBA) provides a new kind of audio experience, delivered to the audience to personalize and customize their experience of listening and to give them choice of what and how to hear their audio content. OBA can be applied to different platforms such as broadcasting, streaming and cinema sound. This paper presents a novel approach for creating object-based audio on the production side. The approach here presents Sample-by-Sample Object Based Audio (SSOBA) embedding. SSOBA places audio object samples in such a way that allows audiences to easily individualize their chosen audio sources according to their interests and needs. SSOBA is an extra service and not an alternative, so it is also compliant with legacy audio players. The biggest advantage of SSOBA is that it does not require any special additional hardware in the broadcasting chain and it is therefore easy to implement and equip legacy players and decoders with enhanced ability. Input audio objects, number of output channels and sampling rates are three important factors affecting SSOBA performance and specifying it to be lossless or lossy. SSOBA adopts interpolation at the decoder side to compensate for eliminated samples. Both subjective and objective experiments are carried out to evaluate the output results at each step. MUSHRA subjective experiments conducted after the encoding step shows good-quality performance of SSOBA with up to five objects. SNR measurements and objective experiments, performed after decoding and interpolation, show significant successful recovery and separation of audio objects. Experimental results show that a minimum sampling rate of 96 kHz is indicated to encode up to five objects in a Stereo-mode channel to acquire good subjective and objective results simultaneously.

摘要
对象基于专业（OBA）提供了一种新的专业音频经验，为听众个人化和自定义音频内容的欣赏体验。OBA可以应用到不同的平台，如广播、流媒体和电影 surround sound。本篇文章介绍了一种 novel 的创新方法，即 Sample-by-Sample Object Based Audio（SSOBA）嵌入。SSOBA 将音频 объек�置于适当的位置，以便让听众根据他们的 interess 和需求选择自己想要的音频源。SSOBA 不是一个替代品，而是一个额外的服务，因此适合旧有的音频播放器。SSOBA 的主要优点是不需要在广播链接中添加特殊的硬件，因此易于实现和升级旧有的播放器。音频对象、出力通道数量和抽样率是 SSOBA 性能的三大因素，可以根据这些因素来决定是使用损失less 或损失y 的编码方式。SSOBA 使用 decoder сторо面的 interpolate 来补偿被删除的样本。在编码和 interpolate 后，我们进行了主观和 объек 的实验，结果显示 SSOBA 在 five 个音频对象的情况下具有良好的品质表现。SNR 测量和对象实验表明，SSOBA 在恢复和分隔音频对象方面取得了显著的成功。实验结果显示，为了在 Stereo-mode 通道中编码 up to five 个音频对象，至少需要 96 kHz 的抽样率。

Towards Streaming Speech-to-Avatar Synthesis

paper_url: http://arxiv.org/abs/2310.16287
repo_url: None
paper_authors: Tejas S. Prabhune, Peter Wu, Bohan Yu, Gopala K. Anumanchipalli
for: 这篇论文旨在实现实时语音到人物动画的转化，以便在语音学、phonetics和phonology等领域可以实时visualize声音，并帮助第二语言学习和贫听患者的虚拟体现。
methods: 该方法使用深度形态反转来实现高质量的人物动画，使用实时语音而不是录音来进行实时动画化。
results: 该方法可以实现130毫秒的平均流动时间，与真实的语音相关性达0.792。此外，我们还展示了生成的口部和舌头动画，以证明我们的方法的有效性。

Abstract
Streaming speech-to-avatar synthesis creates real-time animations for a virtual character from audio data. Accurate avatar representations of speech are important for the visualization of sound in linguistics, phonetics, and phonology, visual feedback to assist second language acquisition, and virtual embodiment for paralyzed patients. Previous works have highlighted the capability of deep articulatory inversion to perform high-quality avatar animation using electromagnetic articulography (EMA) features. However, these models focus on offline avatar synthesis with recordings rather than real-time audio, which is necessary for live avatar visualization or embodiment. To address this issue, we propose a method using articulatory inversion for streaming high quality facial and inner-mouth avatar animation from real-time audio. Our approach achieves 130ms average streaming latency for every 0.1 seconds of audio with a 0.792 correlation with ground truth articulations. Finally, we show generated mouth and tongue animations to demonstrate the efficacy of our methodology.

摘要
《流式报道——语音到人物Synthesis创造实时动画》我们的研究旨在开发一种基于深度辐射倒推的实时人物动画 Synthesis方法，以便实时visual化语音。我们的方法可以快速地从实时 audio 数据中提取高质量的 facial 和 inner-mouth 动画，并且可以在实时播放 audio 时进行实时动画 Synthesis。我们的方法可以实现每0.1秒 audio 的130ms平均流式延迟，并且与真实辐射数据的0.792相对于ground truth的相关性。最后，我们展示了由我们的方法生成的口部和舌头动画，以证明我们的方法的有效性。