cs.SD - 2023-12-05

Leveraging Laryngograph Data for Robust Voicing Detection in Speech

paper_url: http://arxiv.org/abs/2312.03129
repo_url: https://github.com/yixuanz/rvd
paper_authors: Yixuan Zhang, Heming Wang, DeLiang Wang
for: 这篇论文是为了提出一个可靠地检测语音信号中的声音时间的方法，并且有许多应用，例如抑扬识别、语音识别等。
methods: 本研究使用了一个紧密联接的卷积运算神经网络（DC-CRN），并且使用了录音的声带数据集进行训练。还进行了预训练来提高模型的应用性。
results: 本研究的模型可以实现高度的声音检测精度，比较其他强大的基eline方法更好，并且可以对未见的数据集进行一致的检测。

Abstract
Accurately detecting voiced intervals in speech signals is a critical step in pitch tracking and has numerous applications. While conventional signal processing methods and deep learning algorithms have been proposed for this task, their need to fine-tune threshold parameters for different datasets and limited generalization restrict their utility in real-world applications. To address these challenges, this study proposes a supervised voicing detection model that leverages recorded laryngograph data. The model is based on a densely-connected convolutional recurrent neural network (DC-CRN), and trained on data with reference voicing decisions extracted from laryngograph data sets. Pretraining is also investigated to improve the generalization ability of the model. The proposed model produces robust voicing detection results, outperforming other strong baseline methods, and generalizes well to unseen datasets. The source code of the proposed model with pretraining is provided along with the list of used laryngograph datasets to facilitate further research in this area.

摘要
实时检测语音讯号中的发音间隔是一项重要的步骤，有许多应用。传统的信号处理方法和深度学习算法已经被提出来解决这个问题，但它们需要精确地调整阈值参数，并且仅仅对特定数据集有限度的适用。为了解决这些挑战，这些研究提出了一个监督式的发音检测模型，利用语音讯号中的预录资料。这个模型基于密集的卷积神经网络（DC-CRN），并且由参考的发音决策撷取自语音讯号数据集。预训也被 investigate 以提高模型的应用能力。提出的模型实现了Robust的发音检测结果，超越了其他强大的基eline方法，并且对未见数据集具有良好的扩展性。实现的模型代码和预训数据库都提供给进一步的研究。

Integrating Plug-and-Play Data Priors with Weighted Prediction Error for Speech Dereverberation

paper_url: http://arxiv.org/abs/2312.02773
repo_url: None
paper_authors: Ziye Yang, Wenxing Yang, Kai Xie, Jie Chen
for: 提高涂滥后音频信号的质量和可解性
methods: combining physics-based and data-driven methods, incorporating speech prior information learnt from data during the optimization problem solving iterations
results: 实验结果验证了提案的有效性

Abstract
Speech dereverberation aims to alleviate the detrimental effects of late-reverberant components. While the weighted prediction error (WPE) method has shown superior performance in dereverberation, there is still room for further improvement in terms of performance and robustness in complex and noisy environments. Recent research has highlighted the effectiveness of integrating physics-based and data-driven methods, enhancing the performance of various signal processing tasks while maintaining interpretability. Motivated by these advancements, this paper presents a novel dereverberation frame-work, which incorporates data-driven methods for capturing speech priors within the WPE framework. The plug-and-play strategy (PnP), specifically the regularization by denoising (RED) strategy, is utilized to incorporate speech prior information learnt from data during the optimization problem solving iterations. Experimental results validate the effectiveness of the proposed approach.

摘要
文本摘要：干扰除泛音尝试缓解晚 reverberation 的负面影响。虽然Weighted prediction error（WPE）方法已经表现出优于其他方法，但是在复杂和噪音环境中仍然有很多的改进空间。最近的研究表明，结合物理学和数据驱动方法可以提高各种信号处理任务的性能，同时保持可解释性。这篇论文提出了一种新的干扰除泛音框架，通过在 WPE 框架中包含数据驱动方法来捕捉speech 的偏好。使用 plug-and-play 策略（PnP）和正则化 denoising 策略（RED）来在优化问题的解决过程中包含数据驱动方法学习的speech 偏好信息。实验结果证明了该方法的有效性。

Distributed Speech Dereverberation Using Weighted Prediction Error

paper_url: http://arxiv.org/abs/2312.03034
repo_url: None
paper_authors: Ziye Yang, Mengfei Zhang, Jie Chen
for: 减少延迟响应的负面影响，提高语音听录质量。
methods: 使用分布式适应节点特定信号估计（DANSE）算法，在多通道线性预测（MCLP）过程中实现分布式演算。每个节点只需执行本地操作，通过节点间协作实现全局性能。
results: 实验结果表明，提出的方法可以有效地在分布式 Microphone 节点场景下实现高效的语音听录除抖。

Abstract
Speech dereverberation aims to alleviate the negative impact of late reverberant reflections. The weighted prediction error (WPE) method is a well-established technique known for its superior performance in dereverberation. However, in scenarios where microphone nodes are dispersed, the centralized approach of the WPE method requires aggregating all observations for inverse filtering, resulting in a significant computational burden. This paper introduces a distributed speech dereverberation method that emphasizes low computational complexity at each node. Specifically, we leverage the distributed adaptive node-specific signal estimation (DANSE) algorithm within the multichannel linear prediction (MCLP) process. This approach empowers each node to perform local operations with reduced complexity while achieving the global performance through inter-node cooperation. Experimental results validate the effectiveness of our proposed method, showcasing its ability to achieve efficient speech dereverberation in dispersed microphone node scenarios.

摘要
<>Translate the given text into Simplified Chinese.<>Speech dereverberation aims to alleviate the negative impact of late reverberant reflections. The weighted prediction error (WPE) method is a well-established technique known for its superior performance in dereverberation. However, in scenarios where microphone nodes are dispersed, the centralized approach of the WPE method requires aggregating all observations for inverse filtering, resulting in a significant computational burden. This paper introduces a distributed speech dereverberation method that emphasizes low computational complexity at each node. Specifically, we leverage the distributed adaptive node-specific signal estimation (DANSE) algorithm within the multichannel linear prediction (MCLP) process. This approach empowers each node to perform local operations with reduced complexity while achieving the global performance through inter-node cooperation. Experimental results validate the effectiveness of our proposed method, showcasing its ability to achieve efficient speech dereverberation in dispersed microphone node scenarios.Translation:<>转换给定文本到简化中文。<>Speech dereverberation aims to alleviate the negative impact of late reverberant reflections. WPE method is a well-established technique known for its superior performance in dereverberation. However, in scenarios where microphone nodes are dispersed, the centralized approach of the WPE method requires aggregating all observations for inverse filtering, resulting in a significant computational burden. This paper introduces a distributed speech dereverberation method that emphasizes low computational complexity at each node. Specifically, we leverage the distributed adaptive node-specific signal estimation (DANSE) algorithm within the multichannel linear prediction (MCLP) process. This approach empowers each node to perform local operations with reduced complexity while achieving the global performance through inter-node cooperation. Experimental results validate the effectiveness of our proposed method, showcasing its ability to achieve efficient speech dereverberation in dispersed microphone node scenarios.Note: Please keep in mind that the translation is done using a machine translation tool, and the quality of the translation may vary depending on the complexity and nuances of the original text.

Auralization based on multi-perspective ambisonic room impulse responses

paper_url: http://arxiv.org/abs/2312.02581
repo_url: None
paper_authors: Kaspar Müller, Franz Zotter
for: 这种技术用于在变化的听众视角下实时生成听众在不同环境中的听觉效果。
methods: 使用 Ambisonic 室内响应函数（ARIR）的三元 interpolate 技术，将 ARRI 的数据集中的听众视角作为变量，并使用时间差分析和方向探测来实现听觉效果的拟合。
results: 通过 listening 实验，在不同环境下使用这种技术可以实现高质量的听觉效果，并且可以在变化的听众视角下进行实时拟合。

Abstract
Most often, virtual acoustic rendering employs real-time updated room acoustic simulations to accomplish auralization for a variable listener perspective. As an alternative, we propose and test a technique to interpolate room impulse responses, specifically Ambisonic room impulse responses (ARIRs) available at a grid of spatially distributed receiver perspectives, measured or simulated in a desired acoustic environment. In particular, we extrapolate a triplet of neighboring ARIRs to the variable listener perspective, preceding their linear interpolation. The extrapolation is achieved by decomposing each ARIR into localized sound events and re-assigning their direction, time, and level to what could be observed at the listener perspective, with as much temporal, directional, and perspective context as possible. We propose to undertake this decomposition in two levels: Peaks in the early ARIRs are decomposed into jointly localized sound events, based on time differences of arrival observed in either an ARIR triplet, or all ARIRs observing the direct sound. Sound events that could not be jointly localized are treated as residuals whose less precise localization utilizes direction-of-arrival detection and the estimated time of arrival. For the interpolated rendering, suitable parameter settings are found by evaluating the proposed method in a listening experiment, using both measured and simulated ARIR data sets, under static and time-varying conditions.

摘要
通常，虚拟音响渲染使用实时更新的房间音响模拟来实现听众角度变化下的听觉。作为一个 alternatives，我们提出并测试了一种技术，该技术是 interpolating 房间冲击响应（ARIR），特别是在 Desired 音响环境中测量或模拟的各种听众角度上的一组分布式接收器角度上。在特定情况下，我们将三个相邻的 ARIR interpolated 到变量听众角度上，并在其线性 interpolate 之前进行了先前的拟合。 interpolate 的实现方式是将每个 ARIR decomposed 成本地化的声音事件，并将它们的方向、时间和强度重新分配给可能在听众角度上观察到的声音，以保留最多的时间、方向和观察角度上的上下文。我们提议在两个层次进行这种划分：在早期 ARIR 中，发生在相邻的冲击响应中的峰值被 decomposed 成共同localized 的声音事件，基于在 ARIR triplet 中或所有 ARIRs 观察到的直接声音时间差。不能共同localized 的声音事件被视为剩下的 residuals，其准确性较低的localization 使用方向到达检测和估计的时间到达。在 interpolated 渲染中，采用 suitable 参数设置，通过对提议方法进行测试，使用 measured 和 simulated ARIR 数据集，在静止和时间变化的情况下进行评估。