cs.SD - 2023-08-07

Active Noise Control based on the Momentum Multichannel Normalized Filtered-x Least Mean Square Algorithm

paper_url: http://arxiv.org/abs/2308.03684
repo_url: None
paper_authors: Dongyuan Shi, Woon-Seng Gan, Bhan Lam, Shulin Wen, Xiaoyi Shen
for: 实现多 канал活动噪声控制 (MCANC) 中的广泛噪声处理区域。
methods: 使用 filter-x least mean square (FxLMS) 算法，但它的快速减退速度使得在面对快速变化的噪声时，表现不佳。此外，噪声功率的变化也会损害算法的稳定性。
results: 通过与征算法结合了惯性方法，使得算法更加快速地趋向稳定点，并且更好地避免了主要噪声功率的干扰。

Abstract
Multichannel active noise control (MCANC) is widely utilized to achieve significant noise cancellation area in the complicated acoustic field. Meanwhile, the filter-x least mean square (FxLMS) algorithm gradually becomes the benchmark solution for the implementation of MCANC due to its low computational complexity. However, its slow convergence speed more or less undermines the performance of dealing with quickly varying disturbances, such as piling noise. Furthermore, the noise power variation also deteriorates the robustness of the algorithm when it adopts the fixed step size. To solve these issues, we integrated the normalized multichannel FxLMS with the momentum method, which hence, effectively avoids the interference of the primary noise power and accelerates the convergence of the algorithm. To validate its effectiveness, we deployed this algorithm in a multichannel noise control window to control the real machine noise.

摘要
多通道活动噪声控制（MCANC）广泛应用于复杂的噪声场中实现显著的噪声抑制面积。同时，Filter-x最小二乘（FxLMS）算法逐渐成为MCANC实现的标准解决方案，因为它的计算复杂性较低。然而，它的慢速收敛速度在面对快变化的干扰时，很大程度地降低了性能。此外，噪声功率变化也降低了算法的稳定性，特别是当采用固定步长时。为解决这些问题，我们将normalized multichannel FxLMS与势量方法结合，从而有效地避免了主要噪声功率的干扰和加速了算法的收敛。为验证其效果，我们在多通道噪声控制窗口中应用了这种算法来控制实际机器噪声。

AudioVMAF: Audio Quality Prediction with VMAF

paper_url: http://arxiv.org/abs/2308.03437
repo_url: None
paper_authors: Arijit Biswas, Harald Mundt
for: 提高编码音频质量评估的精度
methods: 基于现有VMAF的听觉前端创建参考视频和编码spectrogram，并扩展VMAF来评估编码音频质量
results: 提出的AudioVMAF系统在带宽限制场景下表现出更高的预测精度，并在比较已有视觉质量特征与专门的音频质量指标（ViSQOL-v3）中显示出7.8%和2.0%的显著提高。

Abstract
Video Multimethod Assessment Fusion (VMAF) [1], [2], [3] is a popular tool in the industry for measuring coded video quality. In this study, we propose an auditory-inspired frontend in existing VMAF for creating videos of reference and coded spectrograms, and extended VMAF for measuring coded audio quality. We name our system AudioVMAF. We demonstrate that image replication is capable of further enhancing prediction accuracy, especially when band-limited anchors are present. The proposed method significantly outperforms all existing visual quality features repurposed for audio, and even demonstrates a significant overall improvement of 7.8% and 2.0% of Pearson and Spearman rank correlation coefficient, respectively, over a dedicated audio quality metric (ViSQOL-v3 [4]) also inspired from the image domain.

摘要
视频多方法评估融合（VMAF）是行业中广泛使用的视频质量评估工具。在本研究中，我们提出一种听力 inspirited 的前端，用于创建参考视频和编码спектрограм，并扩展了VMAF以测量编码音频质量。我们称之为AudioVMAF。我们示出，图像复制能够进一步提高预测精度，特别是在存在带限 anchors 时。我们的方法在所有现有的视觉质量特征的抽象下表现出色，并在ViSQOL-v3 （4）中显示了 significan 7.8% 和 2.0% 的潘森和斯宾塞排名相关系数，分别。

Improving Deep Attractor Network by BGRU and GMM for Speech Separation

paper_url: http://arxiv.org/abs/2308.03332
repo_url: None
paper_authors: Rawad Melhem, Assef Jafar, Riad Hamadeh
for: 这个论文是为了提出一种简化了DANet模型，使其更加强大和简单的 speech separation 模型。
methods: 该模型使用了bidirectional gated neural network (BGRU) 代替了 bidirectional long short-term memory (BLSTM)，并使用 Gaussian Mixture Model (GMM) 作为聚类算法来降低复杂性和提高学习速度和准确性。
results: 在使用 TIMIT 语音数据集进行评估时，提出的模型可以达到12.3 dB和2.94的 SDR 和 PESQ 分数，比原始 DANet 模型更好。此外，该模型还减少了20.7%和17.9%的参数数量和训练时间。最后，该模型在混合阿拉伯语音信号上进行评估，得到了更好的结果。

Abstract
Deep Attractor Network (DANet) is the state-of-the-art technique in speech separation field, which uses Bidirectional Long Short-Term Memory (BLSTM), but the complexity of the DANet model is very high. In this paper, a simplified and powerful DANet model is proposed using Bidirectional Gated neural network (BGRU) instead of BLSTM. The Gaussian Mixture Model (GMM) other than the k-means was applied in DANet as a clustering algorithm to reduce the complexity and increase the learning speed and accuracy. The metrics used in this paper are Signal to Distortion Ratio (SDR), Signal to Interference Ratio (SIR), Signal to Artifact Ratio (SAR), and Perceptual Evaluation Speech Quality (PESQ) score. Two speaker mixture datasets from TIMIT corpus were prepared to evaluate the proposed model, and the system achieved 12.3 dB and 2.94 for SDR and PESQ scores respectively, which were better than the original DANet model. Other improvements were 20.7% and 17.9% in the number of parameters and time training, respectively. The model was applied on mixed Arabic speech signals and the results were better than that in English.

摘要
深度吸引网络（DANet）是现代语音分离领域的状态元技术，使用了双向长短期记忆（BLSTM），但DANet模型的复杂性很高。在本文中，一种简化了DANet模型，使用了双向闭合神经网络（BGRU）而不是BLSTM。 Gaussian Mixture Model（GMM）在DANet中作为聚类算法来降低复杂性和提高学习速度和准确性。本文使用的度量包括信号质量至噪声比（SDR）、信号质量至干扰比（SIR）、信号质量至噪声比（SAR）和语音质量评价分数（PESQ）。使用TIMIT corpus中的两个说话者混合数据集进行评估，提出的模型在SDR和PESQ分数上分别达到12.3 dB和2.94，比原始DANet模型更好。此外，模型的参数数量和训练时间都有20.7%和17.9%的下降。该模型在混合阿拉伯语音信号上得到了更好的结果，比英语更好。

SeACo-Paraformer: A Non-Autoregressive ASR System with Flexible and Effective Hotword Customization Ability

paper_url: http://arxiv.org/abs/2308.03266
repo_url: https://github.com/r1ckshi/seaco-paraformer
paper_authors: Xian Shi, Yexin Yang, Zerui Li, Shiliang Zhang
for: 提高 ASR 系统中热词定制的灵活性和效果性。
methods: 提出 Semantic-augmented Contextual-Paraformer (SeACo-Paraformer) 模型，结合 AED 模型的精度、NAR 模型的效率和Contextualization 能力，实现热词定制的灵活性和效果性。
results: 在50,000小时工业大数据实验中，提出的模型比强基eline在定制和总 ASR 任务中表现出色，同时提出了一种高效的大规模热词筛选方法。 industrial models 和两个热词测试集都已经公开。

Abstract
Hotword customization is one of the important issues remained in ASR field - it is of value to enable users of ASR systems to customize names of entities, persons and other phrases. The past few years have seen both implicit and explicit modeling strategies for ASR contextualization developed. While these approaches have performed adequately, they still exhibit certain shortcomings such as instability in effectiveness. In this paper we propose Semantic-augmented Contextual-Paraformer (SeACo-Paraformer) a novel NAR based ASR system with flexible and effective hotword customization ability. It combines the accuracy of the AED-based model, the efficiency of the NAR model, and the excellent performance in contextualization. In 50,000 hours industrial big data experiments, our proposed model outperforms strong baselines in customization and general ASR tasks. Besides, we explore an efficient way to filter large scale incoming hotwords for further improvement. The source codes and industrial models proposed and compared are all opened as well as two hotword test sets.

摘要
“热词自定义是ASR领域中一个重要的 issuesthat is of great value to enable users of ASR systems to customize names of entities, persons, and other phrases. Recently, both implicit and explicit modeling strategies for ASR contextualization have been developed, but they still have some shortcomings such as instability in effectiveness. In this paper, we propose a novel NAR-based ASR system with flexible and effective hotword customization ability, called Semantic-augmented Contextual-Paraformer (SeACo-Paraformer). It combines the accuracy of the AED-based model, the efficiency of the NAR model, and the excellent performance in contextualization. In 50,000 hours of industrial big data experiments, our proposed model outperforms strong baselines in customization and general ASR tasks. Furthermore, we explore an efficient way to filter large-scale incoming hotwords for further improvement. The source codes and industrial models proposed and compared are all open, as well as two hotword test sets.”Note that the translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you prefer Traditional Chinese, please let me know and I can provide the translation in that format as well.

Investigation of Self-supervised Pre-trained Models for Classification of Voice Quality from Speech and Neck Surface Accelerometer Signals

paper_url: http://arxiv.org/abs/2308.03226
repo_url: None
paper_authors: Sudarsana Reddy Kadiri, Farhad Javanmardi, Paavo Alku
for: 这个研究的目的是研究自动分类声音质量的方法，特别是使用同时记录的语音和脊梁压力仪（NSA）信号作为输入，并提取MFCCs和颚部源特征。
methods: 这个研究使用了三个自助学习模型（wav2vec2-BASE、wav2vec2-LARGE和HuBERT）生成的特征，以及支持向量机（SVM）和卷积神经网络（CNN）作为分类器。此外，研究还对颚部源波形和原始信号波形进行了两种信号处理方法（ quasi-closed phase（QCP）颚部逆滤波和零频 filtering（ZFF））来生成颚部源波形。
results: 研究发现，使用 NSA 输入可以比语音输入更好地进行分类，而且使用预训练模型生成的特征可以提高分类精度，特别是对于语音和 NSA 输入。此外，研究还发现 HuBERT 特征在分类任务中表现更好于 wav2vec2-BASE 和 wav2vec2-LARGE 特征。

Abstract
Prior studies in the automatic classification of voice quality have mainly studied the use of the acoustic speech signal as input. Recently, a few studies have been carried out by jointly using both speech and neck surface accelerometer (NSA) signals as inputs, and by extracting MFCCs and glottal source features. This study examines simultaneously-recorded speech and NSA signals in the classification of voice quality (breathy, modal, and pressed) using features derived from three self-supervised pre-trained models (wav2vec2-BASE, wav2vec2-LARGE, and HuBERT) and using a SVM as well as CNNs as classifiers. Furthermore, the effectiveness of the pre-trained models is compared in feature extraction between glottal source waveforms and raw signal waveforms for both speech and NSA inputs. Using two signal processing methods (quasi-closed phase (QCP) glottal inverse filtering and zero frequency filtering (ZFF)), glottal source waveforms are estimated from both speech and NSA signals. The study has three main goals: (1) to study whether features derived from pre-trained models improve classification accuracy compared to conventional features (spectrogram, mel-spectrogram, MFCCs, i-vector, and x-vector), (2) to investigate which of the two modalities (speech vs. NSA) is more effective in the classification task with pre-trained model-based features, and (3) to evaluate whether the deep learning-based CNN classifier can enhance the classification accuracy in comparison to the SVM classifier. The results revealed that the use of the NSA input showed better classification performance compared to the speech signal. Between the features, the pre-trained model-based features showed better classification accuracies, both for speech and NSA inputs compared to the conventional features. It was also found that the HuBERT features performed better than the wav2vec2-BASE and wav2vec2-LARGE features.

摘要
先前的研究主要是使用语音信号来自动分类voice quality，现在有一些研究使用语音和颈部表面加速器（NSA）信号同时录制，并提取MFCCs和颈部源特征。本研究通过使用三种自动学习模型（wav2vec2-BASE、wav2vec2-LARGE和HuBERT）提取特征，使用SVM和CNN作为分类器，研究语音和NSA信号同时录制的voice quality分类效果。此外，还比较了三种模型在特征提取中的效果，以及使用不同的信号处理方法（ quasi-closed phase颈部逆推和zero frequency filtering）来提取颈部源波形。研究的主要目标是：1. 研究使用预训练模型提取的特征是否能够提高分类精度，比较传统特征（spectrogram、mel-spectrogram、MFCCs、i-vector和x-vector）的效果。2. 研究语音和NSA信号中哪一种Modalities更有效iveness在分类任务中，并使用预训练模型基于特征进行分类。3. 研究使用深度学习基于CNN的分类器是否能够提高分类精度，相比SVM分类器。结果表明，使用NSA输入信号可以实现更好的分类性能，而且使用预训练模型基于特征可以提高分类精度，无论是语音还是NSA输入信号。此外，HuBERT特征也表现出了更高的分类精度。