cs.SD - 2023-10-13

Low-latency Speech Enhancement via Speech Token Generation

paper_url: http://arxiv.org/abs/2310.08981
repo_url: None
paper_authors: Huaying Xue, Xiulian Peng, Yan Lu
for: 这篇论文主要应用于低延迟的语音提高中，将语音提高当作语音生成问题，并使用条件生成模型来生成清晰的语音。
methods: 本文提出了一个条件生成框架，使用 neural speech codec 来模型清晰语音，并在 auto-regressive 方式下生成语音 tokens。另外，提出了一个明确的对齐方法来对不同的输入长度进行适应。
results: 实验结果显示，这篇论文的方法在关于噪音抗性和时间语音协调方面比于数据驱动方法有更好的表现。

Abstract
Existing deep learning based speech enhancement mainly employ a data-driven approach, which leverage large amounts of data with a variety of noise types to achieve noise removal from noisy signal. However, the high dependence on the data limits its generalization on the unseen complex noises in real-life environment. In this paper, we focus on the low-latency scenario and regard speech enhancement as a speech generation problem conditioned on the noisy signal, where we generate clean speech instead of identifying and removing noises. Specifically, we propose a conditional generative framework for speech enhancement, which models clean speech by acoustic codes of a neural speech codec and generates the speech codes conditioned on past noisy frames in an auto-regressive way. Moreover, we propose an explicit-alignment approach to align noisy frames with the generated speech tokens to improve the robustness and scalability to different input lengths. Different from other methods that leverage multiple stages to generate speech codes, we leverage a single-stage speech generation approach based on the TF-Codec neural codec to achieve high speech quality with low latency. Extensive results on both synthetic and real-recorded test set show its superiority over data-driven approaches in terms of noise robustness and temporal speech coherence.

摘要
现有的深度学习基于的speech增强主要采用数据驱动的方法，利用大量数据中各种噪音来实现噪音从嘈杂信号中去除。然而，这种方法的强依赖于数据限制其在真实环境中的泛化能力。在这篇论文中，我们关注低延迟场景，将speech增强视为一个speech生成问题，我们通过conditioned on the noisy signal来生成干净的speech。 Specifically, we propose a conditional generative framework for speech enhancement, which models clean speech by acoustic codes of a neural speech codec and generates the speech codes conditioned on past noisy frames in an auto-regressive way. Moreover, we propose an explicit-alignment approach to align noisy frames with the generated speech tokens to improve the robustness and scalability to different input lengths. Different from other methods that leverage multiple stages to generate speech codes, we leverage a single-stage speech generation approach based on the TF-Codec neural codec to achieve high speech quality with low latency. 对于synthetic和实际录制的测试集，我们进行了广泛的实验结果，发现我们的方法在噪音鲁棒性和时间干扰相干性方面比数据驱动方法更高。

Transformer-based Autoencoder with ID Constraint for Unsupervised Anomalous Sound Detection

paper_url: http://arxiv.org/abs/2310.08950
repo_url: None
paper_authors: Jian Guan, Youde Liu, Qiuqiang Kong, Feiyang Xiao, Qiaoxi Zhu, Jiantong Tian, Wenwu Wang
for: 本研究旨在提出一种基于Transformer的自适应噪声检测方法，用于检测设备发生异常噪声时，只有正常噪声数据可用。
methods: 本方法基于ID constrained Transformer-based autoencoder（IDC-TransAE）架构，并使用weighted anomaly score computation来高亮异常事件的异常分数。
results: 经过实验表明，提出的方法在DCASE 2020挑战任务2开发数据集上显示出了效果和超越性。

Abstract
Unsupervised anomalous sound detection (ASD) aims to detect unknown anomalous sounds of devices when only normal sound data is available. The autoencoder (AE) and self-supervised learning based methods are two mainstream methods. However, the AE-based methods could be limited as the feature learned from normal sounds can also fit with anomalous sounds, reducing the ability of the model in detecting anomalies from sound. The self-supervised methods are not always stable and perform differently, even for machines of the same type. In addition, the anomalous sound may be short-lived, making it even harder to distinguish from normal sound. This paper proposes an ID constrained Transformer-based autoencoder (IDC-TransAE) architecture with weighted anomaly score computation for unsupervised ASD. Machine ID is employed to constrain the latent space of the Transformer-based autoencoder (TransAE) by introducing a simple ID classifier to learn the difference in the distribution for the same machine type and enhance the ability of the model in distinguishing anomalous sound. Moreover, weighted anomaly score computation is introduced to highlight the anomaly scores of anomalous events that only appear for a short time. Experiments performed on DCASE 2020 Challenge Task2 development dataset demonstrate the effectiveness and superiority of our proposed method.

摘要
“不监督式异常声检测”（ASD）的目标是检测设备发生未知异常声时，仅使用常规声数据。自动encoder（AE）和基于自我监督学习的方法是主流方法之一。但是AE基础方法可能会受到限制，因为它们从常规声数据学习的特征也可以适应异常声。此外，异常声可能很短暂，使其更难以与常规声区分。这篇论文提出一个具有ID对映运算的Transformer-based autoencoder（IDC-TransAE）架构，并导入权重异常得分计算，以不监督式方式进行异常声检测。此外，我们还导入机器ID，以便将TransAE的内部空间对映到不同机器类型之间的分布。实验结果显示，我们的提案方法在DCASE 2020挑战任务2的开发数据上表现出色，并且与其他方法相比，具有更高的准确性和稳定性。”

Differential Evolution Algorithm based Hyper-Parameters Selection of Convolutional Neural Network for Speech Command Recognition

paper_url: http://arxiv.org/abs/2310.08914
repo_url: https://github.com/techie5879/hyperparameter-optimization-cnn-differential-evolution
paper_authors: Sandipan Dhar, Anuvab Sen, Aritra Bandyopadhyay, Nanda Dulal Jana, Arjun Ghosh, Zahra Sarayloo
for: 这篇论文目标是提高卷积神经网络（CNN）在短语音命令识别（SCR）任务中的表现。
methods: 该论文提出了基于差分演化（DE）算法的卷积神经网络参数选择方法，以提高SCR任务中卷积神经网络的表现。
results: 经过训练和测试使用Google语音命令（GSC）集合，提出的方法在分类语音命令中显示了效果。此外，与基于遗传算法（GA）的选择和其他深度卷积神经网络（DCNN）模型进行比较分析，表明了提出的DE算法在SCR任务中卷积神经网络参数选择中的效率。

Abstract
Speech Command Recognition (SCR), which deals with identification of short uttered speech commands, is crucial for various applications, including IoT devices and assistive technology. Despite the promise shown by Convolutional Neural Networks (CNNs) in SCR tasks, their efficacy relies heavily on hyper-parameter selection, which is typically laborious and time-consuming when done manually. This paper introduces a hyper-parameter selection method for CNNs based on the Differential Evolution (DE) algorithm, aiming to enhance performance in SCR tasks. Training and testing with the Google Speech Command (GSC) dataset, the proposed approach showed effectiveness in classifying speech commands. Moreover, a comparative analysis with Genetic Algorithm based selections and other deep CNN (DCNN) models highlighted the efficiency of the proposed DE algorithm in hyper-parameter selection for CNNs in SCR tasks.

摘要
《语音指令识别（SCR）》，它是许多应用领域的关键技术，如物联网设备和助手技术。尽管卷积神经网络（CNN）在SCR任务中表现出了承诺，但它们的效果却受到参数选择的影响，这是通常是手动进行的劳动密集和耗时的过程。本文提出了基于差分演化（DE）算法的超参数选择方法，以提高CNN在SCR任务中的性能。通过使用Google语音指令（GSC）数据集进行训练和测试，我们发现了该方法在分类语音指令的能力。此外，我们还进行了与基于遗传算法的选择和其他深度卷积神经网络（DCNN）模型进行比较，发现DE算法在SCR任务中的超参数选择具有效率。

Learning to Behave Like Clean Speech: Dual-Branch Knowledge Distillation for Noise-Robust Fake Audio Detection

paper_url: http://arxiv.org/abs/2310.08869
repo_url: None
paper_authors: Cunhang Fan, Mingming Ding, Jianhua Tao, Ruibo Fu, Jiangyan Yi, Zhengqi Wen, Zhao Lv
for: 提高 fake audio detection 系统在噪声混合环境中的性能。
methods: 提议 dual-branch knowledge distillation fake audio detection 方法，包括平行数据流、交互融合和响应基于教师生Student paradigm。
results: 实验结果表明，提议方法在多个数据集上表现良好，能够在噪声混合环境中维持性能。

Abstract
Most research in fake audio detection (FAD) focuses on improving performance on standard noise-free datasets. However, in actual situations, there is usually noise interference, which will cause significant performance degradation in FAD systems. To improve the noise robustness, we propose a dual-branch knowledge distillation fake audio detection (DKDFAD) method. Specifically, a parallel data flow of the clean teacher branch and the noisy student branch is designed, and interactive fusion and response-based teacher-student paradigms are proposed to guide the training of noisy data from the data distribution and decision-making perspectives. In the noise branch, speech enhancement is first introduced for denoising, which reduces the interference of strong noise. The proposed interactive fusion combines denoising features and noise features to reduce the impact of speech distortion and seek consistency with the data distribution of clean branch. The teacher-student paradigm maps the student's decision space to the teacher's decision space, making noisy speech behave as clean. In addition, a joint training method is used to optimize the two branches to achieve global optimality. Experimental results based on multiple datasets show that the proposed method performs well in noisy environments and maintains performance in cross-dataset experiments.

摘要
大多数 fake audio detection（FAD）研究都是在标准噪音自由数据集上提高性能。然而，在实际情况下，通常会有噪声干扰，这会导致 FAD 系统的性能下降。为了改善噪声Robustness，我们提出了双支流知识填充 fake audio detection（DKDFAD）方法。具体来说，我们设计了平行数据流的清晰教师支流和噪音学生支流，并提出了交互融合和响应基于教师-学生模式来导导训练噪音数据的方法。在噪音支流中，首先引入了抑制噪音的Speech enhancement，以减少噪音的干扰。我们提出的交互融合将抑制噪音特征和噪音特征融合在一起，以减少语音扭曲的影响和与干净支流的数据分布一致。教师-学生模式将学生决策空间映射到教师决策空间，使噪音语音 behave as 清晰语音。此外，我们使用了全局优化方法来优化两支流以 достичь全局优化。实验结果基于多个数据集表明，我们提出的方法在噪音环境中表现良好，并在跨数据集实验中保持性能。