eess.AS - 2023-07-28

Efficient Acoustic Echo Suppression with Condition-Aware Training

paper_url: http://arxiv.org/abs/2307.15630
repo_url: None
paper_authors: Ernst Seidel, Pejman Mowlaee, Tim Fingscheidt
for: 这篇论文主要关注deep acoustic echo control (DAEC)方法的改进，以提高双语言听取的效果。
methods: 论文使用了卷积循环神经网络 (CRN)，该网络包括卷积编码和解oder，并且包含了回传瓶颈，以保留双语言听取中的nearend speech。
results: 该网络在双语言听取中实现了更好的效果，比起FCRN和CRUSE这两个基eline架构，不仅储存parameters和计算复杂度少，而且也表现更好。

Abstract
The topic of deep acoustic echo control (DAEC) has seen many approaches with various model topologies in recent years. Convolutional recurrent networks (CRNs), consisting of a convolutional encoder and decoder encompassing a recurrent bottleneck, are repeatedly employed due to their ability to preserve nearend speech even in double-talk (DT) condition. However, past architectures are either computationally complex or trade off smaller model sizes with a decrease in performance. We propose an improved CRN topology which, compared to other realizations of this class of architectures, not only saves parameters and computational complexity, but also shows improved performance in DT, outperforming both baseline architectures FCRN and CRUSE. Striving for a condition-aware training, we also demonstrate the importance of a high proportion of double-talk and the missing value of nearend-only speech in DAEC training data. Finally, we show how to control the trade-off between aggressive echo suppression and near-end speech preservation by fine-tuning with condition-aware component loss functions.

摘要
针对深度听音回声控制（DAEC）的话题在过去几年内有很多方法和不同的模型架构被提出。卷积回声网络（CRN），它由卷积编码器和解码器、回声瓶颈部分组成，因其能够在双语（DT）条件下保持近端语音的能力而被广泛采用。然而，过去的架构都是计算复杂或是减少模型大小的代价是性能下降。我们提出了改进的 CRN 架构，与其他类似架构相比，不仅减少参数和计算复杂度，而且在DT条件下表现更好，超过了基eline FCRN 和 CRUSE 架构。为了实现状态感知训练，我们也表明了高比例的双语和近端只有语音在 DAEC 训练数据中的重要性。最后，我们表明了如何通过condition-aware组件损失函数来控制对强制回声消除和近端语音保持的负面交互。

A Time-Frequency Generative Adversarial based method for Audio Packet Loss Concealment

paper_url: http://arxiv.org/abs/2307.15611
repo_url: https://github.com/aircarlo/bin2bin-gan-plc
paper_authors: Carlo Aironi, Samuele Cornell, Luca Serafini, Stefano Squartini
for: 这篇论文 targets the problem of voice quality degradation in VoIP transmissions caused by packet loss, and proposes a generative adversarial approach to repair lost fragments during audio stream transmission.
methods: The proposed method, called bin2bin, is based on an improved pix2pix framework and uses a combination of two STFT-based loss functions and a modified PatchGAN structure as discriminator to translate magnitude spectrograms of audio frames with lost packets to noncorrupted speech spectrograms.
results: Experimental results show that the proposed method has obvious advantages compared to current state-of-the-art methods, particularly in handling high packet loss rates and large gaps.

Abstract
Packet loss is a major cause of voice quality degradation in VoIP transmissions with serious impact on intelligibility and user experience. This paper describes a system based on a generative adversarial approach, which aims to repair the lost fragments during the transmission of audio streams. Inspired by the powerful image-to-image translation capability of Generative Adversarial Networks (GANs), we propose bin2bin, an improved pix2pix framework to achieve the translation task from magnitude spectrograms of audio frames with lost packets, to noncorrupted speech spectrograms. In order to better maintain the structural information after spectrogram translation, this paper introduces the combination of two STFT-based loss functions, mixed with the traditional GAN objective. Furthermore, we employ a modified PatchGAN structure as discriminator and we lower the concealment time by a proper initialization of the phase reconstruction algorithm. Experimental results show that the proposed method has obvious advantages when compared with the current state-of-the-art methods, as it can better handle both high packet loss rates and large gaps.

摘要
packet loss 是 VoIP 传输中音质下降的主要原因，对于智能指南和用户体验产生严重的影响。这篇论文描述了基于生成对抗方法的系统，用于在音频流传输过程中修复丢失的封包。取得了生成对抗网络（GAN）的强大图像对图像翻译能力的灵感，我们提议了bin2bin，一个改进的 pix2pix 框架，用于从丢失封包的音频帧矩阵中翻译到非损音频矩阵。为了更好地保持翻译后的结构信息，这篇论文提出了两个 STFT-based 损失函数的组合，混合传统 GAN 目标。此外，我们使用修改后的 PatchGAN 结构来担任探测器，并通过适当初始化相位重建算法来降低隐藏时间。实验结果表明，提议的方法在与当前状态艺术方法相比有显著优势，可以更好地处理高 packet loss 率和大 gap。