eess.AS - 2023-11-03

SE Territory: Monaural Speech Enhancement Meets the Fixed Virtual Perceptual Space Mapping

  • paper_url: http://arxiv.org/abs/2311.01679
  • repo_url: None
  • paper_authors: Xinmeng Xu, Jibin Wu, Xiaoyong Wei, Yan Liu, Richard So, Yuhong Yang, Weiping Tu, Kay Chen Tan
  • for: 提高单麦口音频噪声纠正性能
  • methods: 提出了一种将单麦口音频Mapping到固定的Simulation空间中,以便更好地 отли别目标speech和噪声。这种方法基于二stage多任务学习框架,首先使用supervised speech mapping块将单麦口音频映射到虚拟空间中,然后使用cross-attention capture虚拟空间中的虚拟方向信息,以提高target speech的提取。
  • results: 对比其他最新的单麦口音频纠正方法,提出的SE-TerrNet显著超越了它们,both in terms of speech quality和语音可读性。
    Abstract Monaural speech enhancement has achieved remarkable progress recently. However, its performance has been constrained by the limited spatial cues available at a single microphone. To overcome this limitation, we introduce a strategy to map monaural speech into a fixed simulation space for better differentiation between target speech and noise. Concretely, we propose SE-TerrNet, a novel monaural speech enhancement model featuring a virtual binaural speech mapping network via a two-stage multi-task learning framework. In the first stage, monaural noisy input is projected into a virtual space using supervised speech mapping blocks, creating binaural representations. These blocks synthesize binaural noisy speech from monaural input via an ideal binaural room impulse response. The synthesized output assigns speech and noise sources to fixed directions within the perceptual space. In the second stage, the obtained binaural features from the first stage are aggregated. This aggregation aims to decrease pattern discrepancies between the mapped binaural and original monaural features, achieved by implementing an intermediate fusion module. Furthermore, this stage incorporates the utilization of cross-attention to capture the injected virtual spatial information to improve the extraction of the target speech. Empirical studies highlight the effectiveness of virtual spatial cues in enhancing monaural speech enhancement. As a result, the proposed SE-TerrNet significantly surpasses the recent monaural speech enhancement methods in terms of both speech quality and intelligibility.
    摘要 单声 speech 增强已经在最近得到了惊人的进步,但其表现受到单一麦克风提供的空间讯号限制。为了突破这个限制,我们提出了将单声 speech 映射到固定的 simulationspace 中,以更好地区分target speech 和噪声。具体来说,我们提出了 SE-TerrNet,一个新的单声 speech 增强模型,拥有一个通过二阶段多任务学习框架的虚拟 binatural speech 映射网络。在第一阶段,单声噪音输入被投射到虚拟空间中,使用supervised speech 映射封页 synthesize binatural noisy speech from monaural input,这些封页使得speech和噪声源分配到固定的方向 within the perceptual space。在第二阶段,获得的虚拟空间中的特征被聚合,以减少对映射后的单声 speech 和原始单声 input 的模式差异,这是通过实现一个中继融合模组来实现的。此外,这个阶段还包括利用跨关注处理来捕捉在虚拟空间中注射的虚拟空间信息,以提高对target speech的抽取。 empirical studies 显示虚拟空间信息在增强单声 speech 中发挥了惊人的作用。因此,提案的 SE-TerrNet 在比较 recent monaural speech enhancement methods 的情况下,实现了显著的提高。