paper_authors: Jialu Li, Junhui Li, Pu Wang, Youshan Zhang
for: 提高Speech噪音消除的性能
methods: 提出了一种深度复杂混合变换器, integrate both spectrogram和waveform domain的方法,以提高Speech噪音消除的性能
results: 实验结果表明,该方法可以在BirdSoundsDenoising数据集和VCTK+DEMAND数据集上比 estado-of-the-art 方法更好地提高Speech噪音消除的性能。Abstract
Most of the current deep learning-based approaches for speech enhancement only operate in the spectrogram or waveform domain. Although a cross-domain transformer combining waveform- and spectrogram-domain inputs has been proposed, its performance can be further improved. In this paper, we present a novel deep complex hybrid transformer that integrates both spectrogram and waveform domains approaches to improve the performance of speech enhancement. The proposed model consists of two parts: a complex Swin-Unet in the spectrogram domain and a dual-path transformer network (DPTnet) in the waveform domain. We first construct a complex Swin-Unet network in the spectrogram domain and perform speech enhancement in the complex audio spectrum. We then introduce improved DPT by adding memory-compressed attention. Our model is capable of learning multi-domain features to reduce existing noise on different domains in a complementary way. The experimental results on the BirdSoundsDenoising dataset and the VCTK+DEMAND dataset indicate that our method can achieve better performance compared to state-of-the-art methods.
摘要
Current deep learning-based speech enhancement methods mostly operate in the spectrogram or waveform domain. Although a cross-domain transformer combining waveform- and spectrogram-domain inputs has been proposed, its performance can be further improved. In this paper, we present a novel deep complex hybrid transformer that integrates both spectrogram and waveform domains approaches to improve speech enhancement performance. The proposed model consists of two parts: a complex Swin-Unet in the spectrogram domain and a dual-path transformer network (DPTnet) in the waveform domain.First, we construct a complex Swin-Unet network in the spectrogram domain and perform speech enhancement in the complex audio spectrum. We then introduce improved DPT by adding memory-compressed attention. Our model is capable of learning multi-domain features to reduce existing noise on different domains in a complementary way. The experimental results on the BirdSoundsDenoising dataset and the VCTK+DEMAND dataset indicate that our method can achieve better performance compared to state-of-the-art methods.
Sound of Story: Multi-modal Storytelling with Audio
results: 作者通过实验表明,该数据集和任务可以帮助研究人员更好地理解故事的多模态表达,并提出了强大的基线任务。数据集和代码将在链接中公开:https://github.com/Sosdatasets/SoS_Dataset。Abstract
Storytelling is multi-modal in the real world. When one tells a story, one may use all of the visualizations and sounds along with the story itself. However, prior studies on storytelling datasets and tasks have paid little attention to sound even though sound also conveys meaningful semantics of the story. Therefore, we propose to extend story understanding and telling areas by establishing a new component called "background sound" which is story context-based audio without any linguistic information. For this purpose, we introduce a new dataset, called "Sound of Story (SoS)", which has paired image and text sequences with corresponding sound or background music for a story. To the best of our knowledge, this is the largest well-curated dataset for storytelling with sound. Our SoS dataset consists of 27,354 stories with 19.6 images per story and 984 hours of speech-decoupled audio such as background music and other sounds. As benchmark tasks for storytelling with sound and the dataset, we propose retrieval tasks between modalities, and audio generation tasks from image-text sequences, introducing strong baselines for them. We believe the proposed dataset and tasks may shed light on the multi-modal understanding of storytelling in terms of sound. Downloading the dataset and baseline codes for each task will be released in the link: https://github.com/Sosdatasets/SoS_Dataset.
摘要
Storytelling 是多Modal 的在现实世界中。当一个人 tel 一个故事时,可能使用所有的视觉和声音来传达故事的意义。然而,在storytelling 数据集和任务中,尽管声音也传达了故事的 semantics,但是之前的研究却很少关注声音。因此,我们提议通过 Adding 一个新的组件 called "background sound",来扩展故事理解和 tel 的领域。为此,我们引入了一个新的数据集,called "Sound of Story (SoS)",该数据集包含 27,354 个故事,每个故事有 19.6 个图像和 984 小时的speech-decoupled 声音,如背景音乐和其他声音。我们认为这是最大的、最好的纪录的故事tel 数据集。我们的 SoS 数据集包括以下任务: between modalities 的 Retrieval 任务和 image-text 序列的 audio 生成任务,我们提出了强大的基线。我们认为这些任务和数据集可能为 storytelling 中声音的多Modal 理解提供新的灵感。下载数据集和基线代码可以通过以下链接下载:https://github.com/Sosdatasets/SoS_Dataset。