cs.SD - 2023-08-24

Sparks of Large Audio Models: A Survey and Outlook

  • paper_url: http://arxiv.org/abs/2308.12792
  • repo_url: None
  • paper_authors: Siddique Latif, Moazzam Shoukat, Fahad Shamshad, Muhammad Usama, Heriberto Cuayáhuitl, Björn W. Schuller
  • for: 这项论文提供了大语言模型在音频处理领域的最新进展和挑战。
  • methods: 这项论文使用的方法包括使用大量数据的传统变换器结构,以及对这些模型的深入分析和评估。
  • results: 这项论文的结果表明,这些基础Audio Models可以在多种音频任务中表现出色,包括自动语音识别、文本读取和音乐生成等。此外,这些模型还可以 acting as universal translators,支持多种语言的多种语音任务。
    Abstract This survey paper provides a comprehensive overview of the recent advancements and challenges in applying large language models to the field of audio signal processing. Audio processing, with its diverse signal representations and a wide range of sources--from human voices to musical instruments and environmental sounds--poses challenges distinct from those found in traditional Natural Language Processing scenarios. Nevertheless, \textit{Large Audio Models}, epitomized by transformer-based architectures, have shown marked efficacy in this sphere. By leveraging massive amount of data, these models have demonstrated prowess in a variety of audio tasks, spanning from Automatic Speech Recognition and Text-To-Speech to Music Generation, among others. Notably, recently these Foundational Audio Models, like SeamlessM4T, have started showing abilities to act as universal translators, supporting multiple speech tasks for up to 100 languages without any reliance on separate task-specific systems. This paper presents an in-depth analysis of state-of-the-art methodologies regarding \textit{Foundational Large Audio Models}, their performance benchmarks, and their applicability to real-world scenarios. We also highlight current limitations and provide insights into potential future research directions in the realm of \textit{Large Audio Models} with the intent to spark further discussion, thereby fostering innovation in the next generation of audio-processing systems. Furthermore, to cope with the rapid development in this area, we will consistently update the relevant repository with relevant recent articles and their open-source implementations at https://github.com/EmulationAI/awesome-large-audio-models.
    摘要

WavMark: Watermarking for Audio Generation

  • paper_url: http://arxiv.org/abs/2308.12770
  • repo_url: None
  • paper_authors: Guangyu Chen, Yu Wu, Shujie Liu, Tao Liu, Xiaoyong Du, Furu Wei
  • for: 这个论文旨在提出一种 Audio Watermarking 框架,用于防止静音识别和说话者冒充。
  • methods: 该框架使用了一种新的压缩方法,可以在1秒钟的音频片段中编码32比特的水印,并且具有强大的鲁棒性和隐蔽性。
  • results: 该框架可以在10-20秒的音频片段上实现平均的 Bit Error Rate 为0.48%,比现有的水印工具减少了2800%以上的错误率。
    Abstract Recent breakthroughs in zero-shot voice synthesis have enabled imitating a speaker's voice using just a few seconds of recording while maintaining a high level of realism. Alongside its potential benefits, this powerful technology introduces notable risks, including voice fraud and speaker impersonation. Unlike the conventional approach of solely relying on passive methods for detecting synthetic data, watermarking presents a proactive and robust defence mechanism against these looming risks. This paper introduces an innovative audio watermarking framework that encodes up to 32 bits of watermark within a mere 1-second audio snippet. The watermark is imperceptible to human senses and exhibits strong resilience against various attacks. It can serve as an effective identifier for synthesized voices and holds potential for broader applications in audio copyright protection. Moreover, this framework boasts high flexibility, allowing for the combination of multiple watermark segments to achieve heightened robustness and expanded capacity. Utilizing 10 to 20-second audio as the host, our approach demonstrates an average Bit Error Rate (BER) of 0.48\% across ten common attacks, a remarkable reduction of over 2800\% in BER compared to the state-of-the-art watermarking tool. See https://aka.ms/wavmark for demos of our work.
    摘要 最近的零上 синте声技术突破有了可以通过只需几秒钟的录音来模仿说话人的声音,同时保持高度的真实感。然而,这种强大技术也存在了一些风险,包括语音 fraud 和说话人模仿。不同于传统的仅依靠被动方法来检测合成数据,水印技术是一种积极和坚强的防御机制。本文介绍了一种创新的音频水印框架,可以在1秒钟的音频片断中编码Up to 32位的水印,人类不可见。这个水印具有强大的抗击攻击性,可以作为合成voice的标识符,并且有广泛的应用前途在音频版权保护方面。此外,这个框架具有高度的灵活性,可以将多个水印段组合以实现更高的强度和扩展的容量。使用10到20秒的音频作为主机,我们的方法在十种常见的攻击下显示了0.48%的比特错误率(BER),相比之下,现状的水印工具的BER减少了2800%以上。请参考https://aka.ms/wavmark 获取我们的工作示例。

Whombat: An open-source annotation tool for machine learning development in bioacoustics

  • paper_url: http://arxiv.org/abs/2308.12688
  • repo_url: None
  • paper_authors: Santiago Martinez Balvanera, Oisin Mac Aodha, Matthew J. Weldy, Holly Pringle, Ella Browning, Kate E. Jones
  • for: 这个论文的目的是提出一种用于自动分析生物声音记录的机器学习方法,以扩大生物多样性监测的规模。
  • methods: 这个论文使用的方法是基于机器学习的高级应用,需要一种数据驱动的方法,使用仔细标注和整理的评估和训练数据,以确保模型的准确性和可靠性。
  • results: 这个论文通过介绍一种名为Whombat的用户友好的浏览器基本接口,可以帮助用户管理声音记录和注释项目,并提供了一些视觉化、探索和注释工具,以便快速注释、审查和分享注释,并可视化和评估数据集中的机器学习预测结果。
    Abstract
  1. Automated analysis of bioacoustic recordings using machine learning (ML) methods has the potential to greatly scale biodiversity monitoring efforts. The use of ML for high-stakes applications, such as conservation research, demands a data-centric approach with a focus on utilizing carefully annotated and curated evaluation and training data that is relevant and representative. Creating annotated datasets of sound recordings presents a number of challenges, such as managing large collections of recordings with associated metadata, developing flexible annotation tools that can accommodate the diverse range of vocalization profiles of different organisms, and addressing the scarcity of expert annotators. 2. We present Whombat a user-friendly, browser-based interface for managing audio recordings and annotation projects, with several visualization, exploration, and annotation tools. It enables users to quickly annotate, review, and share annotations, as well as visualize and evaluate a set of machine learning predictions on a dataset. The tool facilitates an iterative workflow where user annotations and machine learning predictions feedback to enhance model performance and annotation quality. 3. We demonstrate the flexibility of Whombat by showcasing two distinct use cases: an project aimed at enhancing automated UK bat call identification at the Bat Conservation Trust (BCT), and a collaborative effort among the USDA Forest Service and Oregon State University researchers exploring bioacoustic applications and extending automated avian classification models in the Pacific Northwest, USA. 4. Whombat is a flexible tool that can effectively address the challenges of annotation for bioacoustic research. It can be used for individual and collaborative work, hosted on a shared server or accessed remotely, or run on a personal computer without the need for coding skills.
    摘要
  2. 机器学习(ML)技术可以帮助自动分析生物声音记录,提高生物多样性监测的效率。在保护研究等高度重要应用中使用ML时,需要一种数据驱动的方法,强调使用高质量、精心标注和抽象的训练和评估数据。创建声音记录的标注数据集存在许多挑战,例如管理大量记录和相关元数据,开发 flexible的标注工具,以满足不同生物种类的声音profile的多样性。2. 我们介绍Whombat,一个易于使用的浏览器基本的界面,用于管理声音记录和标注项目。它具有许多可视化、探索和标注工具,让用户快速标注、审核和共享标注,以及可视化和评估数据集上机器学习预测的结果。工具支持循环的工作流程,在用户标注和机器学习预测之间进行反馈,以提高标注质量和模型性能。3. 我们示例了Whombat的灵活性,通过两个不同的应用场景:一是英国蝙蝠保护协会(BCT)的自动蝙蝠叫声识别项目,二是美国农业部和奥REGON州立大学合作的生物声音应用研究,探索生物声音应用和扩展自动鸟类分类模型。4. Whombat是一个灵活的工具,可以有效地解决生物声音研究中的标注挑战。它可以用于个人和团队工作,可以在共享服务器上Host或远程访问,或者在个人计算机上运行,无需编程技能。

Naaloss: Rethinking the objective of speech enhancement

  • paper_url: http://arxiv.org/abs/2308.12615
  • repo_url: None
  • paper_authors: Kuan-Hsun Ho, En-Lun Yu, Jeih-weih Hung, Berlin Chen
    for:* This paper aims to improve the performance of automatic speech recognition (ASR) in noisy environments by reducing the impact of processing artifacts generated by single-channel speech enhancement (SE) methods.methods:* The paper proposes a novel Noise- and Artifacts-aware loss function (NAaLoss) that considers the loss of estimation, de-artifact, and noise ignorance to improve the quality of SE.results:* Experimental results show that NAaLoss significantly improves the ASR performance of most setups while preserving the quality of SE, as demonstrated through visualizations of artifacts in waveforms and spectrograms.
    Abstract Reducing noise interference is crucial for automatic speech recognition (ASR) in a real-world scenario. However, most single-channel speech enhancement (SE) generates "processing artifacts" that negatively affect ASR performance. Hence, in this study, we suggest a Noise- and Artifacts-aware loss function, NAaLoss, to ameliorate the influence of artifacts from a novel perspective. NAaLoss considers the loss of estimation, de-artifact, and noise ignorance, enabling the learned SE to individually model speech, artifacts, and noise. We examine two SE models (simple/advanced) learned with NAaLoss under various input scenarios (clean/noisy) using two configurations of the ASR system (with/without noise robustness). Experiments reveal that NAaLoss significantly improves the ASR performance of most setups while preserving the quality of SE toward perception and intelligibility. Furthermore, we visualize artifacts through waveforms and spectrograms, and explain their impact on ASR.
    摘要 减少干扰是自动语音识别(ASR)在实际场景中的关键。然而,大多数单通道语音增强(SE)生成“处理残留”,这些残留会负面影响ASR性能。因此,在这项研究中,我们提议一种噪声和残留意识损失函数(NAaLoss),从新的视角来缓解噪声和残留的影响。NAaLoss考虑语音估计损失、去残留和噪声忽略,使得学习的SE可以分别模型语音、残留和噪声。我们在不同的输入场景(干扰/不干扰)和ASR系统的两种配置(带/ без噪声Robustness)中对两种SE模型(简单/高级)进行了NAaLoss学习。实验表明,NAaLoss可以在大多数设置中显著提高ASR性能,同时保持SE的质量。此外,我们通过波形和spectrogram来可见化残留,并解释它们对ASR的影响。

Emotion-Aligned Contrastive Learning Between Images and Music

  • paper_url: http://arxiv.org/abs/2308.12610
  • repo_url: None
  • paper_authors: Shanti Stewart, Tiantian Feng, Kleanthis Avramidis, Shrikanth Narayanan
  • for: 这个论文是为了 Retrieving emotionally-relevant music from image queries 而写的。
  • methods: 这篇论文使用的方法包括 learning an affective alignment between images and music audio,以及 cross-modal contrastive learning。
  • results: 该方法能够成功地对图像和音频进行对应,并且学习出的 embedding space 是可以用于cross-modal retrieval应用。
    Abstract Traditional music search engines rely on retrieval methods that match natural language queries with music metadata. There have been increasing efforts to expand retrieval methods to consider the audio characteristics of music itself, using queries of various modalities including text, video, and speech. Most approaches aim to match general music semantics to the input queries, while only a few focus on affective qualities. We address the task of retrieving emotionally-relevant music from image queries by proposing a framework for learning an affective alignment between images and music audio. Our approach focuses on learning an emotion-aligned joint embedding space between images and music. This joint embedding space is learned via emotion-supervised contrastive learning, using an adapted cross-modal version of the SupCon loss. We directly evaluate the joint embeddings with cross-modal retrieval tasks (image-to-music and music-to-image) based on emotion labels. In addition, we investigate the generalizability of the learned music embeddings with automatic music tagging as a downstream task. Our experiments show that our approach successfully aligns images and music, and that the learned embedding space is effective for cross-modal retrieval applications.
    摘要 传统音乐搜索引擎通常使用自然语言查询匹配音乐元数据。随着扩展 Retrieval 方法的尝试,有些方法开始考虑音乐自身的特征,使用不同模式的查询,包括文本、视频和语音。大多数方法尝试匹配通用音乐Semantics 到输入查询,只有一些关注情感质量。我们解决通过图像查询检索情感相关的音乐的任务,我们提议一种学习影响对齐图像和音乐音频的情感对齐的框架。我们的方法是学习一个情感对齐的共同嵌入空间 между图像和音乐。这个共同嵌入空间是通过情感supervised contrastive learning 学习,使用修改后的跨模态版本的 SupCon 损失函数。我们直接评估共同嵌入的 JOINT embedding 与跨模态检索任务(图像到音乐和音乐到图像)基于情感标签。此外,我们还研究了学习得到的音乐嵌入的一致性,并在自动音乐标签设置作为下游任务进行研究。我们的实验表明,我们的方法成功地对图像和音乐进行对齐,并且学习的嵌入空间是跨模态检索应用中有效。

Hybrid noise shaping for audio coding using perfectly overlapped window

  • paper_url: http://arxiv.org/abs/2308.12566
  • repo_url: None
  • paper_authors: Byeongho Jo, Seungkwon Beack
  • for: 优化低比特率音频编码
  • methods: 基于模拟强制变换和变换编码刺激(TCX)的复杂LPC-基于CTNS,以及采用50%重叠窗口和切换方案提高编码效率
  • results: 对象指标和主观听测表明提出的编码框架具有优秀的低比特率音频编码性能
    Abstract In recent years, audio coding technology has been standardized based on several frameworks that incorporate linear predictive coding (LPC). However, coding the transient signal using frequency-domain LP residual signals remains a challenge. To address this, temporal noise shaping (TNS) can be adapted, although it cannot be effectively operated since the estimated temporal envelope in the modified discrete cosine transform (MDCT) domain is accompanied by the time-domain aliasing (TDA) terms. In this study, we propose the modulated complex lapped transform-based coding framework integrated with transform coded excitation (TCX) and complex LPC-based TNS (CTNS). Our approach uses a 50\% overlap window and switching scheme for the CTNS to improve the coding efficiency. Additionally, an adaptive calculation of the target bits for the sub-bands using the frequency envelope information based on the quantized LPC coefficients is proposed. To minimize the quantization mismatch between both modes, an integrated quantization for real and complex values and a TDA augmentation method that compensates for the artificially generated TDA components during switching operations are proposed. The proposed coding framework shows a superior performance in both objective metrics and subjective listening tests, thereby demonstrating its low bit-rate audio coding.
    摘要 Recently, 音频编码技术已经基于多个框架标准化,其中包括线性预测编码(LPC)。然而,使用频域LP residual信号编码脉冲信号仍然是一大挑战。为解决这个问题,可以采用时间噪声形成(TNS),但是由于修改后的离散余弦变换(MDCT)域中的时间尺度扰动(TDA)项,TNS无法有效地运行。在这种研究中,我们提出了基于模拟复杂lapsed transform的编码框架,并与变换编码刺激(TCX)和复杂LPC-based TNS(CTNS)结合。我们的方法使用50%的重叠窗口和切换方案来提高编码效率。此外,我们还提出了基于频率尺度信息的适应计算目标位数据的方法,以避免编码抖音。为了减少编码模式之间的量化差异,我们提出了混合量化和TDA扩展方法。这种编码框架在对象指标和主观听测试中表现出色, thereby demonstrating its low bit-rate audio coding.

MultiPA: a multi-task speech pronunciation assessment system for a closed and open response scenario

  • paper_url: http://arxiv.org/abs/2308.12490
  • repo_url: None
  • paper_authors: Yu-Wen Chen, Zhou Yu, Julia Hirschberg
  • for: 这个研究旨在提出一种自动语音发音评估系统,能够在关闭和开放响应场景下工作,以满足不同学习需求和提供全面和准确的发音技能评估。
  • methods: 该系统使用多任务学习方法,包括语音识别和语音评估任务,以提高发音评估的准确性和可靠性。
  • results: 实验结果表明,该系统在关闭响应场景下的性能与之前的Kaldi-based系统相当,而在开放响应场景下的性能更加稳定和可靠。
    Abstract The design of automatic speech pronunciation assessment can be categorized into closed and open response scenarios, each with strengths and limitations. A system with the ability to function in both scenarios can cater to diverse learning needs and provide a more precise and holistic assessment of pronunciation skills. In this study, we propose a Multi-task Pronunciation Assessment model called MultiPA. MultiPA provides an alternative to Kaldi-based systems in that it has simpler format requirements and better compatibility with other neural network models. Compared with previous open response systems, MultiPA provides a wider range of evaluations, encompassing assessments at both the sentence and word-level. Our experimental results show that MultiPA achieves comparable performance when working in closed response scenarios and maintains more robust performance when directly used for open responses.
    摘要 文本设计自动发音评估可以分为关闭和开放响应场景,每个场景都有优点和局限性。一个能够在多种场景下运行的系统可以满足多样化的学习需求,并提供更加准确和全面的发音技能评估。本研究提出了一种名为MultiPA的多任务发音评估模型。MultiPA比Kaldi基础系统更简单,可以更好地与其他神经网络模型结合使用。相比之前的开放响应系统,MultiPA提供了更广泛的评估范围,包括句子和单词层次的评估。我们的实验结果表明,MultiPA在关闭响应场景下的性能相当,而directly用于开放响应场景时的性能更加稳定。

Attention-Based Acoustic Feature Fusion Network for Depression Detection

  • paper_url: http://arxiv.org/abs/2308.12478
  • repo_url: https://github.com/xuxiaoooo/abafnet
  • paper_authors: Xiao Xu, Yang Wang, Xinru Wei, Fei Wang, Xizhe Zhang
  • for: 这篇论文的目的是提出一个新的听语音特征融合网络(ABAFnet),用于检测创伤后遗症(PTSD)和抑郁症(Depression)。
  • methods: 这篇论文使用了四种不同的听语音特征,通过深度学习模型进行融合,以实现多维度特征的有效融合。它还将提出一个新的权重调整模组,以提高检测性能。
  • results: 这篇论文的实验结果显示,ABAFnet 可以优于先前的方法在检测抑郁症和其子类型上。进一步的分析显示,听语音特征中的MFCC相关特征在检测speech-based抑郁症中扮演着重要的角色。
    Abstract Depression, a common mental disorder, significantly influences individuals and imposes considerable societal impacts. The complexity and heterogeneity of the disorder necessitate prompt and effective detection, which nonetheless, poses a difficult challenge. This situation highlights an urgent requirement for improved detection methods. Exploiting auditory data through advanced machine learning paradigms presents promising research directions. Yet, existing techniques mainly rely on single-dimensional feature models, potentially neglecting the abundance of information hidden in various speech characteristics. To rectify this, we present the novel Attention-Based Acoustic Feature Fusion Network (ABAFnet) for depression detection. ABAFnet combines four different acoustic features into a comprehensive deep learning model, thereby effectively integrating and blending multi-tiered features. We present a novel weight adjustment module for late fusion that boosts performance by efficaciously synthesizing these features. The effectiveness of our approach is confirmed via extensive validation on two clinical speech databases, CNRAC and CS-NRAC, thereby outperforming previous methods in depression detection and subtype classification. Further in-depth analysis confirms the key role of each feature and highlights the importance of MFCCrelated features in speech-based depression detection.
    摘要 抑郁症,一种常见的心理疾病,对个人和社会产生了重要的影响。但是检测抑郁症的复杂性和多样性却提出了严峻的挑战。这种情况强调了改进检测方法的需要。通过利用语音数据的高级机器学习方法可能会开拓出有前途的研究方向。然而,现有的技术主要依靠单一的特征模型,可能会忽略语音特征中的巨量信息。为了纠正这一点,我们提出了一种新的注意力基于的听音特征融合网络(ABAFnet),用于抑郁症检测。ABAFnet将四种不同的语音特征融合到一个深度学习模型中, thereby 有效地汇集和融合多级特征。我们还提出了一种新的权重调整模块,可以在晚期融合中提高性能。我们的方法在两个临床语音数据库(CNRAC和CS-NRAC)上进行了广泛验证,并表现出了在抑郁症检测和亚型分类中的出色表现。进一步的深入分析表明,每种特征都扮演着重要的角色,并且MFCC相关的特征在语音基于的抑郁症检测中具有重要的意义。

An Initial Exploration: Learning to Generate Realistic Audio for Silent Video

  • paper_url: http://arxiv.org/abs/2308.12408
  • repo_url: None
  • paper_authors: Matthew Martel, Jackson Wagner
  • for: 这篇论文的目的是开发一个基于深度学习的框架,用于生成电影和其他媒体中的真实的音效。
  • methods: 这篇论文使用了多种不同的模型架构,包括深度融合CNN、扩展Wavenet CNN和变换器结构,以处理视频上下文和先前生成的音频。
  • results: 研究发现,使用 transformer 结构可以匹配视频的低频谱,但是无法生成更加复杂的波形。
    Abstract Generating realistic audio effects for movies and other media is a challenging task that is accomplished today primarily through physical techniques known as Foley art. Foley artists create sounds with common objects (e.g., boxing gloves, broken glass) in time with video as it is playing to generate captivating audio tracks. In this work, we aim to develop a deep-learning based framework that does much the same - observes video in it's natural sequence and generates realistic audio to accompany it. Notably, we have reason to believe this is achievable due to advancements in realistic audio generation techniques conditioned on other inputs (e.g., Wavenet conditioned on text). We explore several different model architectures to accomplish this task that process both previously-generated audio and video context. These include deep-fusion CNN, dilated Wavenet CNN with visual context, and transformer-based architectures. We find that the transformer-based architecture yields the most promising results, matching low-frequencies to visual patterns effectively, but failing to generate more nuanced waveforms.
    摘要 Generating realistic audio effects for movies and other media is a challenging task that is primarily accomplished today through physical techniques known as Foley art. Foley artists create sounds with common objects (e.g., boxing gloves, broken glass) in time with video as it is playing to generate captivating audio tracks. In this work, we aim to develop a deep-learning based framework that does much the same - observes video in its natural sequence and generates realistic audio to accompany it. Notably, we have reason to believe this is achievable due to advancements in realistic audio generation techniques conditioned on other inputs (e.g., Wavenet conditioned on text). We explore several different model architectures to accomplish this task, including deep-fusion CNN, dilated Wavenet CNN with visual context, and transformer-based architectures. We find that the transformer-based architecture yields the most promising results, matching low-frequencies to visual patterns effectively, but failing to generate more nuanced waveforms.Here's the translation in Traditional Chinese:生成电影和其他媒体中的真实音效是一个挑战性的任务,主要通过物理技术知为FOLEY艺术完成。FOLEY艺术家使用常规的物品(例如拳拳套和碎 glass)与影片同步生成吸引人的音轨。在这个工作中,我们想要开发一个基于深度学习的框架,可以观察影片的自然顺序,并生成吸引人的音轨。我们有理由相信这是可能的,因为有进步在真实音效生成技术中,特别是基于其他输入(例如 Wavenet conditioned on text)。我们探索了不同的模型架构,以实现这个任务,包括深度融合 CNN、扩展 Wavenet CNN WITH 视觉上下文,以及 transformer 型架构。我们发现 transformer 型架构产生了最有前途的结果,能够对低频调和视觉模式匹配得非常好,但是无法产生更加细部的波形。

AdVerb: Visually Guided Audio Dereverberation

  • paper_url: http://arxiv.org/abs/2308.12370
  • repo_url: None
  • paper_authors: Sanjoy Chowdhury, Sreyan Ghosh, Subhrajyoti Dasgupta, Anton Ratnarajah, Utkarsh Tyagi, Dinesh Manocha
  • for: 这篇论文主要是为了提出一种基于视觉信号的声音去抖杂方法,以提高声音质量。
  • methods: 该方法使用了一种新的 geometry-aware cross-modal transformer 架构,利用视觉信号和声音信号之间的相互关系,生成一个复杂的理想比率幕,并将其应用于抖杂声音中,以估计清晰声音。
  • results: 该方法在三个下游任务中表现出色:语音提升、语音识别和speaker verification,与传统的声音 только和视觉只基eline上相比,有18%-82%的Relative improvement。此外,该方法在 AVSpeech 数据集上也实现了非常满意的 RT60 错误分数。
    Abstract We present AdVerb, a novel audio-visual dereverberation framework that uses visual cues in addition to the reverberant sound to estimate clean audio. Although audio-only dereverberation is a well-studied problem, our approach incorporates the complementary visual modality to perform audio dereverberation. Given an image of the environment where the reverberated sound signal has been recorded, AdVerb employs a novel geometry-aware cross-modal transformer architecture that captures scene geometry and audio-visual cross-modal relationship to generate a complex ideal ratio mask, which, when applied to the reverberant audio predicts the clean sound. The effectiveness of our method is demonstrated through extensive quantitative and qualitative evaluations. Our approach significantly outperforms traditional audio-only and audio-visual baselines on three downstream tasks: speech enhancement, speech recognition, and speaker verification, with relative improvements in the range of 18% - 82% on the LibriSpeech test-clean set. We also achieve highly satisfactory RT60 error scores on the AVSpeech dataset.
    摘要 我们提出了AdVerb,一种新的音频视频去噪框架,该框架利用视频听录中的视觉信息,以估算清晰的音频。虽然音频只的去噪是已经广泛研究的问题,但我们的方法利用了补充的视觉Modalidade,以实现音频去噪。给出了环境中录制的听录音信号的图像,AdVerb使用了一种新的场景意识geometry-aware cross-modal transformer架构,捕捉场景几何和音频视频的跨Modalidade关系,生成了复杂的理想比率层,当应用于听录音频时,可以预测清晰的声音。我们的方法的效果得到了广泛的量化和质量评估。我们的方法在三个下游任务上显著超过了传统的音频只和音频视频基eline,在LibriSpeech测试集上的改进率在18%-82%之间。我们还实现了AVSpeech数据集上的高度满意的RT60错误分数。