results: 这篇研究发现了各数据库HRTF之间的差异,并提出了一个新的方法来对HRTFs进行调变,以获得更好的统一HRTF表示方式。Abstract
Individualized head-related transfer functions (HRTFs) are crucial for accurate sound positioning in virtual auditory displays. As the acoustic measurement of HRTFs is resource-intensive, predicting individualized HRTFs using machine learning models is a promising approach at scale. Training such models require a unified HRTF representation across multiple databases to utilize their respectively limited samples. However, in addition to differences on the spatial sampling locations, recent studies have shown that, even for the common location, HRTFs across databases manifest consistent differences that make it trivial to tell which databases they come from. This poses a significant challenge for learning a unified HRTF representation across databases. In this work, we first identify the possible causes of these cross-database differences, attributing them to variations in the measurement setup. Then, we propose a novel approach to normalize the frequency responses of HRTFs across databases. We show that HRTFs from different databases cannot be classified by their database after normalization. We further show that these normalized HRTFs can be used to learn a more unified HRTF representation across databases than the prior art. We believe that this normalization approach paves the road to many data-intensive tasks on HRTF modeling.
摘要
个人化的头顶相关转换函数(HRTF)是虚拟受声显示的精确 зву乐位置的关键。由于Measurement of HRTFs是资源充足的,因此使用机器学习模型预测个人化HRTFs是一个具有潜力的方法。training这些模型需要一个统一的HRTF表现方式,以利用它们的限量样本。然而, latest studies have shown that, even for the same location, HRTFs across databases exhibit consistent differences that make it easy to distinguish which databases they come from. This poses a significant challenge for learning a unified HRTF representation across databases.在这个工作中,我们首先识别了可能的跨数据库差异的原因,将其归因于测量设置的变化。然后,我们提出了一种新的方法来对HRTFs的频谱响应进行Normalization。我们显示了,从不同的数据库中取得的HRTFs无法根据其数据库进行分类之后Normalization。此外,我们还显示了这些Normalized HRTFs可以用来学习一个更统一的HRTF表现方式,比对照rior art。我们认为这个Normalization方法将开拓出许多数据密集的HRTF模型任务。
paper_authors: Cai Yu, Peng Chen, Jiahe Tian, Jin Liu, Jiao Dai, Xi Wang, Yesheng Chai, Jizhong Han for: 这研究旨在开发一种可以检测多模态深伪的方法,同时可以处理缺失模态情况。methods: 该方法使用了一种混合模式检测方法,利用音频视频抽取相关特征,并使用双标签检测方法来支持独立检测每个模态。results: 实验结果表明,该方法不仅在所有三个音频视频数据集上表现出色,而且在缺失模态情况下也能够达到满意的性能。此外,它还超过了使用两个单模态方法的结果,即使在缺失模态情况下。Abstract
As AI-generated content (AIGC) thrives, Deepfakes have expanded from single-modality falsification to cross-modal fake content creation, where either audio or visual components can be manipulated. While using two unimodal detectors can detect audio-visual deepfakes, cross-modal forgery clues could be overlooked. Existing multimodal deepfake detection methods typically establish correspondence between the audio and visual modalities for binary real/fake classification, and require the co-occurrence of both modalities. However, in real-world multi-modal applications, missing modality scenarios may occur where either modality is unavailable. In such cases, audio-visual detection methods are less practical than two independent unimodal methods. Consequently, the detector can not always obtain the number or type of manipulated modalities beforehand, necessitating a fake-modality-agnostic audio-visual detector. In this work, we propose a unified fake-modality-agnostic scenarios framework that enables the detection of multimodal deepfakes and handles missing modalities cases, no matter the manipulation hidden in audio, video, or even cross-modal forms. To enhance the modeling of cross-modal forgery clues, we choose audio-visual speech recognition (AVSR) as a preceding task, which effectively extracts speech correlation across modalities, which is difficult for deepfakes to reproduce. Additionally, we propose a dual-label detection approach that follows the structure of AVSR to support the independent detection of each modality. Extensive experiments show that our scheme not only outperforms other state-of-the-art binary detection methods across all three audio-visual datasets but also achieves satisfying performance on detection modality-agnostic audio/video fakes. Moreover, it even surpasses the joint use of two unimodal methods in the presence of missing modality cases.
摘要
As AI-generated content (AIGC) thrives, Deepfakes have expanded from single-modality falsification to cross-modal fake content creation, where either audio or visual components can be manipulated. While using two unimodal detectors can detect audio-visual deepfakes, cross-modal forgery clues could be overlooked. Existing multimodal deepfake detection methods typically establish correspondence between the audio and visual modalities for binary real/fake classification, and require the co-occurrence of both modalities. However, in real-world multi-modal applications, missing modality scenarios may occur where either modality is unavailable. In such cases, audio-visual detection methods are less practical than two independent unimodal methods. Consequently, the detector can not always obtain the number or type of manipulated modalities beforehand, necessitating a fake-modality-agnostic audio-visual detector. In this work, we propose a unified fake-modality-agnostic scenarios framework that enables the detection of multimodal deepfakes and handles missing modalities cases, no matter the manipulation hidden in audio, video, or even cross-modal forms. To enhance the modeling of cross-modal forgery clues, we choose audio-visual speech recognition (AVSR) as a preceding task, which effectively extracts speech correlation across modalities, which is difficult for deepfakes to reproduce. Additionally, we propose a dual-label detection approach that follows the structure of AVSR to support the independent detection of each modality. Extensive experiments show that our scheme not only outperforms other state-of-the-art binary detection methods across all three audio-visual datasets but also achieves satisfying performance on detection modality-agnostic audio/video fakes. Moreover, it even surpasses the joint use of two unimodal methods in the presence of missing modality cases.
Single Channel Speech Enhancement Using U-Net Spiking Neural Networks
results: 比起智能硬件实现的Intel N-DNS Challenge基准解决方案,提出了能效的SNN模型,并在不同的噪声比例和实际噪声条件下达到了接受性的性能。Abstract
Speech enhancement (SE) is crucial for reliable communication devices or robust speech recognition systems. Although conventional artificial neural networks (ANN) have demonstrated remarkable performance in SE, they require significant computational power, along with high energy costs. In this paper, we propose a novel approach to SE using a spiking neural network (SNN) based on a U-Net architecture. SNNs are suitable for processing data with a temporal dimension, such as speech, and are known for their energy-efficient implementation on neuromorphic hardware. As such, SNNs are thus interesting candidates for real-time applications on devices with limited resources. The primary objective of the current work is to develop an SNN-based model with comparable performance to a state-of-the-art ANN model for SE. We train a deep SNN using surrogate-gradient-based optimization and evaluate its performance using perceptual objective tests under different signal-to-noise ratios and real-world noise conditions. Our results demonstrate that the proposed energy-efficient SNN model outperforms the Intel Neuromorphic Deep Noise Suppression Challenge (Intel N-DNS Challenge) baseline solution and achieves acceptable performance compared to an equivalent ANN model.
摘要
声音增强(SE)是重要的通信设备或可靠的语音识别系统的关键。虽然传统的人工神经网络(ANN)已经在SE中表现出了很好的性能,但它们需要很大的计算能力以及高的能源成本。在这篇论文中,我们提出了一种使用射频神经网络(SNN)的新方法,基于U-Net架构。SNN适合处理时间维度上的数据,如语音,并且在神经逻辑硬件上实现能效。因此,SNN是实时应用中的有趣候选者。目标是开发一个与状态ixen-of-the-art ANN模型相比的性能相似的SNN模型。我们使用代理函数基于优化方法来训练深度SNN,并对其性能进行评估使用感知目标测试,包括不同的信号响应率和实际噪声条件。我们的结果表明,我们提出的能效SNN模型比Intel neuromorphic Deep Noise Suppression Challenge(Intel N-DNS Challenge)基准解决方案更好,并且与相同的ANN模型相比,其性能是可接受的。
WavJourney: Compositional Audio Creation with Large Language Models
paper_authors: Xubo Liu, Zhongkai Zhu, Haohe Liu, Yi Yuan, Meng Cui, Qiushi Huang, Jinhua Liang, Yin Cao, Qiuqiang Kong, Mark D. Plumbley, Wenwu Wang for: 这个论文的目的是如何使用大语言模型(LLMs)来创造具有语音、乐曲和特效的听众场景。methods: 这篇论文使用了LLMs来连接不同的听众场景模型,以便根据文本描述创建听众场景。具体来说,LLMs首先生成了一份特有的听众场景脚本,该脚本包含了不同的听众场景元素,以及这些元素之间的空间时间关系。然后,这份脚本被交给了一个脚本编译器,将其转换为计算机程序。每一行的程序调用了一个任务特定的听众场景生成模型或计算操作函数(例如, concatenate、mix)。最后,计算程序执行以生成听众场景。results: 这篇论文在多个实际场景中证明了WavJourney的实用性,包括科幻、教育和广播剧等。WavJourney的可解释和交互设计使得人机共创在多round对话中得到了改进的创造控制和适应性。WavJourney为听众场景创作提供了一个新的创新平台,打开了新的可能性 для multimedia内容创作。Abstract
Large Language Models (LLMs) have shown great promise in integrating diverse expert models to tackle intricate language and vision tasks. Despite their significance in advancing the field of Artificial Intelligence Generated Content (AIGC), their potential in intelligent audio content creation remains unexplored. In this work, we tackle the problem of creating audio content with storylines encompassing speech, music, and sound effects, guided by text instructions. We present WavJourney, a system that leverages LLMs to connect various audio models for audio content generation. Given a text description of an auditory scene, WavJourney first prompts LLMs to generate a structured script dedicated to audio storytelling. The audio script incorporates diverse audio elements, organized based on their spatio-temporal relationships. As a conceptual representation of audio, the audio script provides an interactive and interpretable rationale for human engagement. Afterward, the audio script is fed into a script compiler, converting it into a computer program. Each line of the program calls a task-specific audio generation model or computational operation function (e.g., concatenate, mix). The computer program is then executed to obtain an explainable solution for audio generation. We demonstrate the practicality of WavJourney across diverse real-world scenarios, including science fiction, education, and radio play. The explainable and interactive design of WavJourney fosters human-machine co-creation in multi-round dialogues, enhancing creative control and adaptability in audio production. WavJourney audiolizes the human imagination, opening up new avenues for creativity in multimedia content creation.
摘要
大型语言模型(LLM)在融合多种专家模型方面表现出了很大的承诺,它们在人工智能生成内容领域的发展中发挥了重要作用。然而,它们在智能音频内容创作方面的潜力还未得到开发。在这项工作中,我们解决了基于文本描述的音频内容创作问题。我们提出了一种名为WavJourney的系统,它利用LLM来连接多种音频模型,以实现音频内容创作。给定一个文本描述的听力场景,WavJourney首先使用LLM生成一个专门为音频故事创作的结构化脚本。这个音频脚本包含多种听力元素,按照其空间时间关系进行组织。作为听力内容的概念表示,音频脚本提供了可交互和可解释的理由,用于人机合作。接下来,音频脚本被 feed 到一个脚本编译器,将其转换成计算机程序。每行程序调用一个任务特定的音频生成模型或计算操作函数(例如, concatenate、mix)。计算程序被执行以获取可解释的音频生成解决方案。我们在多个实际场景中证明了WavJourney的实用性,包括科幻、教育和广播剧。WavJourney的可解释和交互设计使得人机合作在多轮对话中得到加强,提高了音频生产的创作控制和适应性。WavJourney将人类的想象力音频化,开启了新的创作途径在多媒体内容创作领域。