cs.SD - 2023-09-11

Natural Language Supervision for General-Purpose Audio Representations

  • paper_url: http://arxiv.org/abs/2309.05767
  • repo_url: https://github.com/microsoft/clap
  • paper_authors: Benjamin Elizalde, Soham Deshmukh, Huaming Wang
  • for: 这篇论文旨在提出一种混合语音和文本表示学习模型,以实现零基eline推理。
  • methods: 该模型使用两种创新的编码器来学习音频和语言表示,并使用对比学习将音频和语言表示带到共同的多Modal空间中。
  • results: 该模型在26个下游任务中表现出色,达到了多个任务的顶峰性能,为实现通用的音频表示铺平了道路。
    Abstract Audio-Language models jointly learn multimodal text and audio representations that enable Zero-Shot inference. Models rely on the encoders to create powerful representations of the input and generalize to multiple tasks ranging from sounds, music, and speech. Although models have achieved remarkable performance, there is still a performance gap with task-specific models. In this paper, we propose a Contrastive Language-Audio Pretraining model that is pretrained with a diverse collection of 4.6M audio-text pairs employing two innovative encoders for Zero-Shot inference. To learn audio representations, we trained an audio encoder on 22 audio tasks, instead of the standard training of sound event classification. To learn language representations, we trained an autoregressive decoder-only model instead of the standard encoder-only models. Then, the audio and language representations are brought into a joint multimodal space using Contrastive Learning. We used our encoders to improve the downstream performance by a margin. We extensively evaluated the generalization of our representations on 26 downstream tasks, the largest in the literature. Our model achieves state of the art results in several tasks leading the way towards general-purpose audio representations.
    摘要 <> translate_language = zh-CN;>Audio-语言模型同时学习多Modal文本和音频表示,以实现零shot推理。模型依靠encoder创建强大的输入表示,并能泛化到多个任务,从声音、音乐到语音。虽然模型已达到了非常出色的性能,但还存在任务特定模型的性能差距。在这篇论文中,我们提出了一种对比语言-音频预训练模型,该模型通过使用多样化的4.6M audio-文本对employs two innovative encoders来实现零shot推理。为了学习音频表示,我们在22种音频任务上训练了音频encoder,而不是标准的声音分类训练。为了学习语言表示,我们训练了一个自然语言模型,而不是标准的encoder-only模型。然后,音频和语言表示被带入一个共同多模态空间,使用对比学习。我们使用我们的encoder来提高下游性能的边缘。我们广泛评估了我们的表示的泛化性能,并取得了Literature中最大的26个下游任务。我们的模型在一些任务中取得了状态的战果,领先于普适音频表示的发展。

Kernel Interpolation of Incident Sound Field in Region Including Scattering Objects

  • paper_url: http://arxiv.org/abs/2309.05634
  • repo_url: None
  • paper_authors: Shoichi Koyama, Masaki Nakada, Juliano G. C. Ribeiro, Hiroshi Saruwatari
  • for: 这种方法用于估计包含散射物体的区域内的入射声场。
  • methods: 该方法基于幂函数回归的入射场,通过分离散射场的圆函数展开,消除了对散射物体的先前知识或测量的需求。
  • results: 实验结果表明,该方法比无分离的幂函数回归更高精度地估计入射声场。
    Abstract A method for estimating the incident sound field inside a region containing scattering objects is proposed. The sound field estimation method has various applications, such as spatial audio capturing and spatial active noise control; however, most existing methods do not take into account the presence of scatterers within the target estimation region. Although several techniques exist that employ knowledge or measurements of the properties of the scattering objects, it is usually difficult to obtain them precisely in advance, and their properties may change during the estimation process. Our proposed method is based on the kernel ridge regression of the incident field, with a separation from the scattering field represented by a spherical wave function expansion, thus eliminating the need for prior modeling or measurements of the scatterers. Moreover, we introduce a weighting matrix to induce smoothness of the scattering field in the angular direction, which alleviates the effect of the truncation order of the expansion coefficients on the estimation accuracy. Experimental results indicate that the proposed method achieves a higher level of estimation accuracy than the kernel ridge regression without separation.
    摘要 一种估计受到障碍物影响的受测 зву场的方法被提议。这种受测音场估算方法在各种应用中有重要意义,如空间音采和空间活动噪声控制,但大多数现有方法忽略了目标估算区域内的障碍物。虽然有一些技术利用了障碍物的性能知识或测量结果,但通常很难在进行估算之前 precisely 获取它们,而且它们在估算过程中可能会发生变化。我们提议的方法基于incident field的 kernel ridge regression,通过将散射场表示为球形傅里叶函数展开,因此无需在进行估算之前 precisely 知道障碍物的性能。此外,我们引入了一个权重矩阵来促进angular方向上的平滑性,这有助于减少 truncation order 对估算精度的影响。实验结果表明,我们提议的方法比kernel ridge regression无 separation 更高级别的估算精度。

Undecidability Results and Their Relevance in Modern Music Making

  • paper_url: http://arxiv.org/abs/2309.05595
  • repo_url: None
  • paper_authors: Halley Young
  • for: 本研究探讨了计算理论和音乐之间的交叉点,探讨了现代音乐创作和生产中 Undecidability 的重要 yet 被忽略的意义。
  • methods: 该研究采用多维度方法,包括 Ableton 的 Turing 完善性、音频效果的 Undecidability、音频作曲的约束 Undecidability、正律和律 Harmony 的 Undecidability,以及 “新的 ordering systems” 的 Undecidability。
  • results: 研究提供了这些主张的理论证明,并证明了这些概念在实践中的实用性。 本研究的最终目标是促进对 Undecidability 在音乐中的新理解,强调其更广泛的应用和可能性,以及对计算机助理(以及传统)音乐创作的影响。
    Abstract This paper delves into the intersection of computational theory and music, examining the concept of undecidability and its significant, yet overlooked, implications within the realm of modern music composition and production. It posits that undecidability, a principle traditionally associated with theoretical computer science, extends its relevance to the music industry. The study adopts a multidimensional approach, focusing on five key areas: (1) the Turing completeness of Ableton, a widely used digital audio workstation, (2) the undecidability of satisfiability in sound creation utilizing an array of effects, (3) the undecidability of constraints on polymeters in musical compositions, (4) the undecidability of satisfiability in just intonation harmony constraints, and (5) the undecidability of "new ordering systems". In addition to providing theoretical proof for these assertions, the paper elucidates the practical relevance of these concepts for practitioners outside the field of theoretical computer science. The ultimate aim is to foster a new understanding of undecidability in music, highlighting its broader applicability and potential to influence contemporary computer-assisted (and traditional) music making.
    摘要
  1. The Turing completeness of Ableton, a widely used digital audio workstation.2. The undecidability of satisfiability in sound creation using an array of effects.3. The undecidability of constraints on polymeters in musical compositions.4. The undecidability of satisfiability in just intonation harmony constraints.5. The undecidability of “new ordering systems”.In addition to providing theoretical proof for these assertions, the paper also illustrates the practical relevance of these concepts for practitioners outside the field of theoretical computer science. The ultimate aim is to foster a new understanding of undecidability in music, highlighting its broader applicability and potential to influence contemporary computer-assisted (and traditional) music making.

SlideSpeech: A Large-Scale Slide-Enriched Audio-Visual Corpus

  • paper_url: http://arxiv.org/abs/2309.05396
  • repo_url: None
  • paper_authors: Haoxu Wang, Fan Yu, Xian Shi, Yuezhang Wang, Shiliang Zhang, Ming Li
  • for: 提高自然语言处理系统的性能,特别是多模态自动语音识别系统。
  • methods: 利用视频和文本信息,通过关键词提取和语音识别系统中的上下文方法,提高语音识别性能。
  • results: 通过对听写材料进行分析,发现可以通过利用视频上的文本信息提高语音识别性能。
    Abstract Multi-Modal automatic speech recognition (ASR) techniques aim to leverage additional modalities to improve the performance of speech recognition systems. While existing approaches primarily focus on video or contextual information, the utilization of extra supplementary textual information has been overlooked. Recognizing the abundance of online conference videos with slides, which provide rich domain-specific information in the form of text and images, we release SlideSpeech, a large-scale audio-visual corpus enriched with slides. The corpus contains 1,705 videos, 1,000+ hours, with 473 hours of high-quality transcribed speech. Moreover, the corpus contains a significant amount of real-time synchronized slides. In this work, we present the pipeline for constructing the corpus and propose baseline methods for utilizing text information in the visual slide context. Through the application of keyword extraction and contextual ASR methods in the benchmark system, we demonstrate the potential of improving speech recognition performance by incorporating textual information from supplementary video slides.
    摘要 多Modal自动语音识别(ASR)技术目的在于利用其他modalities提高语音识别系统的性能。现有的方法主要关注视频或上下文信息,而使用补充的文本信息则被忽略。 recognizing the abundance of online conference videos with slides, which provide rich domain-specific information in the form of text and images, we release SlideSpeech, a large-scale audio-visual corpus enriched with slides. The corpus contains 1,705 videos, 1,000+ hours, with 473 hours of high-quality transcribed speech. Moreover, the corpus contains a significant amount of real-time synchronized slides. In this work, we present the pipeline for constructing the corpus and propose baseline methods for utilizing text information in the visual slide context. Through the application of keyword extraction and contextual ASR methods in the benchmark system, we demonstrate the potential of improving speech recognition performance by incorporating textual information from supplementary video slides.

Towards generalisable and calibrated synthetic speech detection with self-supervised representations

  • paper_url: http://arxiv.org/abs/2309.05384
  • repo_url: None
  • paper_authors: Dan Oneata, Adriana Stan, Octavian Pascu, Elisabeta Oneata, Horia Cucu
  • for: 这个论文的目的是提高深度模仿器的普适性,以便建立可靠的假象检测器。
  • methods: 该论文使用预训练的自动学习表示 followed by a simple logistic regression classifier,以实现强大的普适性。
  • results: 该方法在新引入的 In-the-Wild 数据集上减少了平均错误率从 30% 降低到 8%,并且生成了更好地归一化的模型,可以用于下游任务,如不确定性估计。
    Abstract Generalisation -- the ability of a model to perform well on unseen data -- is crucial for building reliable deep fake detectors. However, recent studies have shown that the current audio deep fake models fall short of this desideratum. In this paper we show that pretrained self-supervised representations followed by a simple logistic regression classifier achieve strong generalisation capabilities, reducing the equal error rate from 30% to 8% on the newly introduced In-the-Wild dataset. Importantly, this approach also produces considerably better calibrated models when compared to previous approaches. This means that we can trust our model's predictions more and use these for downstream tasks, such as uncertainty estimation. In particular, we show that the entropy of the estimated probabilities provides a reliable way of rejecting uncertain samples and further improving the accuracy.
    摘要 “一般化”——模型在未见到的数据上表现良好的能力——是深圳识别器的重要需求。然而,最近的研究表明,现有的音频深圳模型尚未达到这个需求。在这篇论文中,我们展示了预训自动 represencing,然后跟着一个简单的逻辑函数分类器可以实现强大的一般化能力,从30%降至8%的平均错误率在新引入的 In-the-Wild 数据集上。此外,这种方法还生成了较好的条件分布,使得我们可以更加信任模型的预测,并将其用于下游任务,如uncertainty估计。具体来说,我们显示出估计概率的熵可以可靠地拒绝不确定的数据,并进一步提高准确率。

Enhancing Speaker Diarization with Large Language Models: A Contextual Beam Search Approach

  • paper_url: http://arxiv.org/abs/2309.05248
  • repo_url: None
  • paper_authors: Tae Jin Park, Kunal Dhawan, Nithin Koluguri, Jagadeesh Balam
  • for: 这 paper 的目的是提出一种基于大语言模型 (LLM) 的语音分类方法,以便更好地利用语音和文本之间的上下文关系。
  • methods: 该方法基于一种已有的语音基于的 speaker diarization 系统,并添加了一个大语言模型 (LLM) 的 lexical information,以在推理阶段利用上下文信息。我们将多模式推理过程设计为一个probabilistic模型,并在 joint acoustic 和 lexical beam search 中包含两种模式的信息。
  • results: 我们的实验结果表明,通过在 acoustics-only diarization 系统中添加 LLM 的 lexical knowledge,可以提高总的 speaker-attributed word error rate (SA-WER)。实验结果还表明,LLMs 可以为 speaker diarization 和其他语音处理任务提供更多的上下文信息,并且可以在不可见的上下文信息方面提供补做。
    Abstract Large language models (LLMs) have shown great promise for capturing contextual information in natural language processing tasks. We propose a novel approach to speaker diarization that incorporates the prowess of LLMs to exploit contextual cues in human dialogues. Our method builds upon an acoustic-based speaker diarization system by adding lexical information from an LLM in the inference stage. We model the multi-modal decoding process probabilistically and perform joint acoustic and lexical beam search to incorporate cues from both modalities: audio and text. Our experiments demonstrate that infusing lexical knowledge from the LLM into an acoustics-only diarization system improves overall speaker-attributed word error rate (SA-WER). The experimental results show that LLMs can provide complementary information to acoustic models for the speaker diarization task via proposed beam search decoding approach showing up to 39.8% relative delta-SA-WER improvement from the baseline system. Thus, we substantiate that the proposed technique is able to exploit contextual information that is inaccessible to acoustics-only systems which is represented by speaker embeddings. In addition, these findings point to the potential of using LLMs to improve speaker diarization and other speech processing tasks by capturing semantic and contextual cues.
    摘要