results: 该论文对各种伪声检测方法的评估和比较,并提出了一些未解决的问题和未来研究方向,包括伪声检测系统的可靠性和可扩展性等问题。Abstract
Audio has become an increasingly crucial biometric modality due to its ability to provide an intuitive way for humans to interact with machines. It is currently being used for a range of applications, including person authentication to banking to virtual assistants. Research has shown that these systems are also susceptible to spoofing and attacks. Therefore, protecting audio processing systems against fraudulent activities, such as identity theft, financial fraud, and spreading misinformation, is of paramount importance. This paper reviews the current state-of-the-art techniques for detecting audio spoofing and discusses the current challenges along with open research problems. The paper further highlights the importance of considering the ethical and privacy implications of audio spoofing detection systems. Lastly, the work aims to accentuate the need for building more robust and generalizable methods, the integration of automatic speaker verification and countermeasure systems, and better evaluation protocols.
摘要
本文检视了目前的推广技术,包括伪造检测和攻击检测,以及目前的挑战和未解决的研究问题。此外,文章还强调了考虑音频伪造检测系统的道德和隐私问题。最后,文章强调了需要建立更加 Robust 和通用的方法,同时整合自动认识系统和防护系统,以及更好的评估协议。
An Improved Metric of Informational Masking for Perceptual Audio Quality Measurement
For: The paper is written to develop and improve perceptual audio quality measurement systems that can accurately estimate the perceived quality of audio signals processed by perceptual audio codecs.* Methods: The paper uses models of disturbance audibility and cognitive effects to predict perceived quality degradation in audio signals. Specifically, it proposes an improved model of informational masking (IM) that considers the complexity of disturbance information around the masking threshold.* Results: The proposed IM metric is shown to outperform previously proposed IM metrics in a validation task against subjective quality scores from large and diverse listening test databases. Additionally, the proposed system demonstrated improved quality prediction for music signals coded with bandwidth extension techniques, where other models frequently fail.Abstract
Perceptual audio quality measurement systems algorithmically analyze the output of audio processing systems to estimate possible perceived quality degradation using perceptual models of human audition. In this manner, they save the time and resources associated with the design and execution of listening tests (LTs). Models of disturbance audibility predicting peripheral auditory masking in quality measurement systems have considerably increased subjective quality prediction performance of signals processed by perceptual audio codecs. Additionally, cognitive effects have also been known to regulate perceived distortion severity by influencing their salience. However, the performance gains due to cognitive effect models in quality measurement systems were inconsistent so far, particularly for music signals. Firstly, this paper presents an improved model of informational masking (IM) -- an important cognitive effect in quality perception -- that considers disturbance information complexity around the masking threshold. Secondly, we incorporate the proposed IM metric into a quality measurement systems using a novel interaction analysis procedure between cognitive effects and distortion metrics. The procedure establishes interactions between cognitive effects and distortion metrics using LT data. The proposed IM metric is shown to outperform previously proposed IM metrics in a validation task against subjective quality scores from large and diverse LT databases. Particularly, the proposed system showed an increased quality prediction of music signals coded with bandwidth extension techniques, where other models frequently fail.
摘要
音频质量测量系统使用算法分析音频处理系统的输出,以估算可能的感知质量下降使用人类听觉模型。这种方法可以节省设计和执行听测试(LT)的时间和资源。音频干扰可见性预测模型在质量测量系统中有大幅提高了可视质量预测性能的 signals 处理过的音频编码器。此外,认知效应也被知道可以调控感知错误严重性的感知。然而,认知效应模型在质量测量系统中的性能往往不稳定,尤其是 для 音乐信号。本文首先提出一种改进的信息干扰(IM)模型,该模型考虑干扰信息在抑制阈值附近的复杂性。其次,我们将提出的 IM 度量 incorporated into 质量测量系统中,使用一种新的交互分析过程,该过程通过听测数据来确定认知效应和错误度量之间的交互。提出的 IM 度量在一个验证任务中,与subjective 质量分数从大型和多样化的听测数据库中获得了更好的表现。特别是,提出的系统在使用带宽扩展技术编码音乐信号时,其质量预测性能与其他模型不同,经常失败。
Exploring the Integration of Large Language Models into Automatic Speech Recognition Systems: An Empirical Study
results: 研究发现,使用LLM的 corrected sentences frequently resulted in higher Word Error Rates (WER),表明在语音应用中使用LLM的在场景学习仍然是一个挑战。Abstract
This paper explores the integration of Large Language Models (LLMs) into Automatic Speech Recognition (ASR) systems to improve transcription accuracy. The increasing sophistication of LLMs, with their in-context learning capabilities and instruction-following behavior, has drawn significant attention in the field of Natural Language Processing (NLP). Our primary focus is to investigate the potential of using an LLM's in-context learning capabilities to enhance the performance of ASR systems, which currently face challenges such as ambient noise, speaker accents, and complex linguistic contexts. We designed a study using the Aishell-1 and LibriSpeech datasets, with ChatGPT and GPT-4 serving as benchmarks for LLM capabilities. Unfortunately, our initial experiments did not yield promising results, indicating the complexity of leveraging LLM's in-context learning for ASR applications. Despite further exploration with varied settings and models, the corrected sentences from the LLMs frequently resulted in higher Word Error Rates (WER), demonstrating the limitations of LLMs in speech applications. This paper provides a detailed overview of these experiments, their results, and implications, establishing that using LLMs' in-context learning capabilities to correct potential errors in speech recognition transcriptions is still a challenging task at the current stage.
摘要
Translation in Simplified Chinese:这篇论文探讨了将大语言模型(LLM) integrating into自动语音识别(ASR)系统以提高译文准确性。随着 LLM 的不断提高,其在语言处理领域(NLP)中的应用吸引了广泛关注。我们的主要关注点是探讨 LLM 的上下文学习能力如何提高 ASR 系统的性能,现在面临环境噪音、发音方言和复杂语言上下文等挑战。我们使用 Aishell-1 和 LibriSpeech 数据集,并使用 ChatGPT 和 GPT-4 作为 LLM 的 referential。 unfortunately,我们的初始实验并没有取得希望的结果,这表明使用 LLM 的上下文学习来改进 ASR 应用程序的现在阶段还是一个复杂的任务。尽管我们继续使用不同的设置和模型进行探索,但 corrected 从 LLM 中的句子往往会导致高 Word Error Rate(WER),这表明 LLM 在语音应用程序中的局限性。这篇论文提供了详细的实验结果和意义,证明使用 LLM 的上下文学习能力来修正语音识别转录中的潜在错误仍然是一个挑战。
Feature Embeddings from Large-Scale Acoustic Bird Classifiers Enable Few-Shot Transfer Learning
paper_authors: Burooj Ghani, Tom Denton, Stefan Kahl, Holger Klinck
for: 这种研究的目的是为了帮助理解和保护海洋和陆地动物和其栖息地 across extensive spatiotemporal scales。
methods: 这种研究使用了深度学习模型来分类bioacoustic数据。
results: 研究发现,可以使用大规模的音频分类模型中的特征嵌入来分类不同的生物声音类型,包括鸟类、蝙蝠类、海洋哺乳类和两栖动物的声音。这些特征嵌入可以在不充足的训练数据情况下提供高质量的分类结果。Abstract
Automated bioacoustic analysis aids understanding and protection of both marine and terrestrial animals and their habitats across extensive spatiotemporal scales, and typically involves analyzing vast collections of acoustic data. With the advent of deep learning models, classification of important signals from these datasets has markedly improved. These models power critical data analyses for research and decision-making in biodiversity monitoring, animal behaviour studies, and natural resource management. However, deep learning models are often data-hungry and require a significant amount of labeled training data to perform well. While sufficient training data is available for certain taxonomic groups (e.g., common bird species), many classes (such as rare and endangered species, many non-bird taxa, and call-type), lack enough data to train a robust model from scratch. This study investigates the utility of feature embeddings extracted from large-scale audio classification models to identify bioacoustic classes other than the ones these models were originally trained on. We evaluate models on diverse datasets, including different bird calls and dialect types, bat calls, marine mammals calls, and amphibians calls. The embeddings extracted from the models trained on bird vocalization data consistently allowed higher quality classification than the embeddings trained on general audio datasets. The results of this study indicate that high-quality feature embeddings from large-scale acoustic bird classifiers can be harnessed for few-shot transfer learning, enabling the learning of new classes from a limited quantity of training data. Our findings reveal the potential for efficient analyses of novel bioacoustic tasks, even in scenarios where available training data is limited to a few samples.
摘要
自动化生物声学分析可以帮助我们更好地理解和保护海洋和陆地动物及其栖息地,并且可以在广泛的时空尺度上进行分析。通常情况下,这种分析需要分析大量的声学数据。随着深度学习模型的出现,对重要的声学信号进行分类的精度有了明显的提高。这些模型在生物多样性监测、动物行为研究和自然资源管理中都具有重要的应用价值。然而,深度学习模型通常需要大量的标注训练数据来达到良好的性能。而且,许多类别(如罕见和濒危物种、非鸟类和呼叫类)缺乏足够的训练数据来训练一个可靠的模型。本研究探讨了使用大规模声学分类模型提取的特征向量来分类其他类别。我们在不同的鸟叫和方言类型、蝙蝠叫、海洋哺乳类和两栖动物叫中进行了评估。结果表明,来自鸟叫数据集训练的特征向量能够提供更高质量的分类,比起来自通用音频数据集训练的特征向量。这些结果表明,可以通过将鸟叫分类模型的特征向量进行几步迁移学习,以学习新的类别,即使只有少量的训练数据available。我们的发现表明,可以通过高质量的特征向量和几步迁移学习,实现有效地分析新的生物声学任务,即使有限的训练数据available。