paper_authors: Shu Wang, Kun Sun, Qi Li for:This paper aims to address the challenge of detecting harmful content in audio and video available on social media platforms, specifically the vulnerability of automatic speech recognition (ASR) systems to spectrum reduction attacks.methods:The proposed solution is an acoustic compensation system named ACE, which leverages two key observations: frequency component dependencies and perturbation sensitivity. ACE uses a combination of frequency-based compensation and over-the-air perturbations to counter the spectrum reduction attacks and improve the accuracy of ASR systems.results:The experiments show that ACE can effectively reduce up to 87.9% of ASR inference errors caused by spectrum reduction attacks. Additionally, the paper identifies six general types of ASR inference errors and investigates their causes and potential mitigation solutions.Abstract
Automatic speech recognition (ASR) provides diverse audio-to-text services for humans to communicate with machines. However, recent research reveals ASR systems are vulnerable to various malicious audio attacks. In particular, by removing the non-essential frequency components, a new spectrum reduction attack can generate adversarial audios that can be perceived by humans but cannot be correctly interpreted by ASR systems. It raises a new challenge for content moderation solutions to detect harmful content in audio and video available on social media platforms. In this paper, we propose an acoustic compensation system named ACE to counter the spectrum reduction attacks over ASR systems. Our system design is based on two observations, namely, frequency component dependencies and perturbation sensitivity. First, since the Discrete Fourier Transform computation inevitably introduces spectral leakage and aliasing effects to the audio frequency spectrum, the frequency components with similar frequencies will have a high correlation. Thus, considering the intrinsic dependencies between neighboring frequency components, it is possible to recover more of the original audio by compensating for the removed components based on the remaining ones. Second, since the removed components in the spectrum reduction attacks can be regarded as an inverse of adversarial noise, the attack success rate will decrease when the adversarial audio is replayed in an over-the-air scenario. Hence, we can model the acoustic propagation process to add over-the-air perturbations into the attacked audio. We implement a prototype of ACE and the experiments show ACE can effectively reduce up to 87.9% of ASR inference errors caused by spectrum reduction attacks. Also, by analyzing residual errors, we summarize six general types of ASR inference errors and investigate the error causes and potential mitigation solutions.
摘要
自动语音识别(ASR)系统提供了多种语音到文本服务,帮助人类与机器进行交互。然而,最近的研究发现,ASR系统受到了多种黑客音频攻击。具体来说,通过删除不必要的频率组成部分,可以生成针对ASR系统的恶意音频,这些音频可以被人类听到,但是无法正确地被ASR系统识别。这种攻击提高了社交媒体平台上的内容审核解决方案的挑战。在这篇论文中,我们提出了一种名为ACE的听音补偿系统,以防止频谱减少攻击。我们的系统设计基于两点观察:一是频谱成分之间的相互依赖关系,二是对于攻击音频的扰动敏感性。首先,由于计算Discrete Fourier Transform时必然存在频谱泄漏和射频效应,因此频谱中的相关频率成分会具有高相关性。因此,通过考虑频谱成分之间的内在相互依赖关系,可以通过补偿被删除的频率成分来恢复更多的原始音频。其次,由于攻击音频中删除的频率成分可以看作是对抗噪声的逆,因此在通过空中传播的情况下重新播放攻击音频时,攻击成功率将下降。因此,我们可以模拟听音传播过程,将攻击音频中的频率成分加上了空中传播的扰动。我们实现了ACE的原型,实验结果表明,ACE可以效果地减少ASR推理错误率达87.9%。此外,通过分析剩余错误,我们总结了六种常见的ASR推理错误类型,并分析了错误的原因和可能的修复方案。
results: 与基准系统使用平均分数 regression 相比,我们 observe 下降的外围比率(OR),并可以轻松地预测 confidence interval(CI)。 数据增强技术的引入导致 CI 预测准确率和平均分数预测准确率的提高。Abstract
We show how a neural network can be trained on individual intrusive listening test scores to predict a distribution of scores for each pair of reference and coded input stereo or binaural signals. We nickname this method the Generative Machine Listener (GML), as it is capable of generating an arbitrary amount of simulated listening test data. Compared to a baseline system using regression over mean scores, we observe lower outlier ratios (OR) for the mean score predictions, and obtain easy access to the prediction of confidence intervals (CI). The introduction of data augmentation techniques from the image domain results in a significant increase in CI prediction accuracy as well as Pearson and Spearman rank correlation of mean scores.
摘要
我们展示了一个神经网络可以在个别关注听力测验成绩上训练,以预测每对参考和编码听力信号的分布。我们称这为“生成机器听者”(GML),因为它可以生成无限量的模拟听力测验数据。相比基准系统使用平均值 regression,我们观察到下峰值値(OR)较低,并可以轻松地预测信号interval prediction(CI)。对于数据增强技术的引入,从影像领域的数据增强技术导致了预测CI的准确性和平均分数和Speedman排名相互联系的提高。
Exploring Sampling Techniques for Generating Melodies with a Transformer Language Model
paper_authors: Mathias Rose Bjare, Stefan Lattner, Gerhard Widmer
for: 这个研究旨在investigate the impact of different sampling techniques on musical qualities such as diversity and structure in natural language processing.
results: 研究发现,probability truncation techniques可能会限制优 optimal circumstances中的多样性和结构性,但在suboptimal circumstances中可能生成更多的音乐样本。Abstract
Research in natural language processing has demonstrated that the quality of generations from trained autoregressive language models is significantly influenced by the used sampling strategy. In this study, we investigate the impact of different sampling techniques on musical qualities such as diversity and structure. To accomplish this, we train a high-capacity transformer model on a vast collection of highly-structured Irish folk melodies and analyze the musical qualities of the samples generated using distribution truncation sampling techniques. Specifically, we use nucleus sampling, the recently proposed "typical sampling", and conventional ancestral sampling. We evaluate the effect of these sampling strategies in two scenarios: optimal circumstances with a well-calibrated model and suboptimal circumstances where we systematically degrade the model's performance. We assess the generated samples using objective and subjective evaluations. We discover that probability truncation techniques may restrict diversity and structural patterns in optimal circumstances, but may also produce more musical samples in suboptimal circumstances.
摘要
研究自然语言处理已经证明,训练过程中使用的采样策略会对生成的质量产生重要影响。在这个研究中,我们研究了不同采样技术对音乐质量的影响,特别是多样性和结构。为此,我们训练了一个高容量变换器模型,并使用分布截断采样技术来分析生成的样本。我们使用核心采样、“典型采样”和传统祖先采样三种采样技术,并在优化和不优化情况下进行评估。我们使用对象和主观评估来评估生成的样本质量。我们发现,概率截断技术可能会在优化情况下减少多样性和结构,但在不优化情况下可能会生成更多的音乐样本。
TrOMR:Transformer-Based Polyphonic Optical Music Recognition
results: 广泛的实验表明,TrOMR方法在现实世界场景中比现有的OMR方法高效,特别是对于复杂的乐谱。此外, authors还开发了一个 TrOMR 系统和一个实验室场景数据集,以便进行全面的评估和复制。Abstract
Optical Music Recognition (OMR) is an important technology in music and has been researched for a long time. Previous approaches for OMR are usually based on CNN for image understanding and RNN for music symbol classification. In this paper, we propose a transformer-based approach with excellent global perceptual capability for end-to-end polyphonic OMR, called TrOMR. We also introduce a novel consistency loss function and a reasonable approach for data annotation to improve recognition accuracy for complex music scores. Extensive experiments demonstrate that TrOMR outperforms current OMR methods, especially in real-world scenarios. We also develop a TrOMR system and build a camera scene dataset for full-page music scores in real-world. The code and datasets will be made available for reproducibility.
摘要
依靠视觉技术的音乐识别(OMR)已经是音乐领域的一项重要技术,已经有很长时间的研究。过去的approach通常基于CNN来理解图像和RNN来分类音乐符号。在这篇论文中,我们提出一种基于转换器的方法,具有出色的全球感知能力,用于端到端多重音乐OMR,称为TrOMR。我们还介绍了一种新的一致性损失函数和合理的数据注释方法,以提高复杂音乐分页的识别精度。广泛的实验表明,TrOMR在现实世界中比现有的OMR方法高效,特别是在复杂音乐分页中。我们还开发了TrOMR系统,并建立了真实世界中的摄像头场景数据集,用于全页音乐分页。代码和数据集将被公开,以便重现。
Lip Reading for Low-resource Languages by Learning and Combining General Speech Knowledge and Language-specific Knowledge
methods: 使用高资源语言的speech unit prediction来学习通用语言知识,然后使用Language-specific Memory-augmented Decoder (LMDecoder)来学习语言特定知识。
results: 通过对五种语言(英语、西班牙语、法语、意大利语和葡萄牙语)进行了广泛的实验,证明提案的方法可以有效地提高lip reading模型的表现。Abstract
This paper proposes a novel lip reading framework, especially for low-resource languages, which has not been well addressed in the previous literature. Since low-resource languages do not have enough video-text paired data to train the model to have sufficient power to model lip movements and language, it is regarded as challenging to develop lip reading models for low-resource languages. In order to mitigate the challenge, we try to learn general speech knowledge, the ability to model lip movements, from a high-resource language through the prediction of speech units. It is known that different languages partially share common phonemes, thus general speech knowledge learned from one language can be extended to other languages. Then, we try to learn language-specific knowledge, the ability to model language, by proposing Language-specific Memory-augmented Decoder (LMDecoder). LMDecoder saves language-specific audio features into memory banks and can be trained on audio-text paired data which is more easily accessible than video-text paired data. Therefore, with LMDecoder, we can transform the input speech units into language-specific audio features and translate them into texts by utilizing the learned rich language knowledge. Finally, by combining general speech knowledge and language-specific knowledge, we can efficiently develop lip reading models even for low-resource languages. Through extensive experiments using five languages, English, Spanish, French, Italian, and Portuguese, the effectiveness of the proposed method is evaluated.
摘要
Robust Audio Anti-Spoofing with Fusion-Reconstruction Learning on Multi-Order Spectrograms
results: 在 ASVspoof2019 LA Challenge 上实现了顶尖性能,EER 为 0.77%Abstract
Robust audio anti-spoofing has been increasingly challenging due to the recent advancements on deepfake techniques. While spectrograms have demonstrated their capability for anti-spoofing, complementary information presented in multi-order spectral patterns have not been well explored, which limits their effectiveness for varying spoofing attacks. Therefore, we propose a novel deep learning method with a spectral fusion-reconstruction strategy, namely S2pecNet, to utilise multi-order spectral patterns for robust audio anti-spoofing representations. Specifically, spectral patterns up to second-order are fused in a coarse-to-fine manner and two branches are designed for the fine-level fusion from the spectral and temporal contexts. A reconstruction from the fused representation to the input spectrograms further reduces the potential fused information loss. Our method achieved the state-of-the-art performance with an EER of 0.77% on a widely used dataset: ASVspoof2019 LA Challenge.
摘要
受深圳技术的提高影响,Robust audio anti-spoofing已经变得越来越困难。虽然spectrograms已经表现出了抗假技术的能力,但多个 spectral pattern的信息还没有得到充分利用,这限制了它们对不同的假攻击的效iveness。因此,我们提出了一种基于深度学习的新方法,即S2pecNet,用于利用多个 spectral pattern来获得robust audio anti-spoofing表示。特别是,至第二顺序的spectral pattern被在粗糙到细节的方式进行融合,并设计了两个支线来从spectral和时间上下文中进行细节级别的融合。再次从融合表示中重建输入spectrograms可以减少潜在的融合信息损失。我们的方法在ASVspoof2019 LA Challenge上 achieved state-of-the-art performance,EER为0.77%。
V2A-Mapper: A Lightweight Solution for Vision-to-Audio Generation by Connecting Foundation Models
paper_authors: Heng Wang, Jianbo Ma, Santiago Pascual, Richard Cartwright, Weidong Cai
for: 本研究强调使用基础模型(FM)来解决跨模态生成问题,具体来说是将视觉输入转化为听取输出。
methods: 该研究使用CLIP、CLAP和AudioLDM三个基础模型,通过设计一种简单 yet有效的映射机制(V2A-Mapper)来 bridge the domain gap,并使用预训练的听取生成FM AudioLDM生成高质量的听取输出。
results: 对比现有方法,该方法需要较少的训练参数(86%),但能够提高FD和CS两个评价指标的表现,具体来说是提高了53%和19%。Abstract
Building artificial intelligence (AI) systems on top of a set of foundation models (FMs) is becoming a new paradigm in AI research. Their representative and generative abilities learnt from vast amounts of data can be easily adapted and transferred to a wide range of downstream tasks without extra training from scratch. However, leveraging FMs in cross-modal generation remains under-researched when audio modality is involved. On the other hand, automatically generating semantically-relevant sound from visual input is an important problem in cross-modal generation studies. To solve this vision-to-audio (V2A) generation problem, existing methods tend to design and build complex systems from scratch using modestly sized datasets. In this paper, we propose a lightweight solution to this problem by leveraging foundation models, specifically CLIP, CLAP, and AudioLDM. We first investigate the domain gap between the latent space of the visual CLIP and the auditory CLAP models. Then we propose a simple yet effective mapper mechanism (V2A-Mapper) to bridge the domain gap by translating the visual input between CLIP and CLAP spaces. Conditioned on the translated CLAP embedding, pretrained audio generative FM AudioLDM is adopted to produce high-fidelity and visually-aligned sound. Compared to previous approaches, our method only requires a quick training of the V2A-Mapper. We further analyze and conduct extensive experiments on the choice of the V2A-Mapper and show that a generative mapper is better at fidelity and variability (FD) while a regression mapper is slightly better at relevance (CS). Both objective and subjective evaluation on two V2A datasets demonstrate the superiority of our proposed method compared to current state-of-the-art approaches - trained with 86% fewer parameters but achieving 53% and 19% improvement in FD and CS, respectively.
摘要
Generating semantically-relevant sound from visual input is an important problem in cross-modal generation studies. Existing methods tend to design and build complex systems from scratch using modestly sized datasets. In this paper, we propose a lightweight solution to this problem by leveraging foundation models, specifically CLIP, CLAP, and AudioLDM.We first investigate the domain gap between the latent space of the visual CLIP and the auditory CLAP models. Then, we propose a simple yet effective mapper mechanism (V2A-Mapper) to bridge the domain gap by translating the visual input between CLIP and CLAP spaces. Conditioned on the translated CLAP embedding, pretrained audio generative FM AudioLDM is adopted to produce high-fidelity and visually-aligned sound.Compared to previous approaches, our method only requires a quick training of the V2A-Mapper. We further analyze and conduct extensive experiments on the choice of the V2A-Mapper and show that a generative mapper is better at fidelity and variability (FD) while a regression mapper is slightly better at relevance (CS). Both objective and subjective evaluation on two V2A datasets demonstrate the superiority of our proposed method compared to current state-of-the-art approaches. We trained with 86% fewer parameters but achieved 53% and 19% improvement in FD and CS, respectively.
Bridging High-Quality Audio and Video via Language for Sound Effects Retrieval from Visual Queries
paper_authors: Julia Wilkins, Justin Salamon, Magdalena Fuentes, Juan Pablo Bello, Oriol Nieto for: 这个论文主要是用于找到适合的声音效果(SFX)来匹配视频中的时刻,并且可以使用视频帧直接作为查询来找到高质量(HQ)的声音效果。methods: 这篇论文使用了多模态框架,包括利用大型语言模型和基础视觉语言模型来将HQ音频和视频桥接起来,创建高可扩展的自动音频视频数据纪录pipeline。它还使用预训练的音频和视觉编码器来训练一种对比学习基本来进行匹配。results: 论文表明,使用自动数据纪录ipeline和对比学习基本来训练的系统可以对HQ音频进行高效的匹配,并且在使用各种数据集上表现出色。此外,这种系统还可以从清晰到野外数据中进行泛化,并且在用户测试中获得了67%的正确率。Abstract
Finding the right sound effects (SFX) to match moments in a video is a difficult and time-consuming task, and relies heavily on the quality and completeness of text metadata. Retrieving high-quality (HQ) SFX using a video frame directly as the query is an attractive alternative, removing the reliance on text metadata and providing a low barrier to entry for non-experts. Due to the lack of HQ audio-visual training data, previous work on audio-visual retrieval relies on YouTube (in-the-wild) videos of varied quality for training, where the audio is often noisy and the video of amateur quality. As such it is unclear whether these systems would generalize to the task of matching HQ audio to production-quality video. To address this, we propose a multimodal framework for recommending HQ SFX given a video frame by (1) leveraging large language models and foundational vision-language models to bridge HQ audio and video to create audio-visual pairs, resulting in a highly scalable automatic audio-visual data curation pipeline; and (2) using pre-trained audio and visual encoders to train a contrastive learning-based retrieval system. We show that our system, trained using our automatic data curation pipeline, significantly outperforms baselines trained on in-the-wild data on the task of HQ SFX retrieval for video. Furthermore, while the baselines fail to generalize to this task, our system generalizes well from clean to in-the-wild data, outperforming the baselines on a dataset of YouTube videos despite only being trained on the HQ audio-visual pairs. A user study confirms that people prefer SFX retrieved by our system over the baseline 67% of the time both for HQ and in-the-wild data. Finally, we present ablations to determine the impact of model and data pipeline design choices on downstream retrieval performance. Please visit our project website to listen to and view our SFX retrieval results.
摘要
找到合适的声效(SFX)以匹配影像中的时刻是一个困难和耗时的任务,它高度依赖文本 metadata 的质量和完整性。使用影像帧直接作为查询,抽取高品质(HQ)声效是一个吸引人的选择,它可以解除文本 metadata 的依赖,并提供低门槛的入门点 для非专家。由于缺乏 HQ 音频视觉训练数据,过去的音频视觉检索工作都是使用 YouTube 上的各种质量的影片进行训练,其中的音频 oft 是噪音的,影像则是业余质量。这使得这些系统是否能够应用到高品质音频与生产质量影像的匹配问题仍然存在一定的uncertainty。为了解决这个问题,我们提出了一个多Modal 框架,可以根据影像帧提供高品质声效。我们的方法包括:1. 利用大型语言模型和基础的视觉语言模型,将高品质音频和影像 Bridge 到创建音频视觉对,实现了高度排擦的自动音频视觉数据填充管道。2. 使用预训的音频和视觉嵌入器,使用对照式学习 retrained 一个检索系统。我们的系统,使用我们的自动数据填充管道进行训练,与基准相比,显著超过了对 YouTube 上的各种质量影片进行训练的基准。此外,我们的系统具有很好的泛化能力,可以从清洁到实际上的数据进行检索,并且在 YouTube 上的影片上进行检索时,表现更好 than 基准。在一次用户研究中,人们对我们的系统进行检索的时候,偏好我们的系统67%。最后,我们进行了一些范例的ablation,以决定模型和数据管道设计的影响。您可以前往我们的项目网站,聆听和查看我们的 SFX 检索结果。