results: 研究发现,使用mel spectrogram Similarity来评估模型的记忆行为是更加稳定和可靠的,而learned embedding vectors则更容易受到训练数据的干扰。此外,研究还发现AudioCaps数据库中存在大量的复制声音clip。Abstract
The introduction of audio latent diffusion models possessing the ability to generate realistic sound clips on demand from a text description has the potential to revolutionize how we work with audio. In this work, we make an initial attempt at understanding the inner workings of audio latent diffusion models by investigating how their audio outputs compare with the training data, similar to how a doctor auscultates a patient by listening to the sounds of their organs. Using text-to-audio latent diffusion models trained on the AudioCaps dataset, we systematically analyze memorization behavior as a function of training set size. We also evaluate different retrieval metrics for evidence of training data memorization, finding the similarity between mel spectrograms to be more robust in detecting matches than learned embedding vectors. In the process of analyzing memorization in audio latent diffusion models, we also discover a large amount of duplicated audio clips within the AudioCaps database.
摘要
文本描述生成真实的声音clip的能力可能会革命化我们如何处理音频。在这个工作中,我们初步地理解音频干扰模型的内部工作,通过对它们的声音输出与训练数据进行比较,类似于医生 auscultates 病人的器官声音。使用基于 AudioCaps 数据集的文本-声音干扰模型,我们系统地分析了训练集大小的影响,并评估不同的检索指标,发现mel спектрограм相似性更加稳定地检测匹配。在分析声音干扰模型的 memorization 行为的过程中,我们还发现了 AudioCaps 数据库中的大量重复的声音clip。Note: "Simplified Chinese" is a translation of the text into Standard Chinese, which is the official language of China. "Traditional Chinese" is a different writing system used in Taiwan and some other countries.
BeatDance: A Beat-Based Model-Agnostic Contrastive Learning Framework for Music-Dance Retrieval
paper_authors: Kaixing Yang, Xukun Zhou, Xulong Tang, Ran Diao, Hongyan Liu, Jun He, Zhaoxin Fan
For: The paper is written for improving dance-music retrieval performance by utilizing the alignment between music beats and dance movements.* Methods: The proposed method, BeatDance, incorporates a Beat-Aware Music-Dance InfoExtractor, a Trans-Temporal Beat Blender, and a Beat-Enhanced Hubness Reducer to improve dance-music retrieval performance.* Results: The experimental results on the Music-Dance (MD) dataset demonstrate the superiority of the proposed method over existing baselines, achieving state-of-the-art performance.Here’s the simplified Chinese version of the three key points:* For: 提高舞蹈音乐 Retrieval 性能,利用音乐拍和舞蹈动作的匹配。* Methods: 提出了 BeatDance 模型无关对比学习框架,包括 Beat-Aware Music-Dance InfoExtractor、Trans-Temporal Beat Blender 和 Beat-Enhanced Hubness Reducer。* Results: 在 Music-Dance(MD)数据集上,实验结果表明提议方法比基eline表现更出色,实现了状态级表现。Abstract
Dance and music are closely related forms of expression, with mutual retrieval between dance videos and music being a fundamental task in various fields like education, art, and sports. However, existing methods often suffer from unnatural generation effects or fail to fully explore the correlation between music and dance. To overcome these challenges, we propose BeatDance, a novel beat-based model-agnostic contrastive learning framework. BeatDance incorporates a Beat-Aware Music-Dance InfoExtractor, a Trans-Temporal Beat Blender, and a Beat-Enhanced Hubness Reducer to improve dance-music retrieval performance by utilizing the alignment between music beats and dance movements. We also introduce the Music-Dance (MD) dataset, a large-scale collection of over 10,000 music-dance video pairs for training and testing. Experimental results on the MD dataset demonstrate the superiority of our method over existing baselines, achieving state-of-the-art performance. The code and dataset will be made public available upon acceptance.
摘要
文本:舞蹈和音乐是密切相关的表达形式,它们之间存在着很强的相互关联。然而,现有的方法 часто会导致不自然的生成效果,或者完全不利用音乐和舞蹈之间的相互关系。为了解决这些挑战,我们提出了 BeatDance,一种新的 beat-based 模型无关的对比学习框架。 BeatDance 包括一个 Beat-Aware Music-Dance 信息抽取器、一个 Trans-Temporal Beat Blender 和一个 Beat-Enhanced Hubness Reducer,以便通过音乐 beat 和舞蹈动作的协调来提高舞蹈-音乐 retrieve 性能。我们还提出了 Music-Dance(MD)数据集,一个大规模的音乐-舞蹈视频对集,用于训练和测试。实验结果表明,我们的方法在 MD 数据集上表现出优于现有基eline,实现了状态计算机。代码和数据集将在接受后公开。翻译结果:文本:舞蹈和音乐是密切相关的表达形式,它们之间存在着很强的相互关联。然而,现有的方法常常会导致不自然的生成效果,或者完全不利用音乐和舞蹈之间的相互关系。为了解决这些挑战,我们提出了 BeatDance,一种新的 beat-based 模型无关的对比学习框架。 BeatDance 包括一个 Beat-Aware Music-Dance 信息抽取器、一个 Trans-Temporal Beat Blender 和一个 Beat-Enhanced Hubness Reducer,以便通过音乐 beat 和舞蹈动作的协调来提高舞蹈-音乐 retrieve 性能。我们还提出了 Music-Dance(MD)数据集,一个大规模的音乐-舞蹈视频对集,用于训练和测试。实验结果表明,我们的方法在 MD 数据集上表现出优于现有基eline,实现了状态计算机。代码和数据集将在接受后公开。
Advancing Audio Emotion and Intent Recognition with Large Pre-Trained Models and Bayesian Inference
paper_authors: Dejan Porjazovski, Yaroslav Getman, Tamás Grósz, Mikko Kurimo
for: This paper is written for the ACM Multimedia Computational Paralinguistics Challenge, addressing the Requests and Emotion Share tasks.
methods: The paper employs large pre-trained models and explores audio-only and hybrid solutions leveraging audio and text modalities. The authors also introduce a Bayesian layer as an alternative to the standard linear output layer.
results: The empirical results consistently show the superiority of the hybrid approaches over the audio-only models, with the multimodal fusion approach achieving an 85.4% UAR on HC-Requests and 60.2% on HC-Complaints, and the ensemble model for the Emotion Share task yielding the best rho value of .614. Additionally, the Bayesian wav2vec2 approach allows for easily building ensembles with usable confidence values instead of overconfident posterior probabilities.Here’s the Chinese translation of the three key points:
results: 实验结果表明,混合方案比听音模型更加有优势,并且 multimodal fusion 方法在 HC-Requests 上 achieved 85.4% UAR 和 HC-Complaints 上 achieved 60.2%。此外,作者还介绍了一种 Bayesian wav2vec2 方法,该方法可以轻松地构建集成模型,只需要 fine-tune 一个模型。此外,该方法还可以提供可信度值 instead of 常见的过度信息 posterior probabilities。Abstract
Large pre-trained models are essential in paralinguistic systems, demonstrating effectiveness in tasks like emotion recognition and stuttering detection. In this paper, we employ large pre-trained models for the ACM Multimedia Computational Paralinguistics Challenge, addressing the Requests and Emotion Share tasks. We explore audio-only and hybrid solutions leveraging audio and text modalities. Our empirical results consistently show the superiority of the hybrid approaches over the audio-only models. Moreover, we introduce a Bayesian layer as an alternative to the standard linear output layer. The multimodal fusion approach achieves an 85.4% UAR on HC-Requests and 60.2% on HC-Complaints. The ensemble model for the Emotion Share task yields the best rho value of .614. The Bayesian wav2vec2 approach, explored in this study, allows us to easily build ensembles, at the cost of fine-tuning only one model. Moreover, we can have usable confidence values instead of the usual overconfident posterior probabilities.
摘要
大型预训模型在para语言系统中扮演着关键角色,在情感识别和偏声检测等任务中显示出了效iveness。在这篇论文中,我们使用大型预训模型参加ACM Multimedia Computational Paralinguistics Challenge的请求和情感分享任务。我们研究了基于音频和文本modalities的混合解决方案,并对audio-only和混合方案进行了比较。我们的实验结果表明,混合方案在HC-Requests和HC-Complaints任务上具有显著的优势。此外,我们还引入了一种 bayesian层作为标准线性输出层的替代方案。我们的Multimodal混合方法在HC-Requests上 achieve 85.4% UAR,在HC-Complaints上 achieve 60.2%。此外,我们还提出了一种 bayesian wav2vec2方法,可以轻松地建立ensemble,只需要 Fine-tuning一个模型。此外,我们可以获得可信度值 instead of usual overconfident posterior probabilities。
Real-time Speech Enhancement and Separation with a Unified Deep Neural Network for Single/Dual Talker Scenarios
results: 实验结果表明,提出的训练方法超过了现有的解决方案,并且SOD模块具有高准确性。Abstract
This paper introduces a practical approach for leveraging a real-time deep learning model to alternate between speech enhancement and joint speech enhancement and separation depending on whether the input mixture contains one or two active speakers. Scale-invariant signal-to-distortion ratio (SI-SDR) has shown to be a highly effective training measure in time-domain speech separation. However, the SI-SDR metric is ill-defined for zero-energy target signals, which is a problem when training a speech separation model using utterances with varying numbers of talkers. Unlike existing solutions that focus on modifying the loss function to accommodate zero-energy target signals, the proposed approach circumvents this problem by training the model to extract speech on both its output channels regardless if the input is a single or dual-talker mixture. A lightweight speaker overlap detection (SOD) module is also introduced to differentiate between single and dual-talker segments in real-time. The proposed module takes advantage of the new formulation by operating directly on the separated masks, given by the separation model, instead of the original mixture, thus effectively simplifying the detection task. Experimental results show that the proposed training approach outperforms existing solutions, and the SOD module exhibits high accuracy.
摘要
Unlike existing solutions that modify the loss function to accommodate zero-energy target signals, the proposed approach trains the model to extract speech on both its output channels regardless of whether the input is a single or dual-talker mixture. Additionally, a lightweight speaker overlap detection (SOD) module is introduced to differentiate between single and dual-talker segments in real-time. The SOD module operates directly on the separated masks, provided by the separation model, rather than the original mixture, making the detection task simpler.Experimental results show that the proposed training approach outperforms existing solutions, and the SOD module exhibits high accuracy.