paper_authors: Étienne Labbé, Thomas Pellegrini, Julien Pinquier
for: The paper is written for the task of automated audio captioning (AAC), specifically using a ConvNeXt architecture as the audio encoder and exploring the use of task embeddings to improve performance across multiple datasets.
methods: The paper uses a ConvNeXt architecture as the audio encoder, which is adapted from the vision domain to audio classification. The model is trained on multiple AAC datasets (AC, CL, MACS, WavCaps) with a task embedding (TE) token to identify the source dataset for each input sample.
results: The paper achieves state-of-the-art scores on the AudioCaps (AC) dataset and competitive performance on Clotho (CL) with fewer parameters than existing models. The use of task embeddings improves cross-dataset performance, but there is still a performance gap between datasets, indicating the need for dataset-specific models. The resulting model, called CoNeTTE, achieves SPIDEr scores of 44.1% and 30.5% on AC and CL, respectively.Abstract
Automated Audio Captioning (AAC) involves generating natural language descriptions of audio content, using encoder-decoder architectures. An audio encoder produces audio embeddings fed to a decoder, usually a Transformer decoder, for caption generation. In this work, we describe our model, which novelty, compared to existing models, lies in the use of a ConvNeXt architecture as audio encoder, adapted from the vision domain to audio classification. This model, called CNext-trans, achieved state-of-the-art scores on the AudioCaps (AC) dataset and performed competitively on Clotho (CL), while using four to forty times fewer parameters than existing models. We examine potential biases in the AC dataset due to its origin from AudioSet by investigating unbiased encoder's impact on performance. Using the well-known PANN's CNN14, for instance, as an unbiased encoder, we observed a 1.7% absolute reduction in SPIDEr score (where higher scores indicate better performance). To improve cross-dataset performance, we conducted experiments by combining multiple AAC datasets (AC, CL, MACS, WavCaps) for training. Although this strategy enhanced overall model performance across datasets, it still fell short compared to models trained specifically on a single target dataset, indicating the absence of a one-size-fits-all model. To mitigate performance gaps between datasets, we introduced a Task Embedding (TE) token, allowing the model to identify the source dataset for each input sample. We provide insights into the impact of these TEs on both the form (words) and content (sound event types) of the generated captions. The resulting model, named CoNeTTE, an unbiased CNext-trans model enriched with dataset-specific Task Embeddings, achieved SPIDEr scores of 44.1% and 30.5% on AC and CL, respectively. Code available: https://github.com/Labbeti/conette-audio-captioning.
摘要
自动化语音描述(AAC)技术涉及生成语音内容的自然语言描述,使用编码器-解码器架构。一个音频编码器生成音频嵌入,并将其传递给一个通常是转换器解码器的decoder进行描述生成。在这个工作中,我们描述了我们的模型,它与现有模型的不同之处在于使用ConvNeXt架构作为音频编码器,从视觉领域中适应到音频分类。我们称之为CNext-trans模型,在AudioCaps(AC)数据集上达到了状态之artefact的分数,并在Clotho(CL)数据集上表现竞争力强,同时使用四到四十个参数少于现有模型。我们 investigate了AC数据集中可能的偏见问题,并证明使用不偏见的编码器对性能有一定的影响。使用著名的PANN的CNN14作为不偏见编码器,我们观察到了1.7%的绝对下降分数(其中更高的分数表示更好的性能)。为提高跨数据集性能,我们进行了多个AAC数据集(AC、CL、MACS、WavCaps)的组合训练。虽然这种策略提高了总模型性能,但还不如特定目标数据集训练的模型,表明不存在一个适用于所有数据集的模型。为了减少数据集之间性能差距,我们引入了任务嵌入(TE)token,让模型可以通过检测输入样本的来源数据集来识别样本的来源。我们对TE的影响进行了详细的分析,包括对形式(字词)和内容(声音事件类型)的影响。最终,我们提出了CoNeTTE模型,一个不偏见CNext-trans模型,通过添加数据集特定的任务嵌入,实现了SPIDEr分数44.1%和30.5%。代码可以在https://github.com/Labbeti/conette-audio-captioning中下载。
Remixing-based Unsupervised Source Separation from Scratch
results: 实验结果表明,提议的方法可以超越现有的混合 invariant 训练方法,从零开始训练一个单频分离模型。此外,我们还提出了一种简单的搅拌方法来稳定训练。Abstract
We propose an unsupervised approach for training separation models from scratch using RemixIT and Self-Remixing, which are recently proposed self-supervised learning methods for refining pre-trained models. They first separate mixtures with a teacher model and create pseudo-mixtures by shuffling and remixing the separated signals. A student model is then trained to separate the pseudo-mixtures using either the teacher's outputs or the initial mixtures as supervision. To refine the teacher's outputs, the teacher's weights are updated with the student's weights. While these methods originally assumed that the teacher is pre-trained, we show that they are capable of training models from scratch. We also introduce a simple remixing method to stabilize training. Experimental results demonstrate that the proposed approach outperforms mixture invariant training, which is currently the only available approach for training a monaural separation model from scratch.
摘要
我们提出了一种无监督的方法,用于从零开始训练分离模型,使用RecmixIT和Self-Remixing,这两种最近提出的自动学习方法来修正预训练模型。它们首先使用一个教师模型将混合物分离出来,然后创建假混合物,通过搅拌和重新混合分离后的信号。一个学生模型然后被训练使用教师的输出或初始混合物作为监督来分离假混合物。为了修正教师的输出,教师的权重被更新为学生的权重。而这些方法最初假设了教师是预训练的,但我们表明它们可以训练模型从零开始。我们还介绍了一种简单的搅拌方法,以稳定训练。实验结果表明,我们提出的方法在训练独立式分离模型方面超过了现有的混合物不变训练方法。