eess.AS - 2023-12-06

Parameter-Efficient Transfer Learning of Audio Spectrogram Transformers

paper_url: http://arxiv.org/abs/2312.03694
repo_url: https://github.com/umbertocappellazzo/petl_ast
paper_authors: Umberto Cappellazzo, Daniele Falavigna, Alessio Brutti, Mirco Ravanelli
for: 这篇论文的目的是为了研究如何实现对大型预训模型的精确化，以提高多个下游任务的性能，并且避免全 Parameters 的 fine-tuning。
methods: 这篇论文使用了几种 parameter-efficient 方法，包括 prompt-tuning 和 adapters，以精确化只有几个额外的参数，而不是全部参数。特别是 adapters，它们的 flexibility 使得它们在过去几年得到了很多注意，并且实现了许多 variant。
results: 这篇论文的结果显示，在 audio classification 任务上，Audio Spectrogram Transformer 模型表现出色，但是如何将它 efficiently 适应到多个下游任务仍然是一个问题。这篇论文提供了一个详细的 investigated ，发现 adapters 在四个 benchmark 上 consistently outperform 其他方法，并且在 few-shot learning 设定和当 total 参数数量增加时仍然保持佳效。

Abstract
The common modus operandi of fine-tuning large pre-trained Transformer models entails the adaptation of all their parameters (i.e., full fine-tuning). While achieving striking results on multiple tasks, this approach becomes unfeasible as the model size and the number of downstream tasks increase. In natural language processing and computer vision, parameter-efficient approaches like prompt-tuning and adapters have emerged as solid alternatives by fine-tuning only a small number of extra parameters, without sacrificing performance accuracy. Specifically, adapters, due to their flexibility, have recently garnered significant attention, leading to several variants. For audio classification tasks, the Audio Spectrogram Transformer model shows impressive results. However, surprisingly, how to efficiently adapt it to several downstream tasks has not been tackled before. In this paper, we bridge this gap and present a detailed investigation of common parameter-efficient methods, revealing that adapters consistently outperform the other methods across four benchmarks. This trend is also confirmed in few-shot learning settings and when the total number of trainable parameters increases, demonstrating adapters superior scalability. We finally study the best adapter configuration, as well as the role of residual connections in the learning process.

摘要
通常的模式操作（modus operandi）在精细调整大型预训练Transformer模型中包括所有参数的调整（i.e., full fine-tuning）。 although this approach can achieve striking results on multiple tasks, it becomes unfeasible as the model size and the number of downstream tasks increase. In natural language processing and computer vision, parameter-efficient approaches like prompt-tuning and adapters have emerged as solid alternatives by fine-tuning only a small number of extra parameters, without sacrificing performance accuracy. Specifically, adapters, due to their flexibility, have recently garnered significant attention, leading to several variants. For audio classification tasks, the Audio Spectrogram Transformer model shows impressive results. However, surprisingly, how to efficiently adapt it to several downstream tasks has not been tackled before. In this paper, we bridge this gap and present a detailed investigation of common parameter-efficient methods, revealing that adapters consistently outperform the other methods across four benchmarks. This trend is also confirmed in few-shot learning settings and when the total number of trainable parameters increases, demonstrating adapters superior scalability. We finally study the best adapter configuration, as well as the role of residual connections in the learning process.

Lightweight Speaker Verification Using Transformation Module with Feature Partition and Fusion

paper_url: http://arxiv.org/abs/2312.03324
repo_url: None
paper_authors: Yanxiong Li, Zhongjie Jiang, Qisheng Huang, Wenchang Cao, Jialong Li
for: 降低Speaker Verification模型的复杂度，以提高 Terminal 上的识别率。
methods: 我们提出了一种转换模块，该模块通过Feature Partition和Feature Fusion来实现Lightweight Speaker Verification。该模块包括多种简单 yet effective的操作，如卷积、混合、均值、 concatenation、Normalization 和元素归一化。它可以随意地插入到各种模型中，以降低模型复杂度而不损失模型误差。
results: 我们在两个公共的Speech Corporae（namely VoxCeleb1和VoxCeleb2）上进行实验，结果表明，将转换模块插入到AMCRN、ResNet34和ECAPA-TDNN三种模型中，可以微不足道提高模型误差，同时显著降低模型复杂度。我们的提议方法在内存需求和计算复杂度方面比基eline方法更好，同时也能够通过不同的截断段 lengths generalize well。

Abstract
Although many efforts have been made on decreasing the model complexity for speaker verification, it is still challenging to deploy speaker verification systems with satisfactory result on low-resource terminals. We design a transformation module that performs feature partition and fusion to implement lightweight speaker verification. The transformation module consists of multiple simple but effective operations, such as convolution, pooling, mean, concatenation, normalization, and element-wise summation. It works in a plug-and-play way, and can be easily implanted into a wide variety of models to reduce the model complexity while maintaining the model error. First, the input feature is split into several low-dimensional feature subsets for decreasing the model complexity. Then, each feature subset is updated by fusing it with the inter-feature-subsets correlational information to enhance its representational capability. Finally, the updated feature subsets are independently fed into the block (one or several layers) of the model for further processing. The features that are output from current block of the model are processed according to the steps above before they are fed into the next block of the model. Experimental data are selected from two public speech corpora (namely VoxCeleb1 and VoxCeleb2). Results show that implanting the transformation module into three models (namely AMCRN, ResNet34, and ECAPA-TDNN) for speaker verification slightly increases the model error and significantly decreases the model complexity. Our proposed method outperforms baseline methods on the whole in memory requirement and computational complexity with lower equal error rate. It also generalizes well across truncated segments with various lengths.

摘要
尽管已经做了许多减少模型复杂度的努力，但是在低资源终端上部署高度准确的语音识别系统仍然是一项挑战。我们设计了一个转换模块，该模块通过特征分割和融合来实现轻量级语音识别。该模块由多种简单 yet effective的操作组成，如卷积、抽象、均值、 concatenation、标准化和元素积加。它可以在插件式的方式下工作，并可以轻松地被插入到各种模型中，以降低模型复杂度而保持模型误差。首先，输入特征被拆分成多个低维度特征子集，以降低模型复杂度。然后，每个特征子集被更新，通过与其他特征子集相关信息融合来增强其表达能力。最后，更新后的特征子集独立地被 fed into 模型块（一个或多个层）中进行进一步处理。输出于当前层的特征被按照上述步骤进行处理，然后被 fed into 下一层的模型中。实验数据选自两个公共的语音 corpora（namely VoxCeleb1和VoxCeleb2）。结果显示，在三个模型（namely AMCRN、ResNet34和ECAPA-TDNN）中implanting transformation module slight increase model error and significantly reduce model complexity.我们的提出方法在整体上优于基准方法，在内存需求和计算复杂度方面具有更好的性能，并且能够良好地适应不同长度的截断段。