Related papers: An Empirical Study on the Transferability of Transformer Modules in Parameter-Efficient Fine-Tuning

An Empirical Study on the Transferability of Transformer Modules in Parameter-Efficient Fine-Tuning

URL: http://arxiv.org/abs/2302.00378v1
Date: Wed, 1 Feb 2023 11:20:18 GMT
Title: An Empirical Study on the Transferability of Transformer Modules in Parameter-Efficient Fine-Tuning
Authors: Mohammad AkbarTajari, Sara Rajaee, Mohammad Taher Pilehvar
Abstract summary: We investigate the capability of different transformer modules in transferring knowledge from a pre-trained model to a downstream task. LayerNorms exhibit the best capacity for knowledge transfer with limited trainable weights.
Score: 18.69409646532038
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Parameter-efficient fine-tuning approaches have recently garnered a lot of attention. Having considerably lower number of trainable weights, these methods can bring about scalability and computational effectiveness. In this paper, we look for optimal sub-networks and investigate the capability of different transformer modules in transferring knowledge from a pre-trained model to a downstream task. Our empirical results suggest that every transformer module in BERT can act as a winning ticket: fine-tuning each specific module while keeping the rest of the network frozen can lead to comparable performance to the full fine-tuning. Among different modules, LayerNorms exhibit the best capacity for knowledge transfer with limited trainable weights, to the extent that, with only 0.003% of all parameters in the layer-wise analysis, they show acceptable performance on various target tasks. On the reasons behind their effectiveness, we argue that their notable performance could be attributed to their high-magnitude weights compared to that of the other modules in the pre-trained BERT.

Related papers

VIViT: Variable-Input Vision Transformer Framework for 3D MR Image Segmentation [8.634647333205375]
We propose variable-input ViT (VIViT), a transformer-based framework for self-supervised pretraining and segmentation finetuning.<n>We validate our method on brain infarct and brain tumor segmentation, where our method outperforms current CNN and ViT-based models with a mean Dice score of 0.624 and 0.883 respectively.
arXiv Detail & Related papers (2025-05-13T15:52:34Z)
Skip Tuning: Pre-trained Vision-Language Models are Effective and Efficient Adapters Themselves [123.07450481623124]
We propose Skip Tuning as a novel paradigm for adapting vision-language models to downstream tasks. Unlike existing PT or adapter-based methods, Skip Tuning applies Layer-wise Skipping (LSkip) and Class-wise Skipping (CSkip) upon the FT baseline without introducing extra context vectors or adapter modules.
arXiv Detail & Related papers (2024-12-16T07:33:23Z)
Preserving Pre-trained Representation Space: On Effectiveness of Prefix-tuning for Large Multi-modal Models [24.62337386603331]
Large Multi-modal Models (LMMs) are revolutionizing the way machines interact with the world. To adapt LMMs for downstream tasks, parameter-efficient fine-tuning (PEFT) has gained popularity. This paper focuses on the strengths and weaknesses of each tuning strategy, shifting the focus from the efficiency typically associated with these approaches.
arXiv Detail & Related papers (2024-10-29T07:55:50Z)
Dynamic Tuning Towards Parameter and Inference Efficiency for ViT Adaptation [67.13876021157887]
Dynamic Tuning (DyT) is a novel approach to improve both parameter and inference efficiency for ViT adaptation. DyT achieves superior performance compared to existing PEFT methods while evoking only 71% of their FLOPs on the VTAB-1K benchmark.
arXiv Detail & Related papers (2024-03-18T14:05:52Z)
Vanilla Transformers are Transfer Capability Teachers [34.24324719229975]
We propose that the pre-training performance and transfer capability of a model are joint determinants of its downstream task performance. MoE models, in comparison to vanilla models, have poorer transfer capability, leading to their subpar performance in downstream tasks. The MoE models guided by vanilla models can achieve both strong pre-training performance and transfer capability, ultimately enhancing their performance in downstream tasks.
arXiv Detail & Related papers (2024-03-04T12:40:28Z)
Dynamic Layer Tying for Parameter-Efficient Transformers [65.268245109828]
We employ Reinforcement Learning to select layers during training and tie them together. This facilitates weight sharing, reduces the number of trainable parameters, and also serves as an effective regularization technique. In particular, the memory consumption during training is up to one order of magnitude less than the conventional training method.
arXiv Detail & Related papers (2024-01-23T14:53:20Z)
Consolidator: Mergeable Adapter with Grouped Connections for Visual Adaptation [53.835365470800916]
We show how to efficiently and effectively transfer knowledge in a vision transformer. We propose consolidator to modify the pre-trained model with the addition of a small set of tunable parameters. Our consolidator can reach up to 7.56 better accuracy than full fine-tuning with merely 0.35% parameters.
arXiv Detail & Related papers (2023-04-30T23:59:02Z)
PVP: Pre-trained Visual Parameter-Efficient Tuning [29.05396521860764]
Large-scale pre-trained transformers have demonstrated remarkable success in various computer vision tasks. It is still highly challenging to fully fine-tune these models for downstream tasks due to their high computational and storage costs. We propose a Pre-trained Visual. efficient (PVP) Tuning framework, which pre-trains the parameter-efficient tuning modules first and then leverages the pre-trained modules.
arXiv Detail & Related papers (2023-04-26T15:55:29Z)
Strong Baselines for Parameter Efficient Few-Shot Fine-tuning [50.83426196335385]
Few-shot classification (FSC) entails learning novel classes given only a few examples per class after a pre-training (or meta-training) phase. Recent works have shown that simply fine-tuning a pre-trained Vision Transformer (ViT) on new test classes is a strong approach for FSC. Fine-tuning ViTs, however, is expensive in time, compute and storage. This has motivated the design of parameter efficient fine-tuning (PEFT) methods which fine-tune only a fraction of the Transformer's parameters.
arXiv Detail & Related papers (2023-04-04T16:14:39Z)
Rethinking Efficient Tuning Methods from a Unified Perspective [34.67645496324432]
We revisit the design paradigm of PETL and derive a unified framework U-Tuning for parameter-efficient transfer learning. The U-Tuning framework can simultaneously encompass existing methods and derive new approaches for parameter-efficient transfer learning.
arXiv Detail & Related papers (2023-03-01T17:38:03Z)
Adapted Multimodal BERT with Layer-wise Fusion for Sentiment Analysis [84.12658971655253]
We propose Adapted Multimodal BERT, a BERT-based architecture for multimodal tasks. adapter adjusts the pretrained language model for the task at hand, while the fusion layers perform task-specific, layer-wise fusion of audio-visual information with textual BERT representations. In our ablations we see that this approach leads to efficient models, that can outperform their fine-tuned counterparts and are robust to input noise.
arXiv Detail & Related papers (2022-12-01T17:31:42Z)
UPDeT: Universal Multi-agent Reinforcement Learning via Policy Decoupling with Transformers [108.92194081987967]
We make the first attempt to explore a universal multi-agent reinforcement learning pipeline, designing one single architecture to fit tasks. Unlike previous RNN-based models, we utilize a transformer-based model to generate a flexible policy. The proposed model, named as Universal Policy Decoupling Transformer (UPDeT), further relaxes the action restriction and makes the multi-agent task's decision process more explainable.
arXiv Detail & Related papers (2021-01-20T07:24:24Z)
Investigating Transferability in Pretrained Language Models [8.83046338075119]
We consider a simple ablation technique for determining the impact of each pretrained layer on transfer task performance. This technique reveals that in BERT, layers with high probing performance on downstream GLUE tasks are neither necessary nor sufficient for high accuracy on those tasks.
arXiv Detail & Related papers (2020-04-30T17:23:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.