LST: Ladder Side-Tuning for Parameter and Memory Efficient Transfer
Learning
- URL: http://arxiv.org/abs/2206.06522v1
- Date: Mon, 13 Jun 2022 23:51:56 GMT
- Title: LST: Ladder Side-Tuning for Parameter and Memory Efficient Transfer
Learning
- Authors: Yi-Lin Sung, Jaemin Cho, Mohit Bansal
- Abstract summary: It is costly to update the entire parameter set of large pre-trained models.
PETL techniques allow updating a small subset of parameters inside a pre-trained backbone network for a new task.
We propose Ladder Side-Tuning (LST), a new PETL technique that reduces training memory requirements by more substantial amounts.
- Score: 82.93130407930762
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Fine-tuning large pre-trained models on downstream tasks has been adopted in
a variety of domains recently. However, it is costly to update the entire
parameter set of large pre-trained models. Although recently proposed
parameter-efficient transfer learning (PETL) techniques allow updating a small
subset of parameters (e.g. only using 2% of parameters) inside a pre-trained
backbone network for a new task, they only reduce the training memory
requirement by up to 30%. This is because the gradient computation for the
trainable parameters still requires backpropagation through the large
pre-trained backbone model. To address this, we propose Ladder Side-Tuning
(LST), a new PETL technique that reduces training memory requirements by more
substantial amounts. Unlike existing parameter-efficient methods that insert
additional parameters inside backbone networks, we train a ladder side network,
a small and separate network that takes intermediate activations as input via
shortcut connections (ladders) from backbone networks and makes predictions.
LST has significantly lower memory requirements than previous methods, because
it does not require backpropagation through the backbone network, but instead
only through the side network and ladder connections. We evaluate our method
with various models (T5, CLIP-T5) on both NLP (GLUE) and vision-language (VQA,
GQA, NLVR2, MSCOCO) tasks. LST saves 69% of the memory costs to fine-tune the
whole network, while other methods only save 26% of that in similar parameter
usages (hence, 2.7x more memory savings). Moreover, LST achieves higher
accuracy than Adapter and LoRA in a low-memory regime. To further show the
advantage of this better memory efficiency, we also apply LST to larger T5
models (T5-large, T5-3B), attaining better GLUE performance than full
fine-tuning and other PETL methods. The exact same trend also holds in our
experiments on VL tasks.
Related papers
- SHERL: Synthesizing High Accuracy and Efficient Memory for Resource-Limited Transfer Learning [63.93193829913252]
We propose an innovative METL strategy called SHERL for resource-limited scenarios.
In the early route, intermediate outputs are consolidated via an anti-redundancy operation.
In the late route, utilizing minimal late pre-trained layers could alleviate the peak demand on memory overhead.
arXiv Detail & Related papers (2024-07-10T10:22:35Z) - Dynamic Adapter Meets Prompt Tuning: Parameter-Efficient Transfer Learning for Point Cloud Analysis [51.14136878142034]
Point cloud analysis has achieved outstanding performance by transferring point cloud pre-trained models.
Existing methods for model adaptation usually update all model parameters, which is inefficient as it relies on high computational costs.
In this paper, we aim to study parameter-efficient transfer learning for point cloud analysis with an ideal trade-off between task performance and parameter efficiency.
arXiv Detail & Related papers (2024-03-03T08:25:04Z) - Low-rank Attention Side-Tuning for Parameter-Efficient Fine-Tuning [19.17362588650503]
Low-rank Attention Side-Tuning (LAST) trains a side-network composed of only low-rank self-attention modules.
We show LAST can be highly parallel across multiple optimization objectives, making it very efficient in downstream task adaptation.
arXiv Detail & Related papers (2024-02-06T14:03:15Z) - Time-, Memory- and Parameter-Efficient Visual Adaptation [75.28557015773217]
We propose an adaptation method which does not backpropagate gradients through the backbone.
We achieve this by designing a lightweight network in parallel that operates on features from the frozen, pretrained backbone.
arXiv Detail & Related papers (2024-02-05T10:55:47Z) - DTL: Disentangled Transfer Learning for Visual Recognition [21.549234013998255]
We introduce Disentangled Transfer Learning (DTL), which disentangles the trainable parameters from the backbone using a lightweight Compact Side Network (CSN)
The proposed method not only reduces a large amount of GPU memory usage and trainable parameters, but also outperforms existing PETL methods by a significant margin in accuracy.
arXiv Detail & Related papers (2023-12-13T02:51:26Z) - UniPT: Universal Parallel Tuning for Transfer Learning with Efficient
Parameter and Memory [69.33445217944029]
PETL is an effective strategy for adapting pre-trained models to downstream domains.
Recent PETL works focus on the more valuable memory-efficient characteristic.
We propose a new memory-efficient PETL strategy, Universal Parallel Tuning (UniPT)
arXiv Detail & Related papers (2023-08-28T05:38:43Z) - LiST: Lite Self-training Makes Efficient Few-shot Learners [91.28065455714018]
LiST improves by 35% over classic fine-tuning methods and 6% over prompt-tuning with 96% reduction in number of trainable parameters when fine-tuned with no more than 30 labeled examples from each target domain.
arXiv Detail & Related papers (2021-10-12T18:47:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.