Related papers: Make Pre-trained Model Reversible: From Parameter to Memory Efficient Fine-Tuning

Make Pre-trained Model Reversible: From Parameter to Memory Efficient Fine-Tuning

URL: http://arxiv.org/abs/2306.00477v4
Date: Thu, 19 Oct 2023 16:03:31 GMT
Title: Make Pre-trained Model Reversible: From Parameter to Memory Efficient Fine-Tuning
Authors: Baohao Liao, Shaomu Tan, Christof Monz
Abstract summary: We propose memory-efficient fine-tuning (MEFT) for pre-trained language models. MEFT inserts adapters into a PLM, preserving the PLM's starting point and making it reversible without additional pre-training. MEFT significantly reduces the activation memory up to 84% of full fine-tuning with a negligible amount of trainable parameters.
Score: 6.451743797015637
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Parameter-efficient fine-tuning (PEFT) of pre-trained language models (PLMs) has emerged as a highly successful approach, with training only a small number of parameters without sacrificing performance and becoming the de-facto learning paradigm with the increasing size of PLMs. However, existing PEFT methods are not memory-efficient, because they still require caching most of the intermediate activations for the gradient calculation, akin to fine-tuning. One effective way to reduce the activation memory is to apply a reversible model, so the intermediate activations are not necessary to be cached and can be recomputed. Nevertheless, modifying a PLM to its reversible variant is not straightforward, since the reversible model has a distinct architecture from the currently released PLMs. In this paper, we first investigate what is a key factor for the success of existing PEFT methods, and realize that it's essential to preserve the PLM's starting point when initializing a PEFT method. With this finding, we propose memory-efficient fine-tuning (MEFT) that inserts adapters into a PLM, preserving the PLM's starting point and making it reversible without additional pre-training. We evaluate MEFT on the GLUE benchmark and five question-answering tasks with various backbones, BERT, RoBERTa, BART and OPT. MEFT significantly reduces the activation memory up to 84% of full fine-tuning with a negligible amount of trainable parameters. Moreover, MEFT achieves the same score on GLUE and a comparable score on the question-answering tasks as full fine-tuning. A similar finding is also observed for the image classification task.

Related papers

Bilevel ZOFO: Bridging Parameter-Efficient and Zeroth-Order Techniques for Efficient LLM Fine-Tuning and Meta-Training [44.48966200270378]
Fine-tuning pre-trained Large Language Models (LLMs) for downstream tasks using First-Order (FO)imats presents significant computational challenges. We propose a bilevel optimization framework that complements ZO methods with PEFT to mitigate sensitivity to hard prompts. Our Bilevel ZOFO method employs a double-loop optimization strategy, where only the gradient of the PEFT model and the forward pass of the base model are required.
arXiv Detail & Related papers (2025-02-05T20:47:44Z)
Interweaving Memories of a Siamese Large Language Model [9.60026229476874]
We propose a model-agnostic PEFT framework, which Interweaves Memories of a Siamese Large Language Model. Our findings indicate that IMSM maintains comparable time and space efficiency to backbone PEFT methods.
arXiv Detail & Related papers (2024-12-23T08:33:47Z)
Refining Salience-Aware Sparse Fine-Tuning Strategies for Language Models [14.68920095399595]
sparsity-based PEFT (SPEFT) introduces trainable sparse adaptations to the weight matrices in the model. We conduct the first systematic evaluation of salience metrics for SPEFT, inspired by zero-cost NAS proxies. Our work challenges the notion that complexity is necessary for effective PEFT.
arXiv Detail & Related papers (2024-12-18T04:14:35Z)
ALoRE: Efficient Visual Adaptation via Aggregating Low Rank Experts [71.91042186338163]
ALoRE is a novel PETL method that reuses the hypercomplex parameterized space constructed by Kronecker product to Aggregate Low Rank Experts. Thanks to the artful design, ALoRE maintains negligible extra parameters and can be effortlessly merged into the frozen backbone.
arXiv Detail & Related papers (2024-12-11T12:31:30Z)
SparseGrad: A Selective Method for Efficient Fine-tuning of MLP Layers [88.68985153780514]
We propose a new selective PEFT method, namely SparseGrad, that performs well on parameter blocks. We apply SparseGrad to fine-tune BERT and RoBERTa for the NLU task and LLaMa-2 for the Question-Answering task.
arXiv Detail & Related papers (2024-10-09T19:03:52Z)
FPT+: A Parameter and Memory Efficient Transfer Learning Method for High-resolution Medical Image Classification [1.5791081894226173]
Fine-grained Prompt Tuning plus (FPT+) is a PETL method designed for high-resolution medical image classification. FPT+ performs transfer learning by training a lightweight side network and accessing pre-trained knowledge from a large pre-trained model. Experimental results demonstrate that FPT+ outperforms other PETL methods, using only 1.03% of the learnable parameters and 3.18% of the memory required for fine-tuning an entire ViT-B model.
arXiv Detail & Related papers (2024-08-05T12:33:07Z)
SHERL: Synthesizing High Accuracy and Efficient Memory for Resource-Limited Transfer Learning [63.93193829913252]
We propose an innovative METL strategy called SHERL for resource-limited scenarios. In the early route, intermediate outputs are consolidated via an anti-redundancy operation. In the late route, utilizing minimal late pre-trained layers could alleviate the peak demand on memory overhead.
arXiv Detail & Related papers (2024-07-10T10:22:35Z)
FeDeRA:Efficient Fine-tuning of Language Models in Federated Learning Leveraging Weight Decomposition [7.229494183462913]
Despite exceptional performance after fine-tuning, pre-trained language models (PLMs) face significant challenges due to privacy concerns. We consider federated learning (FL) to fine-tune PLMs in this paper. One promising solution is to exploit parameter-efficient fine-tuning (PEFT) into FL, which trains a much smaller set of parameters than full parameter fine-tuning (FFT)
arXiv Detail & Related papers (2024-04-29T16:42:26Z)
Low-rank Attention Side-Tuning for Parameter-Efficient Fine-Tuning [19.17362588650503]
Low-rank Attention Side-Tuning (LAST) trains a side-network composed of only low-rank self-attention modules. We show LAST can be highly parallel across multiple optimization objectives, making it very efficient in downstream task adaptation.
arXiv Detail & Related papers (2024-02-06T14:03:15Z)
Time-, Memory- and Parameter-Efficient Visual Adaptation [75.28557015773217]
We propose an adaptation method which does not backpropagate gradients through the backbone. We achieve this by designing a lightweight network in parallel that operates on features from the frozen, pretrained backbone.
arXiv Detail & Related papers (2024-02-05T10:55:47Z)
Frustratingly Simple Memory Efficiency for Pre-trained Language Models via Dynamic Embedding Pruning [42.652021176354644]
The memory footprint of pre-trained language models (PLMs) can hinder deployment in memory-constrained settings. We propose a simple yet effective approach that leverages this finding to minimize the memory footprint of the embedding matrix. We show that this approach provides substantial reductions in memory usage across a wide range of models and tasks.
arXiv Detail & Related papers (2023-09-15T19:00:00Z)
UniPT: Universal Parallel Tuning for Transfer Learning with Efficient Parameter and Memory [69.33445217944029]
PETL is an effective strategy for adapting pre-trained models to downstream domains. Recent PETL works focus on the more valuable memory-efficient characteristic. We propose a new memory-efficient PETL strategy, Universal Parallel Tuning (UniPT)
arXiv Detail & Related papers (2023-08-28T05:38:43Z)
Strong Baselines for Parameter Efficient Few-Shot Fine-tuning [50.83426196335385]
Few-shot classification (FSC) entails learning novel classes given only a few examples per class after a pre-training (or meta-training) phase. Recent works have shown that simply fine-tuning a pre-trained Vision Transformer (ViT) on new test classes is a strong approach for FSC. Fine-tuning ViTs, however, is expensive in time, compute and storage. This has motivated the design of parameter efficient fine-tuning (PEFT) methods which fine-tune only a fraction of the Transformer's parameters.
arXiv Detail & Related papers (2023-04-04T16:14:39Z)
Rethinking Efficient Tuning Methods from a Unified Perspective [34.67645496324432]
We revisit the design paradigm of PETL and derive a unified framework U-Tuning for parameter-efficient transfer learning. The U-Tuning framework can simultaneously encompass existing methods and derive new approaches for parameter-efficient transfer learning.
arXiv Detail & Related papers (2023-03-01T17:38:03Z)
Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning [81.3514358542452]
Few-shot in-context learning (ICL) incurs substantial computational, memory, and storage costs because it involves processing all of the training examples every time a prediction is made. parameter-efficient fine-tuning offers an alternative paradigm where a small set of parameters are trained to enable a model to perform the new task. In this paper, we rigorously compare few-shot ICL and parameter-efficient fine-tuning and demonstrate that the latter offers better accuracy as well as dramatically lower computational costs.
arXiv Detail & Related papers (2022-05-11T17:10:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.