On the Effectiveness of Parameter-Efficient Fine-Tuning
- URL: http://arxiv.org/abs/2211.15583v1
- Date: Mon, 28 Nov 2022 17:41:48 GMT
- Title: On the Effectiveness of Parameter-Efficient Fine-Tuning
- Authors: Zihao Fu, Haoran Yang, Anthony Man-Cho So, Wai Lam, Lidong Bing, Nigel
Collier
- Abstract summary: Currently, many research works propose to only fine-tune a small portion of the parameters while keeping most of the parameters shared across different tasks.
We show that all of the methods are actually sparse fine-tuned models and conduct a novel theoretical analysis of them.
Despite the effectiveness of sparsity grounded by our theory, it still remains an open problem of how to choose the tunable parameters.
- Score: 79.6302606855302
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Fine-tuning pre-trained models has been ubiquitously proven to be effective
in a wide range of NLP tasks. However, fine-tuning the whole model is parameter
inefficient as it always yields an entirely new model for each task. Currently,
many research works propose to only fine-tune a small portion of the parameters
while keeping most of the parameters shared across different tasks. These
methods achieve surprisingly good performance and are shown to be more stable
than their corresponding fully fine-tuned counterparts. However, such kind of
methods is still not well understood. Some natural questions arise: How does
the parameter sparsity lead to promising performance? Why is the model more
stable than the fully fine-tuned models? How to choose the tunable parameters?
In this paper, we first categorize the existing methods into random approaches,
rule-based approaches, and projection-based approaches based on how they choose
which parameters to tune. Then, we show that all of the methods are actually
sparse fine-tuned models and conduct a novel theoretical analysis of them. We
indicate that the sparsity is actually imposing a regularization on the
original model by controlling the upper bound of the stability. Such stability
leads to better generalization capability which has been empirically observed
in a lot of recent research works. Despite the effectiveness of sparsity
grounded by our theory, it still remains an open problem of how to choose the
tunable parameters. To better choose the tunable parameters, we propose a novel
Second-order Approximation Method (SAM) which approximates the original problem
with an analytically solvable optimization function. The tunable parameters are
determined by directly optimizing the approximation function. The experimental
results show that our proposed SAM model outperforms many strong baseline
models and it also verifies our theoretical analysis.
Related papers
- SaRA: High-Efficient Diffusion Model Fine-tuning with Progressive Sparse Low-Rank Adaptation [52.6922833948127]
In this work, we investigate the importance of parameters in pre-trained diffusion models.
We propose a novel model fine-tuning method to make full use of these ineffective parameters.
Our method enhances the generative capabilities of pre-trained models in downstream applications.
arXiv Detail & Related papers (2024-09-10T16:44:47Z) - Scaling Exponents Across Parameterizations and Optimizers [94.54718325264218]
We propose a new perspective on parameterization by investigating a key assumption in prior work.
Our empirical investigation includes tens of thousands of models trained with all combinations of threes.
We find that the best learning rate scaling prescription would often have been excluded by the assumptions in prior work.
arXiv Detail & Related papers (2024-07-08T12:32:51Z) - A Unified Gaussian Process for Branching and Nested Hyperparameter
Optimization [19.351804144005744]
In deep learning, tuning parameters with conditional dependence are common in practice.
New GP model accounts for the dependent structure among input variables through a new kernel function.
High prediction accuracy and better optimization efficiency are observed in a series of synthetic simulations and real data applications of neural networks.
arXiv Detail & Related papers (2024-01-19T21:11:32Z) - Should We Learn Most Likely Functions or Parameters? [51.133793272222874]
We investigate the benefits and drawbacks of directly estimating the most likely function implied by the model and the data.
We find that function-space MAP estimation can lead to flatter minima, better generalization, and improved to overfitting.
arXiv Detail & Related papers (2023-11-27T16:39:55Z) - A Stability Analysis of Fine-Tuning a Pre-Trained Model [46.6761331971071]
Fine-tuning a pre-trained model is one of the most promising paradigms in recent NLP research.
Fine-tuning suffers from the instability problem, i.e., tuning the same model under the same setting results in significantly different performance.
We propose a novel theoretical stability analysis of fine-tuning that focuses on two commonly used settings.
arXiv Detail & Related papers (2023-01-24T05:11:17Z) - Scaling & Shifting Your Features: A New Baseline for Efficient Model
Tuning [126.84770886628833]
Existing finetuning methods either tune all parameters of the pretrained model (full finetuning) or only tune the last linear layer (linear probing)
We propose a new parameter-efficient finetuning method termed as SSF, representing that researchers only need to Scale and Shift the deep Features extracted by a pre-trained model to catch up with the performance full finetuning.
arXiv Detail & Related papers (2022-10-17T08:14:49Z) - Delta Tuning: A Comprehensive Study of Parameter Efficient Methods for
Pre-trained Language Models [90.24999406296867]
In contrast with the standard fine-tuning, delta tuning only fine-tunes a small portion of the model parameters while keeping the rest untouched.
Recent studies have demonstrated that a series of delta tuning methods with distinct tuned parameter selection could achieve performance on a par with full- parameter fine-tuning.
arXiv Detail & Related papers (2022-03-14T07:56:32Z) - Lazy Parameter Tuning and Control: Choosing All Parameters Randomly From
a Power-Law Distribution [8.34061303235504]
Most evolutionary algorithms have multiple parameters and their values drastically affect the performance.
We propose a lazy but effective solution, namely choosing all parameter values in each iteration randomly from a suitably scaled power-law distribution.
We prove a performance guarantee that is comparable to the best performance known for static parameters.
arXiv Detail & Related papers (2021-04-14T09:17:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.