One Initialization to Rule them All: Fine-tuning via Explained Variance Adaptation
- URL: http://arxiv.org/abs/2410.07170v3
- Date: Mon, 16 Dec 2024 19:19:14 GMT
- Title: One Initialization to Rule them All: Fine-tuning via Explained Variance Adaptation
- Authors: Fabian Paischer, Lukas Hauzenberger, Thomas Schmied, Benedikt Alkin, Marc Peter Deisenroth, Sepp Hochreiter,
- Abstract summary: Most commonly used fine-tuning method is to update the pre-trained weights via a low-rank adaptation (LoRA)
We propose to improve LoRA by initializing the new weights in a data-driven manner by computing singular value decomposition (SVD) on minibatches of activation.
We call our new method $textbfE$xplained $textbfV$ariance $textbfA$daptation (EVA)
- Score: 13.585425242072173
- License:
- Abstract: Foundation models (FMs) are pre-trained on large-scale datasets and then fine-tuned on a downstream task for a specific application. The most successful and most commonly used fine-tuning method is to update the pre-trained weights via a low-rank adaptation (LoRA). LoRA introduces new weight matrices that are usually initialized at random with a uniform rank distribution across the model weights. Recent works focus on different initialization schemes or the learning of adaptive ranks during fine-tuning. Both approaches have only been investigated in isolation, resulting in slow convergence or a uniform rank distribution, in turn leading to suboptimal performance. We propose to improve LoRA by initializing the new weights in a data-driven manner by computing singular value decomposition (SVD) on minibatches of activation vectors. Then, we initialize the LoRA matrices with the obtained right-singular vectors and redistribute ranks among all weight matrices to provably store the maximum amount of information of the downstream data in the newly introduced weights. In this way, only what information to maintain or neglect during the fine-tuning process needs to be learned. We call our new method $\textbf{E}$xplained $\textbf{V}$ariance $\textbf{A}$daptation (EVA). We apply EVA to a variety of fine-tuning tasks ranging from language generation and understanding to image classification and reinforcement learning. EVA exhibits faster convergence than competitors and achieves the highest average score across a multitude of tasks per domain while reducing the number of trainable parameters through rank redistribution.
Related papers
- DoTA: Weight-Decomposed Tensor Adaptation for Large Language Models [33.4538652558253]
Low-rank adaptation (LoRA) reduces the computational and memory demands of fine-tuning large language models (LLMs) by approximating updates with low-rank matrices.
We propose Weight-Decomposed Adaptation (DoTA), which leverages the Matrix Product Operator (MPO) decomposition of pre-trained weights.
We also introduce QDoTA, a quantized version of DoTA designed for 4-bit quantization.
arXiv Detail & Related papers (2024-12-30T12:00:47Z) - NEAT: Nonlinear Parameter-efficient Adaptation of Pre-trained Models [26.808251361020066]
Fine-tuning pre-trained models is resource-intensive and laborious.
One widely adopted PEFT technique, Low-Rank Adaptation (LoRA), freezes the pre-trained model weights.
NEAT introduces a lightweight neural network that takes pre-trained weights as input and learns a nonlinear transformation to approximate cumulative weight updates.
arXiv Detail & Related papers (2024-10-02T17:29:23Z) - SARA: Singular-Value Based Adaptive Low-Rank Adaption [4.135688713311511]
LoRA as a parameter-efficient fine-tuning(PEFT) method is widely used for not adding inference overhead.
In this work, we first analyze the relationship between the performance of different layers and their ranks using SVD.
Based on this, we design the Singular-Value Based Adaptive Low-Rank Adaption(SARA)
arXiv Detail & Related papers (2024-08-06T16:39:42Z) - Bayesian-LoRA: LoRA based Parameter Efficient Fine-Tuning using Optimal Quantization levels and Rank Values trough Differentiable Bayesian Gates [21.811889512977924]
It is a common practice in natural language processing to pre-train a single model and then fine-tune it for downstream tasks.
B-LoRA is able to fine-tune a pre-trained model on a specific downstream task, finding the optimal rank values and quantization levels for every low-rank matrix.
B-LoRA performs on par with or better than the baselines while reducing the total number of bit operations by roughly 70%.
arXiv Detail & Related papers (2024-06-18T20:26:30Z) - CorDA: Context-Oriented Decomposition Adaptation of Large Language Models for Task-Aware Parameter-Efficient Fine-tuning [101.81127587760831]
Current fine-tuning methods build adapters widely of the context of downstream task to learn, or the context of important knowledge to maintain.
We propose CorDA, a Context-oriented Decomposition Adaptation method that builds learnable task-aware adapters.
Our method enables two options, the knowledge-preserved adaptation and the instruction-previewed adaptation.
arXiv Detail & Related papers (2024-06-07T19:10:35Z) - ConvLoRA and AdaBN based Domain Adaptation via Self-Training [4.006331916849688]
We propose Convolutional Low-Rank Adaptation (ConvLoRA) for multi-target domain adaptation.
ConvLoRA freezes pre-trained model weights, adds trainable low-rank decomposition matrices to convolutional layers, and backpropagates the gradient.
Our method has fewer trainable parameters and performs better or on-par with large independent fine-tuned networks.
arXiv Detail & Related papers (2024-02-07T15:43:50Z) - PRILoRA: Pruned and Rank-Increasing Low-Rank Adaptation [65.268245109828]
We introduce PRILoRA, which linearly allocates a different rank for each layer, in an increasing manner, and performs pruning throughout the training process.
We validate the effectiveness of PRILoRA through extensive experiments on eight GLUE benchmarks, setting a new state of the art.
arXiv Detail & Related papers (2024-01-20T20:25:17Z) - Consensus-Adaptive RANSAC [104.87576373187426]
We propose a new RANSAC framework that learns to explore the parameter space by considering the residuals seen so far via a novel attention layer.
The attention mechanism operates on a batch of point-to-model residuals, and updates a per-point estimation state to take into account the consensus found through a lightweight one-step transformer.
arXiv Detail & Related papers (2023-07-26T08:25:46Z) - InRank: Incremental Low-Rank Learning [85.6380047359139]
gradient-based training implicitly regularizes neural networks towards low-rank solutions through a gradual increase of the rank during training.
Existing training algorithms do not exploit the low-rank property to improve computational efficiency.
We design a new training algorithm Incremental Low-Rank Learning (InRank), which explicitly expresses cumulative weight updates as low-rank matrices.
arXiv Detail & Related papers (2023-06-20T03:03:04Z) - AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning [143.23123791557245]
Fine-tuning large pre-trained language models on downstream tasks has become an important paradigm in NLP.
We propose AdaLoRA, which adaptively allocates the parameter budget among weight matrices according to their importance score.
We conduct extensive experiments with several pre-trained models on natural language processing, question answering, and natural language generation to validate the effectiveness of AdaLoRA.
arXiv Detail & Related papers (2023-03-18T22:36:25Z) - Learning to Re-weight Examples with Optimal Transport for Imbalanced
Classification [74.62203971625173]
Imbalanced data pose challenges for deep learning based classification models.
One of the most widely-used approaches for tackling imbalanced data is re-weighting.
We propose a novel re-weighting method based on optimal transport (OT) from a distributional point of view.
arXiv Detail & Related papers (2022-08-05T01:23:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.