The Impact of Initialization on LoRA Finetuning Dynamics
- URL: http://arxiv.org/abs/2406.08447v1
- Date: Wed, 12 Jun 2024 17:38:20 GMT
- Title: The Impact of Initialization on LoRA Finetuning Dynamics
- Authors: Soufiane Hayou, Nikhil Ghosh, Bin Yu,
- Abstract summary: We study the role of initialization in Low Rank Adaptation (LoRA)
We show that the first scheme (initializing B to zero and A to random) on average yields better performance compared to the other scheme.
- Score: 13.074320303580361
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we study the role of initialization in Low Rank Adaptation (LoRA) as originally introduced in Hu et al. (2021). Essentially, to start from the pretrained model as initialization for finetuning, one can either initialize B to zero and A to random (default initialization in PEFT package), or vice-versa. In both cases, the product BA is equal to zero at initialization, which makes finetuning starts from the pretrained model. These two initialization schemes are seemingly similar. They should in-principle yield the same performance and share the same optimal learning rate. We demonstrate that this is an incorrect intuition and that the first scheme (initializing B to zero and A to random) on average yields better performance compared to the other scheme. Our theoretical analysis shows that the reason behind this might be that the first initialization allows the use of larger learning rates (without causing output instability) compared to the second initialization, resulting in more efficient learning of the first scheme. We validate our results with extensive experiments on LLMs.
Related papers
- ConsNoTrainLoRA: Data-driven Weight Initialization of Low-rank Adapters using Constraints [64.35580479051208]
In previous works, low-rank adapters (LoRA) are randomly with a fixed rank across all attachment points.<n>In this paper, we improve convergence and final performance of LoRA fine-tuning using our proposed data-driven weight initialization method.
arXiv Detail & Related papers (2025-07-09T23:52:31Z) - Beyond Zero Initialization: Investigating the Impact of Non-Zero Initialization on LoRA Fine-Tuning Dynamics [23.84827135317107]
Low-rank adaptation (LoRA) is a widely used parameter-efficient fine-tuning method.<n>In standard LoRA layers, one of the matrices, $A$ or $B$, is to zero, ensuring that fine-tuning starts from the pretrained model.
arXiv Detail & Related papers (2025-05-29T07:33:03Z) - Exposing the Copycat Problem of Imitation-based Planner: A Novel Closed-Loop Simulator, Causal Benchmark and Joint IL-RL Baseline [49.51385135697656]
Within machine learning-based planning, imitation learning (IL) is a common algorithm.
It primarily learns driving policies directly from supervised trajectory data.
It remains challenging to determine if the learned policy truly understands fundamental driving principles.
This work proposes a novel closed-loop simulator supporting both imitation and reinforcement learning.
arXiv Detail & Related papers (2025-04-20T18:51:26Z) - S$^2$R: Teaching LLMs to Self-verify and Self-correct via Reinforcement Learning [51.84977135926156]
We introduce S$2$R, an efficient framework that enhances LLM reasoning by teaching models to self-verify and self-correct during inference.
Our results demonstrate that Qwen2.5-math-7B achieves an accuracy improvement from 51.0% to 81.6%, outperforming models trained on an equivalent amount of long-CoT distilled data.
arXiv Detail & Related papers (2025-02-18T13:40:22Z) - HRP: High-Rank Preheating for Superior LoRA Initialization [58.3319586613105]
High-Rank Preheating (HRP) is proposed to fine-tune Low-Rank Adaptation (LoRA)
HRP significantly enhances LoRA's generalization effectiveness across various models and tasks.
arXiv Detail & Related papers (2025-02-11T17:59:35Z) - One-step full gradient suffices for low-rank fine-tuning, provably and efficiently [10.843508549704959]
This paper studies how to improve the performance of Low-Rank Adaption (LoRA) as guided by our theoretical analysis.
Our analysis leads to the emphLoRA-One algorithm (using emphOne-step gradient and preconditioning), a theoretically grounded algorithm that achieves significant empirical improvement.
arXiv Detail & Related papers (2025-02-03T10:50:03Z) - An Early FIRST Reproduction and Improvements to Single-Token Decoding for Fast Listwise Reranking [50.81324768683995]
FIRST is a novel approach that integrates a learning-to-rank objective and leveraging the logits of only the first generated token.
We extend the evaluation of FIRST to the TREC Deep Learning datasets (DL19-22), validating its robustness across diverse domains.
Our experiments confirm that fast reranking with single-token logits does not compromise out-of-domain reranking quality.
arXiv Detail & Related papers (2024-11-08T12:08:17Z) - On the Crucial Role of Initialization for Matrix Factorization [40.834791383134416]
This work revisits the classical lowrank matrix factorization problem and unveils the critical role of initialization in shaping convergence rates.
We introduce Nystrom NyGD in both symmetric asymmetric matrix factorization tasks and extend this to low-rank adapters (LoRA)
Our approach, NoRA, demonstrates superior performance across various downstream and model scales, from 1B to 7B parameters, in large language and diffusion models.
arXiv Detail & Related papers (2024-10-24T17:58:21Z) - One Initialization to Rule them All: Fine-tuning via Explained Variance Adaptation [13.585425242072173]
Most commonly used fine-tuning method is to update the pre-trained weights via a low-rank adaptation (LoRA)
We propose to enhance LoRA by initializing the new weights in a data-driven manner by computing singular value decomposition on minibatches of activation.
We apply EVA to a variety of fine-tuning tasks ranging from language generation and understanding to image classification and reinforcement learning.
arXiv Detail & Related papers (2024-10-09T17:59:06Z) - Unrolled denoising networks provably learn optimal Bayesian inference [54.79172096306631]
We prove the first rigorous learning guarantees for neural networks based on unrolling approximate message passing (AMP)
For compressed sensing, we prove that when trained on data drawn from a product prior, the layers of the network converge to the same denoisers used in Bayes AMP.
arXiv Detail & Related papers (2024-09-19T17:56:16Z) - Learning effective pruning at initialization from iterative pruning [15.842658282636876]
We present an end-to-end neural network-based PaI method to reduce training costs.
Our approach outperforms existing methods in high-sparsity settings.
As the first neural network-based PaI method, we conduct extensive experiments to validate the factors influencing this approach.
arXiv Detail & Related papers (2024-08-27T03:17:52Z) - Estimating the Hessian Matrix of Ranking Objectives for Stochastic Learning to Rank with Gradient Boosted Trees [63.18324983384337]
We introduce the first learning to rank method for Gradient Boosted Decision Trees (GBDTs)
Our main contribution is a novel estimator for the second-order derivatives, i.e., the Hessian matrix.
We incorporate our estimator into the existing PL-Rank framework, which was originally designed for first-order derivatives only.
arXiv Detail & Related papers (2024-04-18T13:53:32Z) - Unsupervised Learning of Initialization in Deep Neural Networks via
Maximum Mean Discrepancy [74.34895342081407]
We propose an unsupervised algorithm to find good initialization for input data.
We first notice that each parameter configuration in the parameter space corresponds to one particular downstream task of d-way classification.
We then conjecture that the success of learning is directly related to how diverse downstream tasks are in the vicinity of the initial parameters.
arXiv Detail & Related papers (2023-02-08T23:23:28Z) - Prior-Guided Adversarial Initialization for Fast Adversarial Training [84.56377396106447]
We investigate the difference between the training processes of adversarial examples (AEs) of Fast adversarial training (FAT) and standard adversarial training (SAT)
We observe that the attack success rate of adversarial examples (AEs) of FAT gets worse gradually in the late training stage, resulting in overfitting.
Based on the observation, we propose a prior-guided FGSM initialization method to avoid overfitting.
The proposed method can prevent catastrophic overfitting and outperform state-of-the-art FAT methods.
arXiv Detail & Related papers (2022-07-18T18:13:10Z) - Data-driven Weight Initialization with Sylvester Solvers [72.11163104763071]
We propose a data-driven scheme to initialize the parameters of a deep neural network.
We show that our proposed method is especially effective in few-shot and fine-tuning settings.
arXiv Detail & Related papers (2021-05-02T07:33:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.