Beyond Zero Initialization: Investigating the Impact of Non-Zero Initialization on LoRA Fine-Tuning Dynamics
- URL: http://arxiv.org/abs/2505.23194v2
- Date: Sun, 17 Aug 2025 05:48:16 GMT
- Title: Beyond Zero Initialization: Investigating the Impact of Non-Zero Initialization on LoRA Fine-Tuning Dynamics
- Authors: Shiwei Li, Xiandi Luo, Xing Tang, Haozhao Wang, Hao Chen, Weihong Luo, Yuhua Li, Xiuqiang He, Ruixuan Li,
- Abstract summary: Low-rank adaptation (LoRA) is a widely used parameter-efficient fine-tuning method.<n>In standard LoRA layers, one of the matrices, $A$ or $B$, is to zero, ensuring that fine-tuning starts from the pretrained model.
- Score: 23.84827135317107
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Low-rank adaptation (LoRA) is a widely used parameter-efficient fine-tuning method. In standard LoRA layers, one of the matrices, $A$ or $B$, is initialized to zero, ensuring that fine-tuning starts from the pretrained model. However, there is no theoretical support for this practice. In this paper, we investigate the impact of non-zero initialization on LoRA's fine-tuning dynamics from an infinite-width perspective. Our analysis reveals that, compared to zero initialization, simultaneously initializing $A$ and $B$ to non-zero values improves LoRA's robustness to suboptimal learning rates, particularly smaller ones. Further analysis indicates that although the non-zero initialization of $AB$ introduces random noise into the pretrained weight, it generally does not affect fine-tuning performance. In other words, fine-tuning does not need to strictly start from the pretrained model. The validity of our findings is confirmed through extensive experiments across various models and datasets. The code is available at https://github.com/Leopold1423/non_zero_lora-icml25.
Related papers
- Stable-LoRA: Stabilizing Feature Learning of Low-Rank Adaptation [14.315981403487266]
Low-Rank Adaptation (LoRA) is a widely adopted parameter-efficient method for fine-tuning Large Langauge Models.<n>Stable-LoRA is a weight-shrinkage optimization strategy that dynamically enhances stability of LoRA feature learning.<n> Experiments show that Stable-LoRA consistently outperforms other baselines across diverse models and tasks.
arXiv Detail & Related papers (2026-03-05T14:15:51Z) - Don't Forget the Nonlinearity: Unlocking Activation Functions in Efficient Fine-Tuning [82.16625951603315]
NoRA replaces fixed activations with learnable rational functions and applies structured low-rank updates to numerator and denominator coefficients.<n>On vision transformers trained on CIFAR-10 and CIFAR-100, NoRA matches or exceeds full fine-tuning while updating only 0.4% of parameters.<n>NoRA constrains adaptation to a low-dimensional functional subspace, implicitly regularizing update magnitude and direction.
arXiv Detail & Related papers (2025-09-16T16:47:03Z) - ConsNoTrainLoRA: Data-driven Weight Initialization of Low-rank Adapters using Constraints [64.35580479051208]
In previous works, low-rank adapters (LoRA) are randomly with a fixed rank across all attachment points.<n>In this paper, we improve convergence and final performance of LoRA fine-tuning using our proposed data-driven weight initialization method.
arXiv Detail & Related papers (2025-07-09T23:52:31Z) - Initialization using Update Approximation is a Silver Bullet for Extremely Efficient Low-Rank Fine-Tuning [13.823795660384262]
We propose a method, LoRA Silver Bullet or LoRA-SB, that approximates full fine-tuning within low-rank subspaces.<n>Our findings demonstrate that it is possible to simulate full fine-tuning in low-rank subspaces without sacrificing performance.
arXiv Detail & Related papers (2024-11-29T09:10:30Z) - Gradient dynamics for low-rank fine-tuning beyond kernels [9.275532709125242]
We study low-rank fine-tuning in a student-teacher setting.
We prove under mild assumptions that a student model which is matrix at the base model and trained with online gradient descent will converge to the teacher.
arXiv Detail & Related papers (2024-11-23T00:00:28Z) - LoRA vs Full Fine-tuning: An Illusion of Equivalence [76.11938177294178]
We study how Low-Rank Adaptation (LoRA) and full-finetuning change pre-trained models.<n>We find that LoRA and full fine-tuning yield weight matrices whose singular value decompositions exhibit very different structure.<n>We extend the finding that LoRA forgets less than full fine-tuning and find its forgetting is vastly localized to the intruder dimension.
arXiv Detail & Related papers (2024-10-28T17:14:01Z) - CoRA: Optimizing Low-Rank Adaptation with Common Subspace of Large Language Models [7.108651381160281]
Low-Rank Adaptation (LoRA) strategy balances efficiency and performance in fine-tuning large models.
We propose textbfCoRA: leveraging shared knowledge to optimize LoRA training by substituting its matrix $B$ with a common subspace from large models.
Our experiments show that the first approach achieves the same efficacy as the original LoRA fine-tuning while being more efficient than halving parameters.
arXiv Detail & Related papers (2024-08-31T12:48:27Z) - LoRA-Pro: Are Low-Rank Adapters Properly Optimized? [121.0693322732454]
Low-rank adaptation, also known as LoRA, has emerged as a prominent method for parameter-efficient fine-tuning of foundation models.<n>Despite its computational efficiency, LoRA still yields inferior performance compared to full fine-tuning.<n>We introduce LoRA-Pro, a method that enhances LoRA's performance by strategically adjusting the gradients of low-rank matrices.
arXiv Detail & Related papers (2024-07-25T17:57:12Z) - The Impact of Initialization on LoRA Finetuning Dynamics [13.074320303580361]
We study the role of initialization in Low Rank Adaptation (LoRA)
We show that the first scheme (initializing B to zero and A to random) on average yields better performance compared to the other scheme.
arXiv Detail & Related papers (2024-06-12T17:38:20Z) - DoRA: Weight-Decomposed Low-Rank Adaptation [57.68678247436207]
We introduce a novel weight decomposition analysis to investigate the inherent differences between FT and LoRA.
Aiming to resemble the learning capacity of FT from the findings, we propose Weight-Decomposed Low-Rank Adaptation (DoRA)
DoRA decomposes the pre-trained weight into two components, magnitude and direction, for fine-tuning.
arXiv Detail & Related papers (2024-02-14T17:59:34Z) - AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning [143.23123791557245]
Fine-tuning large pre-trained language models on downstream tasks has become an important paradigm in NLP.
We propose AdaLoRA, which adaptively allocates the parameter budget among weight matrices according to their importance score.
We conduct extensive experiments with several pre-trained models on natural language processing, question answering, and natural language generation to validate the effectiveness of AdaLoRA.
arXiv Detail & Related papers (2023-03-18T22:36:25Z) - Revisiting Weighted Strategy for Non-stationary Parametric Bandits [82.1942459195896]
This paper revisits the weighted strategy for non-stationary parametric bandits.
We propose a refined analysis framework, which produces a simpler weight-based algorithm.
Our new framework can be used to improve regret bounds of other parametric bandits.
arXiv Detail & Related papers (2023-03-05T15:11:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.