Learning Rate Scaling across LoRA Ranks and Transfer to Full Finetuning
- URL: http://arxiv.org/abs/2602.06204v1
- Date: Thu, 05 Feb 2026 21:28:59 GMT
- Title: Learning Rate Scaling across LoRA Ranks and Transfer to Full Finetuning
- Authors: Nan Chen, Soledad Villar, Soufiane Hayou,
- Abstract summary: Low-Rank Adaptation (LoRA) is a tool for parameter-efficient finetuning of large models.<n>It is unclear how the optimal learning rate scales with adapter rank.<n>We introduce Maximal-Update Adaptation ($$A), a theoretical framework that characterizes how the "optimal" learning rate should scale.
- Score: 24.03926595342341
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Low-Rank Adaptation (LoRA) is a standard tool for parameter-efficient finetuning of large models. While it induces a small memory footprint, its training dynamics can be surprisingly complex as they depend on several hyperparameters such as initialization, adapter rank, and learning rate. In particular, it is unclear how the optimal learning rate scales with adapter rank, which forces practitioners to re-tune the learning rate whenever the rank is changed. In this paper, we introduce Maximal-Update Adaptation ($μ$A), a theoretical framework that characterizes how the "optimal" learning rate should scale with model width and adapter rank to produce stable, non-vanishing feature updates under standard configurations. $μ$A is inspired from the Maximal-Update Parametrization ($μ$P) in pretraining. Our analysis leverages techniques from hyperparameter transfer and reveals that the optimal learning rate exhibits different scaling patterns depending on initialization and LoRA scaling factor. Specifically, we identify two regimes: one where the optimal learning rate remains roughly invariant across ranks, and another where it scales inversely with rank. We further identify a configuration that allows learning rate transfer from LoRA to full finetuning, drastically reducing the cost of learning rate tuning for full finetuning. Experiments across language, vision, vision--language, image generation, and reinforcement learning tasks validate our scaling rules and show that learning rates tuned on LoRA transfer reliably to full finetuning.
Related papers
- Learning Rate Matters: Vanilla LoRA May Suffice for LLM Fine-tuning [48.66442009036754]
Low-Rank Adaptation (LoRA) is the prevailing approach for efficient large language model fine-tuning.<n>In this work, we re-evaluate four representative LoRA variants alongside vanilla LoRA.<n>We find that different LoRA methods favor distinct learning rate ranges.
arXiv Detail & Related papers (2026-02-04T19:36:20Z) - Hyperparameter Transfer Enables Consistent Gains of Matrix-Preconditioned Optimizers Across Scales [55.91454326946738]
We study how the optimal learning rate and weight decay should scale with model width and depth for a wide range of languages.<n>We find that scaling the learning rate according to $$P improves transfer, but can still suffer from significant finite-width deviations.<n>For compute-optimal scaling, we find scaling independent weight decay as $1/mathrmwidth$ is nearly optimal across languages.
arXiv Detail & Related papers (2025-12-05T11:03:41Z) - LoFT: Low-Rank Adaptation That Behaves Like Full Fine-Tuning [5.980897761790243]
We introduce LoFT, a novel low-rank adaptation method that behaves like full fine-tuning.<n>LoFT aligns the model's internal dynamics with those of updating all model weights.<n> Empirically, this approach substantially narrows the performance gap between adapter-based tuning and full fine-tuning.
arXiv Detail & Related papers (2025-05-27T14:54:24Z) - Adaptive Rank, Reduced Forgetting: Knowledge Retention in Continual Learning Vision-Language Models with Dynamic Rank-Selective LoRA [26.079123341965687]
We study low-rank learning and analyze how LoRA ranks and placements affect learning and forgetting.<n>A higher-rank LoRA improves task learning (plasticity) but increases forgetting, while a lower-rank LoRA enhances stability but limits adaptation.<n>Motivated by this, we propose Continual Dynamic Rank-Selective LoRA (CoDyRA), which continually updates PTMs with LoRA adapters of adaptively optimized ranks.
arXiv Detail & Related papers (2024-12-01T23:41:42Z) - Parameter Efficient Instruction Tuning: An Empirical Study [1.5186090363516862]
Efficient Finetuning (PEFT) has arisen as a cost-effective practice for instruction tuning because of significantly smaller computational, memory, and storage cost compared to full finetuning.
Our empirical study shows that only LoRA and adapter can get close to full finetuning with ideal training settings.
arXiv Detail & Related papers (2024-11-25T07:06:09Z) - Less is More: Extreme Gradient Boost Rank-1 Adaption for Efficient Finetuning of LLMs [75.11449420928139]
Fine-tuning Large Language Models (LLMs) has become a crucial technique for adapting pre-trained models to downstream tasks.
Low-Rank Adaptation (LoRA) has emerged as a promising solution, but there exists a gap between the practical performance of low-rank adaptations and its theoretical optimum.
We propose eXtreme Gradient Boosting LoRA, a novel framework that bridges this gap by leveraging the power of ensemble learning.
arXiv Detail & Related papers (2024-10-25T17:07:13Z) - DoRA: Weight-Decomposed Low-Rank Adaptation [57.68678247436207]
We introduce a novel weight decomposition analysis to investigate the inherent differences between FT and LoRA.
Aiming to resemble the learning capacity of FT from the findings, we propose Weight-Decomposed Low-Rank Adaptation (DoRA)
DoRA decomposes the pre-trained weight into two components, magnitude and direction, for fine-tuning.
arXiv Detail & Related papers (2024-02-14T17:59:34Z) - Flora: Low-Rank Adapters Are Secretly Gradient Compressors [30.224822087562163]
Low-rank adaptation (LoRA) is proposed to reduce the optimization states by training fewer parameters.
LoRA restricts overall weight update matrices to be low-rank, limiting the model performance.
We propose Flora, which is able to achieve high-rank updates by resampling the projection matrices.
arXiv Detail & Related papers (2024-02-05T18:50:39Z) - PRILoRA: Pruned and Rank-Increasing Low-Rank Adaptation [65.268245109828]
We introduce PRILoRA, which linearly allocates a different rank for each layer, in an increasing manner, and performs pruning throughout the training process.
We validate the effectiveness of PRILoRA through extensive experiments on eight GLUE benchmarks, setting a new state of the art.
arXiv Detail & Related papers (2024-01-20T20:25:17Z) - Sparse Low-rank Adaptation of Pre-trained Language Models [79.74094517030035]
We introduce sparse low-rank adaptation (SoRA) that enables dynamic adjustments to the intrinsic rank during the adaptation process.
Our approach strengthens the representation power of LoRA by initializing it with a higher rank, while efficiently taming a temporarily increased number of parameters.
Our experimental results demonstrate that SoRA can outperform other baselines even with 70% retained parameters and 70% training time.
arXiv Detail & Related papers (2023-11-20T11:56:25Z) - AutoDrop: Training Deep Learning Models with Automatic Learning Rate
Drop [16.396327849817464]
We develop an algorithm that realizes the learning rate drop $textitautomatically$.
We show that our method improves over SOTA training approaches.
arXiv Detail & Related papers (2021-11-30T11:55:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.