One-step full gradient suffices for low-rank fine-tuning, provably and efficiently
- URL: http://arxiv.org/abs/2502.01235v1
- Date: Mon, 03 Feb 2025 10:50:03 GMT
- Title: One-step full gradient suffices for low-rank fine-tuning, provably and efficiently
- Authors: Yuanhe Zhang, Fanghui Liu, Yudong Chen,
- Abstract summary: This paper studies how to improve the performance of Low-Rank Adaption (LoRA) as guided by our theoretical analysis.<n>Our analysis leads to the emphLoRA-One algorithm (using emphOne-step gradient and preconditioning), a theoretically grounded algorithm that achieves significant empirical improvement.
- Score: 10.843508549704959
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper studies how to improve the performance of Low-Rank Adaption (LoRA) as guided by our theoretical analysis. Our first set of theoretical results show that for random initialization and linear models, \textit{i)} LoRA will align to the certain singular subspace of one-step gradient of full fine-tuning; \textit{ii)} preconditioners improve convergence in the high-rank case. These insights motivate us to focus on preconditioned LoRA using a specific spectral initialization strategy for aligning with certain subspaces. For both linear and nonlinear models, we prove that alignment and generalization guarantees can be directly achieved at initialization, and the subsequent linear convergence can be also built. Our analysis leads to the \emph{LoRA-One} algorithm (using \emph{One}-step gradient and preconditioning), a theoretically grounded algorithm that achieves significant empirical improvement over vanilla LoRA and its variants on several benchmarks. Our theoretical analysis, based on decoupling the learning dynamics and characterizing how spectral initialization contributes to feature learning, may be of independent interest for understanding matrix sensing and deep learning theory. The source code can be found in the https://github.com/YuanheZ/LoRA-One.
Related papers
- Understanding the Learning Dynamics of LoRA: A Gradient Flow Perspective on Low-Rank Adaptation in Matrix Factorization [7.940066909711888]
We analyze the learning dynamics of Low-Rank Adaptation (LoRA) for matrix factorization under gradient flow (GF)
Our analysis shows that the final error is affected by the misalignment between the singular spaces of the pre-trained model and the target matrix.
arXiv Detail & Related papers (2025-03-10T06:57:10Z) - Make LoRA Great Again: Boosting LoRA with Adaptive Singular Values and Mixture-of-Experts Optimization Alignment [20.382810396966473]
Low-Rank Adaptation (LoRA) enables parameter-efficient fine-tuning for Large Language Models (LLMs)
Current methods optimize LoRA by initializing with static singular value decomposition subsets, leading to suboptimal leveraging of pre-trained knowledge.
We propose underlineGreat LunderlineoRunderlineA Mixture-of-Experunderlinet (GOAT)
GOAT integrates relevant priors using an SVD-structured MoE, and aligns optimization with full fine-tuned MoE by deriving a theoretical scaling factor
arXiv Detail & Related papers (2025-02-24T06:48:13Z) - Logarithmic Regret for Online KL-Regularized Reinforcement Learning [51.113248212150964]
KL-regularization plays a pivotal role in improving efficiency of RL fine-tuning for large language models.
Despite its empirical advantage, the theoretical difference between KL-regularized RL and standard RL remains largely under-explored.
We propose an optimistic-based KL-regularized online contextual bandit algorithm, and provide a novel analysis of its regret.
arXiv Detail & Related papers (2025-02-11T11:11:05Z) - SD-LoRA: Scalable Decoupled Low-Rank Adaptation for Class Incremental Learning [73.93639228235622]
Continual Learning with foundation models has emerged as a promising paradigm to exploit abundant knowledge acquired during pre-training for tackling sequential tasks.
Existing prompt-based and Low-Rank Adaptation-based (LoRA-based) methods often require expanding a prompt/LoRA pool or retaining samples of previous tasks.
We propose Scalable Decoupled LoRA (SD-LoRA) for class incremental learning, which continually separates the learning of the magnitude and direction of LoRA components without rehearsal.
arXiv Detail & Related papers (2025-01-22T20:00:41Z) - Initialization using Update Approximation is a Silver Bullet for Extremely Efficient Low-Rank Fine-Tuning [13.823795660384262]
Low-rank adapters have become a standard approach for efficiently fine-tuning large language models (LLMs)<n>We propose a method, LoRA Silver Bullet or LoRA-SB, that approximates full fine-tuning within low-rank subspaces.<n>Our findings establish that it is possible to simulate full fine-tuning in low-rank subspaces, and achieve significant efficiency gains without sacrificing performance.
arXiv Detail & Related papers (2024-11-29T09:10:30Z) - Randomized Asymmetric Chain of LoRA: The First Meaningful Theoretical Framework for Low-Rank Adaptation [58.288682735160585]
Low-Rank Adaptation (LoRA) is a popular technique for finetuning models.
LoRA often under performs when compared to full- parameter fine-tuning.
We present a framework that rigorously analyzes the adaptation rates of LoRA methods.
arXiv Detail & Related papers (2024-10-10T18:51:53Z) - LoRA-Pro: Are Low-Rank Adapters Properly Optimized? [121.0693322732454]
Low-rank adaptation, also known as LoRA, has emerged as a prominent method for parameter-efficient fine-tuning of foundation models.
Despite its computational efficiency, LoRA still yields inferior performance compared to full fine-tuning.
We introduce LoRA-Pro, a method that enhances LoRA's performance by strategically adjusting the gradients of low-rank matrices.
arXiv Detail & Related papers (2024-07-25T17:57:12Z) - Computational Limits of Low-Rank Adaptation (LoRA) for Transformer-Based Models [10.827800772359844]
We study the computational limits of Low-Rank Adaptation (LoRA) update for finetuning transformer-based models.
Our key observation is that the existence of low-rank decompositions within the gradient computation of LoRA adaptation leads to possible algorithmic speedup.
We prove the existence of nearly linear approximation algorithms for LoRA adaptation by utilizing the hierarchical low-rank structures of LoRA gradients.
arXiv Detail & Related papers (2024-06-05T10:44:08Z) - Riemannian Preconditioned LoRA for Fine-Tuning Foundation Models [45.72323731094864]
Low-Rank Adaptation (LoRA) emerges as a popular parameter-efficient fine-tuning (PEFT) method.
In this work, we study the enhancement of LoRA training by introducing an $r times r$ preconditioner in each gradient step.
arXiv Detail & Related papers (2024-02-04T05:05:43Z) - SHOT: Suppressing the Hessian along the Optimization Trajectory for
Gradient-Based Meta-Learning [28.26143547479141]
We introduce an algorithm called SHOT (Suppressing the Hessian along the Optimization Trajectory)
SHOT does not increase the computational complexity of the baseline model much.
We confirm our hypothesis empirically and demonstrate that SHOT outperforms the corresponding baseline.
arXiv Detail & Related papers (2023-10-04T11:43:08Z) - Lassoed Tree Boosting [53.56229983630983]
We prove that a gradient boosted tree algorithm with early stopping faster than $n-1/4$ L2 convergence in the large nonparametric space of cadlag functions of bounded sectional variation.
Our convergence proofs are based on a novel, general theorem on early stopping with empirical loss minimizers of nested Donsker classes.
arXiv Detail & Related papers (2022-05-22T00:34:41Z) - Leveraging Non-uniformity in First-order Non-convex Optimization [93.6817946818977]
Non-uniform refinement of objective functions leads to emphNon-uniform Smoothness (NS) and emphNon-uniform Lojasiewicz inequality (NL)
New definitions inspire new geometry-aware first-order methods that converge to global optimality faster than the classical $Omega (1/t2)$ lower bounds.
arXiv Detail & Related papers (2021-05-13T04:23:07Z) - Towards Understanding Label Smoothing [36.54164997035046]
Label smoothing regularization (LSR) has a great success in deep neural networks by training algorithms.
We show that an appropriate LSR can help to speed up convergence by reducing the variance.
We propose a simple yet effective strategy, namely Two-Stage LAbel smoothing algorithm (TSLA)
arXiv Detail & Related papers (2020-06-20T20:36:17Z) - A Generic First-Order Algorithmic Framework for Bi-Level Programming
Beyond Lower-Level Singleton [49.23948907229656]
Bi-level Descent Aggregation is a flexible and modularized algorithmic framework for generic bi-level optimization.
We derive a new methodology to prove the convergence of BDA without the LLS condition.
Our investigations also demonstrate that BDA is indeed compatible to a verify of particular first-order computation modules.
arXiv Detail & Related papers (2020-06-07T05:18:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.