Computational Limits of Low-Rank Adaptation (LoRA) for Transformer-Based Models
- URL: http://arxiv.org/abs/2406.03136v1
- Date: Wed, 5 Jun 2024 10:44:08 GMT
- Title: Computational Limits of Low-Rank Adaptation (LoRA) for Transformer-Based Models
- Authors: Jerry Yao-Chieh Hu, Maojiang Su, En-Jui Kuo, Zhao Song, Han Liu,
- Abstract summary: We study the computational limits of Low-Rank Adaptation (LoRA) update for finetuning transformer-based models.
Our key observation is that the existence of low-rank decompositions within the gradient computation of LoRA adaptation leads to possible algorithmic speedup.
We prove the existence of nearly linear approximation algorithms for LoRA adaptation by utilizing the hierarchical low-rank structures of LoRA gradients.
- Score: 10.827800772359844
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: We study the computational limits of Low-Rank Adaptation (LoRA) update for finetuning transformer-based models using fine-grained complexity theory. Our key observation is that the existence of low-rank decompositions within the gradient computation of LoRA adaptation leads to possible algorithmic speedup. This allows us to (i) identify a phase transition behavior and (ii) prove the existence of nearly linear algorithms by controlling the LoRA update computation term by term, assuming the Strong Exponential Time Hypothesis (SETH). For the former, we identify a sharp transition in the efficiency of all possible rank-$r$ LoRA update algorithms for transformers, based on specific norms resulting from the multiplications of the input sequence $\mathbf{X}$, pretrained weights $\mathbf{W^\star}$, and adapter matrices $\alpha \mathbf{B} \mathbf{A} / r$. Specifically, we derive a shared upper bound threshold for such norms and show that efficient (sub-quadratic) approximation algorithms of LoRA exist only below this threshold. For the latter, we prove the existence of nearly linear approximation algorithms for LoRA adaptation by utilizing the hierarchical low-rank structures of LoRA gradients and approximating the gradients with a series of chained low-rank approximations. To showcase our theory, we consider two practical scenarios: partial (e.g., only $\mathbf{W}_V$ and $\mathbf{W}_Q$) and full adaptations (e.g., $\mathbf{W}_Q$, $\mathbf{W}_V$, and $\mathbf{W}_K$) of weights in attention heads.
Related papers
- Efficient Frameworks for Generalized Low-Rank Matrix Bandit Problems [61.85150061213987]
We study the generalized low-rank matrix bandit problem, proposed in citelu2021low under the Generalized Linear Model (GLM) framework.
To overcome the computational infeasibility and theoretical restrain of existing algorithms, we first propose the G-ESTT framework.
We show that G-ESTT can achieve the $tildeO(sqrt(d_1+d_2)3/2Mr3/2T)$ bound of regret while G-ESTS can achineve the $tildeO
arXiv Detail & Related papers (2024-01-14T14:14:19Z) - Stochastic Optimization for Non-convex Problem with Inexact Hessian
Matrix, Gradient, and Function [99.31457740916815]
Trust-region (TR) and adaptive regularization using cubics have proven to have some very appealing theoretical properties.
We show that TR and ARC methods can simultaneously provide inexact computations of the Hessian, gradient, and function values.
arXiv Detail & Related papers (2023-10-18T10:29:58Z) - Efficiently Learning One-Hidden-Layer ReLU Networks via Schur
Polynomials [50.90125395570797]
We study the problem of PAC learning a linear combination of $k$ ReLU activations under the standard Gaussian distribution on $mathbbRd$ with respect to the square loss.
Our main result is an efficient algorithm for this learning task with sample and computational complexity $(dk/epsilon)O(k)$, whereepsilon>0$ is the target accuracy.
arXiv Detail & Related papers (2023-07-24T14:37:22Z) - A Constrained BA Algorithm for Rate-Distortion and Distortion-Rate
Functions [13.570794979535934]
modification of Blahut-Arimoto (BA) algorithm for rate-distortion functions.
modified algorithm directly computes the RD function for a given target distortion.
arXiv Detail & Related papers (2023-05-04T08:41:03Z) - Refined Regret for Adversarial MDPs with Linear Function Approximation [50.00022394876222]
We consider learning in an adversarial Decision Process (MDP) where the loss functions can change arbitrarily over $K$ episodes.
This paper provides two algorithms that improve the regret to $tildemathcal O(K2/3)$ in the same setting.
arXiv Detail & Related papers (2023-01-30T14:37:21Z) - Learning Transition Operators From Sparse Space-Time Samples [11.859913430860335]
We consider the nonlinear problem of learning a transition operator $mathbfA$ from partial observations at different times.
We show that no more than $mathcalOrn log(nT)$ space-time samples are sufficient to ensure accurate recovery of a rank-$r$ operator.
arXiv Detail & Related papers (2022-12-01T18:33:59Z) - Minimax Optimal Quantization of Linear Models: Information-Theoretic
Limits and Efficient Algorithms [59.724977092582535]
We consider the problem of quantizing a linear model learned from measurements.
We derive an information-theoretic lower bound for the minimax risk under this setting.
We show that our method and upper-bounds can be extended for two-layer ReLU neural networks.
arXiv Detail & Related papers (2022-02-23T02:39:04Z) - A Law of Iterated Logarithm for Multi-Agent Reinforcement Learning [3.655021726150368]
In Multi-Agent Reinforcement Learning (MARL), multiple agents interact with a common environment, as also with each other, for solving a shared problem in sequential decision-making.
We derive a novel law of iterated for a family of distributed nonlinear approximation schemes that is useful in MARL.
arXiv Detail & Related papers (2021-10-27T08:01:17Z) - Linear-Time Gromov Wasserstein Distances using Low Rank Couplings and
Costs [45.87981728307819]
The ability to compare and align related datasets living in heterogeneous spaces plays an increasingly important role in machine learning.
The Gromov-Wasserstein (GW) formalism can help tackle this problem.
arXiv Detail & Related papers (2021-06-02T12:50:56Z) - Reinforcement Learning with General Value Function Approximation:
Provably Efficient Approach via Bounded Eluder Dimension [124.7752517531109]
We establish a provably efficient reinforcement learning algorithm with general value function approximation.
We show that our algorithm achieves a regret bound of $widetildeO(mathrmpoly(dH)sqrtT)$ where $d$ is a complexity measure.
Our theory generalizes recent progress on RL with linear value function approximation and does not make explicit assumptions on the model of the environment.
arXiv Detail & Related papers (2020-05-21T17:36:09Z) - Convergence of Online Adaptive and Recurrent Optimization Algorithms [0.0]
We prove local convergence of several notable descent algorithms used in machine learning.
We adopt an "ergodic" rather than probabilistic viewpoint, working with empirical time averages instead of probability distributions.
arXiv Detail & Related papers (2020-05-12T09:48:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.