Asymmetry in Low-Rank Adapters of Foundation Models
- URL: http://arxiv.org/abs/2402.16842v2
- Date: Tue, 27 Feb 2024 18:06:29 GMT
- Title: Asymmetry in Low-Rank Adapters of Foundation Models
- Authors: Jiacheng Zhu, Kristjan Greenewald, Kimia Nadjahi, Haitz S\'aez de
Oc\'ariz Borde, Rickard Br\"uel Gabrielsson, Leshem Choshen, Marzyeh
Ghassemi, Mikhail Yurochkin, Justin Solomon
- Abstract summary: This paper characterizes and leverages unexpected asymmetry in the importance of low-rank adapter matrices.
We show that fine-tuning $B$ is inherently more effective than fine-tuning $A$, and that a random untrained $A$ should perform nearly as well as a fine-tuned one.
- Score: 47.310550805920585
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Parameter-efficient fine-tuning optimizes large, pre-trained foundation
models by updating a subset of parameters; in this class, Low-Rank Adaptation
(LoRA) is particularly effective. Inspired by an effort to investigate the
different roles of LoRA matrices during fine-tuning, this paper characterizes
and leverages unexpected asymmetry in the importance of low-rank adapter
matrices. Specifically, when updating the parameter matrices of a neural
network by adding a product $BA$, we observe that the $B$ and $A$ matrices have
distinct functions: $A$ extracts features from the input, while $B$ uses these
features to create the desired output. Based on this observation, we
demonstrate that fine-tuning $B$ is inherently more effective than fine-tuning
$A$, and that a random untrained $A$ should perform nearly as well as a
fine-tuned one. Using an information-theoretic lens, we also bound the
generalization of low-rank adapters, showing that the parameter savings of
exclusively training $B$ improves the bound. We support our conclusions with
experiments on RoBERTa, BART-Large, LLaMA-2, and ViTs.
Related papers
- Transfer Q Star: Principled Decoding for LLM Alignment [105.89114186982972]
Transfer $Q*$ estimates the optimal value function for a target reward $r$ through a baseline model.
Our approach significantly reduces the sub-optimality gap observed in prior SoTA methods.
arXiv Detail & Related papers (2024-05-30T21:36:12Z) - A Single Linear Layer Yields Task-Adapted Low-Rank Matrices [4.695004706877747]
Low-Rank Adaptation (LoRA) is a widely used Efficient Fine-Tuning (PEFT) method that updates an initial weight matrix $W_0$ with a delta matrix $Delta W$.
We show that CondLoRA maintains a performance on par with LoRA, despite the fact that the trainable parameters of CondLoRA are fewer than those of LoRA.
arXiv Detail & Related papers (2024-03-22T04:38:42Z) - RoSA: Accurate Parameter-Efficient Fine-Tuning via Robust Adaptation [30.797422827190278]
We present a new PEFT method called Robust Adaptation (RoSA) inspired by robust principal component analysis.
RoSA trains $textitlow-rank$ and $textithighly-sparse$ components on top of a set of fixed pretrained weights.
We show that RoSA outperforms LoRA, pure sparse fine-tuning, and alternative hybrid methods at the same parameter budget.
arXiv Detail & Related papers (2024-01-09T17:09:01Z) - GIFT: Generative Interpretable Fine-Tuning [8.481707805559589]
We present Generative Interpretable Fine-Tuning (GIFT) for parameter-efficient fine-tuning of pretrained Transformer backbones.
$Theta$ can be shared by all layers selected for fine-tuning, or can be layer-type specific.
We show the output of the first linear layer (i.e., $omegacdot phi$) is surprisingly interpretable.
arXiv Detail & Related papers (2023-12-01T16:33:57Z) - AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning [143.23123791557245]
Fine-tuning large pre-trained language models on downstream tasks has become an important paradigm in NLP.
We propose AdaLoRA, which adaptively allocates the parameter budget among weight matrices according to their importance score.
We conduct extensive experiments with several pre-trained models on natural language processing, question answering, and natural language generation to validate the effectiveness of AdaLoRA.
arXiv Detail & Related papers (2023-03-18T22:36:25Z) - Nearly Minimax Optimal Reinforcement Learning for Linear Markov Decision
Processes [80.89852729380425]
We propose the first computationally efficient algorithm that achieves the nearly minimax optimal regret $tilde O(dsqrtH3K)$.
Our work provides a complete answer to optimal RL with linear MDPs, and the developed algorithm and theoretical tools may be of independent interest.
arXiv Detail & Related papers (2022-12-12T18:58:59Z) - $\mathcal{Y}$-Tuning: An Efficient Tuning Paradigm for Large-Scale
Pre-Trained Models via Label Representation Learning [47.742220473129684]
$mathcalY$-tuning learns dense representations for labels defined in a given task and aligns them to fixed feature representation.
For $textDeBERTa_textXXL$ with 1.6 billion parameters, $mathcalY$-tuning achieves performance more than $96%$ of full fine-tuning on GLUE Benchmark.
arXiv Detail & Related papers (2022-02-20T13:49:34Z) - Supervised Quantile Normalization for Low-rank Matrix Approximation [50.445371939523305]
We learn the parameters of quantile normalization operators that can operate row-wise on the values of $X$ and/or of its factorization $UV$ to improve the quality of the low-rank representation of $X$ itself.
We demonstrate the applicability of these techniques on synthetic and genomics datasets.
arXiv Detail & Related papers (2020-02-08T21:06:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.