SORSA: Singular Values and Orthonormal Regularized Singular Vectors Adaptation of Large Language Models
- URL: http://arxiv.org/abs/2409.00055v5
- Date: Wed, 20 Nov 2024 07:08:22 GMT
- Title: SORSA: Singular Values and Orthonormal Regularized Singular Vectors Adaptation of Large Language Models
- Authors: Yang Cao,
- Abstract summary: We propose Singular Values and Orthonormal Regularized Singular Vectors Adaptation, or SORSA, a novel PEFT method.
Each SORSA adapter consists of two main parts: trainable principal singular weights $W_p = U_p textdiag(S_p) Vtop_p$, and frozen residual weights $W_r = U_r textdiag(S_r) Vtop_r$.
- Score: 5.573502364188814
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: In this paper, we propose Singular Values and Orthonormal Regularized Singular Vectors Adaptation, or SORSA, a novel PEFT method. Each SORSA adapter consists of two main parts: trainable principal singular weights $W_p = U_p \text{diag}(S_p) V^\top_p$, and frozen residual weights $W_r = U_r \text{diag}(S_r) V^\top_r$. These parts are initialized by performing singular value decomposition (SVD) on pre-trained weights. Moreover, we implement and analyze an orthonormal regularizer, which we prove could decrease the condition number of $W_p$ and make the optimization more efficient. SORSA adapters could be merged during inference, thus eliminating any inference latency. We also introduce a method to analyze the variation of the parameters by performing SVD and discuss and analyze SORSA's superiority in minimizing the alteration in the SVD aspect. After all, SORSA shows a faster convergence than LoRA and PiSSA in our experiments. On the GSM-8K benchmark, Llama 2 7B adapted using SORSA achieved 56.03% accuracy, surpassing LoRA (42.30%), AdaLoRA (47.30%), Full FT (49.05%), and PiSSA (53.07%). On the MATH benchmark, SORSA achieved 10.36% accuracy, outperforming LoRA (5.50%), AdaLoRA (6.48%), Full FT (7.22%), and PiSSA (7.44%). We conclude that SORSA offers a new perspective on parameter-efficient fine-tuning, demonstrating remarkable performance.
Related papers
- Singular Value Decomposition on Kronecker Adaptation for Large Language Model [0.8747606955991707]
Large pre-trained Transformer models achieve state-of-the-art results across diverse language and reasoning tasks.<n>Full fine-tuning incurs substantial storage, memory, and computational overhead.<n>We propose SoKA, a novel PEFT strategy that combines Kronecker-product tensor factorization with SVD-driven initialization and dynamic rank selection.
arXiv Detail & Related papers (2025-06-18T08:28:53Z) - FedSVD: Adaptive Orthogonalization for Private Federated Learning with LoRA [61.79405341803085]
Low-Rank Adaptation (LoRA) is widely used for efficient fine-tuning of language models in federated learning (FL)<n>Low-Rank Adaptation (LoRA) is widely used for efficient fine-tuning of language models in federated learning (FL)
arXiv Detail & Related papers (2025-05-19T07:32:56Z) - SpinSVAR: Estimating Structural Vector Autoregression Assuming Sparse Input [9.548703593014107]
We introduce SpinSvar, a novel method for estimating a structural vector autoregression from time-series data under sparse input assumption.
We model the input as independent Laplacian variables, enforcing sparsity and yielding a maximum likelihood estimator (MLE) based on least absolute error regression.
When applied to S&P 500 data, it clusters stocks by sectors and identifies significant structural shocks linked to major price movements.
arXiv Detail & Related papers (2025-01-06T16:48:30Z) - LoRA Done RITE: Robust Invariant Transformation Equilibration for LoRA Optimization [78.93425154518705]
Low-rank adaption (LoRA) is a widely used parameter-efficient finetuning method for LLM that reduces memory requirements.
This paper introduces LoRA-RITE, a novel adaptive matrix preconditioning method for LoRA optimization.
arXiv Detail & Related papers (2024-10-27T22:57:12Z) - PiSSA: Principal Singular Values and Singular Vectors Adaptation of Large Language Models [23.890454137522774]
We introduce Principal Singular values and Singular vectors Adaptation (PiSSA)
PiSSA shares the same architecture as LoRA, but initializes the adaptor matrices $A$ and $B$ with the principal components of the original matrix $W$.
Compared to LoRA, PiSSA updates the principal components while freezing the "residual" parts, allowing faster convergence and enhanced performance.
arXiv Detail & Related papers (2024-04-03T15:06:43Z) - A Specialized Semismooth Newton Method for Kernel-Based Optimal
Transport [92.96250725599958]
Kernel-based optimal transport (OT) estimators offer an alternative, functional estimation procedure to address OT problems from samples.
We show that our SSN method achieves a global convergence rate of $O (1/sqrtk)$, and a local quadratic convergence rate under standard regularity conditions.
arXiv Detail & Related papers (2023-10-21T18:48:45Z) - A Novel Sparse Regularizer [0.0]
This paper introduces a regularizer based on minimizing a novel measure of entropy applied to the model during optimization.
It is differentiable, simple and fast to compute, scale-invariant, requires a trivial amount of additional memory, and can easily be parallelized.
arXiv Detail & Related papers (2023-01-18T03:17:36Z) - Adaptive Stochastic Gradient Descent for Fast and
Communication-Efficient Distributed Learning [33.590006101071765]
We consider the setting where a master wants to run a distributed descent (SGD) algorithm on $n$ workers.
We show that the adaptive version of distributed SGD can reach lower error values in less time compared to non-adaptive implementations.
arXiv Detail & Related papers (2022-08-04T10:57:25Z) - Bounding the Width of Neural Networks via Coupled Initialization -- A
Worst Case Analysis [121.9821494461427]
We show how to significantly reduce the number of neurons required for two-layer ReLU networks.
We also prove new lower bounds that improve upon prior work, and that under certain assumptions, are best possible.
arXiv Detail & Related papers (2022-06-26T06:51:31Z) - Robust Training of Neural Networks using Scale Invariant Architectures [70.67803417918854]
In contrast to SGD, adaptive gradient methods like Adam allow robust training of modern deep networks.
We show that this general approach is robust to rescaling of parameter and loss.
We design a scale invariant version of BERT, called SIBERT, which when trained simply by vanilla SGD achieves performance comparable to BERT trained by adaptive methods like Adam.
arXiv Detail & Related papers (2022-02-02T11:58:56Z) - A contextual analysis of multi-layer perceptron models in classifying
hand-written digits and letters: limited resources [0.0]
We extensively test an end-to-end vanilla neural network (MLP) approach in pure numpy without any pre-processing or feature extraction done beforehand.
We show that basic data mining operations can significantly improve the performance of the models in terms of computational time.
arXiv Detail & Related papers (2021-07-05T04:30:37Z) - Reducing the Variance of Gaussian Process Hyperparameter Optimization
with Preconditioning [54.01682318834995]
Preconditioning is a highly effective step for any iterative method involving matrix-vector multiplication.
We prove that preconditioning has an additional benefit that has been previously unexplored.
It simultaneously can reduce variance at essentially negligible cost.
arXiv Detail & Related papers (2021-07-01T06:43:11Z) - VSAC: Efficient and Accurate Estimator for H and F [68.65610177368617]
VSAC is a RANSAC-type robust estimator with a number of novelties.
It is significantly faster than all its predecessors and runs on average in 1-2 ms, on a CPU.
It is two orders of magnitude faster and yet as precise as MAGSAC++, the currently most accurate estimator of two-view geometry.
arXiv Detail & Related papers (2021-06-18T17:04:57Z) - Non-Parametric Adaptive Network Pruning [125.4414216272874]
We introduce non-parametric modeling to simplify the algorithm design.
Inspired by the face recognition community, we use a message passing algorithm to obtain an adaptive number of exemplars.
EPruner breaks the dependency on the training data in determining the "important" filters.
arXiv Detail & Related papers (2021-01-20T06:18:38Z) - Computationally and Statistically Efficient Truncated Regression [36.3677715543994]
We provide a computationally and statistically efficient estimator for the classical problem of truncated linear regression.
Our estimator uses Projected Descent Gradient (PSGD) without replacement on the negative log-likelihood of the truncated sample.
As a corollary, we show that SGD learns the parameters of single-layer neural networks with noisy activation functions.
arXiv Detail & Related papers (2020-10-22T19:31:30Z) - AdamP: Slowing Down the Slowdown for Momentum Optimizers on
Scale-invariant Weights [53.8489656709356]
Normalization techniques are a boon for modern deep learning.
It is often overlooked, however, that the additional introduction of momentum results in a rapid reduction in effective step sizes for scale-invariant weights.
In this paper, we verify that the widely-adopted combination of the two ingredients lead to the premature decay of effective step sizes and sub-optimal model performances.
arXiv Detail & Related papers (2020-06-15T08:35:15Z) - ADAHESSIAN: An Adaptive Second Order Optimizer for Machine Learning [91.13797346047984]
We introduce ADAHESSIAN, a second order optimization algorithm which dynamically incorporates the curvature of the loss function via ADAptive estimates.
We show that ADAHESSIAN achieves new state-of-the-art results by a large margin as compared to other adaptive optimization methods.
arXiv Detail & Related papers (2020-06-01T05:00:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.