Related papers: DoTA: Weight-Decomposed Tensor Adaptation for Large Language Models

DoTA: Weight-Decomposed Tensor Adaptation for Large Language Models

URL: http://arxiv.org/abs/2412.20891v1
Date: Mon, 30 Dec 2024 12:00:47 GMT
Title: DoTA: Weight-Decomposed Tensor Adaptation for Large Language Models
Authors: Xiaolin Hu, Xiang Cheng, Peiyu Liu, Wei Liu, Jian Luan, Bin Wang, Yong Liu,
Abstract summary: Low-rank adaptation (LoRA) reduces the computational and memory demands of fine-tuning large language models (LLMs) by approximating updates with low-rank matrices.<n>We propose Weight-Decomposed Adaptation (DoTA), which leverages the Matrix Product Operator (MPO) decomposition of pre-trained weights.<n>We also introduce QDoTA, a quantized version of DoTA designed for 4-bit quantization.
Score: 33.4538652558253
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Low-rank adaptation (LoRA) reduces the computational and memory demands of fine-tuning large language models (LLMs) by approximating updates with low-rank matrices. However, low-rank approximation in two-dimensional space fails to capture high-dimensional structures within the target matrix. Recently, tensor decomposition methods have been explored for fine-tuning LLMs, leveraging their ability to extract structured information. Yet, these approaches primarily rely on random initialization, and the impact of initialization on tensor adaptation remains underexplored. In this paper, we reveal that random initialization significantly diverges from the validation loss achieved by full fine-tuning. To address this, we propose Weight-Decomposed Tensor Adaptation (DoTA), which leverages the Matrix Product Operator (MPO) decomposition of pre-trained weights for effective initialization in fine-tuning LLMs. Additionally, we introduce QDoTA, a quantized version of DoTA designed for 4-bit quantization. Experiments on commonsense and arithmetic reasoning tasks show that DoTA outperforms random initialization methods with fewer parameters. QDoTA further reduces memory consumption and achieves comparable performance to DoTA on commonsense reasoning tasks. We will release our code to support future research.

Related papers

Stabilizing Native Low-Rank LLM Pretraining [24.2079184778031]
Low-rank factorization offers a promising route to reduce training and inference costs.<n>We demonstrate that Large Language Models (LLMs) can be trained from scratch using exclusively low-rank factorized weights.<n>Our method enables stable, end-to-end factorized training with negligible overhead.
arXiv Detail & Related papers (2026-02-12T21:33:14Z)
$α$-LoRA: Effective Fine-Tuning via Base Model Rescaling [41.58663029548425]
We introduce a new class of re parameterization methods for transfer learning, designed to enhance the ability generalization of fine-tuned models.<n>We establish the effectiveness of our approach in a high-dimensional binary classification setting using tools from Random Matrix Theory, and further validate our theoretical findings through more realistic experiments.
arXiv Detail & Related papers (2025-10-24T11:19:33Z)
Optimized Weight Initialization on the Stiefel Manifold for Deep ReLU Neural Networks [5.363441578662801]
Improper weight training of ReLU networks can cause inactivation dying ReLU and exacerbate instability as network depth increases.<n>We introduce an optimization problem on the Stiefel manifold, thereby preserving scale and calibrating the pre-activation statistics.<n>We show that prevention of the dying ReLU problem, slower decay of activation variance, and mitigation of gradient vanishing, which together stabilize signal and gradient flow in deep architectures.
arXiv Detail & Related papers (2025-08-30T05:17:31Z)
ConsNoTrainLoRA: Data-driven Weight Initialization of Low-rank Adapters using Constraints [64.35580479051208]
In previous works, low-rank adapters (LoRA) are randomly with a fixed rank across all attachment points.<n>In this paper, we improve convergence and final performance of LoRA fine-tuning using our proposed data-driven weight initialization method.
arXiv Detail & Related papers (2025-07-09T23:52:31Z)
It Takes a Good Model to Train a Good Model: Generalized Gaussian Priors for Optimized LLMs [15.263422862969803]
We introduce BackSlash, a training-time compression algorithm for large language models.<n>We propose a unified, end-to-end framework for LLM optimization based on the GG model.<n>Our contributions are threefold:.<n>DeepShape, a post-training regularization method that reshapes weight distributions to match a GG profile,.<n>RF8, a compact and hardware-efficient 8-bit floating-point format designed for GG-distributed-priord BackSlash training.
arXiv Detail & Related papers (2025-05-31T09:49:17Z)
LESA: Learnable LLM Layer Scaling-Up [57.0510934286449]
Training Large Language Models (LLMs) from scratch requires immense computational resources, making it prohibitively expensive. Model scaling-up offers a promising solution by leveraging the parameters of smaller models to create larger ones. We propose textbfLESA, a novel learnable method for depth scaling-up.
arXiv Detail & Related papers (2025-02-19T14:58:48Z)
RoSTE: An Efficient Quantization-Aware Supervised Fine-Tuning Approach for Large Language Models [53.571195477043496]
We propose an algorithm named Rotated Straight-Through-Estimator (RoSTE) RoSTE combines quantization-aware supervised fine-tuning (QA-SFT) with an adaptive rotation strategy to reduce activation outliers. Our findings reveal that the prediction error is directly proportional to the quantization error of the converged weights, which can be effectively managed through an optimized rotation configuration.
arXiv Detail & Related papers (2025-02-13T06:44:33Z)
Training Deep Learning Models with Norm-Constrained LMOs [56.00317694850397]
We study optimization methods that leverage the linear minimization oracle (LMO) over a norm-ball. We propose a new family of algorithms that uses the LMO to adapt to the geometry of the problem and, perhaps surprisingly, show that they can be applied to unconstrained problems.
arXiv Detail & Related papers (2025-02-11T13:10:34Z)
Sparser Training for On-Device Recommendation Systems [50.74019319100728]
We propose SparseRec, a lightweight embedding method based on Dynamic Sparse Training (DST) It avoids dense gradients during backpropagation by sampling a subset of important vectors.
arXiv Detail & Related papers (2024-11-19T03:48:48Z)
Zeroth-Order Fine-Tuning of LLMs in Random Subspaces [66.27334633749734]
As language models grow in size, memory demands for backpropagation increase. Zeroth-order (ZOZO) optimization methods offer a memory-efficient alternative. We show that SubZero enhances fine-tuning and achieves faster results compared to standard ZOZO approaches.
arXiv Detail & Related papers (2024-10-11T17:01:43Z)
One Initialization to Rule them All: Fine-tuning via Explained Variance Adaptation [13.585425242072173]
Most commonly used fine-tuning method is to update the pre-trained weights via a low-rank adaptation (LoRA)<n>We propose to improve LoRA by initializing the new weights in a data-driven manner by computing singular value decomposition (SVD) on minibatches of activation.<n>We call our new method $textbfE$xplained $textbfV$ariance $textbfA$daptation (EVA)
arXiv Detail & Related papers (2024-10-09T17:59:06Z)
LoRTA: Low Rank Tensor Adaptation of Large Language Models [70.32218116940393]
Low Rank Adaptation (LoRA) is a popular Efficient Fine Tuning (PEFT) method. We propose a higher-order Candecomp/Parafac (CP) decomposition, enabling a more compact and flexible representation. Our method can achieve a reduction in the number of parameters while maintaining comparable performance.
arXiv Detail & Related papers (2024-10-05T06:59:50Z)
TRAWL: Tensor Reduced and Approximated Weights for Large Language Models [11.064868044313855]
We introduce TRAWL (Tensor Reduced and Approximated Weights for Large Language Models), a technique that applies tensor decomposition across multiple weight matrices to effectively denoise LLMs by capturing global structural patterns. Our experiments show that TRAWL improves model performance by up to 16% over baseline models on benchmark datasets, without requiring additional data, training, or fine-tuning.
arXiv Detail & Related papers (2024-06-25T04:01:32Z)
Bypass Back-propagation: Optimization-based Structural Pruning for Large Language Models via Policy Gradient [57.9629676017527]
We propose an optimization-based structural pruning on Large-Language Models. We learn the pruning masks in a probabilistic space directly by optimizing the loss of the pruned model. Our method operates for 2.7 hours with around 35GB memory for the 13B models on a single A100 GPU.
arXiv Detail & Related papers (2024-06-15T09:31:03Z)
VeLoRA: Memory Efficient Training using Rank-1 Sub-Token Projections [35.133698935322634]
Large language models (LLMs) have recently emerged as powerful tools for tackling many language-processing tasks. We identify and characterise the important components needed for effective model convergence using gradient descent. This result leads us to a cheap and memory-efficient algorithm for both fine-tuning and pre-training LLMs.
arXiv Detail & Related papers (2024-05-28T09:23:14Z)
Characterizing the Accuracy -- Efficiency Trade-off of Low-rank Decomposition in Language Models [1.401463252785724]
Low-rank decomposition can be a promising direction for LLM-based applications that require real-time service at scale. We formalize the low-rank decomposition design space and show that the decomposition design space is enormous. Our results show that we can achieve a 9% model size reduction with minimal accuracy drops.
arXiv Detail & Related papers (2024-05-10T17:40:02Z)
Data-freeWeight Compress and Denoise for Large Language Models [101.53420111286952]
We propose a novel approach termed Data-free Joint Rank-k Approximation for compressing the parameter matrices. We achieve a model pruning of 80% parameters while retaining 93.43% of the original performance without any calibration data.
arXiv Detail & Related papers (2024-02-26T05:51:47Z)
LQ-LoRA: Low-rank Plus Quantized Matrix Decomposition for Efficient Language Model Finetuning [66.85589263870702]
Our approach uses an iterative algorithm to decompose each pretrained matrix into a high-precision low-rank component and a memory-efficient quantized component. Experiments on finetuning RoBERTa and LLaMA-2 demonstrate that our low-rank plus quantized matrix decomposition approach (LQ-LoRA) outperforms strong QLoRA and GPTQ-LoRA baselines.
arXiv Detail & Related papers (2023-11-20T18:57:41Z)
Maestro: Uncovering Low-Rank Structures via Trainable Decomposition [15.254107731735553]
Deep Neural Networks (DNNs) have been a large driver for AI breakthroughs in recent years. They have been getting increasingly large as they become more accurate and safe. This means that their training becomes increasingly costly and time-consuming. We propose Maestro, a framework for trainable low-rank layers.
arXiv Detail & Related papers (2023-08-28T23:08:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.