Gradient Weight-normalized Low-rank Projection for Efficient LLM Training
- URL: http://arxiv.org/abs/2412.19616v2
- Date: Sun, 05 Jan 2025 07:12:27 GMT
- Title: Gradient Weight-normalized Low-rank Projection for Efficient LLM Training
- Authors: Jia-Hong Huang, Yixian Shen, Hongyi Zhu, Stevan Rudinac, Evangelos Kanoulas,
- Abstract summary: Large Language Models (LLMs) have shown remarkable performance across various tasks, but the escalating demands on computational resources pose significant challenges.
To address this, parameter-efficient fine-tuning (PEFT) methods have been developed, but they often underperform compared to full fine-tuning.
We introduceGradNormLoRP, a novel approach that enhances both parameter and memory efficiency while maintaining comparable performance to full fine-tuning.
- Score: 16.00576040281808
- License:
- Abstract: Large Language Models (LLMs) have shown remarkable performance across various tasks, but the escalating demands on computational resources pose significant challenges, particularly in the extensive utilization of full fine-tuning for downstream tasks. To address this, parameter-efficient fine-tuning (PEFT) methods have been developed, but they often underperform compared to full fine-tuning and struggle with memory efficiency. In this work, we introduce Gradient Weight-Normalized Low-Rank Projection (GradNormLoRP), a novel approach that enhances both parameter and memory efficiency while maintaining comparable performance to full fine-tuning. GradNormLoRP normalizes the weight matrix to improve gradient conditioning, facilitating better convergence during optimization. Additionally, it applies low-rank approximations to the weight and gradient matrices, significantly reducing memory usage during training. Extensive experiments demonstrate that our 8-bit GradNormLoRP reduces optimizer memory usage by up to 89.5% and enables the pre-training of large LLMs, such as LLaMA 7B, on consumer-level GPUs like the NVIDIA RTX 4090, without additional inference costs. Moreover, GradNormLoRP outperforms existing low-rank methods in fine-tuning tasks. For instance, when fine-tuning the RoBERTa model on all GLUE tasks with a rank of 8, GradNormLoRP achieves an average score of 80.65, surpassing LoRA's score of 79.23. These results underscore GradNormLoRP as a promising alternative for efficient LLM pre-training and fine-tuning. Source code: https://github.com/Jhhuangkay/Gradient-Weight-normalized-Low-rank-Projection-for-Efficient-LLM-Train ing
Related papers
- Less is More: Extreme Gradient Boost Rank-1 Adaption for Efficient Finetuning of LLMs [75.11449420928139]
Fine-tuning Large Language Models (LLMs) has become a crucial technique for adapting pre-trained models to downstream tasks.
Low-Rank Adaptation (LoRA) has emerged as a promising solution, but there exists a gap between the practical performance of low-rank adaptations and its theoretical optimum.
We propose eXtreme Gradient Boosting LoRA, a novel framework that bridges this gap by leveraging the power of ensemble learning.
arXiv Detail & Related papers (2024-10-25T17:07:13Z) - Flat-LoRA: Low-Rank Adaption over a Flat Loss Landscape [52.98187034726091]
Low-Rank Adaptation (LoRA) is an efficient way to fine-tune models by optimizing only a low-rank matrix.
A solution that appears flat in the LoRA space may exist sharp directions in the full parameter space, potentially harming generalization performance.
We propose Flat-LoRA, an efficient approach that seeks a low-rank adaptation located in a flat region of the full parameter space.
arXiv Detail & Related papers (2024-09-22T11:24:10Z) - SLTrain: a sparse plus low-rank approach for parameter and memory efficient pretraining [39.56934385513862]
Training large language models (LLMs) from scratch requires significant computational power and extensive memory capacity.
Recent studies have explored low-rank structures on weights for efficient fine-tuning in terms of parameters and memory.
We propose to parameterize the weights as a sum of low-rank and sparse matrices for pretraining, which we call SLTrain.
arXiv Detail & Related papers (2024-06-04T11:14:21Z) - GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection [133.45193150403537]
Training Large Language Models (LLMs) presents significant memory challenges due to the growing size of weights and GPU states.
In this work, we propose Gradient Low-Rank Projection (GaLore) as a memory-efficient training strategy.
Our 8-bit GaLore further reduces memory by up to 82.5% and total training memory by 63.3%, compared to a BF16 baseline.
arXiv Detail & Related papers (2024-03-06T07:29:57Z) - Flora: Low-Rank Adapters Are Secretly Gradient Compressors [30.224822087562163]
Low-rank adaptation (LoRA) is proposed to reduce the optimization states by training fewer parameters.
LoRA restricts overall weight update matrices to be low-rank, limiting the model performance.
We propose Flora, which is able to achieve high-rank updates by resampling the projection matrices.
arXiv Detail & Related papers (2024-02-05T18:50:39Z) - PRILoRA: Pruned and Rank-Increasing Low-Rank Adaptation [65.268245109828]
We introduce PRILoRA, which linearly allocates a different rank for each layer, in an increasing manner, and performs pruning throughout the training process.
We validate the effectiveness of PRILoRA through extensive experiments on eight GLUE benchmarks, setting a new state of the art.
arXiv Detail & Related papers (2024-01-20T20:25:17Z) - Hyperparameter Optimization for Large Language Model Instruction-Tuning [6.743825167463901]
We study the whole pipeline of performing fine-tuning and validation on a pre-trained LLM as a blackbox.
We efficiently explore the space of hyper parameters with the nomad algorithm, achieving a boost in performance and human alignment of the tuned model.
arXiv Detail & Related papers (2023-12-01T22:03:12Z) - Dynamic Sparse No Training: Training-Free Fine-tuning for Sparse LLMs [67.38165028487242]
We introduce Dynamic Sparse No Training (DSnoT), a training-free fine-tuning approach to fine-tune large language models (LLMs)
Inspired by the Dynamic Sparse Training, DSnoT minimizes the reconstruction error between the dense and sparse LLMs.
Our paper offers fresh insights into how to fine-tune sparse LLMs in an efficient training-free manner and open new venues to scale the great potential of sparsity to LLMs.
arXiv Detail & Related papers (2023-10-13T07:38:52Z) - Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose.
We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.