GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection
- URL: http://arxiv.org/abs/2403.03507v2
- Date: Sun, 2 Jun 2024 21:24:12 GMT
- Title: GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection
- Authors: Jiawei Zhao, Zhenyu Zhang, Beidi Chen, Zhangyang Wang, Anima Anandkumar, Yuandong Tian,
- Abstract summary: Training Large Language Models (LLMs) presents significant memory challenges due to the growing size of weights and GPU states.
In this work, we propose Gradient Low-Rank Projection (GaLore) as a memory-efficient training strategy.
Our 8-bit GaLore further reduces memory by up to 82.5% and total training memory by 63.3%, compared to a BF16 baseline.
- Score: 133.45193150403537
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Training Large Language Models (LLMs) presents significant memory challenges, predominantly due to the growing size of weights and optimizer states. Common memory-reduction approaches, such as low-rank adaptation (LoRA), add a trainable low-rank matrix to the frozen pre-trained weight in each layer, reducing trainable parameters and optimizer states. However, such approaches typically underperform training with full-rank weights in both pre-training and fine-tuning stages since they limit the parameter search to a low-rank subspace and alter the training dynamics, and further, may require full-rank warm start. In this work, we propose Gradient Low-Rank Projection (GaLore), a training strategy that allows full-parameter learning but is more memory-efficient than common low-rank adaptation methods such as LoRA. Our approach reduces memory usage by up to 65.5% in optimizer states while maintaining both efficiency and performance for pre-training on LLaMA 1B and 7B architectures with C4 dataset with up to 19.7B tokens, and on fine-tuning RoBERTa on GLUE tasks. Our 8-bit GaLore further reduces optimizer memory by up to 82.5% and total training memory by 63.3%, compared to a BF16 baseline. Notably, we demonstrate, for the first time, the feasibility of pre-training a 7B model on consumer GPUs with 24GB memory (e.g., NVIDIA RTX 4090) without model parallel, checkpointing, or offloading strategies.
Related papers
- Gradient Weight-normalized Low-rank Projection for Efficient LLM Training [16.00576040281808]
Large Language Models (LLMs) have shown remarkable performance across various tasks, but the escalating demands on computational resources pose significant challenges.
To address this, parameter-efficient fine-tuning (PEFT) methods have been developed, but they often underperform compared to full fine-tuning.
We introduceGradNormLoRP, a novel approach that enhances both parameter and memory efficiency while maintaining comparable performance to full fine-tuning.
arXiv Detail & Related papers (2024-12-27T12:23:39Z) - APOLLO: SGD-like Memory, AdamW-level Performance [61.53444035835778]
Large language models (LLMs) are notoriously memory-intensive during training.
Various memory-efficient Scals have been proposed to reduce memory usage.
They face critical challenges: (i) costly SVD operations; (ii) significant performance trade-offs compared to AdamW; and (iii) still substantial memory overhead to maintain competitive performance.
arXiv Detail & Related papers (2024-12-06T18:55:34Z) - COAP: Memory-Efficient Training with Correlation-Aware Gradient Projection [11.655821671462427]
We present COAP, a memory-efficient method that minimizes computational overhead while maintaining training performance.
For LLaMA-1B, it reduces memory by 61% with only 2% additional time cost, achieving the same PPL as AdamW.
With 8-bit quantization, COAP cuts memory by 81% and 4x speedup over GaLore for LLaVA-v1.5-7B fine-tuning, while delivering higher accuracy.
arXiv Detail & Related papers (2024-11-26T03:50:52Z) - Natural GaLore: Accelerating GaLore for memory-efficient LLM Training and Fine-tuning [1.3597551064547502]
GaLore allows full-supervised learning while being more memory-efficient.
This work introduces Natural GaLore, which efficiently applies the inverse Empirical Fisher Information Matrix to low-rank gradients.
arXiv Detail & Related papers (2024-10-21T14:05:06Z) - Q-GaLore: Quantized GaLore with INT4 Projection and Layer-Adaptive Low-Rank Gradients [86.40635601953446]
We introduce Q-Galore, a novel approach that substantially reduces memory usage by combining quantization and low-rank projection.
We demonstrate that Q-Galore achieves highly competitive performance with exceptional memory efficiency.
arXiv Detail & Related papers (2024-07-11T08:42:58Z) - BlockLLM: Memory-Efficient Adaptation of LLMs by Selecting and Optimizing the Right Coordinate Blocks [19.007090250576585]
BlockLLM is an approach inspired by block coordinate descent.
It achieves state-of-the-art performance in both finetuning and pretraining tasks.
arXiv Detail & Related papers (2024-06-25T05:45:12Z) - SLTrain: a sparse plus low-rank approach for parameter and memory efficient pretraining [39.56934385513862]
Training large language models (LLMs) from scratch requires significant computational power and extensive memory capacity.
Recent studies have explored low-rank structures on weights for efficient fine-tuning in terms of parameters and memory.
We propose to parameterize the weights as a sum of low-rank and sparse matrices for pretraining, which we call SLTrain.
arXiv Detail & Related papers (2024-06-04T11:14:21Z) - Time-, Memory- and Parameter-Efficient Visual Adaptation [75.28557015773217]
We propose an adaptation method which does not backpropagate gradients through the backbone.
We achieve this by designing a lightweight network in parallel that operates on features from the frozen, pretrained backbone.
arXiv Detail & Related papers (2024-02-05T10:55:47Z) - AdaLomo: Low-memory Optimization with Adaptive Learning Rate [59.64965955386855]
We introduce low-memory optimization with adaptive learning rate (AdaLomo) for large language models.
AdaLomo results on par with AdamW, while significantly reducing memory requirements, thereby lowering the hardware barrier to training large language models.
arXiv Detail & Related papers (2023-10-16T09:04:28Z) - Full Parameter Fine-tuning for Large Language Models with Limited Resources [55.794732214059806]
Large Language Models (LLMs) have revolutionized Natural Language Processing (NLP) but demand massive GPU resources for training.
We propose a new computation, LOw-Memory Optimization (LOMO), which fuses the gradient and the parameter update in one step to reduce memory usage.
arXiv Detail & Related papers (2023-06-16T11:37:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.