Related papers: Natural GaLore: Accelerating GaLore for memory-efficient LLM Training and Fine-tuning

Natural GaLore: Accelerating GaLore for memory-efficient LLM Training and Fine-tuning

URL: http://arxiv.org/abs/2410.16029v1
Date: Mon, 21 Oct 2024 14:05:06 GMT
Title: Natural GaLore: Accelerating GaLore for memory-efficient LLM Training and Fine-tuning
Authors: Arijit Das,
Abstract summary: GaLore allows full-supervised learning while being more memory-efficient. This work introduces Natural GaLore, which efficiently applies the inverse Empirical Fisher Information Matrix to low-rank gradients.
Score: 1.3597551064547502
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Training LLMs presents significant memory challenges due to growing size of data, weights, and optimizer states. Techniques such as data and model parallelism, gradient checkpointing, and offloading strategies address this issue but are often infeasible due to hardware constraints. To mitigate memory usage, alternative methods like Parameter-Efficient-Fine-Tuning (PEFT) and GaLore approximate weights or optimizer states. PEFT methods, such as LoRA, have gained popularity for fine-tuning LLMs, though they require a full-rank warm start. In contrast, GaLore allows full-parameter learning while being more memory-efficient. This work introduces Natural GaLore, a simple drop in replacement for AdamW, which efficiently applies the inverse Empirical Fisher Information Matrix to low-rank gradients using Woodbury's Identity. We demonstrate that incorporating second-order information speeds up optimization significantly, especially when the iteration budget is limited. Empirical pretraining on 60M, 130M, 350M, and 1.1B parameter Llama models on C4 data demonstrate significantly lower perplexity over GaLore without additional memory overhead. By fine-tuning RoBERTa on the GLUE benchmark using Natural GaLore, we demonstrate significant reduction in gap 86.05% vs 86.28% for full-finetuning. Furthermore, fine-tuning the TinyLlama 1.1B model for function calling using the TinyAgent framework shows that Natural GaLore achieving 83.09% accuracy on the TinyAgent dataset, significantly outperforms 16-bit LoRA at 80.06% and even surpasses GPT4-Turbo by 4%, all while using 30% less memory. All code to reproduce the results are available at: https://github.com/selfsupervised-ai/Natural-GaLore.git

Related papers

GaLore$+$: Boosting Low-Rank Adaptation for LLMs with Cross-Head Projection [17.33732087380253]
We propose GaLore$+$, which uses cross-head low-rank projection to reduce the substantial time consumption in estimating low-rank projections for multi-head attention. Our experiments demonstrate that GaLore$+$ delivers superior performance while achieving approximately $4times$ fine-tuning speed compared to vanilla GaLore.
arXiv Detail & Related papers (2024-12-15T12:28:13Z)
LoRA Done RITE: Robust Invariant Transformation Equilibration for LoRA Optimization [78.93425154518705]
Low-rank adaption (LoRA) is a widely used parameter-efficient finetuning method for LLM that reduces memory requirements. This paper introduces LoRA-RITE, a novel adaptive matrix preconditioning method for LoRA optimization.
arXiv Detail & Related papers (2024-10-27T22:57:12Z)
Q-GaLore: Quantized GaLore with INT4 Projection and Layer-Adaptive Low-Rank Gradients [86.40635601953446]
We introduce Q-Galore, a novel approach that substantially reduces memory usage by combining quantization and low-rank projection. We demonstrate that Q-Galore achieves highly competitive performance with exceptional memory efficiency.
arXiv Detail & Related papers (2024-07-11T08:42:58Z)
OwLore: Outlier-weighed Layerwise Sampled Low-Rank Projection for Memory-Efficient LLM Fine-tuning [18.102930806071978]
Outlier-weighed Layerwise Sampled Low-Rank Projection (OwLore) is a memory-efficient fine-tuning approach. OwLore consistently outperforms baseline approaches, including full fine-tuning.
arXiv Detail & Related papers (2024-05-28T17:22:22Z)
LISA: Layerwise Importance Sampling for Memory-Efficient Large Language Model Fine-Tuning [31.088229461632206]
Large language models (LLMs) have become a significant roadblock to large-scale training. Low-Rank Adaptation (LoRA) have been proposed to alleviate this problem. We investigate the layerwise properties of LoRA on fine-tuning tasks and observe an unexpected but consistent skewness of weight norms. We name it Layerwise Importance Sampled AdamW (LISA), a promising alternative for LoRA.
arXiv Detail & Related papers (2024-03-26T17:55:02Z)
GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection [133.45193150403537]
Training Large Language Models (LLMs) presents significant memory challenges due to the growing size of weights and GPU states. In this work, we propose Gradient Low-Rank Projection (GaLore) as a memory-efficient training strategy. Our 8-bit GaLore further reduces memory by up to 82.5% and total training memory by 63.3%, compared to a BF16 baseline.
arXiv Detail & Related papers (2024-03-06T07:29:57Z)
Scaling Sparse Fine-Tuning to Large Language Models [67.59697720719672]
Large Language Models (LLMs) are difficult to fully fine-tune due to their sheer number of parameters. We propose SpIEL, a novel sparse finetuning method which maintains an array of parameter indices and the deltas of these parameters relative to their pretrained values. We show that SpIEL is superior to popular parameter-efficient fine-tuning methods like LoRA in terms of performance and comparable in terms of run time.
arXiv Detail & Related papers (2024-01-29T18:43:49Z)
Full Parameter Fine-tuning for Large Language Models with Limited Resources [55.794732214059806]
Large Language Models (LLMs) have revolutionized Natural Language Processing (NLP) but demand massive GPU resources for training. We propose a new computation, LOw-Memory Optimization (LOMO), which fuses the gradient and the parameter update in one step to reduce memory usage.
arXiv Detail & Related papers (2023-06-16T11:37:15Z)
QLoRA: Efficient Finetuning of Quantized LLMs [66.58009990713134]
We present QLoRA, an efficient finetuning approach that reduces memory usage enough to finetune a 65B parameter model on a single 48GB GPU. QLoRA backpropagates through a frozen, 4-bit quantized pretrained language model into Low Rank Adapters(LoRA) Our best model family, which we name Guanaco, outperforms all previous openly released models on the Vicuna benchmark.
arXiv Detail & Related papers (2023-05-23T17:50:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.