Related papers: GaLore 2: Large-Scale LLM Pre-Training by Gradient Low-Rank Projection

GaLore 2: Large-Scale LLM Pre-Training by Gradient Low-Rank Projection

URL: http://arxiv.org/abs/2504.20437v1
Date: Tue, 29 Apr 2025 05:27:02 GMT
Title: GaLore 2: Large-Scale LLM Pre-Training by Gradient Low-Rank Projection
Authors: DiJia Su, Andrew Gu, Jane Xu, Yuandong Tian, Jiawei Zhao,
Abstract summary: GaLore, Gradient Low-Rank Projection, addresses this issue by leveraging the inherent low-rank structure of weight gradients.<n>Recent works further extend GaLore from various aspects, including low-bit quantization and higher-order tensor structures.<n>We present GaLore 2, an efficient and scalable GaLore framework that addresses these challenges and incorporates recent advancements.
Score: 31.277462922203302
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) have revolutionized natural language understanding and generation but face significant memory bottlenecks during training. GaLore, Gradient Low-Rank Projection, addresses this issue by leveraging the inherent low-rank structure of weight gradients, enabling substantial memory savings without sacrificing performance. Recent works further extend GaLore from various aspects, including low-bit quantization and higher-order tensor structures. However, there are several remaining challenges for GaLore, such as the computational overhead of SVD for subspace updates and the integration with state-of-the-art training parallelization strategies (e.g., FSDP). In this paper, we present GaLore 2, an efficient and scalable GaLore framework that addresses these challenges and incorporates recent advancements. In addition, we demonstrate the scalability of GaLore 2 by pre-training Llama 7B from scratch using up to 500 billion training tokens, highlighting its potential impact on real LLM pre-training scenarios.

Related papers

Dynamic Low-Rank Sparse Adaptation for Large Language Models [54.1231638555233]
Low-rank Sparse Adaptation (LoSA) is a novel method that seamlessly integrates low-rank adaptation into sparse LLM sparsity. LoSA dynamically sparsifies the LoRA outcomes based on the corresponding sparse weights during fine-tuning. LoSA can efficiently boost the efficacy of sparse LLMs within a few hours, without introducing any additional inferential burden.
arXiv Detail & Related papers (2025-02-20T18:37:32Z)
LESA: Learnable LLM Layer Scaling-Up [57.0510934286449]
Training Large Language Models (LLMs) from scratch requires immense computational resources, making it prohibitively expensive.<n>Model scaling-up offers a promising solution by leveraging the parameters of smaller models to create larger ones.<n>We propose textbfLESA, a novel learnable method for depth scaling-up.
arXiv Detail & Related papers (2025-02-19T14:58:48Z)
The Curse of Depth in Large Language Models [28.37870372690079]
In large language models, nearly half of the layers are less effective than expected.<n>LayerNorm Scaling (LNS) scales the variance of output of the layer normalization inversely by the square root of its depth.<n>LNS consistently outperforms previous normalization and scaling techniques in enhancing LLM pre-training performance.
arXiv Detail & Related papers (2025-02-09T07:03:36Z)
GaLore$+$: Boosting Low-Rank Adaptation for LLMs with Cross-Head Projection [17.33732087380253]
We propose GaLore$+$, which uses cross-head low-rank projection to reduce the substantial time consumption in estimating low-rank projections for multi-head attention.<n>Our experiments demonstrate that GaLore$+$ delivers superior performance while achieving approximately $4times$ fine-tuning speed compared to vanilla GaLore.
arXiv Detail & Related papers (2024-12-15T12:28:13Z)
From GaLore to WeLore: How Low-Rank Weights Non-uniformly Emerge from Low-Rank Gradients [86.40635601953446]
We study the emergence of low-rank structures across different layers of Modern Large Language Models. We present Weight Low-Rank Projection (WeLore) that unifies weight compression and memory-efficient fine-tuning as ONE.
arXiv Detail & Related papers (2024-07-15T21:05:20Z)
Q-GaLore: Quantized GaLore with INT4 Projection and Layer-Adaptive Low-Rank Gradients [86.40635601953446]
We introduce Q-Galore, a novel approach that substantially reduces memory usage by combining quantization and low-rank projection. We demonstrate that Q-Galore achieves highly competitive performance with exceptional memory efficiency.
arXiv Detail & Related papers (2024-07-11T08:42:58Z)
BlockLLM: Memory-Efficient Adaptation of LLMs by Selecting and Optimizing the Right Coordinate Blocks [19.007090250576585]
BlockLLM is an approach inspired by block coordinate descent.<n>It achieves state-of-the-art performance in both finetuning and pretraining tasks.
arXiv Detail & Related papers (2024-06-25T05:45:12Z)
OwLore: Outlier-weighed Layerwise Sampled Low-Rank Projection for Memory-Efficient LLM Fine-tuning [18.102930806071978]
Outlier-weighed Layerwise Sampled Low-Rank Projection (OwLore) is a memory-efficient fine-tuning approach. OwLore consistently outperforms baseline approaches, including full fine-tuning.
arXiv Detail & Related papers (2024-05-28T17:22:22Z)
GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection [133.45193150403537]
Training Large Language Models (LLMs) presents significant memory challenges due to the growing size of weights and GPU states. In this work, we propose Gradient Low-Rank Projection (GaLore) as a memory-efficient training strategy. Our 8-bit GaLore further reduces memory by up to 82.5% and total training memory by 63.3%, compared to a BF16 baseline.
arXiv Detail & Related papers (2024-03-06T07:29:57Z)
Why Lift so Heavy? Slimming Large Language Models by Cutting Off the Layers [9.549646359252346]
Large Language Models (LLMs) possess outstanding capabilities in addressing various natural language processing (NLP) tasks.<n>The sheer size of these models poses challenges in terms of storage, training and inference due to the inclusion of billions of parameters through layer stacking.<n>We show that even with fewer layers, LLMs maintain similar or better performance levels, particularly in prompt-based fine-tuning for text classification tasks.
arXiv Detail & Related papers (2024-02-18T20:47:10Z)
Dynamic Sparse No Training: Training-Free Fine-tuning for Sparse LLMs [67.38165028487242]
We introduce Dynamic Sparse No Training (DSnoT), a training-free fine-tuning approach to fine-tune large language models (LLMs) Inspired by the Dynamic Sparse Training, DSnoT minimizes the reconstruction error between the dense and sparse LLMs. Our paper offers fresh insights into how to fine-tune sparse LLMs in an efficient training-free manner and open new venues to scale the great potential of sparsity to LLMs.
arXiv Detail & Related papers (2023-10-13T07:38:52Z)
LLM-Pruner: On the Structural Pruning of Large Language Models [65.02607075556742]
Large language models (LLMs) have shown remarkable capabilities in language understanding and generation. We tackle the compression of LLMs within the bound of two constraints: being task-agnostic and minimizing the reliance on the original training dataset. Our method, named LLM-Pruner, adopts structural pruning that selectively removes non-critical coupled structures.
arXiv Detail & Related papers (2023-05-19T12:10:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.