PLUMAGE: Probabilistic Low rank Unbiased Min Variance Gradient Estimator for Efficient Large Model Training
- URL: http://arxiv.org/abs/2505.18313v1
- Date: Fri, 23 May 2025 19:17:55 GMT
- Title: PLUMAGE: Probabilistic Low rank Unbiased Min Variance Gradient Estimator for Efficient Large Model Training
- Authors: Matan Haroush, Daniel Soudry,
- Abstract summary: Accelerator memory and networking constraints have emerged as dominant bottlenecks when training large language models.<n>We propose PLUMAGE: Probabilistic Low rank Unbiased Minimum vAriance Gradient Estor.<n>We empirically demonstrate that PLUMAGE shrinks the full rank optimization's gap over the pre training evaluation loss by 33% on average across models and the average training loss across the GLUE benchmark by 28% within a similar computational and memory footprint as GaloRE.
- Score: 21.695928776150808
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Accelerator memory and networking constraints have emerged as dominant bottlenecks when training large language models LLMs with billions of parameters. Existing low rank gradient estimators such as GaLoRE and FLORA compress gradients and optimizer tensors by projecting weight gradients onto a rank r subspace, enabling LLM training on consumer hardware. Yet, these methods are either biased or subject to high estimator variance. Moreover, the optimizer state based on the first and second moments estimates expressed in the previous subspace becomes misaligned whenever the projection is updated, leading to instabilities during training. We propose PLUMAGE: Probabilistic Low rank Unbiased Minimum vAriance Gradient Estimator. PLUMAGE is a drop in replacement for existing low rank gradient estimators. It does not introduce new hyperparameters beyond the chosen rank r and the update interval. In addition, we resolve optimizer state misalignment issues to prevent spurious weight updates and enhance training stability. We empirically demonstrate that PLUMAGE shrinks the full rank optimization's gap over the pre training evaluation loss by 33% on average across models and the average training loss across the GLUE benchmark by 28% within a similar computational and memory footprint as GaloRE.
Related papers
- SubTrack++ : Gradient Subspace Tracking for Scalable LLM Training [6.057289837472806]
Training large language models (LLMs) is highly resource-intensive due to their massive number of parameters and the overhead of states.<n>We propose SubTrack++ that leverages Grassmannian gradient subspace tracking combined with projection-awares.<n>We demonstrate SOTA convergence by exploiting Grassmannian geometry and lowest evaluation loss.
arXiv Detail & Related papers (2025-02-03T18:13:52Z) - The Journey Matters: Average Parameter Count over Pre-training Unifies Sparse and Dense Scaling Laws [51.608402959163925]
We present the first systematic exploration of optimal sparse pre-training configurations for large language models.<n>We find that initiating pruning at 25% of total training compute and concluding at 75% achieves near-optimal final evaluation loss.<n>We propose a new scaling law that modifies the Chinchilla scaling law to use the average parameter count over pre-training.
arXiv Detail & Related papers (2025-01-21T20:23:22Z) - AdaRankGrad: Adaptive Gradient-Rank and Moments for Memory-Efficient LLMs Training and Fine-Tuning [9.51289606759621]
Training and fine-tuning large language models (LLMs) come with challenges related to memory and computational requirements.<n>Various techniques have been developed to tackle these challenges, such as low-rank adaptation (LoRA)<n>We introduce a new method inspired by a phenomenon we formally prove: as training progresses, the rank of the estimated gradient gradually decreases.
arXiv Detail & Related papers (2024-10-23T13:53:26Z) - Zeroth-Order Fine-Tuning of LLMs in Random Subspaces [63.10833446782114]
As language models grow in size, memory demands for backpropagation increase.<n>Zeroth-order (ZO) optimization methods offer a memory-efficient alternative.<n>In this paper, we propose Subspace Zero-order optimization to address the challenges posed by posed by high dimensionality perturbations.
arXiv Detail & Related papers (2024-10-11T17:01:43Z) - Fira: Can We Achieve Full-rank Training of LLMs Under Low-rank Constraint? [40.94505326255136]
Low-rank training has emerged as a promising approach for reducing memory usage in training Large Language Models.
We propose a new plug-and-play training framework for LLMs called Fira, as the first attempt to achieve this goal.
We show that Fira outperforms both LoRA and GaLore, achieving performance that is comparable to or even better than full-rank training.
arXiv Detail & Related papers (2024-10-02T14:58:27Z) - InRank: Incremental Low-Rank Learning [85.6380047359139]
gradient-based training implicitly regularizes neural networks towards low-rank solutions through a gradual increase of the rank during training.
Existing training algorithms do not exploit the low-rank property to improve computational efficiency.
We design a new training algorithm Incremental Low-Rank Learning (InRank), which explicitly expresses cumulative weight updates as low-rank matrices.
arXiv Detail & Related papers (2023-06-20T03:03:04Z) - Winner-Take-All Column Row Sampling for Memory Efficient Adaptation of Language Model [89.8764435351222]
We propose a new family of unbiased estimators called WTA-CRS, for matrix production with reduced variance.
Our work provides both theoretical and experimental evidence that, in the context of tuning transformers, our proposed estimators exhibit lower variance compared to existing ones.
arXiv Detail & Related papers (2023-05-24T15:52:08Z) - Gradient Estimation for Binary Latent Variables via Gradient Variance
Clipping [6.234350105794441]
gradient estimation is often necessary for fitting generative models with discrete latent variables.
DisARM and other estimators have potentially exploding variance near the boundary of the parameter space.
We propose a new gradient estimator textitbitflip-1 that has lower variance at the boundaries of the parameter space.
arXiv Detail & Related papers (2022-08-12T05:37:52Z) - Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose.
We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z) - TRP: Trained Rank Pruning for Efficient Deep Neural Networks [69.06699632822514]
We propose Trained Rank Pruning (TRP), which alternates between low rank approximation and training.
A nuclear regularization optimized by sub-gradient descent is utilized to further promote low rank in TRP.
The TRP trained network inherently has a low-rank structure, and is approximated with negligible performance loss.
arXiv Detail & Related papers (2020-04-30T03:37:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.