Backward-Friendly Optimization: Training Large Language Models with Approximate Gradients under Memory Constraints
- URL: http://arxiv.org/abs/2510.22467v1
- Date: Sun, 26 Oct 2025 00:50:12 GMT
- Title: Backward-Friendly Optimization: Training Large Language Models with Approximate Gradients under Memory Constraints
- Authors: Jing Yang, Kaitong Cai, Yijia Fan, Yufeng Yang, Keze Wang,
- Abstract summary: Full fine-tuning of Large Language Models (LLMs) is notoriously memory-intensive.<n>We introduce GradLite, a backward-friendly solution that relaxes the requirement of exact gradients.<n>We show that GradLite maintains unbiased estimates with bounded variance, ensuring convergence rates comparable to Adam.
- Score: 14.20716202034732
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Full fine-tuning of Large Language Models (LLMs) is notoriously memory-intensive, primarily because conventional optimizers such as SGD or Adam assume access to exact gradients derived from cached activations. Existing solutions either alter the model architecture (e.g., reversible networks) or trade memory for computation (e.g., activation checkpointing), but the optimizer itself remains untouched. In this work, we introduce GradLite, a backward-friendly optimizer that relaxes the requirement of exact gradients, enabling efficient training even when intermediate activations are aggressively discarded or approximated. GradLite leverages two key techniques: (i) low-rank Jacobian approximation, which reduces the dimensionality of backpropagated error signals, and (ii) error-feedback correction, which accumulates and compensates approximation errors across iterations to preserve convergence guarantees. We provide a theoretical analysis showing that GradLite maintains unbiased gradient estimates with bounded variance, ensuring convergence rates comparable to Adam. Empirically, GradLite reduces optimizer-state and activation memory consumption by up to 50\% without architectural changes, and achieves on-par or superior downstream performance on reasoning (MMLU, GSM8K), multilingual, and dialogue benchmarks compared to checkpointing and optimizer-centric baselines (LoMo, GaLore).
Related papers
- Towards Robust Scaling Laws for Optimizers [89.21160945066737]
Empirical scaling laws are widely used to predict loss as model size and training data grow.<n>We show that Chinchilla-style scaling laws emerge naturally as a result of loss decomposition into irreducible, approximation, and optimization errors.
arXiv Detail & Related papers (2026-02-07T21:40:33Z) - FOAM: Blocked State Folding for Memory-Efficient LLM Training [41.8909496809588]
Large language models (LLMs) have demonstrated remarkable performance due to their large parameter counts and extensive training data.<n>However, their scale leads to significant memory bottlenecks during training, especially when using memory-intensives like Adam.<n>In this paper, we propose Folded with Approximate Moment (FOAM), a method that compresses states by computing block-wise means and incorporates a gradient correction to recover lost information.
arXiv Detail & Related papers (2025-12-08T02:48:27Z) - Low-rank Momentum Factorization for Memory Efficient Training [13.464518325870444]
Momentum Factorized (MoFaSGD) maintains a dynamically updated low-rank SVD representation of the first-order momentum.<n>We demonstrate MoFaSGD's effectiveness on large language model benchmarks, achieving a competitive trade-off between memory reduction (e.g. LoRA) and performance.
arXiv Detail & Related papers (2025-07-10T18:04:52Z) - Revisiting the Initial Steps in Adaptive Gradient Descent Optimization [6.468625143772815]
Adaptive gradient optimization methods, such as Adam, are prevalent in training deep neural networks across diverse machine learning tasks.<n>These methods often suffer from suboptimal generalization compared to descent gradient (SGD) and exhibit instability.<n>We introduce simple yet effective solutions: initializing the second-order moment estimation with non-zero values.
arXiv Detail & Related papers (2024-12-03T04:28:14Z) - COAP: Memory-Efficient Training with Correlation-Aware Gradient Projection [17.54863041098623]
We present COAP, a memory-efficient method that minimizes computational overhead while maintaining training performance.<n>For LLaMA-1B, it reduces memory by 61% with only 2% additional time cost, achieving the same PPL as AdamW.<n>With 8-bit quantization, COAP cuts memory by 81% and 4x speedup over GaLore for LLaVA-v1.5-7B fine-tuning, while delivering higher accuracy.
arXiv Detail & Related papers (2024-11-26T03:50:52Z) - Zeroth-Order Fine-Tuning of LLMs in Random Subspaces [63.10833446782114]
As language models grow in size, memory demands for backpropagation increase.<n>Zeroth-order (ZO) optimization methods offer a memory-efficient alternative.<n>In this paper, we propose Subspace Zero-order optimization to address the challenges posed by posed by high dimensionality perturbations.
arXiv Detail & Related papers (2024-10-11T17:01:43Z) - AdaLomo: Low-memory Optimization with Adaptive Learning Rate [59.64965955386855]
We introduce low-memory optimization with adaptive learning rate (AdaLomo) for large language models.
AdaLomo results on par with AdamW, while significantly reducing memory requirements, thereby lowering the hardware barrier to training large language models.
arXiv Detail & Related papers (2023-10-16T09:04:28Z) - Winner-Take-All Column Row Sampling for Memory Efficient Adaptation of Language Model [89.8764435351222]
We propose a new family of unbiased estimators called WTA-CRS, for matrix production with reduced variance.
Our work provides both theoretical and experimental evidence that, in the context of tuning transformers, our proposed estimators exhibit lower variance compared to existing ones.
arXiv Detail & Related papers (2023-05-24T15:52:08Z) - GradInit: Learning to Initialize Neural Networks for Stable and
Efficient Training [59.160154997555956]
We present GradInit, an automated and architecture method for initializing neural networks.
It is based on a simple agnostic; the variance of each network layer is adjusted so that a single step of SGD or Adam results in the smallest possible loss value.
It also enables training the original Post-LN Transformer for machine translation without learning rate warmup.
arXiv Detail & Related papers (2021-02-16T11:45:35Z) - Cogradient Descent for Bilinear Optimization [124.45816011848096]
We introduce a Cogradient Descent algorithm (CoGD) to address the bilinear problem.
We solve one variable by considering its coupling relationship with the other, leading to a synchronous gradient descent.
Our algorithm is applied to solve problems with one variable under the sparsity constraint.
arXiv Detail & Related papers (2020-06-16T13:41:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.