Related papers: FOAM: Blocked State Folding for Memory-Efficient LLM Training

FOAM: Blocked State Folding for Memory-Efficient LLM Training

URL: http://arxiv.org/abs/2512.07112v1
Date: Mon, 08 Dec 2025 02:48:27 GMT
Title: FOAM: Blocked State Folding for Memory-Efficient LLM Training
Authors: Ziqing Wen, Jiahuan Wang, Ping Luo, Dongsheng Li, Tao Sun,
Abstract summary: Large language models (LLMs) have demonstrated remarkable performance due to their large parameter counts and extensive training data.<n>However, their scale leads to significant memory bottlenecks during training, especially when using memory-intensives like Adam.<n>In this paper, we propose Folded with Approximate Moment (FOAM), a method that compresses states by computing block-wise means and incorporates a gradient correction to recover lost information.
Score: 41.8909496809588
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) have demonstrated remarkable performance due to their large parameter counts and extensive training data. However, their scale leads to significant memory bottlenecks during training, especially when using memory-intensive optimizers like Adam. Existing memory-efficient approaches often rely on techniques such as singular value decomposition (SVD), projections, or weight freezing, which can introduce substantial computational overhead, require additional memory for projections, or degrade model performance. In this paper, we propose Folded Optimizer with Approximate Moment (FOAM), a method that compresses optimizer states by computing block-wise gradient means and incorporates a residual correction to recover lost information. Theoretically, FOAM achieves convergence rates equivalent to vanilla Adam under standard non-convex optimization settings. Empirically, FOAM reduces total training memory by approximately 50\%, eliminates up to 90\% of optimizer state memory overhead, and accelerates convergence. Furthermore, FOAM is compatible with other memory-efficient optimizers, delivering performance and throughput that match or surpass both full-rank and existing memory-efficient baselines.

Related papers

Backward-Friendly Optimization: Training Large Language Models with Approximate Gradients under Memory Constraints [14.20716202034732]
Full fine-tuning of Large Language Models (LLMs) is notoriously memory-intensive.<n>We introduce GradLite, a backward-friendly solution that relaxes the requirement of exact gradients.<n>We show that GradLite maintains unbiased estimates with bounded variance, ensuring convergence rates comparable to Adam.
arXiv Detail & Related papers (2025-10-26T00:50:12Z)
COSMOS: A Hybrid Adaptive Optimizer for Memory-Efficient Training of LLMs [77.79640601822341]
Large Language Models (LLMs) have demonstrated remarkable success across various domains.<n>Their optimization remains a significant challenge due to the complex and high-dimensional loss landscapes they inhabit.
arXiv Detail & Related papers (2025-02-24T18:42:19Z)
Sparse Gradient Compression for Fine-Tuning Large Language Models [58.44973963468691]
Fine-tuning large language models (LLMs) for downstream tasks has become increasingly crucial due to their widespread use and the growing availability of open-source models.<n>High memory costs associated with fine-tuning remain a significant challenge, especially as models increase in size.<n>We propose sparse compression gradient (SGC) to address these limitations.
arXiv Detail & Related papers (2025-02-01T04:18:28Z)
Wavelet Meets Adam: Compressing Gradients for Memory-Efficient Training [45.225732322141994]
Large language models (LLMs) have impressive performance across a range of natural language processing tasks.<n>Their vast number of parameters introduces significant memory challenges during training.<n>Existing memory-efficient algorithms often rely on techniques such as singular value decomposition projection or weight freezing.<n>We propose a novel solution called Gradient Wavelet Transform (GWT), which applies wavelet transforms to gradients in order to significantly reduce the memory requirements.
arXiv Detail & Related papers (2025-01-13T11:35:09Z)
APOLLO: SGD-like Memory, AdamW-level Performance [61.53444035835778]
Large language models (LLMs) are notoriously memory-intensive during training.<n>Various memory-efficient Scals have been proposed to reduce memory usage.<n>They face critical challenges: (i) costly SVD operations; (ii) significant performance trade-offs compared to AdamW; and (iii) still substantial memory overhead to maintain competitive performance.
arXiv Detail & Related papers (2024-12-06T18:55:34Z)
COAP: Memory-Efficient Training with Correlation-Aware Gradient Projection [17.54863041098623]
We present COAP, a memory-efficient method that minimizes computational overhead while maintaining training performance.<n>For LLaMA-1B, it reduces memory by 61% with only 2% additional time cost, achieving the same PPL as AdamW.<n>With 8-bit quantization, COAP cuts memory by 81% and 4x speedup over GaLore for LLaVA-v1.5-7B fine-tuning, while delivering higher accuracy.
arXiv Detail & Related papers (2024-11-26T03:50:52Z)
GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection [133.45193150403537]
Training Large Language Models (LLMs) presents significant memory challenges due to the growing size of weights and GPU states. In this work, we propose Gradient Low-Rank Projection (GaLore) as a memory-efficient training strategy. Our 8-bit GaLore further reduces memory by up to 82.5% and total training memory by 63.3%, compared to a BF16 baseline.
arXiv Detail & Related papers (2024-03-06T07:29:57Z)
AdaLomo: Low-memory Optimization with Adaptive Learning Rate [59.64965955386855]
We introduce low-memory optimization with adaptive learning rate (AdaLomo) for large language models. AdaLomo results on par with AdamW, while significantly reducing memory requirements, thereby lowering the hardware barrier to training large language models.
arXiv Detail & Related papers (2023-10-16T09:04:28Z)
CAME: Confidence-guided Adaptive Memory Efficient Optimization [20.009302737137787]
Adaptive gradient methods have demonstrated excellent performance in the training of large language models. The need for maintaining second-moment estimates requires maintaining a high cost of extra memory overheads. Several memory-efficients have been proposed to obtain a drastic reduction in auxiliary memory usage, but with a performance penalty. We propose CAME to simultaneously achieve two goals: fast convergence as in traditional adaptive methods, and low memory usage as in memory-efficient methods.
arXiv Detail & Related papers (2023-07-05T06:05:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.