AdaFRUGAL: Adaptive Memory-Efficient Training with Dynamic Control
- URL: http://arxiv.org/abs/2601.11568v1
- Date: Sat, 27 Dec 2025 14:11:08 GMT
- Title: AdaFRUGAL: Adaptive Memory-Efficient Training with Dynamic Control
- Authors: Quang-Hung Bui, Anh Son Ta,
- Abstract summary: Training Large Language Models (LLMs) is highly memory-intensive due to state overhead.<n>AdaFRUGAL introduces two dynamic controls: (i) a linear decay for $$ to progressively reduce memory, and (ii) a loss-aware schedule for $T$ to lower computational overhead.<n>It maintains competitive performance against AdamW and static FRUGAL while significantly reducing both GPU memory and training time.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Training Large Language Models (LLMs) is highly memory-intensive due to optimizer state overhead. The FRUGAL framework mitigates this with gradient splitting, but its static hyperparameters -- the subspace ratio ($ρ$) and update frequency ($T$) -- require costly manual tuning, limiting adaptability. We present AdaFRUGAL, which automates this process by introducing two dynamic controls: (i) a linear decay for $ρ$ to progressively reduce memory, and (ii) a loss-aware schedule for $T$ to lower computational overhead. Experiments across large-scale pre-training (English C4, Vietnamese VietVault) and fine-tuning (GLUE) demonstrate that AdaFRUGAL achieves a compelling trade-off. It maintains competitive performance against AdamW and static FRUGAL while significantly reducing both GPU memory and training time, offering a more practical, autonomous solution for resource-constrained LLM training.
Related papers
- DiRL: An Efficient Post-Training Framework for Diffusion Language Models [54.405206032785706]
Diffusion Language Models (dLLMs) have emerged as promising alternatives to Auto-Regressive (AR) models.<n>Existing methods suffer from computational inefficiency and objective mismatches between training and inference.<n>We introduce DiRL, an efficient post-training framework that tightly integrates FlexAttention-accelerated blockwise training with LMDeploy-optimized inference.
arXiv Detail & Related papers (2025-12-23T08:33:19Z) - PT$^2$-LLM: Post-Training Ternarization for Large Language Models [52.4629647715623]
Large Language Models (LLMs) have shown impressive capabilities across diverse tasks, but their large memory and compute demands hinder deployment.<n>We propose PT$2$-LLM, a post-training ternarization framework tailored for LLMs.<n>At its core is an Asymmetric Ternary Quantizer equipped with a two-stage refinement pipeline.
arXiv Detail & Related papers (2025-09-27T03:01:48Z) - DAF: An Efficient End-to-End Dynamic Activation Framework for on-Device DNN Training [41.09085549544767]
We introduce a Dynamic Activation Framework (DAF) that enables scalable and efficient on-device training through system-level optimizations.<n>DAF achieves both memory- and time-efficient dynamic quantization training by addressing key system bottlenecks.<n> Evaluations on various deep learning models across embedded and mobile platforms demonstrate up to a $22.9times$ reduction in memory usage and a $3.2times$ speedup.
arXiv Detail & Related papers (2025-07-09T08:59:30Z) - Flexiffusion: Training-Free Segment-Wise Neural Architecture Search for Efficient Diffusion Models [50.260693393896716]
Diffusion models (DMs) are powerful generative models capable of producing high-fidelity images but constrained by high computational costs.<n>We propose Flexiffusion, a training-free NAS framework that jointly optimize generation schedules and model architectures without modifying pre-trained parameters.<n>Our work pioneers a resource-efficient paradigm for searching high-speed DMs without sacrificing quality.
arXiv Detail & Related papers (2025-06-03T06:02:50Z) - MLorc: Momentum Low-rank Compression for Memory Efficient Large Language Model Adaptation [24.943207005554246]
We propose a memory-efficient training paradigm called Momentum Low-rank compression (MLorc)<n>The key idea of MLorc is to compress and reconstruct the momentum of matrix parameters during training to reduce memory consumption.
arXiv Detail & Related papers (2025-06-02T17:21:10Z) - CoLA: Compute-Efficient Pre-Training of LLMs via Low-Rank Activation [19.447967755388092]
We propose CoLA and its memory-efficient implementation, CoLA-M, to replace full-size layers with compute-efficient auto-encoders.<n>Experiments on LLaMA models with 60 million to 7 billion parameters show that CoLA reduces the computing cost by $bf 2pmbtimes$.<n>CoLA-M further squeezes memory cost without sacrificing throughput, offering a pre-training approach with collectively superior parameter, computing, and memory efficiency.
arXiv Detail & Related papers (2025-02-16T01:05:16Z) - Forget Forgetting: Continual Learning in a World of Abundant Memory [55.64184779530581]
Continual learning has traditionally focused on minimizing exemplar memory.<n>This paper challenges this paradigm by investigating a more realistic regime.<n>We find that the core challenge shifts from stability to plasticity, as models become biased toward prior tasks and struggle to learn new ones.
arXiv Detail & Related papers (2025-02-11T05:40:52Z) - FRUGAL: Memory-Efficient Optimization by Reducing State Overhead for Scalable Training [51.39495282347475]
We introduce $textttFRUGAL$ ($textbfF$ull-$textbfR$ank $textbfU$pdates with $textbfG$r$textbfA$dient sp$textbfL$itting, a new memory-efficient optimization framework.<n>Our framework can be integrated with various low-rank update selection techniques, including GaLore and BAdam.
arXiv Detail & Related papers (2024-11-12T14:41:07Z) - EoRA: Fine-tuning-free Compensation for Compressed LLM with Eigenspace Low-Rank Approximation [84.70637613266835]
EoRA is a fine-tuning-free method that augments compressed Large Language Models with low-rank matrices.<n>EoRA consistently outperforms prior training-free low rank methods in recovering the accuracy of compressed LLMs.
arXiv Detail & Related papers (2024-10-28T17:59:03Z) - Adaptive Layer Selection for Efficient Vision Transformer Fine-Tuning [18.776903525210933]
We introduce an efficient fine-tuning method for ViTs called $textbfALaST$ ($textitAdaptive Layer Selection Fine-Tuning for Vision Transformers$)
Our approach is based on the observation that not all layers are equally critical during fine-tuning, and their importance varies depending on the current mini-batch.
We show that this adaptive compute allocation enables a nearly-optimal schedule for distributing computational resources.
arXiv Detail & Related papers (2024-08-16T11:27:52Z) - CoMERA: Computing- and Memory-Efficient Training via Rank-Adaptive Tensor Optimization [9.826264204082095]
Training large AI models such as LLMs and DLRMs costs massive GPU and computing time.<n>CoMERA achieves rank-adaptive tensor-compressed (pre)-training via a multi-objective optimization formulation.<n>CoMERA is $2times$ faster per training epoch and $9times$ more memory-efficient than GaLore.
arXiv Detail & Related papers (2024-05-23T09:52:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.