Related papers: On-Device Fine-Tuning via Backprop-Free Zeroth-Order Optimization

On-Device Fine-Tuning via Backprop-Free Zeroth-Order Optimization

URL: http://arxiv.org/abs/2511.11362v1
Date: Fri, 14 Nov 2025 14:46:29 GMT
Title: On-Device Fine-Tuning via Backprop-Free Zeroth-Order Optimization
Authors: Prabodh Katti, Sangwoo Park, Bipin Rajendran, Osvaldo Simeone,
Abstract summary: Memory-efficient zeroth-order optimization (MeZO) alleviates this bottleneck.<n>This paper first provides a theoretical estimate of the relative model sizes that can be accommodated under BP and MeZO training.<n>We then numerically validate the analysis, demonstrating that MeZO exhibits accuracy advantages under on-device memory constraints.
Score: 27.237134457089194
License: http://creativecommons.org/licenses/by/4.0/
Abstract: On-device fine-tuning is a critical capability for edge AI systems, which must support adaptation to different agentic tasks under stringent memory constraints. Conventional backpropagation (BP)-based training requires storing layer activations and optimizer states, a demand that can be only partially alleviated through checkpointing. In edge deployments in which the model weights must reside entirely in device memory, this overhead severely limits the maximum model size that can be deployed. Memory-efficient zeroth-order optimization (MeZO) alleviates this bottleneck by estimating gradients using forward evaluations alone, eliminating the need for storing intermediate activations or optimizer states. This enables significantly larger models to fit within on-chip memory, albeit at the cost of potentially longer fine-tuning wall-clock time. This paper first provides a theoretical estimate of the relative model sizes that can be accommodated under BP and MeZO training. We then numerically validate the analysis, demonstrating that MeZO exhibits accuracy advantages under on-device memory constraints, provided sufficient wall-clock time is available for fine-tuning.

Related papers

Memory-Efficient Structured Backpropagation for On-Device LLM Fine-Tuning [10.913120072779193]
On-device fine-tuning enables privacy-preserving personalization of large language models.<n>Mobile devices impose severe memory constraints, typically 6--12GB shared across all workloads.<n>We propose Memory-efficient Structured Backpropagation (MeSP)<n>MeSP reduces peak memory from 361MB to 136MB for Qwen2.5-0.5B, enabling fine-tuning scenarios previously infeasible.
arXiv Detail & Related papers (2026-02-13T16:24:33Z)
FOAM: Blocked State Folding for Memory-Efficient LLM Training [41.8909496809588]
Large language models (LLMs) have demonstrated remarkable performance due to their large parameter counts and extensive training data.<n>However, their scale leads to significant memory bottlenecks during training, especially when using memory-intensives like Adam.<n>In this paper, we propose Folded with Approximate Moment (FOAM), a method that compresses states by computing block-wise means and incorporates a gradient correction to recover lost information.
arXiv Detail & Related papers (2025-12-08T02:48:27Z)
The Curious Case of In-Training Compression of State Space Models [49.819321766705514]
State Space Models (SSMs) tackle long sequence modeling tasks efficiently, offer both parallelizable training and fast inference.<n>Key design challenge is striking the right balance between maximizing expressivity and limiting this computational burden.<n>Our approach, textscCompreSSM, applies to Linear Time-Invariant SSMs such as Linear Recurrent Units, but is also extendable to selective models.
arXiv Detail & Related papers (2025-10-03T09:02:33Z)
Low-rank Momentum Factorization for Memory Efficient Training [13.464518325870444]
Momentum Factorized (MoFaSGD) maintains a dynamically updated low-rank SVD representation of the first-order momentum.<n>We demonstrate MoFaSGD's effectiveness on large language model benchmarks, achieving a competitive trade-off between memory reduction (e.g. LoRA) and performance.
arXiv Detail & Related papers (2025-07-10T18:04:52Z)
Pushing the Limits of Low-Bit Optimizers: A Focus on EMA Dynamics [64.62231094774211]
Statefuls (e.g., Adam) maintain auxiliary information even 2x the model size in order to achieve optimal convergence.<n>SOLO enables Adam-styles to maintain quantized states with precision as low as 3 bits, or even 2 bits.<n>SOLO can thus be seamlessly applied to Adam-styles, leading to substantial memory savings with minimal accuracy loss.
arXiv Detail & Related papers (2025-05-01T06:47:45Z)
Sparse Gradient Compression for Fine-Tuning Large Language Models [58.44973963468691]
Fine-tuning large language models (LLMs) for downstream tasks has become increasingly crucial due to their widespread use and the growing availability of open-source models.<n>High memory costs associated with fine-tuning remain a significant challenge, especially as models increase in size.<n>We propose sparse compression gradient (SGC) to address these limitations.
arXiv Detail & Related papers (2025-02-01T04:18:28Z)
MobiZO: Enabling Efficient LLM Fine-Tuning at the Edge via Inference Engines [28.18421624702502]
We introduce MobiZO, a resource-efficient fine-tuning framework for Large Language Models (LLMs) specifically designed for edge devices.<n>We show that MobiZO achieves substantial runtime speedups and memory savings while improving fine-tuning accuracy.<n> Experiments demonstrate that MobiZO achieves substantial runtime speedups and memory savings while improving fine-tuning accuracy.
arXiv Detail & Related papers (2024-09-23T20:14:09Z)
AdaZeta: Adaptive Zeroth-Order Tensor-Train Adaption for Memory-Efficient Large Language Models Fine-Tuning [22.950914612765494]
Fine-tuning large language models (LLMs) has achieved remarkable performance across various natural language processing tasks.<n>Memory-efficient Zeroth-order (MeZO) methods attempt to fine-tune LLMs using only forward passes, thereby avoiding the need for a backpropagation graph.<n>We propose the Adaptive Zeroth-order-Train Adaption (AdaZeta) framework, specifically designed to improve the performance and convergence of the ZO methods.
arXiv Detail & Related papers (2024-06-26T04:33:13Z)
Block Selective Reprogramming for On-device Training of Vision Transformers [12.118303034660531]
We present block selective reprogramming (BSR) in which we fine-tune only a fraction of total blocks of a pre-trained model. Compared to the existing alternatives, our approach simultaneously reduces training memory by up to 1.4x and compute cost by up to 2x.
arXiv Detail & Related papers (2024-03-25T08:41:01Z)
Winner-Take-All Column Row Sampling for Memory Efficient Adaptation of Language Model [89.8764435351222]
We propose a new family of unbiased estimators called WTA-CRS, for matrix production with reduced variance. Our work provides both theoretical and experimental evidence that, in the context of tuning transformers, our proposed estimators exhibit lower variance compared to existing ones.
arXiv Detail & Related papers (2023-05-24T15:52:08Z)
On-Device Training Under 256KB Memory [62.95579393237751]
We propose an algorithm-system co-design framework to make on-device training possible with only 256KB of memory. Our framework is the first solution to enable tiny on-device training of convolutional neural networks under 256KB and 1MB Flash.
arXiv Detail & Related papers (2022-06-30T17:59:08Z)
Mesa: A Memory-saving Training Framework for Transformers [58.78933015299703]
We present Mesa, a memory-saving training framework for Transformers. Mesa uses exact activations during forward pass while storing a low-precision version of activations to reduce memory consumption during training. Experiments on ImageNet, CIFAR-100 and ADE20K demonstrate that Mesa can reduce half of the memory footprints during training.
arXiv Detail & Related papers (2021-11-22T11:23:01Z)
A Privacy-Preserving-Oriented DNN Pruning and Mobile Acceleration Framework [56.57225686288006]
Weight pruning of deep neural networks (DNNs) has been proposed to satisfy the limited storage and computing capability of mobile edge devices. Previous pruning methods mainly focus on reducing the model size and/or improving performance without considering the privacy of user data. We propose a privacy-preserving-oriented pruning and mobile acceleration framework that does not require the private training dataset.
arXiv Detail & Related papers (2020-03-13T23:52:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.