Memory-Efficient Backpropagation for Fine-Tuning LLMs on Resource-Constrained Mobile Devices
- URL: http://arxiv.org/abs/2510.03425v1
- Date: Fri, 03 Oct 2025 18:36:21 GMT
- Title: Memory-Efficient Backpropagation for Fine-Tuning LLMs on Resource-Constrained Mobile Devices
- Authors: Congzheng Song, Xinyu Tang,
- Abstract summary: Fine-tuning large language models (LLMs) with backpropagationtextemdash can be much more memory-consuming than inference.<n>We propose a memory-efficient implementation of backpropagation (MeBP) on mobile devices that provides better trade-off between memory usage and compute time.
- Score: 5.747073544547447
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Fine-tuning large language models (LLMs) with backpropagation\textemdash even for a subset of parameters such as LoRA\textemdash can be much more memory-consuming than inference and is often deemed impractical for resource-constrained mobile devices. Alternative methods, such as zeroth-order optimization (ZO), can greatly reduce the memory footprint but come at the cost of significantly slower model convergence (10$\times$ to 100$\times$ more steps than backpropagation). We propose a memory-efficient implementation of backpropagation (MeBP) on mobile devices that provides better trade-off between memory usage and compute time, while converging faster and achieving better performance than the ZO baseline. We verify the effectiveness of MeBP on an iPhone 15 Pro Max and show that various LLMs, ranging from 0.5B to 4B parameters, can be fine-tuned using less than 1GB of memory. We release an example of the MeBP implementation at https://github.com/apple/ml-mebp.
Related papers
- Memory-Efficient Structured Backpropagation for On-Device LLM Fine-Tuning [10.913120072779193]
On-device fine-tuning enables privacy-preserving personalization of large language models.<n>Mobile devices impose severe memory constraints, typically 6--12GB shared across all workloads.<n>We propose Memory-efficient Structured Backpropagation (MeSP)<n>MeSP reduces peak memory from 361MB to 136MB for Qwen2.5-0.5B, enabling fine-tuning scenarios previously infeasible.
arXiv Detail & Related papers (2026-02-13T16:24:33Z) - On-Device Fine-Tuning via Backprop-Free Zeroth-Order Optimization [27.237134457089194]
Memory-efficient zeroth-order optimization (MeZO) alleviates this bottleneck.<n>This paper first provides a theoretical estimate of the relative model sizes that can be accommodated under BP and MeZO training.<n>We then numerically validate the analysis, demonstrating that MeZO exhibits accuracy advantages under on-device memory constraints.
arXiv Detail & Related papers (2025-11-14T14:46:29Z) - MLorc: Momentum Low-rank Compression for Memory Efficient Large Language Model Adaptation [24.943207005554246]
We propose a memory-efficient training paradigm called Momentum Low-rank compression (MLorc)<n>The key idea of MLorc is to compress and reconstruct the momentum of matrix parameters during training to reduce memory consumption.
arXiv Detail & Related papers (2025-06-02T17:21:10Z) - MOM: Memory-Efficient Offloaded Mini-Sequence Inference for Long Context Language Models [72.61076288351201]
We propose Memory-efficient Offloaded Mini-sequence Inference (MOM)<n>MOM partitions critical layers into smaller "mini-sequences" and integrates seamlessly with KV cache offloading.<n>On Meta-Llama-3.2-8B, MOM extends the maximum context length from 155k to 455k tokens on a single A100 80GB GPU.
arXiv Detail & Related papers (2025-04-16T23:15:09Z) - APOLLO: SGD-like Memory, AdamW-level Performance [61.53444035835778]
Large language models (LLMs) are notoriously memory-intensive during training.<n>Various memory-efficient Scals have been proposed to reduce memory usage.<n>They face critical challenges: (i) costly SVD operations; (ii) significant performance trade-offs compared to AdamW; and (iii) still substantial memory overhead to maintain competitive performance.
arXiv Detail & Related papers (2024-12-06T18:55:34Z) - Reducing Fine-Tuning Memory Overhead by Approximate and Memory-Sharing Backpropagation [29.139579820699495]
This work strives to reduce memory overhead in fine-tuning from perspectives of activation function and layer normalization.
We apply our Approx-BP theory to backpropagation training and derive memory-efficient alternatives of GELU and SiLU activation functions.
In addition, we introduce a Memory-Sharing Backpropagation strategy, which enables the activation memory to be shared by two adjacent layers.
arXiv Detail & Related papers (2024-06-24T03:09:15Z) - Scalable MatMul-free Language Modeling [9.048532540945086]
MatMul operations can be eliminated from large language models.<n>MatMul-free models, tested on models up to 2.7B parameters, are comparable to state-of-the-art pre-trained Transformers.
arXiv Detail & Related papers (2024-06-04T17:50:34Z) - Sparse MeZO: Less Parameters for Better Performance in Zeroth-Order LLM
Fine-Tuning [67.44661423463927]
This paper introduces Sparse MeZO, a memory-efficient zeroth-order optimization approach that applies ZO only to a carefully chosen subset of parameters.
We show that Sparse-MeZO consistently improves both performance and convergence speed over MeZO without any overhead.
arXiv Detail & Related papers (2024-02-24T07:22:04Z) - AdaLomo: Low-memory Optimization with Adaptive Learning Rate [59.64965955386855]
We introduce low-memory optimization with adaptive learning rate (AdaLomo) for large language models.
AdaLomo results on par with AdamW, while significantly reducing memory requirements, thereby lowering the hardware barrier to training large language models.
arXiv Detail & Related papers (2023-10-16T09:04:28Z) - Full Parameter Fine-tuning for Large Language Models with Limited Resources [55.794732214059806]
Large Language Models (LLMs) have revolutionized Natural Language Processing (NLP) but demand massive GPU resources for training.
We propose a new computation, LOw-Memory Optimization (LOMO), which fuses the gradient and the parameter update in one step to reduce memory usage.
arXiv Detail & Related papers (2023-06-16T11:37:15Z) - Fine-Tuning Language Models with Just Forward Passes [92.04219196752007]
Fine-tuning language models (LMs) has yielded success on diverse downstream tasks, but as LMs grow in size, backpropagation requires a large amount of memory.
We propose a memory-efficient zerothorder (MeZO) to operate in-place, thereby fine-tuning LMs with the same memory footprint as inference.
arXiv Detail & Related papers (2023-05-27T02:28:10Z) - SPARTAN: Sparse Hierarchical Memory for Parameter-Efficient Transformers [29.721162097790646]
SPARTAN is a parameter efficient (PE) and computationally fast architecture for edge devices.
It adds hierarchically organized sparse memory after each Transformer layer.
It can be trained 34% faster in a few-shot setting, while performing within 0.9 points of adapters.
arXiv Detail & Related papers (2022-11-29T23:59:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.