Related papers: ZO2: Scalable Zeroth-Order Fine-Tuning for Extremely Large Language Models with Limited GPU Memory

ZO2: Scalable Zeroth-Order Fine-Tuning for Extremely Large Language Models with Limited GPU Memory

URL: http://arxiv.org/abs/2503.12668v1
Date: Sun, 16 Mar 2025 21:58:29 GMT
Title: ZO2: Scalable Zeroth-Order Fine-Tuning for Extremely Large Language Models with Limited GPU Memory
Authors: Liangyu Wang, Jie Ren, Hang Xu, Junxiao Wang, Huanyi Xie, David E. Keyes, Di Wang,
Abstract summary: We propose a novel framework, ZO2, for efficient zeroth-order fine-tuning of LLMs with only limited GPU memory.<n>Our framework supports an innovative low-bit precision approach in AMP mode to streamline data exchanges between the CPU and GPU.
Score: 29.245719403159615
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Fine-tuning large pre-trained LLMs generally demands extensive GPU memory. Traditional first-order optimizers like SGD encounter substantial difficulties due to increased memory requirements from storing activations and gradients during both the forward and backward phases as the model size expands. Alternatively, zeroth-order (ZO) techniques can compute gradients using just forward operations, eliminating the need to store activations. Furthermore, by leveraging CPU capabilities, it's feasible to enhance both the memory and processing power available to a single GPU. We propose a novel framework, ZO2 (Zeroth-Order Offloading), for efficient zeroth-order fine-tuning of LLMs with only limited GPU memory. Our framework dynamically shifts model parameters between the CPU and GPU as required, optimizing computation flow and maximizing GPU usage by minimizing downtime. This integration of parameter adjustments with ZO's double forward operations reduces unnecessary data movement, enhancing the fine-tuning efficacy. Additionally, our framework supports an innovative low-bit precision approach in AMP mode to streamline data exchanges between the CPU and GPU. Employing this approach allows us to fine-tune extraordinarily large models, such as the OPT-175B with more than 175 billion parameters, on a mere 18GB GPU--achievements beyond the reach of traditional methods. Moreover, our framework achieves these results with almost no additional time overhead and absolutely no accuracy loss compared to standard zeroth-order methods. ZO2's code has been open-sourced in https://github.com/liangyuwang/zo2.

Related papers

Nexus:Proactive Intra-GPU Disaggregation of Prefill and Decode in LLM Serving [4.309392302169281]
Engine-level prefill-decode (PD) disaggregation avoids interference but incurs higher hardware and coordination overhead.<n>PD achieves up to 2.2x higher throughput, 20x lower TTFT, and 2.5x lower TBT than vLLM; outperforms SG by up to 2x; and matches or exceeds disaggregated vLLM.
arXiv Detail & Related papers (2025-07-09T07:27:18Z)
DistZO2: High-Throughput and Memory-Efficient Zeroth-Order Fine-tuning LLMs with Distributed Parallel Computing [4.589472292598182]
Fine-tuning large language models (LLMs) remains resource-intensive due to their sheer scale.<n>We present DistZO2, a memory-efficient framework for distributed zeroth-order fine-tuning of LLMs.
arXiv Detail & Related papers (2025-07-03T22:53:34Z)
Minute-Long Videos with Dual Parallelisms [57.22737565366549]
Diffusion Transformer (DiT)-based video diffusion models generate high-quality videos at scale but incur prohibitive processing latency and memory costs for long videos.<n>We propose a novel distributed inference strategy, termed DualParal.<n>Instead of generating an entire video on a single GPU, we parallelize both temporal frames and model layers across GPUs.
arXiv Detail & Related papers (2025-05-27T11:55:22Z)
70% Size, 100% Accuracy: Lossless LLM Compression for Efficient GPU Inference via Dynamic-Length Float [71.43026659686679]
Large Language Models (LLMs) have grown rapidly in size, creating challenges for efficient deployment on resource-constrained hardware. We introduce Dynamic-Length Float (DFloat11), a compression framework that reduces LLM size by 30% while preserving outputs that are bit-for-bit identical to the original model.
arXiv Detail & Related papers (2025-04-15T22:38:38Z)
Mind the Memory Gap: Unveiling GPU Bottlenecks in Large-Batch LLM Inference [4.497936996651617]
Large language models have been widely adopted across different tasks, but their auto-regressive generation nature often leads to inefficient resource utilization during inference.<n>In this paper, through an in-depth GPU-level analysis, we reveal that large-batch inference remains memory-bound, with most GPU compute capabilities underutilized due to DRAM bandwidth saturation as the primary bottleneck.
arXiv Detail & Related papers (2025-03-11T11:21:35Z)
APOLLO: SGD-like Memory, AdamW-level Performance [61.53444035835778]
Large language models (LLMs) are notoriously memory-intensive during training.<n>Various memory-efficient Scals have been proposed to reduce memory usage.<n>They face critical challenges: (i) costly SVD operations; (ii) significant performance trade-offs compared to AdamW; and (iii) still substantial memory overhead to maintain competitive performance.
arXiv Detail & Related papers (2024-12-06T18:55:34Z)
Deep Optimizer States: Towards Scalable Training of Transformer Models Using Interleaved Offloading [2.8231000588510757]
Transformers and large language models(LLMs) have seen rapid adoption in all domains. Training of transformers is very expensive and often hits a memory wall'' We propose a novel technique to split the LLM into subgroups, whose update phase is scheduled on either the CPU or the GPU.
arXiv Detail & Related papers (2024-10-26T00:43:59Z)
Breaking the Memory Barrier: Near Infinite Batch Size Scaling for Contrastive Loss [59.835032408496545]
We propose a tile-based strategy that partitions the contrastive loss calculation into arbitrary small blocks. We also introduce a multi-level tiling strategy to leverage the hierarchical structure of distributed systems. Compared to SOTA memory-efficient solutions, it achieves a two-order-of-magnitude reduction in memory while maintaining comparable speed.
arXiv Detail & Related papers (2024-10-22T17:59:30Z)
Practical offloading for fine-tuning LLM on commodity GPU via learned sparse projectors [11.127604539303373]
Fine-tuning large language models (LLMs) requires significant memory, often exceeding the capacity of a single GPU.<n>A common solution to this memory challenge is offloading compute and data from the GPU to the CPU.<n>We present an offloading framework, LSP-Offload, that enables near-native speed LLM fine-tuning on commodity hardware.
arXiv Detail & Related papers (2024-06-14T16:59:11Z)
Scalable MatMul-free Language Modeling [8.672867887354977]
We show that MatMul operations can be completely eliminated from large language models. Our proposed MatMul-free models achieve performance on-par with state-of-the-art Transformers.
arXiv Detail & Related papers (2024-06-04T17:50:34Z)
HiFT: A Hierarchical Full Parameter Fine-Tuning Strategy [55.17502828915191]
We propose a novel-independent end-to-end hierarchical fine-tuning strategy, HiFT, which only updates a subset of parameters at each training step. Our results demonstrate that HiFT achieves comparable performance to parameter-efficient fine-tuning and standard full parameter fine-tuning.
arXiv Detail & Related papers (2024-01-26T21:14:32Z)
SqueezeLLM: Dense-and-Sparse Quantization [80.32162537942138]
Main bottleneck for generative inference with LLMs is memory bandwidth, rather than compute, for single batch inference. We introduce SqueezeLLM, a post-training quantization framework that enables lossless compression to ultra-low precisions of up to 3-bit. Our framework incorporates two novel ideas: (i) sensitivity-based non-uniform quantization, which searches for the optimal bit precision assignment based on second-order information; and (ii) the Dense-and-Sparse decomposition that stores outliers and sensitive weight values in an efficient sparse format.
arXiv Detail & Related papers (2023-06-13T08:57:54Z)
Fine-Tuning Language Models with Just Forward Passes [92.04219196752007]
Fine-tuning language models (LMs) has yielded success on diverse downstream tasks, but as LMs grow in size, backpropagation requires a large amount of memory. We propose a memory-efficient zerothorder (MeZO) to operate in-place, thereby fine-tuning LMs with the same memory footprint as inference.
arXiv Detail & Related papers (2023-05-27T02:28:10Z)
Heterogeneous CPU+GPU Stochastic Gradient Descent Algorithms [1.3249453757295084]
We study training algorithms for deep learning on heterogeneous CPU+GPU architectures. Our two-fold objective -- maximize convergence rate and resource utilization simultaneously -- makes the problem challenging. We show that the implementation of these algorithms achieves both faster convergence and higher resource utilization than on several real datasets.
arXiv Detail & Related papers (2020-04-19T05:21:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.