LLMem: Estimating GPU Memory Usage for Fine-Tuning Pre-Trained LLMs
- URL: http://arxiv.org/abs/2404.10933v1
- Date: Tue, 16 Apr 2024 22:11:35 GMT
- Title: LLMem: Estimating GPU Memory Usage for Fine-Tuning Pre-Trained LLMs
- Authors: Taeho Kim, Yanming Wang, Vatshank Chaturvedi, Lokesh Gupta, Seyeon Kim, Yongin Kwon, Sangtae Ha,
- Abstract summary: Fine-tuning pre-trained large language models with limited hardware presents challenges due to GPU memory constraints.
We introduce LLMem, a solution that estimates the GPU memory consumption when applying distributed fine-tuning methods.
We show that LLMem accurately estimates peak GPU memory usage on a single GPU, with error rates of up to 1.6%.
- Score: 4.536118764799076
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Fine-tuning pre-trained large language models (LLMs) with limited hardware presents challenges due to GPU memory constraints. Various distributed fine-tuning methods have been proposed to alleviate memory constraints on GPU. However, determining the most effective method for achieving rapid fine-tuning while preventing GPU out-of-memory issues in a given environment remains unclear. To address this challenge, we introduce LLMem, a solution that estimates the GPU memory consumption when applying distributed fine-tuning methods across multiple GPUs and identifies the optimal method. We conduct GPU memory usage estimation prior to fine-tuning, leveraging the fundamental structure of transformer-based decoder models and the memory usage distribution of each method. Experimental results show that LLMem accurately estimates peak GPU memory usage on a single GPU, with error rates of up to 1.6%. Additionally, it shows an average error rate of 3.0% when applying distributed fine-tuning methods to LLMs with more than a billion parameters on multi-GPU setups.
Related papers
- vTensor: Flexible Virtual Tensor Management for Efficient LLM Serving [53.972175896814505]
Large Language Models (LLMs) are widely used across various domains, processing millions of daily requests.
Large Language Models (LLMs) are widely used across various domains, processing millions of daily requests.
arXiv Detail & Related papers (2024-07-22T14:37:58Z) - Implementation and Analysis of GPU Algorithms for Vecchia Approximation [0.8057006406834466]
Vecchia Approximation is widely used to reduce the computational complexity and can be calculated with embarrassingly parallel algorithms.
While multi-core software has been developed for Vecchia Approximation, software designed to run on graphics processing units ( GPU) is lacking.
We show that our new method outperforms the other two and then present it in the GpGpU R package.
arXiv Detail & Related papers (2024-07-03T01:24:44Z) - Grass: Compute Efficient Low-Memory LLM Training with Structured Sparse Gradients [24.58231358634904]
Large language model (LLM) training and finetuning are often bottlenecked by limited GPU memory.
We propose Grass (GRAdient Stuctured Sparsification), a novel approach that leverages sparse projections to transform gradients into structured sparse updates.
arXiv Detail & Related papers (2024-06-25T15:50:32Z) - Practical offloading for fine-tuning LLM on commodity GPU via learned subspace projectors [11.938205508966808]
Fine-tuning large language models (LLMs) requires significant memory, often exceeding the capacity of a single GPU.
We present an offloading framework, LSP_Offload, that enables near-native speed LLM fine-tuning on commodity hardware.
arXiv Detail & Related papers (2024-06-14T16:59:11Z) - Efficient Video Object Segmentation via Modulated Cross-Attention Memory [123.12273176475863]
We propose a transformer-based approach, named MAVOS, to model temporal smoothness without requiring frequent memory expansion.
Our MAVOS achieves a J&F score of 63.3% while operating at 37 frames per second (FPS) on a single V100 GPU.
arXiv Detail & Related papers (2024-03-26T17:59:58Z) - Full Parameter Fine-tuning for Large Language Models with Limited Resources [55.794732214059806]
Large Language Models (LLMs) have revolutionized Natural Language Processing (NLP) but demand massive GPU resources for training.
We propose a new computation, LOw-Memory Optimization (LOMO), which fuses the gradient and the parameter update in one step to reduce memory usage.
arXiv Detail & Related papers (2023-06-16T11:37:15Z) - EVEREST: Efficient Masked Video Autoencoder by Removing Redundant Spatiotemporal Tokens [57.354304637367555]
We present EVEREST, a surprisingly efficient MVA approach for video representation learning.
It finds tokens containing rich motion features and discards uninformative ones during both pre-training and fine-tuning.
Our method significantly reduces the computation and memory requirements of MVA.
arXiv Detail & Related papers (2022-11-19T09:57:01Z) - Adaptive Elastic Training for Sparse Deep Learning on Heterogeneous
Multi-GPU Servers [65.60007071024629]
We show that Adaptive SGD outperforms four state-of-the-art solutions in time-to-accuracy.
We show experimentally that Adaptive SGD outperforms four state-of-the-art solutions in time-to-accuracy.
arXiv Detail & Related papers (2021-10-13T20:58:15Z) - GPU-Accelerated Primal Learning for Extremely Fast Large-Scale
Classification [10.66048003460524]
One of the most efficient methods to solve L2-regularized primal problems, such as logistic regression and linear support vector machine (SVM) classification, is the widely used trust region Newton algorithm, TRON.
We show that using judicious GPU-optimization principles, TRON training time for different losses and feature representations may be drastically reduced.
arXiv Detail & Related papers (2020-08-08T03:40:27Z) - Kernel methods through the roof: handling billions of points efficiently [94.31450736250918]
Kernel methods provide an elegant and principled approach to nonparametric learning, but so far could hardly be used in large scale problems.
Recent advances have shown the benefits of a number of algorithmic ideas, for example combining optimization, numerical linear algebra and random projections.
Here, we push these efforts further to develop and test a solver that takes full advantage of GPU hardware.
arXiv Detail & Related papers (2020-06-18T08:16:25Z) - Out-of-Core GPU Gradient Boosting [0.0]
We show that much larger datasets can fit on a given GPU, without degrading model accuracy or training time.
This is the first out-of-core GPU implementation of gradient boosting.
arXiv Detail & Related papers (2020-05-19T00:41:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.