Superpipeline: A Universal Approach for Reducing GPU Memory Usage in Large Models
- URL: http://arxiv.org/abs/2410.08791v1
- Date: Fri, 11 Oct 2024 13:17:05 GMT
- Title: Superpipeline: A Universal Approach for Reducing GPU Memory Usage in Large Models
- Authors: Reza Abbasi, Sernam Lim,
- Abstract summary: This paper introduces Superpipeline, a framework designed to optimize the execution of large AI models on constrained hardware.
Superpipeline reduces GPU memory usage by up to 60% in our experiments while maintaining model accuracy and acceptable processing speeds.
- Score: 40.41898661688188
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: The rapid growth in machine learning models, especially in natural language processing and computer vision, has led to challenges when running these models on hardware with limited resources. This paper introduces Superpipeline, a new framework designed to optimize the execution of large AI models on constrained hardware during both training and inference. Our approach involves dynamically managing model execution by dividing models into individual layers and efficiently transferring these layers between GPU and CPU memory. Superpipeline reduces GPU memory usage by up to 60% in our experiments while maintaining model accuracy and acceptable processing speeds. This allows models that would otherwise exceed available GPU memory to run effectively. Unlike existing solutions that focus mainly on inference or specific model types, Superpipeline can be applied to large language models (LLMs), vision-language models (VLMs), and vision-based models. We tested Superpipeline's performance across various models and hardware setups. The method includes two key parameters that allow fine-tuning the balance between GPU memory use and processing speed. Importantly, Superpipeline does not require retraining or changing model parameters, ensuring that the original model's output remains unchanged. Superpipeline's simplicity and flexibility make it useful for researchers and professionals working with advanced AI models on limited hardware. It enables the use of larger models or bigger batch sizes on existing hardware, potentially speeding up innovation across many machine learning applications. This work marks an important step toward making advanced AI models more accessible and optimizing their deployment in resource-limited environments. The code for Superpipeline is available at https://github.com/abbasiReza/super-pipeline.
Related papers
- eP-ALM: Efficient Perceptual Augmentation of Language Models [70.47962271121389]
We propose to direct effort to efficient adaptations of existing models, and propose to augment Language Models with perception.
Existing approaches for adapting pretrained models for vision-language tasks still rely on several key components that hinder their efficiency.
We show that by freezing more than 99% of total parameters, training only one linear projection layer, and prepending only one trainable token, our approach (dubbed eP-ALM) significantly outperforms other baselines on VQA and Captioning.
arXiv Detail & Related papers (2023-03-20T19:20:34Z) - Cramming: Training a Language Model on a Single GPU in One Day [64.18297923419627]
Recent trends in language modeling have focused on increasing performance through scaling.
We investigate the downstream performance achievable with a transformer-based language model trained completely from scratch with masked language modeling for a single day on a single consumer GPU.
We provide evidence that even in this constrained setting, performance closely follows scaling laws observed in large-compute settings.
arXiv Detail & Related papers (2022-12-28T18:59:28Z) - Merak: An Efficient Distributed DNN Training Framework with Automated 3D
Parallelism for Giant Foundation Models [14.903847751841221]
We propose Merak, an automated 3D parallelism deep learning training framework with high resource utilization.
Merak automatically deploys with an automatic model partitioner, which uses a graph sharding algorithm on a proxy representation of the model.
Merak can speedup the training performance over the state-of-the-art 3D parallelism frameworks of models with 1.5, 2.5, 8.3, and 20 billion parameters by up to 1.42X, 1.39X, 1.43X, and 1.61X, respectively.
arXiv Detail & Related papers (2022-06-10T09:15:48Z) - MoEBERT: from BERT to Mixture-of-Experts via Importance-Guided
Adaptation [68.30497162547768]
We propose MoEBERT, which uses a Mixture-of-Experts structure to increase model capacity and inference speed.
We validate the efficiency and effectiveness of MoEBERT on natural language understanding and question answering tasks.
arXiv Detail & Related papers (2022-04-15T23:19:37Z) - Harmony: Overcoming the hurdles of GPU memory capacity to train massive
DNN models on commodity servers [13.620650014358413]
Deep neural networks (DNNs) have grown exponentially in complexity and size over the past decade.
One of the main challenges for researchers who might have access to only limited resources is limited memory capacity compared to model size.
arXiv Detail & Related papers (2022-02-02T22:16:27Z) - Hydra: A System for Large Multi-Model Deep Learning [3.571623412954477]
We present'model spilling', a technique aimed at models such as Transformers and CNNs to move groups of layers between DRAM and GPU memory.
We then present a set of novel techniques leveraging spilling to raise efficiency for multi-model training workloads.
Experiments with real benchmark workloads show that HYDRA is over 7x faster than regular model parallelism and over 50% faster than state-of-the-art industrial tools for pipeline parallelism.
arXiv Detail & Related papers (2021-10-16T18:13:57Z) - M6-10T: A Sharing-Delinking Paradigm for Efficient Multi-Trillion
Parameter Pretraining [55.16088793437898]
Training extreme-scale models requires enormous amounts of computes and memory footprint.
We propose a simple training strategy called "Pseudo-to-Real" for high-memory-footprint-required large models.
arXiv Detail & Related papers (2021-10-08T04:24:51Z) - Top-KAST: Top-K Always Sparse Training [50.05611544535801]
We propose Top-KAST, a method that preserves constant sparsity throughout training.
We show that it performs comparably to or better than previous works when training models on the established ImageNet benchmark.
In addition to our ImageNet results, we also demonstrate our approach in the domain of language modeling.
arXiv Detail & Related papers (2021-06-07T11:13:05Z) - Efficient Large-Scale Language Model Training on GPU Clusters [19.00915720435389]
Large language models have led to state-of-the-art accuracies across a range of tasks.
Memory capacity is limited, making it impossible to fit large models on a single GPU.
The number of compute operations required to train these models can result in unrealistically long training times.
arXiv Detail & Related papers (2021-04-09T16:43:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.