xMem: A CPU-Based Approach for Accurate Estimation of GPU Memory in Deep Learning Training Workloads
- URL: http://arxiv.org/abs/2510.21048v1
- Date: Thu, 23 Oct 2025 23:16:27 GMT
- Title: xMem: A CPU-Based Approach for Accurate Estimation of GPU Memory in Deep Learning Training Workloads
- Authors: Jiabo Shi, Dimitrios Pezaros, Yehia Elkhatib,
- Abstract summary: estimation of how much GPU memory a job will require is fundamental to enabling advanced scheduling and GPU sharing.<n>We propose xMem, a novel framework that leverages CPU-only dynamic analysis to accurately estimate peak GPU memory requirements.<n>The analysis of 5209 runs, which includes ANOVA and Monte Carlo results, highlights xMem's benefits.
- Score: 2.2991119948183525
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The global scarcity of GPUs necessitates more sophisticated strategies for Deep Learning jobs in shared cluster environments. Accurate estimation of how much GPU memory a job will require is fundamental to enabling advanced scheduling and GPU sharing, which helps prevent out-of-memory (OOM) errors and resource underutilization. However, existing estimation methods have limitations. Approaches relying on static analysis or historical data with machine learning often fail to accurately capture runtime dynamics. Furthermore, direct GPU analysis consumes scarce resources, and some techniques require intrusive code modifications. Thus, the key challenge lies in precisely estimating dynamic memory requirements, including memory allocator nuances, without consuming GPU resources and non-intrusive code changes. To address this challenge, we propose xMem, a novel framework that leverages CPU-only dynamic analysis to accurately estimate peak GPU memory requirements a priori. We conducted a thorough evaluation of xMem against state-of-the-art solutions using workloads from 25 different models, including architectures like Convolutional Neural Networks and Transformers. The analysis of 5209 runs, which includes ANOVA and Monte Carlo results, highlights xMem's benefits: it decreases the median relative error by 91% and significantly reduces the probability of estimation failure as safe OOM thresholds by 75%, meaning that the estimated value can often be used directly without causing OOM. Ultimately, these improvements lead to a 368% increase in memory conservation potential over current solutions.
Related papers
- MemSifter: Offloading LLM Memory Retrieval via Outcome-Driven Proxy Reasoning [78.46301394559903]
Large Language Models (LLMs) are increasingly used for long-duration tasks.<n>Current methods face a trade-off between cost and accuracy.<n>MemSifter is a novel framework that offloads the memory retrieval process to a small-scale proxy model.
arXiv Detail & Related papers (2026-03-03T02:57:38Z) - GPU-Accelerated Algorithms for Graph Vector Search: Taxonomy, Empirical Study, and Research Directions [54.570944939061555]
We present a comprehensive study of GPU-accelerated graph-based vector search algorithms.<n>We establish a detailed taxonomy of GPU optimization strategies and clarify the mapping between algorithmic tasks and hardware execution units.<n>Our findings offer clear guidelines for designing scalable and robust GPU-powered approximate nearest neighbor search systems.
arXiv Detail & Related papers (2026-02-10T16:18:04Z) - GPU Memory Prediction for Multimodal Model Training [12.707615972878472]
We propose a framework that predicts the peak GPU memory usage by analyzing the model architecture and training behavior of multimodal models.<n>Our framework achieves high prediction accuracy of 8.7% average MAPE.
arXiv Detail & Related papers (2025-11-26T06:24:58Z) - Accurate GPU Memory Prediction for Deep Learning Jobs through Dynamic Analysis [0.3867363075280544]
Out-of-Memory errors present a primary impediment to model training and efficient resource utilization.<n>VeritasEst is an entirely CPU-based analysis tool capable of accurately predicting the peak GPU memory required for Deep Learning training tasks.<n>Its performance was validated through thousands of experimental runs across convolutional neural network (CNN) models.
arXiv Detail & Related papers (2025-04-04T19:20:03Z) - Mind the Memory Gap: Unveiling GPU Bottlenecks in Large-Batch LLM Inference [4.497936996651617]
Large language models have been widely adopted across different tasks, but their auto-regressive nature often leads to inefficient resource utilization during inference.<n>In this paper, through an in-depth GPU-level analysis, we reveal that large-batch inference remains memory-bound, with most GPU compute capabilities underutilized.
arXiv Detail & Related papers (2025-03-11T11:21:35Z) - HeadInfer: Memory-Efficient LLM Inference by Head-wise Offloading [79.38548165722229]
HEADINFER offloads the KV cache to CPU RAM while avoiding the need to fully store the KV cache for any transformer layer on the GPU.<n>We demonstrate HEADINFER maintains computational efficiency while significantly reducing memory footprint.
arXiv Detail & Related papers (2025-02-18T06:26:05Z) - Memory-Efficient Training for Deep Speaker Embedding Learning in Speaker Verification [50.596077598766975]
We explore a memory-efficient training strategy for deep speaker embedding learning in resource-constrained scenarios.<n>For activations, we design two types of reversible neural networks which eliminate the need to store intermediate activations.<n>For states, we introduce a dynamic quantization approach that replaces the original 32-bit floating-point values with a dynamic tree-based 8-bit data type.
arXiv Detail & Related papers (2024-12-02T06:57:46Z) - Deep Optimizer States: Towards Scalable Training of Transformer Models Using Interleaved Offloading [2.8231000588510757]
Transformers and large language models(LLMs) have seen rapid adoption in all domains.
Training of transformers is very expensive and often hits a memory wall''
We propose a novel technique to split the LLM into subgroups, whose update phase is scheduled on either the CPU or the GPU.
arXiv Detail & Related papers (2024-10-26T00:43:59Z) - LLMem: Estimating GPU Memory Usage for Fine-Tuning Pre-Trained LLMs [4.536118764799076]
Fine-tuning pre-trained large language models with limited hardware presents challenges due to GPU memory constraints.
We introduce LLMem, a solution that estimates the GPU memory consumption when applying distributed fine-tuning methods.
We show that LLMem accurately estimates peak GPU memory usage on a single GPU, with error rates of up to 1.6%.
arXiv Detail & Related papers (2024-04-16T22:11:35Z) - QLoRA: Efficient Finetuning of Quantized LLMs [66.58009990713134]
We present QLoRA, an efficient finetuning approach that reduces memory usage enough to finetune a 65B parameter model on a single 48GB GPU.
QLoRA backpropagates through a frozen, 4-bit quantized pretrained language model into Low Rank Adapters(LoRA)
Our best model family, which we name Guanaco, outperforms all previous openly released models on the Vicuna benchmark.
arXiv Detail & Related papers (2023-05-23T17:50:33Z) - EVEREST: Efficient Masked Video Autoencoder by Removing Redundant Spatiotemporal Tokens [57.354304637367555]
We present EVEREST, a surprisingly efficient MVA approach for video representation learning.
It finds tokens containing rich motion features and discards uninformative ones during both pre-training and fine-tuning.
Our method significantly reduces the computation and memory requirements of MVA.
arXiv Detail & Related papers (2022-11-19T09:57:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.