10Cache: Heterogeneous Resource-Aware Tensor Caching and Migration for LLM Training
- URL: http://arxiv.org/abs/2511.14124v1
- Date: Tue, 18 Nov 2025 04:17:44 GMT
- Title: 10Cache: Heterogeneous Resource-Aware Tensor Caching and Migration for LLM Training
- Authors: Sabiha Afroz, Redwan Ibne Seraj Khan, Hadeel Albahar, Jingoo Han, Ali R. Butt,
- Abstract summary: Training large language models (LLMs) in the cloud faces growing memory bottlenecks due to the limited capacity and high cost of GPU.<n>We present 10Cache, a resource-aware tensor caching and migration system that accelerates training by intelligently coordinating memory usage across GPU, CPU, and tiers.<n>It achieves up to 2x speedup in training time, improves GPU cache hit rate by up to 86.6x, and increases CPU/GPU memory utilization by up to 2.15x and 1.33x, respectively.
- Score: 0.22913283036871865
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Training large language models (LLMs) in the cloud faces growing memory bottlenecks due to the limited capacity and high cost of GPUs. While GPU memory offloading to CPU and NVMe has made large-scale training more feasible, existing approaches suffer from high tensor migration latency and suboptimal device memory utilization, ultimately increasing training time and cloud costs. To address these challenges, we present 10Cache, a resource-aware tensor caching and migration system that accelerates LLM training by intelligently coordinating memory usage across GPU, CPU, and NVMe tiers. 10Cache profiles tensor execution order to construct prefetch policies, allocates memory buffers in pinned memory based on tensor size distributions, and reuses memory buffers to minimize allocation overhead. Designed for cloud-scale deployments, 10Cache improves memory efficiency and reduces reliance on high-end GPUs. Across diverse LLM workloads, it achieves up to 2x speedup in training time, improves GPU cache hit rate by up to 86.6x, and increases CPU/GPU memory utilization by up to 2.15x and 1.33x, respectively, compared to state-of-the-art offloading methods. These results demonstrate that 10Cache is a practical and scalable solution for optimizing LLM training throughput and resource efficiency in cloud environments.
Related papers
- Horizon-LM: A RAM-Centric Architecture for LLM Training [26.927410607740025]
Horizon-LM is a memory-centric training system that redefines the roles of CPU and GPU for large-model optimization.<n>On a single H200 GPU with 1.5,TB host RAM, Horizon-LM reliably trains models up to 120B parameters.<n>On a standard single A100 machine, Horizon-LM achieves up to 12.2$times$ higher training throughput than DeepSpeed ZeRO-3 with CPU offloading.
arXiv Detail & Related papers (2026-02-04T18:04:46Z) - Out of the Memory Barrier: A Highly Memory Efficient Training System for LLMs with Million-Token Contexts [68.79341332280062]
Training Large Language Models (LLMs) on long contexts is severely constrained by prohibitive GPU memory overhead, not training time.<n>We introduce OOMB, a highly memory-efficient training system that directly confronts this barrier.<n>Our approach employs a chunk-recurrent training framework with on-the-fly activation recomputation, which maintains a constant activation memory footprint.
arXiv Detail & Related papers (2026-02-02T13:52:40Z) - Harvest: Opportunistic Peer-to-Peer GPU Caching for LLM Inference [0.0]
Large Language Model (LLM) inference is increasingly constrained by GPU memory capacity rather than compute throughput.<n>We present Harvest, an opportunistic GPU cache management framework that exploits high-bandwidth peer-to-peer GPU interconnects.<n>We demonstrate significant throughput speedup of more than 2 times by using Harvest to accelerate the retrieval of two widely-used inference components.
arXiv Detail & Related papers (2026-01-30T21:29:04Z) - Reducing GPU Memory Fragmentation via Spatio-Temporal Planning for Efficient Large-Scale Model Training [9.775731832789116]
We introduce STWeaver, a GPU memory allator for deep learning frameworks that reduces fragmentation by exploiting temporal regularity in memory allocation behaviors.<n>Built as a plug PyTorch, STWeaver reduces fragmentation ratio on average by 79.2% (up to 100%) across both dense and sparse models, with negligible overhead.
arXiv Detail & Related papers (2025-07-22T06:39:07Z) - Cost-Efficient LLM Training with Lifetime-Aware Tensor Offloading via GPUDirect Storage [9.106167012987747]
TERAIO is a framework for GPU memory expansion using low-cost PCIe-based solid-state drives (SSDs)<n>Its design is driven by our observation that the active tensors take only a small fraction (1.7% on average) of allocated GPU memory in each large language iteration training process.<n>We show that TERAIO improves the training performance of various LLMs by 1.47x on average, and achieves 80.7% of the ideal performance.
arXiv Detail & Related papers (2025-06-06T18:57:20Z) - Mind the Memory Gap: Unveiling GPU Bottlenecks in Large-Batch LLM Inference [4.497936996651617]
Large language models have been widely adopted across different tasks, but their auto-regressive nature often leads to inefficient resource utilization during inference.<n>In this paper, through an in-depth GPU-level analysis, we reveal that large-batch inference remains memory-bound, with most GPU compute capabilities underutilized.
arXiv Detail & Related papers (2025-03-11T11:21:35Z) - APOLLO: SGD-like Memory, AdamW-level Performance [61.53444035835778]
Large language models (LLMs) are notoriously memory-intensive during training.<n>Various memory-efficient Scals have been proposed to reduce memory usage.<n>They face critical challenges: (i) costly SVD operations; (ii) significant performance trade-offs compared to AdamW; and (iii) still substantial memory overhead to maintain competitive performance.
arXiv Detail & Related papers (2024-12-06T18:55:34Z) - Memory-Efficient Training for Deep Speaker Embedding Learning in Speaker Verification [50.596077598766975]
We explore a memory-efficient training strategy for deep speaker embedding learning in resource-constrained scenarios.<n>For activations, we design two types of reversible neural networks which eliminate the need to store intermediate activations.<n>For states, we introduce a dynamic quantization approach that replaces the original 32-bit floating-point values with a dynamic tree-based 8-bit data type.
arXiv Detail & Related papers (2024-12-02T06:57:46Z) - KVPR: Efficient LLM Inference with I/O-Aware KV Cache Partial Recomputation [7.204881999658682]
Key-Value cache is used to store intermediate activations for large language models.<n>The memory required for the KV cache grows rapidly, often exceeding the capacity of GPU memory.<n>Existing methods attempt to address these issues by overlapping GPU computation with I/O or employing CPU-GPU heterogeneous execution.<n>We introduce KVPR, an efficient I/O-aware LLM inference method where the CPU first transfers a partial set of activations.<n> KVPR achieves up to 35.8% lower latency and 46.2% higher throughput during decoding compared to state-of-the-art approaches.
arXiv Detail & Related papers (2024-11-26T04:03:14Z) - vTensor: Flexible Virtual Tensor Management for Efficient LLM Serving [53.972175896814505]
Large Language Models (LLMs) are widely used across various domains, processing millions of daily requests.
Large Language Models (LLMs) are widely used across various domains, processing millions of daily requests.
arXiv Detail & Related papers (2024-07-22T14:37:58Z) - Endor: Hardware-Friendly Sparse Format for Offloaded LLM Inference [47.043257902725294]
We propose a novel sparse format that compresses unstructured sparse pattern of pruned LLM weights to non-zero values with high compression ratio and low decompression overhead.
Compared to offloaded inference using the popular Huggingface Accelerate, applying Endor accelerates OPT-66B by 1.70x and Llama2-70B by 1.78x.
arXiv Detail & Related papers (2024-06-17T15:55:08Z) - FlexGen: High-Throughput Generative Inference of Large Language Models
with a Single GPU [89.2451963569343]
FlexGen is a generation engine for running large language model (LLM) inference on a single commodity GPU.
When running OPT-175B on a single 16GB GPU, FlexGen achieves significantly higher throughput compared to state-of-the-art offloading systems.
On the HELM benchmark, FlexGen can benchmark a 30B model with a 16GB GPU on 7 representative sub-scenarios in 21 hours.
arXiv Detail & Related papers (2023-03-13T05:19:28Z) - On-Device Training Under 256KB Memory [62.95579393237751]
We propose an algorithm-system co-design framework to make on-device training possible with only 256KB of memory.
Our framework is the first solution to enable tiny on-device training of convolutional neural networks under 256KB and 1MB Flash.
arXiv Detail & Related papers (2022-06-30T17:59:08Z) - Training Personalized Recommendation Systems from (GPU) Scratch: Look
Forward not Backwards [1.7733623930581417]
Personalized recommendation models (RecSys) are one of the most popular machine learning workload serviced by hyperscalers.
A critical challenge of training RecSys is its high memory capacity requirements, reaching hundreds of GBs to TBs of model size.
In RecSys, the so-called embedding layers account for the majority of memory usage so current systems employ a hybrid CPU-GPU design to have the large CPU memory store the memory hungry embedding layers.
arXiv Detail & Related papers (2022-05-10T07:05:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.