Horizon-LM: A RAM-Centric Architecture for LLM Training
- URL: http://arxiv.org/abs/2602.04816v2
- Date: Thu, 05 Feb 2026 02:14:18 GMT
- Title: Horizon-LM: A RAM-Centric Architecture for LLM Training
- Authors: Zhengqing Yuan, Lichao Sun, Yanfang Ye,
- Abstract summary: Horizon-LM is a memory-centric training system that redefines the roles of CPU and GPU for large-model optimization.<n>On a single H200 GPU with 1.5,TB host RAM, Horizon-LM reliably trains models up to 120B parameters.<n>On a standard single A100 machine, Horizon-LM achieves up to 12.2$times$ higher training throughput than DeepSpeed ZeRO-3 with CPU offloading.
- Score: 26.927410607740025
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The rapid growth of large language models (LLMs) has outpaced the evolution of single-GPU hardware, making model scale increasingly constrained by memory capacity rather than computation. While modern training systems extend GPU memory through distributed parallelism and offloading across CPU and storage tiers, they fundamentally retain a GPU-centric execution paradigm in which GPUs host persistent model replicas and full autograd graphs. As a result, scaling large models remains tightly coupled to multi-GPU clusters, complex distributed runtimes, and unpredictable host memory consumption, creating substantial barriers for node-scale post-training workloads such as instruction tuning, alignment, and domain adaptation. We present Horizon-LM, a memory-centric training system that redefines the roles of CPU and GPU for large-model optimization. Horizon-LM treats host memory as the authoritative parameter store and uses GPUs solely as transient compute engines through a CPU-master, GPU-template execution model. By eliminating persistent GPU-resident modules and autograd graphs, employing explicit recomputation with manual gradient propagation, and introducing a pipelined double-buffered execution engine, Horizon-LM decouples model scale from GPU count and bounds memory usage to the theoretical parameter footprint. On a single H200 GPU with 1.5\,TB host RAM, Horizon-LM reliably trains models up to 120B parameters. On a standard single A100 machine, Horizon-LM achieves up to 12.2$\times$ higher training throughput than DeepSpeed ZeRO-3 with CPU offloading while preserving numerical correctness. Across platforms and scales, Horizon-LM sustains high device utilization and predictable memory growth, demonstrating that host memory, not GPU memory, defines the true feasibility boundary for node-scale large-model training.
Related papers
- Out of the Memory Barrier: A Highly Memory Efficient Training System for LLMs with Million-Token Contexts [68.79341332280062]
Training Large Language Models (LLMs) on long contexts is severely constrained by prohibitive GPU memory overhead, not training time.<n>We introduce OOMB, a highly memory-efficient training system that directly confronts this barrier.<n>Our approach employs a chunk-recurrent training framework with on-the-fly activation recomputation, which maintains a constant activation memory footprint.
arXiv Detail & Related papers (2026-02-02T13:52:40Z) - 10Cache: Heterogeneous Resource-Aware Tensor Caching and Migration for LLM Training [0.22913283036871865]
Training large language models (LLMs) in the cloud faces growing memory bottlenecks due to the limited capacity and high cost of GPU.<n>We present 10Cache, a resource-aware tensor caching and migration system that accelerates training by intelligently coordinating memory usage across GPU, CPU, and tiers.<n>It achieves up to 2x speedup in training time, improves GPU cache hit rate by up to 86.6x, and increases CPU/GPU memory utilization by up to 2.15x and 1.33x, respectively.
arXiv Detail & Related papers (2025-11-18T04:17:44Z) - Minute-Long Videos with Dual Parallelisms [57.22737565366549]
Diffusion Transformer (DiT)-based video diffusion models generate high-quality videos at scale but incur prohibitive processing latency and memory costs for long videos.<n>We propose a novel distributed inference strategy, termed DualParal.<n>Instead of generating an entire video on a single GPU, we parallelize both temporal frames and model layers across GPUs.
arXiv Detail & Related papers (2025-05-27T11:55:22Z) - Mind the Memory Gap: Unveiling GPU Bottlenecks in Large-Batch LLM Inference [4.497936996651617]
Large language models have been widely adopted across different tasks, but their auto-regressive nature often leads to inefficient resource utilization during inference.<n>In this paper, through an in-depth GPU-level analysis, we reveal that large-batch inference remains memory-bound, with most GPU compute capabilities underutilized.
arXiv Detail & Related papers (2025-03-11T11:21:35Z) - MoE-Lightning: High-Throughput MoE Inference on Memory-constrained GPUs [55.95879347182669]
MoE architecture is renowned for its ability to increase model capacity without a proportional increase in inference cost.
MoE-Lightning introduces a novel CPU-GPU-I/O pipelining schedule, CGOPipe, with paged weights to achieve high resource utilization.
MoE-Lightning can achieve up to 10.3x higher throughput than state-of-the-art offloading-enabled LLM inference systems for Mixtral 8x7B on a single T4 GPU (16GB)
arXiv Detail & Related papers (2024-11-18T01:06:12Z) - Deep Optimizer States: Towards Scalable Training of Transformer Models Using Interleaved Offloading [2.8231000588510757]
Transformers and large language models(LLMs) have seen rapid adoption in all domains.
Training of transformers is very expensive and often hits a memory wall''
We propose a novel technique to split the LLM into subgroups, whose update phase is scheduled on either the CPU or the GPU.
arXiv Detail & Related papers (2024-10-26T00:43:59Z) - vTensor: Flexible Virtual Tensor Management for Efficient LLM Serving [53.972175896814505]
Large Language Models (LLMs) are widely used across various domains, processing millions of daily requests.
Large Language Models (LLMs) are widely used across various domains, processing millions of daily requests.
arXiv Detail & Related papers (2024-07-22T14:37:58Z) - FusionAI: Decentralized Training and Deploying LLMs with Massive
Consumer-Level GPUs [57.12856172329322]
We envision a decentralized system unlocking the potential vast untapped consumer-level GPU.
This system faces critical challenges, including limited CPU and GPU memory, low network bandwidth, the variability of peer and device heterogeneity.
arXiv Detail & Related papers (2023-09-03T13:27:56Z) - FlexGen: High-Throughput Generative Inference of Large Language Models
with a Single GPU [89.2451963569343]
FlexGen is a generation engine for running large language model (LLM) inference on a single commodity GPU.
When running OPT-175B on a single 16GB GPU, FlexGen achieves significantly higher throughput compared to state-of-the-art offloading systems.
On the HELM benchmark, FlexGen can benchmark a 30B model with a 16GB GPU on 7 representative sub-scenarios in 21 hours.
arXiv Detail & Related papers (2023-03-13T05:19:28Z) - Heterogeneous Acceleration Pipeline for Recommendation System Training [1.8457649813040096]
Recommendation models rely on deep learning networks and large embedding tables.
These models are typically trained using hybrid-GPU or GPU-only configurations.
This paper introduces Hotline, a heterogeneous CPU acceleration pipeline.
arXiv Detail & Related papers (2022-04-11T23:10:41Z) - ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep
Learning [9.322987670900778]
ZeRO-Infinity can fit models with tens and even hundreds of trillions of parameters for training on current generation GPU clusters.
It can be used to fine-tune trillion parameter models on a single NVIDIA DGX-2 node, making large models more accessible.
arXiv Detail & Related papers (2021-04-16T02:22:12Z) - Large Graph Convolutional Network Training with GPU-Oriented Data
Communication Architecture [19.2129567657739]
Graph Convolutional Networks (GCNs) are increasingly adopted in large-scale graph-based recommender systems.
Current GCN training systems keep the feature table in host memory and rely on the CPU to collect sparse features.
This approach, however, puts tremendous pressure on host memory bandwidth and the CPU.
We propose a novel GPU-oriented data communication approach for GCN training, where GPU threads directly access sparse features in host memory.
arXiv Detail & Related papers (2021-03-04T21:00:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.