A Universal Load Balancing Principle and Its Application to Large Language Model Serving
- URL: http://arxiv.org/abs/2601.17855v2
- Date: Sun, 01 Feb 2026 05:12:19 GMT
- Title: A Universal Load Balancing Principle and Its Application to Large Language Model Serving
- Authors: Zixi Chen, Tianci Bu, Chendong Song, Xin Lu, Yinyu Ye, Zijie Zhou,
- Abstract summary: In large language model inference alone, this translates to gigawatt-hours of wasted electricity daily.<n>We develop a universal load-balancing principle for barrier-synchronized systems with non-migratable state.<n>The resulting energy savings can exceed 52% for modern hardware at fleet scale.
- Score: 12.668439908706604
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Over 40% of computational power in Large Language Model (LLM) serving systems can be systematically wasted - not from hardware limits, but from load imbalance in barrier-synchronized parallel processing. When progress is gated by the slowest worker at each step, heterogeneous and evolving workloads create persistent stragglers; faster workers idle while drawing power, producing nothing. In large language model inference alone, this translates to gigawatt-hours of wasted electricity daily. Here we develop a universal load-balancing principle for barrier-synchronized systems with non-migratable state. We prove worst-case theoretical guarantees: imbalance reduction grows with system scale, and the resulting energy savings can exceed 52% for modern hardware at fleet scale. Experiments corroborate the theory, demonstrating 28% energy reduction alongside substantial throughput and latency improvements. Formulated as an online integer optimization with provable guarantees, the principle extends beyond LLM serving to broad classes of barrier-synchronized parallel systems, establishing a theoretical foundation for sustainable high-performance computing.
Related papers
- HALO: Semantic-Aware Distributed LLM Inference in Lossy Edge Network [50.33808558714122]
Large language models' (LLMs) inference at the edge can facilitate prompt service responsiveness while protecting user privacy.<n>We propose HALO, a novel framework that can boost the distributed LLM inference in lossy edge network.<n> Experimental results from a Raspberry Pi cluster demonstrate that HALO achieves a 3.41x end-to-end speedup for LLaMA-series LLMs under unreliable network conditions.
arXiv Detail & Related papers (2026-01-16T07:37:23Z) - RollArt: Scaling Agentic RL Training via Disaggregated Infrastructure [49.88201789074532]
Agentic Reinforcement Learning (RL) enables Large Language Models (LLMs) to perform autonomous decision-making and long-term planning.<n>We present RollArc, a distributed system designed to maximize throughput for multi-task agentic RL on disaggregated infrastructure.
arXiv Detail & Related papers (2025-12-27T11:14:23Z) - GANGR: GAN-Assisted Scalable and Efficient Global Routing Parallelization [0.6117371161379208]
Global routing is a critical stage in electronic design automation (EDA)<n>This paper introduces Wasserstein generative networks (WGANs) to enable more effective parallelization.<n>The proposed algorithm is tested on the latest ISPD'24 contest benchmarks, demonstrating up to 40% reduction with only 0.002% degradation in routing quality as compared to state-of-the-art routers.
arXiv Detail & Related papers (2025-11-21T00:32:33Z) - Heterogeneous Multi-Agent Proximal Policy Optimization for Power Distribution System Restoration [4.46185759083096]
This paper applies a Heterogeneous-Agent Reinforcement Learning framework to enable coordinated restoration across interconnected microgrids.<n>Results demonstrate that incorporating microgrid-level heterogeneity within the HARL framework yields a scalable, stable, and constraint-aware solution for complex PDS restoration.
arXiv Detail & Related papers (2025-11-18T18:23:35Z) - Seer: Online Context Learning for Fast Synchronous LLM Reinforcement Learning [6.742598086990326]
Reinforcement Learning (RL) has become critical for advancing modern Large Language Models (LLMs), yet existing synchronous RL systems face severe performance bottlenecks.<n>We present Seer, a novel online context learning system that addresses these challenges by exploiting previously overlooked similarities in output lengths and generation patterns among requests sharing the same prompt.<n>Seer introduces three key techniques: divided rollout for dynamic load balancing, context-aware scheduling, and adaptive grouped speculative decoding.
arXiv Detail & Related papers (2025-11-18T16:12:21Z) - Eliminating Multi-GPU Performance Taxes: A Systems Approach to Efficient Distributed LLMs [61.953548065938385]
We introduce the ''Three Taxes'' (Bulk Synchronous, Inter- Kernel Data Locality, and Kernel Launch Overhead) as an analytical framework.<n>We propose moving beyond the rigid BSP model to address key inefficiencies in distributed GPU execution.<n>We observe a 10-20% speedup in end-to-end latency over BSP-based approaches.
arXiv Detail & Related papers (2025-11-04T01:15:44Z) - FairBatching: Fairness-Aware Batch Formation for LLM Inference [2.0917668141703207]
This work identifies the root cause of this unfairness: the non-monotonic nature of Time--Tokens (TBT)<n>We propose Fair the Prioritizing, a novel system that enforces fair resource allocation between fill and decode tasks.
arXiv Detail & Related papers (2025-10-16T07:43:56Z) - Laminar: A Scalable Asynchronous RL Post-Training Framework [20.127034898123508]
Long-tail skewness in RL trajectory generation causes severe GPU underutilization.<n>Current RL systems rely on global weight synchronization between the actor and all rollouts, which creates a rigid model update schedule.<n>We propose Laminar, a scalable and robust RL post-training system built on a fully decoupled architecture.
arXiv Detail & Related papers (2025-10-14T15:29:14Z) - PowerGrow: Feasible Co-Growth of Structures and Dynamics for Power Grid Synthesis [75.14189839277928]
We present PowerGrow, a co-generative framework that significantly reduces computational overhead while maintaining operational validity.<n> Experiments across benchmark settings show that PowerGrow outperforms prior diffusion models in fidelity and diversity.<n>This demonstrates its ability to generate operationally valid and realistic power grid scenarios.
arXiv Detail & Related papers (2025-08-29T01:47:27Z) - CSGO: Generalized Optimization for Cold Start in Wireless Collaborative Edge LLM Systems [62.24576366776727]
We propose a latency-aware scheduling framework to minimize total inference latency.<n>We show that the proposed method significantly reduces cold-start latency compared to baseline strategies.
arXiv Detail & Related papers (2025-08-15T07:49:22Z) - Beyond Fixed: Training-Free Variable-Length Denoising for Diffusion Large Language Models [74.15250326312179]
Diffusion Large Language Models offer efficient parallel generation and capable global modeling.<n>The dominant application ofDLLMs is hindered by the need for a statically predefined generation length.<n>We introduce DAEDAL, a novel training-free denoising strategy that enables Dynamic Adaptive Length Expansion.
arXiv Detail & Related papers (2025-08-01T17:56:07Z) - AsyncFlow: An Asynchronous Streaming RL Framework for Efficient LLM Post-Training [24.60677187852425]
Reinforcement learning (RL) has become a pivotal technology in the post-training phase of large language models (LLMs)<n>Traditional task-colocated RL frameworks suffer from significant scalability bottlenecks.<n>Task-separated RL frameworks face challenges in complex dataflows and the corresponding resource idling and workload imbalance.<n>We propose AsyncFlow, an asynchronous streaming RL framework for efficient post-training.
arXiv Detail & Related papers (2025-07-02T12:45:34Z) - StreamRL: Scalable, Heterogeneous, and Elastic RL for LLMs with Disaggregated Stream Generation [55.75008325187133]
Reinforcement learning (RL) has become the core post-training technique for large language models (LLMs)<n>StreamRL is designed with disaggregation from first principles to address two types of performance bottlenecks.<n> Experiments show that StreamRL improves throughput by up to 2.66x compared to existing state-of-the-art systems.
arXiv Detail & Related papers (2025-04-22T14:19:06Z) - Characterization of Large Language Model Development in the Datacenter [55.9909258342639]
Large Language Models (LLMs) have presented impressive performance across several transformative tasks.
However, it is non-trivial to efficiently utilize large-scale cluster resources to develop LLMs.
We present an in-depth characterization study of a six-month LLM development workload trace collected from our GPU datacenter Acme.
arXiv Detail & Related papers (2024-03-12T13:31:14Z) - Distributed Inference and Fine-tuning of Large Language Models Over The
Internet [91.00270820533272]
Large language models (LLMs) are useful in many NLP tasks and become more capable with size.
These models require high-end hardware, making them inaccessible to most researchers.
We develop fault-tolerant inference algorithms and load-balancing protocols that automatically assign devices to maximize the total system throughput.
arXiv Detail & Related papers (2023-12-13T18:52:49Z) - Learning Mean-Field Control for Delayed Information Load Balancing in
Large Queuing Systems [26.405495663998828]
In this work, we consider a multi-agent load balancing system, with delayed information, consisting of many clients (load balancers) and many parallel queues.
We apply policy gradient reinforcement learning algorithms to find an optimal load balancing solution.
Our approach is scalable but also shows good performance when compared to the state-of-the-art power-of-d variant of the Join-the-Shortest-Queue (JSQ)
arXiv Detail & Related papers (2022-08-09T13:47:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.