Taming the Chaos: Coordinated Autoscaling for Heterogeneous and Disaggregated LLM Inference
- URL: http://arxiv.org/abs/2508.19559v1
- Date: Wed, 27 Aug 2025 04:22:02 GMT
- Title: Taming the Chaos: Coordinated Autoscaling for Heterogeneous and Disaggregated LLM Inference
- Authors: Rongzhi Li, Ruogu Du, Zefang Chu, Sida Zhao, Chunlei Han, Zuocheng Shi, Yiwen Shao, Huanle Han, Long Huang, Zherui Liu, Shufan Liu,
- Abstract summary: Serving Large Language Models (LLMs) is a GPU-intensive task where traditional autoscalers fall short.<n>We introduce HeteroScale, a coordinated autoscaling framework that addresses the core challenges of P/D disaggregated serving.
- Score: 5.786961198115219
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Serving Large Language Models (LLMs) is a GPU-intensive task where traditional autoscalers fall short, particularly for modern Prefill-Decode (P/D) disaggregated architectures. This architectural shift, while powerful, introduces significant operational challenges, including inefficient use of heterogeneous hardware, network bottlenecks, and critical imbalances between prefill and decode stages. We introduce HeteroScale, a coordinated autoscaling framework that addresses the core challenges of P/D disaggregated serving. HeteroScale combines a topology-aware scheduler that adapts to heterogeneous hardware and network constraints with a novel metric-driven policy derived from the first large-scale empirical study of autoscaling signals in production. By leveraging a single, robust metric to jointly scale prefill and decode pools, HeteroScale maintains architectural balance while ensuring efficient, adaptive resource management. Deployed in a massive production environment on tens of thousands of GPUs, HeteroScale has proven its effectiveness, increasing average GPU utilization by a significant 26.6 percentage points and saving hundreds of thousands of GPU-hours daily, all while upholding stringent service level objectives.
Related papers
- ECHO: Encoding Communities via High-order Operators [8.970269049715933]
Community detection in attributed networks faces a fundamental divide: topological algorithms ignore semantic features, while Graph Neural Networks (GNNs) encounter devastating computational bottlenecks.<n>We introduce ECHO, a scalable, self supervised architecture that reframes community detection as an adaptive, multi scale diffusion process.<n>ECHO completely bypasses traditional O(N2) memory bottlenecks without sacrificing the mathematical precision of global gradients.
arXiv Detail & Related papers (2026-02-25T22:14:29Z) - Design Space Exploration of DMA based Finer-Grain Compute Communication Overlap [0.8763937152756086]
We argue for finer-grain compute-communication overlap which we term FiCCO.<n>We show that FiCCO opens up a wider design space of execution schedules than possible at shard-level alone.<n>We then present a detailed characterization of these inefficiency losses, then present a design space of FiCCO schedules, and finally overlay the schedules with concomitant inefficiency signatures.
arXiv Detail & Related papers (2025-12-11T02:43:27Z) - MSPT: Efficient Large-Scale Physical Modeling via Parallelized Multi-Scale Attention [8.605284164957984]
We introduce the Multi-Scale Patch Transformer (MSPT), an architecture that combines local point attention within patches with global attention to coarse patch-level representations.<n>We validate our method on standard PDE benchmarks (elasticity, plasticity, fluid dynamics, porous flow) and large-scale aerodynamic datasets.
arXiv Detail & Related papers (2025-12-01T14:43:46Z) - A Multi-Stage Optimization Framework for Deploying Learned Image Compression on FPGAs [7.577235739757108]
Deep learning-based image compression (LIC) has achieved state-of-the-art rate-distortion (RD) performance, yet deploying these models on resource-constrained FPGAs remains a major challenge.<n>This work presents a complete, multi-stage optimization framework to bridge the gap between high-performance floating-point models and efficient, hardware-friendly integer-based implementations.
arXiv Detail & Related papers (2025-11-21T10:55:44Z) - Edge Collaborative Gaussian Splatting with Integrated Rendering and Communication [69.23838350582764]
We present edge collaborative (ECO-GS) where each user can switch between a small GS model to guarantee fidelity and a remote large GS model to guarantee fidelity.<n>We propose integrated and communication (IRAC) which jointly optimize low-cost rendering status and edge power allocation.
arXiv Detail & Related papers (2025-10-26T15:33:29Z) - CollaPipe: Adaptive Segment-Optimized Pipeline Parallelism for Collaborative LLM Training in Heterogeneous Edge Networks [57.95170323315603]
We introduce CollaPipe, a distributed learning framework that integrates collaborative pipeline parallelism with federated aggregation to support self-evolving networks.<n>In CollaPipe, the encoder part is adaptively partitioned into variable-sized segments and deployed across mobile devices for pipeline-parallel training, while the decoder is deployed on edge servers to handle generative tasks.<n>To enhance training efficiency, we formulate a joint optimization problem that adaptively allocates model segments, micro-batches, bandwidth, and transmission power.
arXiv Detail & Related papers (2025-09-24T07:54:01Z) - JEDI-linear: Fast and Efficient Graph Neural Networks for Jet Tagging on FPGAs [36.158374493924455]
Graph Neural Networks (GNNs) have shown exceptional performance for jet tagging at the CERN High Luminosity Large Hadron Collider (HLLHC)<n>We propose JEDI-linear, a novel GNN architecture with linear computational complexity that eliminates explicit pairwise interactions.<n>This is the first interaction-based GNN to achieve less than 60ns latency and currently meets the requirements for use in the HL-LHC CMS Level-1 trigger system.
arXiv Detail & Related papers (2025-08-21T11:40:49Z) - STAMP: Scalable Task And Model-agnostic Collaborative Perception [24.890993164334766]
STAMP is a task- and model-agnostic, collaborative perception pipeline for heterogeneous agents.<n>It minimizes computational overhead, enhances scalability, and preserves model security.<n>As a first-of-its-kind framework, STAMP aims to advance research in scalable and secure mobility systems towards Level 5 autonomy.
arXiv Detail & Related papers (2025-01-24T16:27:28Z) - FusionLLM: A Decentralized LLM Training System on Geo-distributed GPUs with Adaptive Compression [55.992528247880685]
Decentralized training faces significant challenges regarding system design and efficiency.
We present FusionLLM, a decentralized training system designed and implemented for training large deep neural networks (DNNs)
We show that our system and method can achieve 1.45 - 9.39x speedup compared to baseline methods while ensuring convergence.
arXiv Detail & Related papers (2024-10-16T16:13:19Z) - Multi-Level GNN Preconditioner for Solving Large Scale Problems [0.0]
Graph Neural Networks (GNNs) are great for learning from unstructured data like meshes but are often limited to small-scale problems.
This paper introduces a novel preconditioner integrating a GNN model within a multi-level Domain Decomposition framework.
The proposed GNN-based preconditioner is used to enhance the efficiency of a Krylov method, resulting in a hybrid solver that can converge with any desired level of accuracy.
arXiv Detail & Related papers (2024-02-13T08:50:14Z) - Quantize Once, Train Fast: Allreduce-Compatible Compression with Provable Guarantees [53.950234267704]
We introduce Global-QSGD, an All-reduce gradient-compatible quantization method.<n>We show that it accelerates distributed training by up to 3.51% over baseline quantization methods.
arXiv Detail & Related papers (2023-05-29T21:32:15Z) - Energy-efficient Task Adaptation for NLP Edge Inference Leveraging
Heterogeneous Memory Architectures [68.91874045918112]
adapter-ALBERT is an efficient model optimization for maximal data reuse across different tasks.
We demonstrate the advantage of mapping the model to a heterogeneous on-chip memory architecture by performing simulations on a validated NLP edge accelerator.
arXiv Detail & Related papers (2023-03-25T14:40:59Z) - Unifying Synergies between Self-supervised Learning and Dynamic
Computation [53.66628188936682]
We present a novel perspective on the interplay between SSL and DC paradigms.
We show that it is feasible to simultaneously learn a dense and gated sub-network from scratch in a SSL setting.
The co-evolution during pre-training of both dense and gated encoder offers a good accuracy-efficiency trade-off.
arXiv Detail & Related papers (2023-01-22T17:12:58Z) - Finding the Task-Optimal Low-Bit Sub-Distribution in Deep Neural
Networks [10.278350434623107]
Quantized neural networks typically require smaller memory footprints and lower computation complexity, which is crucial for efficient deployment.
We present an adaptive-mapping quantization method to learn an optimal latent sub-distribution that is inherent within models.
Experiments on image classification and object detection over various modern architectures demonstrate the effectiveness, generalization property, and transferability of the proposed method.
arXiv Detail & Related papers (2021-12-30T17:28:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.