Related papers: Taming the Chaos: Coordinated Autoscaling for Heterogeneous and Disaggregated LLM Inference

Taming the Chaos: Coordinated Autoscaling for Heterogeneous and Disaggregated LLM Inference

URL: http://arxiv.org/abs/2508.19559v1
Date: Wed, 27 Aug 2025 04:22:02 GMT
Title: Taming the Chaos: Coordinated Autoscaling for Heterogeneous and Disaggregated LLM Inference
Authors: Rongzhi Li, Ruogu Du, Zefang Chu, Sida Zhao, Chunlei Han, Zuocheng Shi, Yiwen Shao, Huanle Han, Long Huang, Zherui Liu, Shufan Liu,
Abstract summary: Serving Large Language Models (LLMs) is a GPU-intensive task where traditional autoscalers fall short.<n>We introduce HeteroScale, a coordinated autoscaling framework that addresses the core challenges of P/D disaggregated serving.
Score: 5.786961198115219
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Serving Large Language Models (LLMs) is a GPU-intensive task where traditional autoscalers fall short, particularly for modern Prefill-Decode (P/D) disaggregated architectures. This architectural shift, while powerful, introduces significant operational challenges, including inefficient use of heterogeneous hardware, network bottlenecks, and critical imbalances between prefill and decode stages. We introduce HeteroScale, a coordinated autoscaling framework that addresses the core challenges of P/D disaggregated serving. HeteroScale combines a topology-aware scheduler that adapts to heterogeneous hardware and network constraints with a novel metric-driven policy derived from the first large-scale empirical study of autoscaling signals in production. By leveraging a single, robust metric to jointly scale prefill and decode pools, HeteroScale maintains architectural balance while ensuring efficient, adaptive resource management. Deployed in a massive production environment on tens of thousands of GPUs, HeteroScale has proven its effectiveness, increasing average GPU utilization by a significant 26.6 percentage points and saving hundreds of thousands of GPU-hours daily, all while upholding stringent service level objectives.

Related papers

ECHO: Encoding Communities via High-order Operators [8.970269049715933]
Community detection in attributed networks faces a fundamental divide: topological algorithms ignore semantic features, while Graph Neural Networks (GNNs) encounter devastating computational bottlenecks.<n>We introduce ECHO, a scalable, self supervised architecture that reframes community detection as an adaptive, multi scale diffusion process.<n>ECHO completely bypasses traditional O(N2) memory bottlenecks without sacrificing the mathematical precision of global gradients.
arXiv Detail & Related papers (2026-02-25T22:14:29Z)
Design Space Exploration of DMA based Finer-Grain Compute Communication Overlap [0.8763937152756086]
We argue for finer-grain compute-communication overlap which we term FiCCO.<n>We show that FiCCO opens up a wider design space of execution schedules than possible at shard-level alone.<n>We then present a detailed characterization of these inefficiency losses, then present a design space of FiCCO schedules, and finally overlay the schedules with concomitant inefficiency signatures.
arXiv Detail & Related papers (2025-12-11T02:43:27Z)
MSPT: Efficient Large-Scale Physical Modeling via Parallelized Multi-Scale Attention [8.605284164957984]
We introduce the Multi-Scale Patch Transformer (MSPT), an architecture that combines local point attention within patches with global attention to coarse patch-level representations.<n>We validate our method on standard PDE benchmarks (elasticity, plasticity, fluid dynamics, porous flow) and large-scale aerodynamic datasets.
arXiv Detail & Related papers (2025-12-01T14:43:46Z)
A Multi-Stage Optimization Framework for Deploying Learned Image Compression on FPGAs [7.577235739757108]
Deep learning-based image compression (LIC) has achieved state-of-the-art rate-distortion (RD) performance, yet deploying these models on resource-constrained FPGAs remains a major challenge.<n>This work presents a complete, multi-stage optimization framework to bridge the gap between high-performance floating-point models and efficient, hardware-friendly integer-based implementations.
arXiv Detail & Related papers (2025-11-21T10:55:44Z)
Edge Collaborative Gaussian Splatting with Integrated Rendering and Communication [69.23838350582764]
We present edge collaborative (ECO-GS) where each user can switch between a small GS model to guarantee fidelity and a remote large GS model to guarantee fidelity.<n>We propose integrated and communication (IRAC) which jointly optimize low-cost rendering status and edge power allocation.
arXiv Detail & Related papers (2025-10-26T15:33:29Z)
CollaPipe: Adaptive Segment-Optimized Pipeline Parallelism for Collaborative LLM Training in Heterogeneous Edge Networks [57.95170323315603]
We introduce CollaPipe, a distributed learning framework that integrates collaborative pipeline parallelism with federated aggregation to support self-evolving networks.<n>In CollaPipe, the encoder part is adaptively partitioned into variable-sized segments and deployed across mobile devices for pipeline-parallel training, while the decoder is deployed on edge servers to handle generative tasks.<n>To enhance training efficiency, we formulate a joint optimization problem that adaptively allocates model segments, micro-batches, bandwidth, and transmission power.
arXiv Detail & Related papers (2025-09-24T07:54:01Z)
JEDI-linear: Fast and Efficient Graph Neural Networks for Jet Tagging on FPGAs [36.158374493924455]
Graph Neural Networks (GNNs) have shown exceptional performance for jet tagging at the CERN High Luminosity Large Hadron Collider (HLLHC)<n>We propose JEDI-linear, a novel GNN architecture with linear computational complexity that eliminates explicit pairwise interactions.<n>This is the first interaction-based GNN to achieve less than 60ns latency and currently meets the requirements for use in the HL-LHC CMS Level-1 trigger system.
arXiv Detail & Related papers (2025-08-21T11:40:49Z)
STAMP: Scalable Task And Model-agnostic Collaborative Perception [24.890993164334766]
STAMP is a task- and model-agnostic, collaborative perception pipeline for heterogeneous agents.<n>It minimizes computational overhead, enhances scalability, and preserves model security.<n>As a first-of-its-kind framework, STAMP aims to advance research in scalable and secure mobility systems towards Level 5 autonomy.
arXiv Detail & Related papers (2025-01-24T16:27:28Z)
FusionLLM: A Decentralized LLM Training System on Geo-distributed GPUs with Adaptive Compression [55.992528247880685]
Decentralized training faces significant challenges regarding system design and efficiency. We present FusionLLM, a decentralized training system designed and implemented for training large deep neural networks (DNNs) We show that our system and method can achieve 1.45 - 9.39x speedup compared to baseline methods while ensuring convergence.
arXiv Detail & Related papers (2024-10-16T16:13:19Z)
Multi-Level GNN Preconditioner for Solving Large Scale Problems [0.0]
Graph Neural Networks (GNNs) are great for learning from unstructured data like meshes but are often limited to small-scale problems. This paper introduces a novel preconditioner integrating a GNN model within a multi-level Domain Decomposition framework. The proposed GNN-based preconditioner is used to enhance the efficiency of a Krylov method, resulting in a hybrid solver that can converge with any desired level of accuracy.
arXiv Detail & Related papers (2024-02-13T08:50:14Z)
Quantize Once, Train Fast: Allreduce-Compatible Compression with Provable Guarantees [53.950234267704]
We introduce Global-QSGD, an All-reduce gradient-compatible quantization method.<n>We show that it accelerates distributed training by up to 3.51% over baseline quantization methods.
arXiv Detail & Related papers (2023-05-29T21:32:15Z)
Energy-efficient Task Adaptation for NLP Edge Inference Leveraging Heterogeneous Memory Architectures [68.91874045918112]
adapter-ALBERT is an efficient model optimization for maximal data reuse across different tasks. We demonstrate the advantage of mapping the model to a heterogeneous on-chip memory architecture by performing simulations on a validated NLP edge accelerator.
arXiv Detail & Related papers (2023-03-25T14:40:59Z)
Unifying Synergies between Self-supervised Learning and Dynamic Computation [53.66628188936682]
We present a novel perspective on the interplay between SSL and DC paradigms. We show that it is feasible to simultaneously learn a dense and gated sub-network from scratch in a SSL setting. The co-evolution during pre-training of both dense and gated encoder offers a good accuracy-efficiency trade-off.
arXiv Detail & Related papers (2023-01-22T17:12:58Z)
Finding the Task-Optimal Low-Bit Sub-Distribution in Deep Neural Networks [10.278350434623107]
Quantized neural networks typically require smaller memory footprints and lower computation complexity, which is crucial for efficient deployment. We present an adaptive-mapping quantization method to learn an optimal latent sub-distribution that is inherent within models. Experiments on image classification and object detection over various modern architectures demonstrate the effectiveness, generalization property, and transferability of the proposed method.
arXiv Detail & Related papers (2021-12-30T17:28:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.