Related papers: HyGen: Efficient LLM Serving via Elastic Online-Offline Request Co-location

Related papers

Beyond Fixed: Variable-Length Denoising for Diffusion Large Language Models [74.15250326312179]
Diffusion Large Language Models offer efficient parallel generation and capable global modeling.<n>The dominant application ofDLLMs is hindered by the need for a statically predefined generation length.<n>We introduce DAEDAL, a novel training-free denoising strategy that enables Dynamic Adaptive Length Expansion.
arXiv Detail & Related papers (2025-08-01T17:56:07Z)
PolyServe: Efficient Multi-SLO Serving at Scale [6.147741784378271]
PolyServe is a novel multi-SLO scheduling policy at scale that maintains high SLO attainment while maximizing throughput.<n> PolyServe achieves 1.23x goodput gain compared to existing policies, achieving up to 92.5% of optimal goodput.
arXiv Detail & Related papers (2025-07-17T05:54:42Z)
Pangu Embedded: An Efficient Dual-system LLM Reasoner with Metacognition [95.54406667705999]
Pangu Embedded is an efficient Large Language Model (LLM) reasoner developed on Ascend Neural Processing Units (NPUs)<n>It addresses the significant computational costs and inference latency challenges prevalent in existing reasoning-optimized LLMs.<n>It delivers rapid responses and state-of-the-art reasoning quality within a single, unified model architecture.
arXiv Detail & Related papers (2025-05-28T14:03:02Z)
Tempo: Application-aware LLM Serving with Mixed SLO Requirements [7.290735867969561]
We introduce Tempo, a scheduler designed to maximize service gain across diverse LLM workloads. Our evaluation shows that Tempo improves end-to-end service gain by up to 8.3$times$ achieves and up to 10.3$times$ SLO goodput compared to state-of-the-art designs.
arXiv Detail & Related papers (2025-04-24T05:55:21Z)
StreamRL: Scalable, Heterogeneous, and Elastic RL for LLMs with Disaggregated Stream Generation [55.75008325187133]
Reinforcement learning (RL) has become the core post-training technique for large language models (LLMs) StreamRL is designed with disaggregation from first principles to address two types of performance bottlenecks. Experiments show that StreamRL improves throughput by up to 2.66x compared to existing state-of-the-art systems.
arXiv Detail & Related papers (2025-04-22T14:19:06Z)
Apt-Serve: Adaptive Request Scheduling on Hybrid Cache for Scalable LLM Inference Serving [22.66354939370058]
Apt-Serve is a framework designed to enhance effective throughput in large language model (LLM) inference serving systems. A new hybrid cache scheme combines KV cache with a memory-efficient hidden cache for reusable input hidden state vectors, allowing large batch sizes and improving request. We show that Apt-Serve achieves up to 8.8x improvement in effective throughput compared to the state-of-the-art inference serving systems.
arXiv Detail & Related papers (2025-04-10T06:51:23Z)
Autellix: An Efficient Serving Engine for LLM Agents as General Programs [59.673243129044465]
Large language model (LLM) applications are evolving beyond simple chatbots into dynamic, general-purpose agentic programs. Existing LLM serving systems ignore dependencies between programs and calls, missing significant opportunities for optimization. We introduce Autellix, an LLM serving system that treats programs as first-class citizens to minimize their end-to-end latencies.
arXiv Detail & Related papers (2025-02-19T18:59:30Z)
AdaServe: Accelerating Multi-SLO LLM Serving with SLO-Customized Speculative Decoding [12.106234303559571]
We present AdaServe, the first serving system designed to support efficient multi-SLO serving through SLO-customized speculative decoding.<n>AdaServe formulates multi-SLO serving as a constrained optimization problem and introduces a hardware-aware algorithm.<n>It features a speculate-select-verify pipeline that enables fine-grained control over decoding speed while maximizing system throughput.
arXiv Detail & Related papers (2025-01-21T14:15:01Z)
FusionLLM: A Decentralized LLM Training System on Geo-distributed GPUs with Adaptive Compression [55.992528247880685]
Decentralized training faces significant challenges regarding system design and efficiency. We present FusionLLM, a decentralized training system designed and implemented for training large deep neural networks (DNNs) We show that our system and method can achieve 1.45 - 9.39x speedup compared to baseline methods while ensuring convergence.
arXiv Detail & Related papers (2024-10-16T16:13:19Z)
ConServe: Harvesting GPUs for Low-Latency and High-Throughput Large Language Model Serving [15.01982917560918]
This paper proposes to harvest stranded GPU resources for offline LLM inference tasks. We built ConServe, an LLM serving system that contains an execution engine that preempts running offline tasks. Our evaluation demonstrates that ConServe achieves strong performance isolation when co-serving online and offline tasks.
arXiv Detail & Related papers (2024-10-02T04:12:13Z)
Short-Long Convolutions Help Hardware-Efficient Linear Attention to Focus on Long Sequences [60.489682735061415]
We propose CHELA, which replaces state space models with short-long convolutions and implements linear attention in a divide-and-conquer manner. Our experiments on the Long Range Arena benchmark and language modeling tasks demonstrate the effectiveness of the proposed method.
arXiv Detail & Related papers (2024-06-12T12:12:38Z)
Efficient Interactive LLM Serving with Proxy Model-based Sequence Length Prediction [8.705908108054878]
Large models (LLMs) have been driving a new wave of AI applications across numerous domains. We present a speculative shortest-job-first (SSJF) scheduler that uses a light proxy model to predict LLM output sequence lengths.
arXiv Detail & Related papers (2024-04-12T14:46:15Z)
SpotServe: Serving Generative Large Language Models on Preemptible Instances [64.18638174004151]
SpotServe is the first distributed large language models serving system on preemptible instances. We show that SpotServe can reduce the P99 tail latency by 2.4 - 9.1x compared with the best existing LLM serving systems. We also show that SpotServe can leverage the price advantage of preemptive instances, saving 54% monetary cost compared with only using on-demand instances.
arXiv Detail & Related papers (2023-11-27T06:31:17Z)
Fast Distributed Inference Serving for Large Language Models [12.703624317418237]
We present FastServe, a distributed inference serving system for large language models (LLMs) FastServe exploits the autoregressive pattern of LLM inference to enable preemption at the granularity of each output token. We build a system prototype of FastServe and experimental results show that compared to the state-of-the-art solution vLLM, FastServe improves the throughput by up to 31.4x and 17.9x under the same average and tail latency requirements, respectively.
arXiv Detail & Related papers (2023-05-10T06:17:50Z)
Efficient Parallel Split Learning over Resource-constrained Wireless Edge Networks [44.37047471448793]
In this paper, we advocate the integration of edge computing paradigm and parallel split learning (PSL) We propose an innovative PSL framework, namely, efficient parallel split learning (EPSL) to accelerate model training. We show that the proposed EPSL framework significantly decreases the training latency needed to achieve a target accuracy.
arXiv Detail & Related papers (2023-03-26T16:09:48Z)
A Graph Neural Networks based Framework for Topology-Aware Proactive SLA Management in a Latency Critical NFV Application Use-case [0.34376560669160383]
Recent advancements in 5G and 6G have led to the emergence of latency-critical applications delivered via a Network-series (NFV) enabled paradigm. We propose a proactive SLA management framework leveraging Graph Neural Networks (GNN) and Deep Reinforcement Learning (DRL) to balance the trade-off between efficiency and reliability.
arXiv Detail & Related papers (2022-11-10T23:22:05Z)
An Intelligent Deterministic Scheduling Method for Ultra-Low Latency Communication in Edge Enabled Industrial Internet of Things [19.277349546331557]
Time Sensitive Network (TSN) is recently researched to realize low latency communication via deterministic scheduling. Non-collision theory based deterministic scheduling (NDS) method is proposed to achieve ultra-low latency communication for the time-sensitive flows. Experiment results demonstrate that NDS/DQS can well support deterministic ultra-low latency services and guarantee efficient bandwidth utilization.
arXiv Detail & Related papers (2022-07-17T16:52:51Z)
Real-Time GPU-Accelerated Machine Learning Based Multiuser Detection for 5G and Beyond [70.81551587109833]
nonlinear beamforming filters can significantly outperform linear approaches in stationary scenarios with massive connectivity. One of the main challenges comes from the real-time implementation of these algorithms. This paper explores the acceleration of APSM-based algorithms through massive parallelization.
arXiv Detail & Related papers (2022-01-13T15:20:45Z)
Dynamic Network-Assisted D2D-Aided Coded Distributed Learning [59.29409589861241]
We propose a novel device-to-device (D2D)-aided coded federated learning method (D2D-CFL) for load balancing across devices. We derive an optimal compression rate for achieving minimum processing time and establish its connection with the convergence time. Our proposed method is beneficial for real-time collaborative applications, where the users continuously generate training data.
arXiv Detail & Related papers (2021-11-26T18:44:59Z)
EdgeBERT: Sentence-Level Energy Optimizations for Latency-Aware Multi-Task NLP Inference [82.1584439276834]
Transformer-based language models such as BERT provide significant accuracy improvement for a multitude of natural language processing (NLP) tasks. We present EdgeBERT, an in-depth algorithm- hardware co-design for latency-aware energy optimization for multi-task NLP.
arXiv Detail & Related papers (2020-11-28T19:21:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.