Related papers: ConServe: Fine-Grained GPU Harvesting for LLM Online and Offline Co-Serving

Related papers

PLA-Serve: A Prefill-Length-Aware LLM Serving System [33.313531352453346]
PLA-Serve identifies and disaggregates requests with different prompt lengths to reduce TTFT latency.<n>We observe that prompt-length variations lead to distinct bottlenecks, motivating an adaptive scheduling strategy.<n> PLA-Serve reduces prefill latency by over 30% compared to vanilla SG under prefill**-Lang**decode disaggregation.
arXiv Detail & Related papers (2026-01-04T18:14:24Z)
InfiniteVL: Synergizing Linear and Sparse Attention for Highly-Efficient, Unlimited-Input Vision-Language Models [49.08289742711585]
We propose InfiniteVL, a linear-complexity VLM architecture that synergizes sliding window attention (SWA) with Gated DeltaNet.<n>We show that InfiniteVL achieves over 3.6times inference speedup while maintaining constant latency and memory footprint.<n>In streaming video understanding scenarios, it sustains a stable 24 FPS real-time prefill speed while preserving long-term memory cache.
arXiv Detail & Related papers (2025-12-09T17:18:32Z)
AugServe: Adaptive Request Scheduling for Augmented Large Language Model Inference Serving [6.505016440664893]
AugServe is an efficient inference framework designed to reduce queueing latency and enhance effective throughput for augmented large language models (LLMs)<n> Experimental results show that AugServe achieves 4.7-33.1x and 3.3-13.2x higher effective throughput than vLLM and InferCept, while reducing fluctuating time-to-first-token (TTFT) by up to 96.3% and 95.0%, respectively.
arXiv Detail & Related papers (2025-12-03T17:49:38Z)
StreamDiffusionV2: A Streaming System for Dynamic and Interactive Video Generation [65.90400162290057]
Generative models are reshaping the live-streaming industry by redefining how content is created, styled, and delivered.<n>Recent advances in video diffusion have markedly improved temporal consistency and sampling efficiency for offline generation.<n>Live online streaming operates under strict service-level objectives (SLOs): time-to-first-frame must be minimal, and every frame must meet a per-frame deadline with low jitter.
arXiv Detail & Related papers (2025-11-10T18:51:28Z)
DuetServe: Harmonizing Prefill and Decode for LLM Serving via Adaptive GPU Multiplexing [15.376910065679994]
DuetServe is a unified LLM serving framework that achieves disaggregation-level isolation within a single GPU.<n>DuetServe improves total throughput by up to 1.3x while maintaining low generation latency compared to state-of-the-art frameworks.
arXiv Detail & Related papers (2025-11-06T20:18:34Z)
Dynamic Speculative Agent Planning [57.630218933994534]
Large language-model-based agents face critical deployment challenges due to prohibitive latency and inference costs.<n>We introduce Dynamic Speculative Planning (DSP), an online reinforcement learning framework that provides lossless acceleration with substantially reduced costs.<n>Experiments on two standard agent benchmarks demonstrate that DSP achieves comparable efficiency to the fastest acceleration method while reducing total cost by 30% and unnecessary cost up to 60%.
arXiv Detail & Related papers (2025-09-02T03:34:36Z)
CSGO: Generalized Optimization for Cold Start in Wireless Collaborative Edge LLM Systems [62.24576366776727]
We propose a latency-aware scheduling framework to minimize total inference latency.<n>We show that the proposed method significantly reduces cold-start latency compared to baseline strategies.
arXiv Detail & Related papers (2025-08-15T07:49:22Z)
Nexus:Proactive Intra-GPU Disaggregation of Prefill and Decode in LLM Serving [4.309392302169281]
Engine-level prefill-decode (PD) disaggregation avoids interference but incurs higher hardware and coordination overhead.<n>PD achieves up to 2.2x higher throughput, 20x lower TTFT, and 2.5x lower TBT than vLLM; outperforms SG by up to 2x; and matches or exceeds disaggregated vLLM.
arXiv Detail & Related papers (2025-07-09T07:27:18Z)
Apt-Serve: Adaptive Request Scheduling on Hybrid Cache for Scalable LLM Inference Serving [22.66354939370058]
Apt-Serve is a framework designed to enhance effective throughput in large language model (LLM) inference serving systems.<n>A new hybrid cache scheme combines KV cache with a memory-efficient hidden cache for reusable input hidden state vectors, allowing large batch sizes and improving request.<n>We show that Apt-Serve achieves up to 8.8x improvement in effective throughput compared to the state-of-the-art inference serving systems.
arXiv Detail & Related papers (2025-04-10T06:51:23Z)
HyGen: Efficient LLM Serving via Elastic Online-Offline Request Co-location [3.348953136575379]
HyGen is an interference-aware LLM serving system that enables efficient co-location of online and offline workloads. Our evaluation on production workloads shows that HyGen achieves up to 3.87x overall throughput and 5.84x offline throughput gains.
arXiv Detail & Related papers (2025-01-15T16:32:27Z)
MoE-Lightning: High-Throughput MoE Inference on Memory-constrained GPUs [55.95879347182669]
MoE architecture is renowned for its ability to increase model capacity without a proportional increase in inference cost. MoE-Lightning introduces a novel CPU-GPU-I/O pipelining schedule, CGOPipe, with paged weights to achieve high resource utilization. MoE-Lightning can achieve up to 10.3x higher throughput than state-of-the-art offloading-enabled LLM inference systems for Mixtral 8x7B on a single T4 GPU (16GB)
arXiv Detail & Related papers (2024-11-18T01:06:12Z)
FusionLLM: A Decentralized LLM Training System on Geo-distributed GPUs with Adaptive Compression [55.992528247880685]
Decentralized training faces significant challenges regarding system design and efficiency. We present FusionLLM, a decentralized training system designed and implemented for training large deep neural networks (DNNs) We show that our system and method can achieve 1.45 - 9.39x speedup compared to baseline methods while ensuring convergence.
arXiv Detail & Related papers (2024-10-16T16:13:19Z)
Benchmarking Edge AI Platforms for High-Performance ML Inference [0.0]
Edge computing's growing prominence, due to its ability to reduce communication latency and enable real-time processing, is promoting the rise of high-performance, heterogeneous System-on-Chip solutions. While current approaches often involve scaling down modern hardware, the performance characteristics of neural network workloads can vary significantly. We compare the latency and throughput of various linear algebra and neural network inference tasks across CPU-only, CPU/GPU, and CPU/NPU integrated solutions.
arXiv Detail & Related papers (2024-09-23T08:27:27Z)
Exploiting Student Parallelism for Efficient GPU Inference of BERT-like Models in Online Services [27.998951498347626]
We present sys for the real-world setting of GPU inference on online workloads. sys adopts stacking distillation and boosting ensemble, distilling the original deep model into a group of shallow but virtually stacked student models. Results show sys outperforms the baselines by $4.1timessim 1.6times$ in latency while maintaining accuracy and achieves up to $22.27times$ higher for workload bursts.
arXiv Detail & Related papers (2024-08-22T16:31:32Z)
MARLIN: Mixed-Precision Auto-Regressive Parallel Inference on Large Language Models [58.3342517278868]
This paper describes the design of Mixed-precision AutoRegressive LINear kernels. It shows that batchsizes up to 16-32 can be supported with close to maximum ($4times$) quantization speedup. MarLIN accomplishes this via a combination of techniques, such as asynchronous memory access, complex task scheduling and pipelining.
arXiv Detail & Related papers (2024-08-21T16:10:41Z)
A Real-Time Adaptive Multi-Stream GPU System for Online Approximate Nearest Neighborhood Search [3.116913746878115]
We introduce a novel Real-Time Adaptive Multi-Stream GPU ANNS System (RTAMS-GANNS) Our architecture achieves its objectives through three key advancements. The proposed system has also been deployed in real-world industrial search and recommendation systems, serving hundreds of millions of users daily.
arXiv Detail & Related papers (2024-08-06T03:44:06Z)
RelayAttention for Efficient Large Language Model Serving with Long System Prompts [59.50256661158862]
This paper aims to improve the efficiency of LLM services that involve long system prompts. handling these system prompts requires heavily redundant memory accesses in existing causal attention algorithms. We propose RelayAttention, an attention algorithm that allows reading hidden states from DRAM exactly once for a batch of input tokens.
arXiv Detail & Related papers (2024-02-22T18:58:28Z)
SpotServe: Serving Generative Large Language Models on Preemptible Instances [64.18638174004151]
SpotServe is the first distributed large language models serving system on preemptible instances. We show that SpotServe can reduce the P99 tail latency by 2.4 - 9.1x compared with the best existing LLM serving systems. We also show that SpotServe can leverage the price advantage of preemptive instances, saving 54% monetary cost compared with only using on-demand instances.
arXiv Detail & Related papers (2023-11-27T06:31:17Z)
FusionAI: Decentralized Training and Deploying LLMs with Massive Consumer-Level GPUs [57.12856172329322]
We envision a decentralized system unlocking the potential vast untapped consumer-level GPU. This system faces critical challenges, including limited CPU and GPU memory, low network bandwidth, the variability of peer and device heterogeneity.
arXiv Detail & Related papers (2023-09-03T13:27:56Z)
A GPU-specialized Inference Parameter Server for Large-Scale Deep Recommendation Models [6.823233135936128]
Recommendation systems are crucial for a variety of modern apps and web services, such as news feeds, social networks, e-commerce, search, etc. To achieve peak prediction accuracy, modern recommendation models combine deep learning with terabyte-scale embedding tables to obtain a fine-grained representation of the underlying data. Traditional inference serving architectures require deploying the whole model to standalone servers, which is infeasible at such massive scale.
arXiv Detail & Related papers (2022-10-17T07:36:18Z)
GPU-Accelerated Machine Learning in Non-Orthogonal Multiple Access [71.58925117604039]
Non-orthogonal multiple access (NOMA) is an interesting technology that enables massive connectivity as required in future 5G and 6G networks. We propose a neural network architecture that combines the advantages of both linear and non-linear processing.
arXiv Detail & Related papers (2022-06-13T09:38:23Z)
Multi-model Machine Learning Inference Serving with GPU Spatial Partitioning [7.05946599544139]
High throughput machine learning (ML) inference servers are critical for online service applications. These servers must provide a bounded latency for each request to support consistent service-level objective (SLO) This paper proposes a new ML inference scheduling framework for multi-model ML inference servers.
arXiv Detail & Related papers (2021-09-01T04:46:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.