Related papers: SuperInfer: SLO-Aware Rotary Scheduling and Memory Management for LLM Inference on Superchips

SuperInfer: SLO-Aware Rotary Scheduling and Memory Management for LLM Inference on Superchips

URL: http://arxiv.org/abs/2601.20309v1
Date: Wed, 28 Jan 2026 07:01:46 GMT
Title: SuperInfer: SLO-Aware Rotary Scheduling and Memory Management for LLM Inference on Superchips
Authors: Jiahuan Yu, Mingtao Hu, Zichao Lin, Minjia Zhang,
Abstract summary: We present SuperInfer, a high-performance Large Model (LLM) inference system designed for emerging Superchips (e.g., NVIDIA GH200)<n>SuperInfer introduces RotaSched, the first proactive, SLOaware rotary scheduler that rotates requests to maintain responsiveness on Superchips.<n>We show that SuperInfer improves TTFT SLO attainment rates by up to 74.7% while maintaining comparable TBT and throughput compared to state-of-the-art systems.
Score: 13.816966749411037
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Large Language Model (LLM) serving faces a fundamental tension between stringent latency Service Level Objectives (SLOs) and limited GPU memory capacity. When high request rates exhaust the KV cache budget, existing LLM inference systems often suffer severe head-of-line (HOL) blocking. While prior work explored PCIe-based offloading, these approaches cannot sustain responsiveness under high request rates, often failing to meet tight Time-To-First-Token (TTFT) and Time-Between-Tokens (TBT) SLOs. We present SuperInfer, a high-performance LLM inference system designed for emerging Superchips (e.g., NVIDIA GH200) with tightly coupled GPU-CPU architecture via NVLink-C2C. SuperInfer introduces RotaSched, the first proactive, SLO-aware rotary scheduler that rotates requests to maintain responsiveness on Superchips, and DuplexKV, an optimized rotation engine that enables full-duplex transfer over NVLink-C2C. Evaluations on GH200 using various models and datasets show that SuperInfer improves TTFT SLO attainment rates by up to 74.7% while maintaining comparable TBT and throughput compared to state-of-the-art systems, demonstrating that SLO-aware scheduling and memory co-design unlocks the full potential of Superchips for responsive LLM serving.

Related papers

ORBITFLOW: SLO-Aware Long-Context LLM Serving with Fine-Grained KV Cache Reconfiguration [1.2879848319971192]
Offloading KV caches to host memory limits effective memory usage.<n>We introduce ORBITFLOW, a fine-grained and adaptive KV cache management system.<n>Our experiments demonstrate that ORBITFLOW improves SLO attainment for TPOT and TBT by up to 66% and 48%, respectively.
arXiv Detail & Related papers (2026-01-05T04:02:34Z)
InfiniteVL: Synergizing Linear and Sparse Attention for Highly-Efficient, Unlimited-Input Vision-Language Models [49.08289742711585]
We propose InfiniteVL, a linear-complexity VLM architecture that synergizes sliding window attention (SWA) with Gated DeltaNet.<n>We show that InfiniteVL achieves over 3.6times inference speedup while maintaining constant latency and memory footprint.<n>In streaming video understanding scenarios, it sustains a stable 24 FPS real-time prefill speed while preserving long-term memory cache.
arXiv Detail & Related papers (2025-12-09T17:18:32Z)
Part II: ROLL Flash -- Accelerating RLVR and Agentic Training with Asynchrony [78.70328630805041]
ROLL Flash is a system that extends ROLL with native support for asynchronous RL post-training.<n>We show that ROLL Flash significantly improves resource utilization and scalability over synchronous RL post-training.
arXiv Detail & Related papers (2025-10-13T12:41:27Z)
VoltanaLLM: Feedback-Driven Frequency Control and State-Space Routing for Energy-Efficient LLM Serving [13.494819588196371]
VoltanaLLM is a system for energy-efficient Large Language Model (LLM) serving.<n>It co-designs frequency scaling and request routing in emerging prefill/decode disaggregated architectures.<n>It achieves up to 36.3% energy savings while maintaining near-perfect SLO attainment rate.
arXiv Detail & Related papers (2025-09-05T05:58:16Z)
HyperFlexis: Joint Design of Algorithms and Systems for Multi-SLO Serving and Fast Scaling [19.154782641360253]
Modern large language model (LLM) serving systems face challenges from highly variable requests with diverse lengths, priorities, and stage-specific service-level objectives (SLOs)<n>We present HyperFlexis, a unified LLM serving system that integrates algorithmic and system-level innovations to jointly optimize scheduling and scaling under multiple SLOs.
arXiv Detail & Related papers (2025-08-21T18:40:20Z)
Pangu Embedded: An Efficient Dual-system LLM Reasoner with Metacognition [95.54406667705999]
Pangu Embedded is an efficient Large Language Model (LLM) reasoner developed on Ascend Neural Processing Units (NPUs)<n>It addresses the significant computational costs and inference latency challenges prevalent in existing reasoning-optimized LLMs.<n>It delivers rapid responses and state-of-the-art reasoning quality within a single, unified model architecture.
arXiv Detail & Related papers (2025-05-28T14:03:02Z)
Digital Twin-Assisted Federated Learning with Blockchain in Multi-tier Computing Systems [67.14406100332671]
In Industry 4.0 systems, resource-constrained edge devices engage in frequent data interactions. This paper proposes a digital twin (DT) and federated digital twin (FL) scheme. The efficacy of our proposed cooperative interference-based FL process has been verified through numerical analysis.
arXiv Detail & Related papers (2024-11-04T17:48:02Z)
ConServe: Fine-Grained GPU Harvesting for LLM Online and Offline Co-Serving [61.35068981176018]
ConServe is a large language model (LLM) serving system that achieves high throughput and strong online latency guarantees.<n>We show that ConServe delivers an average of 2.2$times$ higher throughput and reduces online serving tail latency by 2.9$times$ on average compared to state-of-the-art systems.
arXiv Detail & Related papers (2024-10-02T04:12:13Z)
LayerKV: Optimizing Large Language Model Serving with Layer-wise KV Cache Management [23.431794605498084]
We propose Layer KV, a simple yet effective plug-in method that effectively reduces TTFT without requiring additional hardware or compromising output performance. Layer KV introduces layer-wise KV block allocation, management, and offloading for fine-grained control over system memory. Comprehensive evaluations on representative models, ranging from 7B to 70B parameters, across various GPU configurations, demonstrate that Layer KV improves TTFT latency up to 69x and reduces SLO violation rates by 28.7%.
arXiv Detail & Related papers (2024-10-01T06:23:17Z)
Designing Efficient LLM Accelerators for Edge Devices [1.4128048241287314]
Large Language Models (LLMs) can be deployed on resource-constrained edge devices to reduce reliance on network connections and provide more privacy. To address this issue, designing new and efficient edge accelerators for LLM inference is crucial. We propose SECDA-LLM, that utilizes the SECDA methodology to streamline the process of designing, integrating, and deploying efficient FPGA-based LLM accelerators.
arXiv Detail & Related papers (2024-08-01T11:06:05Z)
FusionAI: Decentralized Training and Deploying LLMs with Massive Consumer-Level GPUs [57.12856172329322]
We envision a decentralized system unlocking the potential vast untapped consumer-level GPU. This system faces critical challenges, including limited CPU and GPU memory, low network bandwidth, the variability of peer and device heterogeneity.
arXiv Detail & Related papers (2023-09-03T13:27:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.