Revati: Transparent GPU-Free Time-Warp Emulation for LLM Serving
- URL: http://arxiv.org/abs/2601.00397v1
- Date: Thu, 01 Jan 2026 17:19:58 GMT
- Title: Revati: Transparent GPU-Free Time-Warp Emulation for LLM Serving
- Authors: Amey Agrawal, Mayank Yadav, Sukrit Kumar, Anirudha Agrawal, Garv Ghai, Souradeep Bera, Elton Pinto, Sirish Gambhira, Mohammad Adain, Kasra Sohrab, Chus Antonanzas, Alexey Tumanov,
- Abstract summary: Revati is a time-warp emulator that enables performance modeling by directly executing real serving system code at simulation-like speed.<n>Revati achieves less than 5% prediction error across multiple models and configurations, while running 5-17x faster than real GPU execution.
- Score: 1.4573878379102423
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deploying LLMs efficiently requires testing hundreds of serving configurations, but evaluating each one on a GPU cluster takes hours and costs thousands of dollars. Discrete-event simulators are faster and cheaper, but they require re-implementing the serving system's control logic -- a burden that compounds as frameworks evolve. We present Revati, a time-warp emulator that enables performance modeling by directly executing real serving system code at simulation-like speed. The system intercepts CUDA API calls to virtualize device management, allowing serving frameworks to run without physical GPUs. Instead of executing GPU kernels, it performs time jumps -- fast-forwarding virtual time by predicted kernel durations. We propose a coordination protocol that synchronizes these jumps across distributed processes while preserving causality. On vLLM and SGLang, Revati achieves less than 5% prediction error across multiple models and parallelism configurations, while running 5-17x faster than real GPU execution.
Related papers
- Spava: Accelerating Long-Video Understanding via Sequence-Parallelism-aware Approximate Attention [63.69228529380251]
Spava is a sequence-parallel framework with optimized attention for long-video inference.<n>Spava delivers speedups of 12.72x, 1.70x, and 1.18x over FlashAttn, ZigZagRing, and APB, without notable performance loss.
arXiv Detail & Related papers (2026-01-29T09:23:13Z) - An LLVM-Based Optimization Pipeline for SPDZ [0.0]
We implement a proof-of-concept LLVM-based optimization pipeline for the SPDZ protocol.<n>Our front end accepts a subset of C with lightweight privacy annotations and lowers it to LLVM IR.<n>Our back end performs data-flow and control-flow analysis on the optimized IR to drive a non-blocking runtime scheduler.
arXiv Detail & Related papers (2025-12-11T20:53:35Z) - Accelerating Mobile Inference through Fine-Grained CPU-GPU Co-Execution [1.3356260369011272]
We propose a lightweight synchronization mechanism based on OpenCL fine-grained shared virtual memory (SVM) and machine learning models to accurately predict execution times.<n>A comprehensive evaluation on four mobile platforms shows that our approach can quickly select CPU- GPU co-execution strategies achieving up to 1.89x speedup for linear layers and 1.75x speedup for convolutional layers.
arXiv Detail & Related papers (2025-10-24T01:41:43Z) - Minute-Long Videos with Dual Parallelisms [57.22737565366549]
Diffusion Transformer (DiT)-based video diffusion models generate high-quality videos at scale but incur prohibitive processing latency and memory costs for long videos.<n>We propose a novel distributed inference strategy, termed DualParal.<n>Instead of generating an entire video on a single GPU, we parallelize both temporal frames and model layers across GPUs.
arXiv Detail & Related papers (2025-05-27T11:55:22Z) - KVPR: Efficient LLM Inference with I/O-Aware KV Cache Partial Recomputation [7.204881999658682]
Key-Value cache is used to store intermediate activations for large language models.<n>The memory required for the KV cache grows rapidly, often exceeding the capacity of GPU memory.<n>Existing methods attempt to address these issues by overlapping GPU computation with I/O or employing CPU-GPU heterogeneous execution.<n>We introduce KVPR, an efficient I/O-aware LLM inference method where the CPU first transfers a partial set of activations.<n> KVPR achieves up to 35.8% lower latency and 46.2% higher throughput during decoding compared to state-of-the-art approaches.
arXiv Detail & Related papers (2024-11-26T04:03:14Z) - vTensor: Flexible Virtual Tensor Management for Efficient LLM Serving [53.972175896814505]
Large Language Models (LLMs) are widely used across various domains, processing millions of daily requests.
Large Language Models (LLMs) are widely used across various domains, processing millions of daily requests.
arXiv Detail & Related papers (2024-07-22T14:37:58Z) - Hardware-Aware Parallel Prompt Decoding for Memory-Efficient Acceleration of LLM Inference [23.633481089469836]
Auto-regressive decoding of Large Language Models (LLMs) results in significant overheads in their hardware performance.<n>We propose a novel parallel prompt decoding that requires only $0.0002$% trainable parameters, enabling efficient training on a single A100-40GB GPU in just 16 hours.<n>Our approach demonstrates up to 2.49$times$ speedup and maintains a minimal memory overhead of just $0.0004$%.
arXiv Detail & Related papers (2024-05-28T22:19:30Z) - FIKIT: Priority-Based Real-time GPU Multi-tasking Scheduling with Kernel
Identification [2.9271819018953162]
In a cloud computing cluster, serving a GPU's computation power through multi-tasks sharing is highly demanded.
Existing GPU sharing solutions focus on reducing task-level waiting time or task-level switching costs when multiple jobs competing for a single GPU.
We present a novel kernel-level scheduling strategy called FIKIT: Filling Inter- Kernel Idle Time.
arXiv Detail & Related papers (2023-11-17T07:25:18Z) - FusionAI: Decentralized Training and Deploying LLMs with Massive
Consumer-Level GPUs [57.12856172329322]
We envision a decentralized system unlocking the potential vast untapped consumer-level GPU.
This system faces critical challenges, including limited CPU and GPU memory, low network bandwidth, the variability of peer and device heterogeneity.
arXiv Detail & Related papers (2023-09-03T13:27:56Z) - PARTIME: Scalable and Parallel Processing Over Time with Deep Neural
Networks [68.96484488899901]
We present PARTIME, a library designed to speed up neural networks whenever data is continuously streamed over time.
PARTIME starts processing each data sample at the time in which it becomes available from the stream.
Experiments are performed in order to empirically compare PARTIME with classic non-parallel neural computations in online learning.
arXiv Detail & Related papers (2022-10-17T14:49:14Z) - Optimizing Data Collection in Deep Reinforcement Learning [4.9709347068704455]
GPU vectorization can achieve up to $1024times$ speedup over commonly used CPU simulators.
We show that simulator kernel fusion speedups with a simple simulator are $11.3times$ and increase by up to $1024times$ as simulator complexity increases in terms of memory bandwidth requirements.
arXiv Detail & Related papers (2022-07-15T20:22:31Z) - PLSSVM: A (multi-)GPGPU-accelerated Least Squares Support Vector Machine [68.8204255655161]
Support Vector Machines (SVMs) are widely used in machine learning.
However, even modern and optimized implementations do not scale well for large non-trivial dense data sets on cutting-edge hardware.
PLSSVM can be used as a drop-in replacement for an LVM.
arXiv Detail & Related papers (2022-02-25T13:24:23Z) - RTGPU: Real-Time GPU Scheduling of Hard Deadline Parallel Tasks with
Fine-Grain Utilization [5.02836935036198]
We propose RTGPU, which can schedule the execution of multiple GPU applications in real-time to meet hard deadlines.
Our approach provides superior schedulability compared with previous work, and gives real-time guarantees to meet hard deadlines for multiple GPU applications.
arXiv Detail & Related papers (2021-01-25T22:34:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.