LatencyPrism: Online Non-intrusive Latency Sculpting for SLO-Guaranteed LLM Inference
- URL: http://arxiv.org/abs/2601.09258v2
- Date: Tue, 20 Jan 2026 07:29:49 GMT
- Title: LatencyPrism: Online Non-intrusive Latency Sculpting for SLO-Guaranteed LLM Inference
- Authors: Yin Du, Jiayi Ren, Xiayu Sun, Tianyao Zhou, Haizhu Zhou, Ruiyan Ma, Danyang Zhang,
- Abstract summary: We presentPrism, the first zero-intrusion multi-platform latency inference system.<n>It aims to break down the latency sculpting across pipeline, proactively alert inference anomalies, and guarantee adherence to SLOs without requiring code or service restarts.<n>We conduct extensive experiments and investigations into root cause analysis to demonstratePrism's capability.
- Score: 1.280379756275477
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: LLM inference latency critically determines user experience and operational costs, directly impacting throughput under SLO constraints. Even brief latency spikes degrade service quality despite acceptable average performance. However, distributed inference environments featuring diverse software frameworks and XPU architectures combined with dynamic workloads make latency analysis challenging. Constrained by intrusive designs that necessitate service restarts or even suspension, and by hardware-bound implementations that fail to adapt to heterogeneous inference environments, existing AI profiling methods are often inadequate for real-time production analysis. We present LatencyPrism, the first zero-intrusion multi-platform latency sculpting system. It aims to break down the inference latency across pipeline, proactively alert on inference latency anomalies, and guarantee adherence to SLOs, all without requiring code modifications or service restarts. LatencyPrism has been deployed across thousands of XPUs for over six months. It enables low-overhead real-time monitoring at batch level with alerts triggered in milliseconds. This approach distinguishes between workload-driven latency variations and anomalies indicating underlying issues with an F1-score of 0.98. We also conduct extensive experiments and investigations into root cause analysis to demonstrate LatencyPrism's capability. Furthermore, we introduce the first LLM anomaly simulation toolkit to facilitate future research in robust and predictable inference systems.
Related papers
- Contextual and Seasonal LSTMs for Time Series Anomaly Detection [49.50689313712684]
We propose a novel prediction-based framework named Contextual and Seasonal LSTMs (CS-LSTMs)<n>CS-LSTMs are built upon a noise decomposition strategy and jointly leverage contextual dependencies and seasonal patterns.<n>They consistently outperform state-of-the-art methods, highlighting their effectiveness and practical value in robust time series anomaly detection.
arXiv Detail & Related papers (2026-02-10T11:46:15Z) - Why Does the LLM Stop Computing: An Empirical Study of User-Reported Failures in Open-Source LLMs [50.075587392477935]
We conduct the first large-scale empirical study of 705 real-world failures from the open-source DeepSeek, Llama, and Qwen ecosystems.<n>Our analysis reveals a paradigm shift: white-box orchestration relocates the reliability bottleneck from model algorithmic defects to the systemic fragility of the deployment stack.
arXiv Detail & Related papers (2026-01-20T06:42:56Z) - HALO: Semantic-Aware Distributed LLM Inference in Lossy Edge Network [50.33808558714122]
Large language models' (LLMs) inference at the edge can facilitate prompt service responsiveness while protecting user privacy.<n>We propose HALO, a novel framework that can boost the distributed LLM inference in lossy edge network.<n> Experimental results from a Raspberry Pi cluster demonstrate that HALO achieves a 3.41x end-to-end speedup for LLaMA-series LLMs under unreliable network conditions.
arXiv Detail & Related papers (2026-01-16T07:37:23Z) - Hypothesize-Then-Verify: Speculative Root Cause Analysis for Microservices with Pathwise Parallelism [19.31110304702373]
SpecRCA is a speculative root cause analysis framework that adopts a textithypothesize-then-verify paradigm.<n>Preliminary experiments on the AIOps 2022 demonstrate that SpecRCA achieves superior accuracy and efficiency compared to existing approaches.
arXiv Detail & Related papers (2026-01-06T05:58:25Z) - CSGO: Generalized Optimization for Cold Start in Wireless Collaborative Edge LLM Systems [62.24576366776727]
We propose a latency-aware scheduling framework to minimize total inference latency.<n>We show that the proposed method significantly reduces cold-start latency compared to baseline strategies.
arXiv Detail & Related papers (2025-08-15T07:49:22Z) - Learning Unified System Representations for Microservice Tail Latency Prediction [8.532290784939967]
Microservice architectures have become the de facto standard for building scalable cloud-native applications.<n>Traditional approaches often rely on per-request latency metrics, which are highly sensitive to transient noise.<n>We propose USRFNet, a deep learning network that explicitly separates and models traffic-side and resource-side features.
arXiv Detail & Related papers (2025-08-03T07:46:23Z) - Towards Latency-Aware 3D Streaming Perception for Autonomous Driving [25.879279738510398]
We propose a new benchmark tailored for online evaluation by considering runtime latency.<n>Based on the benchmark, we build a latency-aware 3D Streaming Perception framework.<n>Our method shows generalization across various latency levels, achieving an online performance that closely aligns with 80% of its offline evaluation.
arXiv Detail & Related papers (2025-04-27T05:49:52Z) - SlowFastVAD: Video Anomaly Detection via Integrating Simple Detector and RAG-Enhanced Vision-Language Model [52.47816604709358]
Video anomaly detection (VAD) aims to identify unexpected events in videos and has wide applications in safety-critical domains.<n> vision-language models (VLMs) have demonstrated strong multimodal reasoning capabilities, offering new opportunities for anomaly detection.<n>We propose SlowFastVAD, a hybrid framework that integrates a fast anomaly detector with a slow anomaly detector.
arXiv Detail & Related papers (2025-04-14T15:30:03Z) - HyGen: Efficient LLM Serving via Elastic Online-Offline Request Co-location [6.727166537196941]
Large language models (LLMs) have facilitated a wide range of applications with distinct service-level objectives (SLOs)<n>Existing deployment model, which dedicates machines to each workload, simplifies SLO management but often leads to poor resource utilization.<n>This paper introduces HyGen, an interference-aware LLM serving system that enables efficient co-location of online and offline workloads.
arXiv Detail & Related papers (2025-01-15T16:32:27Z) - Code-as-Monitor: Constraint-aware Visual Programming for Reactive and Proactive Robotic Failure Detection [56.66677293607114]
We propose Code-as-Monitor (CaM) for both open-set reactive and proactive failure detection.<n>To enhance the accuracy and efficiency of monitoring, we introduce constraint elements that abstract constraint-related entities.<n>Experiments show that CaM achieves a 28.7% higher success rate and reduces execution time by 31.8% under severe disturbances.
arXiv Detail & Related papers (2024-12-05T18:58:27Z) - RT-LM: Uncertainty-Aware Resource Management for Real-Time Inference of
Language Models [12.947537874888717]
varied inference latency, identified as a consequence of uncertainty intrinsic to the nature of language, can lead to computational inefficiency.
We present RT-LM, an uncertainty-aware resource management ecosystem for real-time inference of LMs.
We show that RT-LM can significantly reduce the average response time and improve throughput while incurring a rather small runtime overhead.
arXiv Detail & Related papers (2023-09-12T22:22:10Z) - Neural Laplace Control for Continuous-time Delayed Systems [76.81202657759222]
We propose a continuous-time model-based offline RL method that combines a Neural Laplace dynamics model with a model predictive control (MPC) planner.
We show experimentally on continuous-time delayed environments it is able to achieve near expert policy performance.
arXiv Detail & Related papers (2023-02-24T12:40:28Z) - An Intelligent Deterministic Scheduling Method for Ultra-Low Latency
Communication in Edge Enabled Industrial Internet of Things [19.277349546331557]
Time Sensitive Network (TSN) is recently researched to realize low latency communication via deterministic scheduling.
Non-collision theory based deterministic scheduling (NDS) method is proposed to achieve ultra-low latency communication for the time-sensitive flows.
Experiment results demonstrate that NDS/DQS can well support deterministic ultra-low latency services and guarantee efficient bandwidth utilization.
arXiv Detail & Related papers (2022-07-17T16:52:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.