Related papers: Vortex: Hosting ML Inference and Knowledge Retrieval Services With Tight Latency and Throughput Requirements

Vortex: Hosting ML Inference and Knowledge Retrieval Services With Tight Latency and Throughput Requirements

URL: http://arxiv.org/abs/2511.02062v1
Date: Mon, 03 Nov 2025 20:59:27 GMT
Title: Vortex: Hosting ML Inference and Knowledge Retrieval Services With Tight Latency and Throughput Requirements
Authors: Yuting Yang, Tiancheng Yuan, Jamal Hashim, Thiago Garrett, Jeffrey Qian, Ann Zhang, Yifan Wang, Weijia Song, Ken Birman,
Abstract summary: Growing interest in deploying ML inference and knowledge retrieval as services that could support both interactive queries by end users and more demanding request flows that arise from AIs integrated into end-user applications and deployed as agents.<n>Existing ML serving platforms use to optimize for high throughput, exposing them to unpredictable tail latencies. Vortex enables an SLO-first approach.<n>For identical tasks, Vortex's pipelines achieve significantly lower and more stable latencies than TorchServe and Ray Serve over a wide range of workloads, often enabling a given SLO target at more than twice the request rate.
Score: 5.853608336265818
License: http://creativecommons.org/licenses/by/4.0/
Abstract: There is growing interest in deploying ML inference and knowledge retrieval as services that could support both interactive queries by end users and more demanding request flows that arise from AIs integrated into a end-user applications and deployed as agents. Our central premise is that these latter cases will bring service level latency objectives (SLOs). Existing ML serving platforms use batching to optimize for high throughput, exposing them to unpredictable tail latencies. Vortex enables an SLO-first approach. For identical tasks, Vortex's pipelines achieve significantly lower and more stable latencies than TorchServe and Ray Serve over a wide range of workloads, often enabling a given SLO target at more than twice the request rate. When RDMA is available, the Vortex advantage is even more significant.

Related papers

AceGRPO: Adaptive Curriculum Enhanced Group Relative Policy Optimization for Autonomous Machine Learning Engineering [52.67783579040657]
AceGRPO is a machine learning system that prioritizes tasks at the agent's learning frontier to maximize learning efficiency.<n>Our trained Ace-30B model achieves a 100% valid submission rate on MLE-Bench-Lite, approaches the performance of proprietary frontier models, and outperforms larger open-source baselines.
arXiv Detail & Related papers (2026-02-08T10:55:03Z)
IntentRL: Training Proactive User-intent Agents for Open-ended Deep Research via Reinforcement Learning [54.21689544323704]
Deep Research (DR) agents extend Large Language Models (LLMs) beyond parametric knowledge.<n>Unlike real-time conversational assistants, DR is computationally expensive and time-consuming.<n>We propose IntentRL, a framework that trains proactive agents to clarify latent user intents before starting long-horizon research.
arXiv Detail & Related papers (2026-02-03T12:43:09Z)
AugServe: Adaptive Request Scheduling for Augmented Large Language Model Inference Serving [6.505016440664893]
AugServe is an efficient inference framework designed to reduce queueing latency and enhance effective throughput for augmented large language models (LLMs)<n> Experimental results show that AugServe achieves 4.7-33.1x and 3.3-13.2x higher effective throughput than vLLM and InferCept, while reducing fluctuating time-to-first-token (TTFT) by up to 96.3% and 95.0%, respectively.
arXiv Detail & Related papers (2025-12-03T17:49:38Z)
Nemotron-Flash: Towards Latency-Optimal Hybrid Small Language Models [97.55009021098554]
This work aims to identify the key determinants of SLMs' real-device latency and offer generalizable principles and methodologies for SLM design and training.<n>We introduce a new family of hybrid SLMs, called Nemotron-Flash, which significantly advances the accuracy-efficiency frontier of state-of-the-art SLMs.
arXiv Detail & Related papers (2025-11-24T08:46:36Z)
Intra-request branch orchestration for efficient LLM reasoning [52.68946975865865]
Large Language Models (LLMs) increasingly rely on inference-time reasoning algorithms to improve accuracy on complex tasks.<n>Prior work has largely focused on reducing token usage, often at the expense of accuracy, while overlooking other latency factors.<n>We present DUCHESS, an LLM serving system that reduces cost and latency without sacrificing accuracy through intra-request branch orchestration guided by predictions.
arXiv Detail & Related papers (2025-09-29T15:52:08Z)
PolyServe: Efficient Multi-SLO Serving at Scale [6.147741784378271]
PolyServe is a novel multi-SLO scheduling policy at scale that maintains high SLO attainment while maximizing throughput.<n> PolyServe achieves 1.23x goodput gain compared to existing policies, achieving up to 92.5% of optimal goodput.
arXiv Detail & Related papers (2025-07-17T05:54:42Z)
Tempo: Application-aware LLM Serving with Mixed SLO Requirements [7.290735867969561]
We introduce Tempo, a scheduler designed to maximize service gain across diverse LLM workloads.<n>Our evaluation shows that Tempo improves end-to-end service gain by up to 8.3$times$ achieves and up to 10.3$times$ SLO goodput compared to state-of-the-art designs.
arXiv Detail & Related papers (2025-04-24T05:55:21Z)
AccelGen: Heterogeneous SLO-Guaranteed High-Throughput LLM Inference Serving for Diverse Applications [8.964981700274059]
We propose AccelGen, a high- throughput inference serving system with heterogeneous SLO guarantees for diverse applications.<n>Trace real experiments demonstrate that AccelGen achieves 1.42-11.21X higher throughput, 1.43-13.71X higher goodput, 37-90% higher SLO attainment, and 1.61-12.22X lower response latency compared to the state-of-the-art approaches.
arXiv Detail & Related papers (2025-03-17T21:47:43Z)
Hierarchical Autoscaling for Large Language Model Serving with Chiron [2.767894999702707]
Large language model (LLM) serving is becoming an increasingly important workload for cloud providers.<n>Previous autoscalers for LLM serving do not consider request SLOs leading to unnecessary scaling and resource under-utilization.<n>We introduce Chiron, an autoscaler that uses the idea of hierarchical backpressure estimated using queue size, utilization, and SLOs.
arXiv Detail & Related papers (2025-01-14T12:57:40Z)
ConServe: Fine-Grained GPU Harvesting for LLM Online and Offline Co-Serving [61.35068981176018]
ConServe is a large language model (LLM) serving system that achieves high throughput and strong online latency guarantees.<n>We show that ConServe delivers an average of 2.2$times$ higher throughput and reduces online serving tail latency by 2.9$times$ on average compared to state-of-the-art systems.
arXiv Detail & Related papers (2024-10-02T04:12:13Z)
Efficiency Unleashed: Inference Acceleration for LLM-based Recommender Systems with Speculative Decoding [61.45448947483328]
We introduce Lossless Acceleration via Speculative Decoding for LLM-based Recommender Systems (LASER)<n>LASER features a Customized Retrieval Pool to enhance retrieval efficiency and Relaxed Verification to improve the acceptance rate of draft tokens.<n>LASER achieves a 3-5x speedup on public datasets and saves about 67% of computational resources during the online A/B test.
arXiv Detail & Related papers (2024-08-11T02:31:13Z)
SHERL: Synthesizing High Accuracy and Efficient Memory for Resource-Limited Transfer Learning [63.93193829913252]
We propose an innovative METL strategy called SHERL for resource-limited scenarios. In the early route, intermediate outputs are consolidated via an anti-redundancy operation. In the late route, utilizing minimal late pre-trained layers could alleviate the peak demand on memory overhead.
arXiv Detail & Related papers (2024-07-10T10:22:35Z)
DeepSpeed-FastGen: High-throughput Text Generation for LLMs via MII and DeepSpeed-Inference [23.49242865222089]
This paper introduces DeepSpeed-FastGen, a system that delivers up to 2.3x higher effective throughput, 2x lower latency on average, and up to 3.7x lower (token-level) tail latency. We leverage a synergistic combination of DeepSpeed-MII and DeepSpeed-Inference to provide an efficient and easy-to-use serving system for large language models.
arXiv Detail & Related papers (2024-01-09T06:49:40Z)
On the Role of Server Momentum in Federated Learning [85.54616432098706]
We propose a general framework for server momentum, that (a) covers a large class of momentum schemes that are unexplored in federated learning (FL) We provide rigorous convergence analysis for the proposed framework.
arXiv Detail & Related papers (2023-12-19T23:56:49Z)
Client Orchestration and Cost-Efficient Joint Optimization for NOMA-Enabled Hierarchical Federated Learning [55.49099125128281]
We propose a non-orthogonal multiple access (NOMA) enabled HFL system under semi-synchronous cloud model aggregation. We show that the proposed scheme outperforms the considered benchmarks regarding HFL performance improvement and total cost reduction.
arXiv Detail & Related papers (2023-11-03T13:34:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.