Related papers: Efficient Training-Free Online Routing for High-Volume Multi-LLM Serving

Efficient Training-Free Online Routing for High-Volume Multi-LLM Serving

URL: http://arxiv.org/abs/2509.02718v2
Date: Mon, 20 Oct 2025 22:47:09 GMT
Title: Efficient Training-Free Online Routing for High-Volume Multi-LLM Serving
Authors: Fangzhou Wu, Sandeep Silwal,
Abstract summary: LLM routing offers a cost-efficient solution by directing queries to the optimal LLM based on model and query features.<n>Existing works primarily focus on offline scenarios and struggle to adapt to online settings.<n>We introduce the first training-free algorithm for online routing scenarios.
Score: 10.746325451673274
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Increasing demand for Large Language Models (LLMs) services imposes substantial deployment and computation costs on providers. LLM routing offers a cost-efficient solution by directing queries to the optimal LLM based on model and query features. However, existing works primarily focus on offline scenarios and struggle to adapt to online settings with high query volume and constrained token budgets. In this work, we introduce the first training-free algorithm for online routing scenarios. Our algorithm leverages approximate nearest neighbor search to efficiently estimate query features and performs a one-time optimization over a small set of initial queries to learn a routing strategy that guides future routing. We provide theoretical guarantees demonstrating that our algorithm achieves a competitive ratio of $1 - o(1)$ under natural assumptions, which is further validated by extensive experiments across 3 benchmark datasets and 8 baselines, showing an average improvement of 3.55$\times$ in overall performance, 1.85$\times$ in cost efficiency, and nearly 4.25$\times$ in throughput. Our code is available at https://github.com/fzwark/PORT.

Related papers

xRouter: Training Cost-Aware LLMs Orchestration System via Reinforcement Learning [104.63494870852894]
We present x, a tool-calling-based routing system in which a learned router can either answer directly or invoke one or more external models.<n>Our implementation encompasses the full reinforcement learning framework, including reward and cost accounting.<n>Across diverse benchmarks, x achieves strong cost-performance trade-offs.
arXiv Detail & Related papers (2025-10-09T16:52:01Z)
One Head, Many Models: Cross-Attention Routing for Cost-Aware LLM Selection [3.872690949369412]
Large language models (LLMs) with varying computational costs and performance profiles present a critical challenge for scalable, cost-effective deployment in real-world applications.<n>We introduce a unified routing framework that leverages a single-head cross-attention mechanism to jointly model query and model embeddings.<n>By explicitly capturing fine-grained query-model interactions, our router predicts both response quality and generation cost, achieving up to 6.6% improvement in Average Improvement in Quality (AIQ) and 2.9% in maximum performance over existing routers.
arXiv Detail & Related papers (2025-09-11T18:29:09Z)
How to Train Your LLM Web Agent: A Statistical Diagnosis [102.04125085041473]
We present the first statistically grounded study on compute allocation for LLM web-agent post-training.<n>Our approach uses a two-stage pipeline, training a Llama 3.1 8B student to imitate a Llama 3.3 70B teacher via supervised fine-tuning (SFT) and on-policy reinforcement learning.<n>Our results show that combining SFT with on-policy RL consistently outperforms either approach alone on both WorkArena and MiniWob++.
arXiv Detail & Related papers (2025-07-05T17:12:33Z)
SkewRoute: Training-Free LLM Routing for Knowledge Graph Retrieval-Augmented Generation via Score Skewness of Retrieved Context [39.19789380714972]
Large language models excel at many tasks but often incur high inference costs during deployment.<n>We propose an extremely simple yet effective routing framework for KG-RAG that efficiently balances performance and cost in a plug-and-play manner.
arXiv Detail & Related papers (2025-05-28T14:45:56Z)
Accelerating RL for LLM Reasoning with Optimal Advantage Regression [52.0792918455501]
We propose a novel two-stage policy optimization framework that directly approximates the optimal advantage function.<n>$A$*-PO achieves competitive performance across a wide range of mathematical reasoning benchmarks.<n>It reduces training time by up to 2$times$ and peak memory usage by over 30% compared to PPO, GRPO, and REBEL.
arXiv Detail & Related papers (2025-05-27T03:58:50Z)
OmniRouter: Budget and Performance Controllable Multi-LLM Routing [31.60019342381251]
Large language models (LLMs) deliver superior performance but require substantial computational resources and operate with relatively low efficiency.<n>We introduce Omni, a controllable routing framework for multi-LLM serving.<n>Experiments show that Omni achieves up to 6.30% improvement in response accuracy while simultaneously reducing computational costs by at least 10.15%.
arXiv Detail & Related papers (2025-02-27T22:35:31Z)
Universal Model Routing for Efficient LLM Inference [69.86195589350264]
Model routing is a technique for reducing the inference cost of large language models (LLMs)<n>We propose UniRoute, a new approach to the problem of dynamic routing, where new, previously unobserved LLMs are available at test time.<n>We show that these are estimates of a theoretically optimal routing rule, and quantify their errors via an excess risk bound.
arXiv Detail & Related papers (2025-02-12T20:30:28Z)
CITER: Collaborative Inference for Efficient Large Language Model Decoding with Token-Level Routing [56.98081258047281]
Collaborative Inference with Token-lEvel Routing (CITER) is a framework that enables efficient collaboration between small and large language models.<n>We formulate router training as a policy optimization, where the router receives rewards based on both the quality of predictions and the inference costs of generation.<n>Our experiments show that CITER reduces the inference costs while preserving high-quality generation, offering a promising solution for real-time and resource-constrained applications.
arXiv Detail & Related papers (2025-02-04T03:36:44Z)
Doing More with Less: A Survey on Routing Strategies for Resource Optimisation in Large Language Model-Based Systems [1.430963201405577]
Large Language Model (LLM)-based systems are usually designed with a single, general-purpose LLM to handle all user queries.<n>These systems may be inefficient as different queries may require different levels of reasoning, domain knowledge or pre-processing.<n>A routing mechanism can therefore be employed to route queries to more appropriate components, such as smaller or specialised models.
arXiv Detail & Related papers (2025-02-01T12:08:38Z)
PickLLM: Context-Aware RL-Assisted Large Language Model Routing [0.5325390073522079]
PickLLM is a lightweight framework that relies on Reinforcement Learning (RL) to route on-the-fly queries to available models.<n>We demonstrate the speed of convergence for different learning rates and improvement in hard metrics such as cost per querying session and overall response latency.
arXiv Detail & Related papers (2024-12-12T06:27:12Z)
Optimizing LLM Queries in Relational Data Analytics Workloads [50.95919232839785]
Batch data analytics is a growing application for Large Language Models (LLMs)<n>LLMs enable users to perform a wide range of natural language tasks, such as classification, entity extraction, and translation, over large datasets.<n>We propose novel techniques that can significantly reduce the cost of LLM calls for relational data analytics workloads.
arXiv Detail & Related papers (2024-03-09T07:01:44Z)
Reinforcement Learning from Human Feedback with Active Queries [59.855433734053555]
Current reinforcement learning approaches often require a large amount of human-labelled preference data.<n>We propose query-efficient RLHF methods inspired by the success of active learning.<n>Our experiments show that ADPO, while only making about half of queries for human preference, matches the performance of the state-of-the-art DPO method.
arXiv Detail & Related papers (2024-02-14T18:58:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.