Universal Model Routing for Efficient LLM Inference
- URL: http://arxiv.org/abs/2502.08773v2
- Date: Tue, 22 Jul 2025 15:27:33 GMT
- Title: Universal Model Routing for Efficient LLM Inference
- Authors: Wittawat Jitkrittum, Harikrishna Narasimhan, Ankit Singh Rawat, Jeevesh Juneja, Congchao Wang, Zifeng Wang, Alec Go, Chen-Yu Lee, Pradeep Shenoy, Rina Panigrahy, Aditya Krishna Menon, Sanjiv Kumar,
- Abstract summary: Model routing is a technique for reducing the inference cost of large language models (LLMs)<n>We propose UniRoute, a new approach to the problem of dynamic routing, where new, previously unobserved LLMs are available at test time.<n>We show that these are estimates of a theoretically optimal routing rule, and quantify their errors via an excess risk bound.
- Score: 69.86195589350264
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Model routing is a simple technique for reducing the inference cost of large language models (LLMs), wherein one maintains a pool of candidate LLMs, and learns to route each prompt to the smallest feasible LLM. Existing works focus on learning a router for a fixed pool of LLMs. In this paper, we consider the problem of dynamic routing, where new, previously unobserved LLMs are available at test time. We propose UniRoute, a new approach to this problem that relies on representing each LLM as a feature vector, derived based on predictions on a set of representative prompts. Based on this, we detail two effective instantiations of UniRoute, relying on cluster-based routing and a learned cluster map respectively. We show that these are estimates of a theoretically optimal routing rule, and quantify their errors via an excess risk bound. Experiments on a range of public benchmarks show the effectiveness of UniRoute in routing amongst more than 30 unseen LLMs.
Related papers
- Router-R1: Teaching LLMs Multi-Round Routing and Aggregation via Reinforcement Learning [12.878608250420832]
We present textbf generalization-R1, a reinforcement learning framework that formulates multi-LLM routing and aggregation as a sequential decision process.<n>To facilitate learning, we employ a lightweight rule-based reward comprising format rewards, final outcome rewards, and a novel cost reward for optimizing the balance between performance and cost.
arXiv Detail & Related papers (2025-06-10T17:56:45Z) - RadialRouter: Structured Representation for Efficient and Robust Large Language Models Routing [31.446419903916425]
Radial is a novel framework for large language models routing.<n>It uses a lightweight Transformer-based backbone with a radial structure named RadialFormer to articulate the query-LLMs relationship.<n>It significantly outperforms existing routing methods by 9.2% and 5.8% in the Balance and Cost First scenarios.
arXiv Detail & Related papers (2025-06-04T12:16:41Z) - Query Routing for Retrieval-Augmented Language Models [38.05904245087491]
Retrieval-Augmented Generation (RAG) significantly improves the performance of Large Language Models (LLMs) on knowledge-intensive tasks.<n>We observe that external documents dynamically affect LLM's ability to answer queries, while existing routing methods exhibit suboptimal performance in RAG scenarios.<n>We propose RAG, a parametric RAG-aware routing design, which leverages document embeddings and RAG capability embeddings with contrastive learning to capture knowledge representation shifts.
arXiv Detail & Related papers (2025-05-29T03:44:56Z) - RouterEval: A Comprehensive Benchmark for Routing LLMs to Explore Model-level Scaling Up in LLMs [44.273794030829556]
This paper introduces RouterEval, a benchmark for router research that includes over 200,000,000 performance records for 12 popular LLM evaluations.
Using RouterEval, extensive evaluations of existing Routing LLM methods reveal that most still have significant room for improvement.
arXiv Detail & Related papers (2025-03-08T04:07:07Z) - LLM-Lasso: A Robust Framework for Domain-Informed Feature Selection and Regularization [59.75242204923353]
We introduce LLM-Lasso, a framework that leverages large language models (LLMs) to guide feature selection in Lasso regression.<n>LLMs generate penalty factors for each feature, which are converted into weights for the Lasso penalty using a simple, tunable model.<n>Features identified as more relevant by the LLM receive lower penalties, increasing their likelihood of being retained in the final model.
arXiv Detail & Related papers (2025-02-15T02:55:22Z) - Confident or Seek Stronger: Exploring Uncertainty-Based On-device LLM Routing From Benchmarking to Generalization [61.02719787737867]
Large language models (LLMs) are increasingly deployed and democratized on edge devices.
One promising solution is uncertainty-based SLM routing, offloading high-stakes queries to stronger LLMs when resulting in low-confidence responses on SLM.
We conduct a comprehensive investigation into benchmarking and generalization of uncertainty-driven routing strategies from SLMs to LLMs over 1500+ settings.
arXiv Detail & Related papers (2025-02-06T18:59:11Z) - PickLLM: Context-Aware RL-Assisted Large Language Model Routing [0.5325390073522079]
PickLLM is a lightweight framework that relies on Reinforcement Learning (RL) to route on-the-fly queries to available models.
We demonstrate the speed of convergence for different learning rates and improvement in hard metrics such as cost per querying session and overall response latency.
arXiv Detail & Related papers (2024-12-12T06:27:12Z) - Strada-LLM: Graph LLM for traffic prediction [62.2015839597764]
A considerable challenge in traffic prediction lies in handling the diverse data distributions caused by vastly different traffic conditions.<n>We propose a graph-aware LLM for traffic prediction that considers proximal traffic information.<n>We adopt a lightweight approach for efficient domain adaptation when facing new data distributions in few-shot fashion.
arXiv Detail & Related papers (2024-10-28T09:19:29Z) - GraphRouter: A Graph-based Router for LLM Selections [13.463815950807874]
Graph is a graph-based approach for the contextual and adaptive selection of Large Language Models.
We show that Graph substantially surpasses existing routers, delivering a minimum performance improvement of 12.3%.
This work achieves a graph-based approach for the contextual and adaptive selection of LLMs, offering insights for real-world applications.
arXiv Detail & Related papers (2024-10-04T18:02:48Z) - RouterDC: Query-Based Router by Dual Contrastive Learning for Assembling Large Language Models [24.113223576205932]
We show that query-based Router by Dual Contrastive learning (DC) is effective in assembling large language models (LLMs)
DC is effective in assembling LLMs and largely outperforms individual top-performing LLMs as well as existing routing methods on both in-distribution and out-of-distribution tasks.
arXiv Detail & Related papers (2024-09-30T02:31:40Z) - From Words to Actions: Unveiling the Theoretical Underpinnings of LLM-Driven Autonomous Systems [59.40480894948944]
Large language model (LLM) empowered agents are able to solve decision-making problems in the physical world.
Under this model, the LLM Planner navigates a partially observable Markov decision process (POMDP) by iteratively generating language-based subgoals via prompting.
We prove that the pretrained LLM Planner effectively performs Bayesian aggregated imitation learning (BAIL) through in-context learning.
arXiv Detail & Related papers (2024-05-30T09:42:54Z) - Towards Efficient LLM Grounding for Embodied Multi-Agent Collaboration [70.09561665520043]
We propose a novel framework for multi-agent collaboration that introduces Reinforced Advantage feedback (ReAd) for efficient self-refinement of plans.
We provide theoretical analysis by extending advantage-weighted regression in reinforcement learning to multi-agent systems.
Experiments on Over-AI and a difficult variant of RoCoBench show that ReAd surpasses baselines in success rate, and also significantly decreases the interaction steps of agents.
arXiv Detail & Related papers (2024-05-23T08:33:19Z) - Optimising Calls to Large Language Models with Uncertainty-Based Two-Tier Selection [80.63946798650653]
Decision centers on whether to use a large LLM with better performance or a smaller one with reduced costs.
We propose a simpler solution; we use only the uncertainty of the generations of the small LLM as the decision criterion.
Our experiments reveal this simple solution optimally balances cost and performance, outperforming existing methods on 25 out of 27 experimental setups.
arXiv Detail & Related papers (2024-05-03T14:38:59Z) - RouterBench: A Benchmark for Multi-LLM Routing System [25.515453832224804]
No single model can optimally address all tasks and applications, particularly when balancing performance with cost.
This limitation has led to the development of LLM routing systems, which combine the strengths of various models to overcome the constraints of individual LLMs.
We present RouterBench, a novel evaluation framework designed to systematically assess the efficacy of LLM routing systems.
arXiv Detail & Related papers (2024-03-18T17:59:04Z) - How Can LLM Guide RL? A Value-Based Approach [68.55316627400683]
Reinforcement learning (RL) has become the de facto standard practice for sequential decision-making problems by improving future acting policies with feedback.
Recent developments in large language models (LLMs) have showcased impressive capabilities in language understanding and generation, yet they fall short in exploration and self-improvement capabilities.
We develop an algorithm named LINVIT that incorporates LLM guidance as a regularization factor in value-based RL, leading to significant reductions in the amount of data needed for learning.
arXiv Detail & Related papers (2024-02-25T20:07:13Z) - Response Length Perception and Sequence Scheduling: An LLM-Empowered LLM
Inference Pipeline [22.08897444328099]
Large language models (LLMs) have revolutionized the field of AI, demonstrating unprecedented capacity across various tasks.
In this paper, we propose an efficient LLM inference pipeline that harnesses the power of LLMs.
arXiv Detail & Related papers (2023-05-22T15:36:06Z) - Guiding Large Language Models via Directional Stimulus Prompting [114.84930073977672]
We introduce Directional Stimulus Prompting, a novel framework for guiding black-box large language models (LLMs) toward specific desired outputs.
Instead of directly adjusting LLMs, our method employs a small tunable policy model to generate an auxiliary directional stimulus prompt for each input instance.
arXiv Detail & Related papers (2023-02-22T17:44:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.