Leveraging Uncertainty Estimation for Efficient LLM Routing
- URL: http://arxiv.org/abs/2502.11021v1
- Date: Sun, 16 Feb 2025 07:08:47 GMT
- Title: Leveraging Uncertainty Estimation for Efficient LLM Routing
- Authors: Tuo Zhang, Asal Mehradfar, Dimitrios Dimitriadis, Salman Avestimehr,
- Abstract summary: Deploying large language models (LLMs) in edge-cloud environments requires an efficient routing strategy to balance cost and response quality.<n>Traditional approaches prioritize either human-preference data or accuracy metrics from benchmark datasets as routing criteria.<n>We propose the Confidence-Driven LLM Router, a novel framework that leverages uncertainty estimation to optimize routing decisions.
- Score: 20.67188754368684
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Deploying large language models (LLMs) in edge-cloud environments requires an efficient routing strategy to balance cost and response quality. Traditional approaches prioritize either human-preference data or accuracy metrics from benchmark datasets as routing criteria, but these methods suffer from rigidity and subjectivity. Moreover, existing routing frameworks primarily focus on accuracy and cost, neglecting response quality from a human preference perspective. In this work, we propose the Confidence-Driven LLM Router, a novel framework that leverages uncertainty estimation to optimize routing decisions. To comprehensively assess routing performance, we evaluate both system cost efficiency and response quality. In particular, we introduce the novel use of LLM-as-a-Judge to simulate human rating preferences, providing the first systematic assessment of response quality across different routing strategies. Extensive experiments on MT-Bench, GSM8K, and MMLU demonstrate that our approach outperforms state-of-the-art routing methods, achieving superior response quality while maintaining cost efficiency.
Related papers
- Reliable LLM-Based Edge-Cloud-Expert Cascades for Telecom Knowledge Systems [54.916243942641444]
Large language models (LLMs) are emerging as key enablers of automation in domains such as telecommunications.<n>We study an edge-cloud-expert cascaded LLM-based knowledge system that supports decision-making through a question-and-answer pipeline.
arXiv Detail & Related papers (2025-12-23T03:10:09Z) - Meta-Router: Bridging Gold-standard and Preference-based Evaluations in Large Language Model Routing [15.724480880994259]
A large language model (LLM) router selects the most appropriate model from a pool of candidates for each query.<n> preference-based data, collected via crowdsourcing or LLM-as-a-judge systems, are cheaper and more scalable, yet often biased in reflecting the true quality of responses.<n>We develop an integrative causal router training framework that corrects preference-data bias, address imbalances between two data sources, and improve routing robustness and efficiency.
arXiv Detail & Related papers (2025-09-29T21:44:00Z) - One Head, Many Models: Cross-Attention Routing for Cost-Aware LLM Selection [3.872690949369412]
Large language models (LLMs) with varying computational costs and performance profiles present a critical challenge for scalable, cost-effective deployment in real-world applications.<n>We introduce a unified routing framework that leverages a single-head cross-attention mechanism to jointly model query and model embeddings.<n>By explicitly capturing fine-grained query-model interactions, our router predicts both response quality and generation cost, achieving up to 6.6% improvement in Average Improvement in Quality (AIQ) and 2.9% in maximum performance over existing routers.
arXiv Detail & Related papers (2025-09-11T18:29:09Z) - Federated In-Context Learning: Iterative Refinement for Improved Answer Quality [62.72381208029899]
In-context learning (ICL) enables language models to generate responses without modifying their parameters by leveraging examples provided in the input.<n>We propose Federated In-Context Learning (Fed-ICL), a general framework that enhances ICL through an iterative, collaborative process.<n>Fed-ICL progressively refines responses by leveraging multi-round interactions between clients and a central server, improving answer quality without the need to transmit model parameters.
arXiv Detail & Related papers (2025-06-09T05:33:28Z) - Speculative Reward Model Boosts Decision Making Ability of LLMs Cost-Effectively [13.40488551654639]
We introduce the 3E Criteria to assess the cost-effectiveness of search strategies.<n>We propose the Speculative Reward Model (SRM), a plug-and-play framework that integrates seamlessly with existing search strategies.<n> Experimental results show that RM reduces costs to 1/10 of the original search framework on average while maintaining effectiveness.
arXiv Detail & Related papers (2025-05-31T05:32:12Z) - Preference Optimization for Combinatorial Optimization Problems [54.87466279363487]
Reinforcement Learning (RL) has emerged as a powerful tool for neural optimization, enabling models learns that solve complex problems without requiring expert knowledge.<n>Despite significant progress, existing RL approaches face challenges such as diminishing reward signals and inefficient exploration in vast action spaces.<n>We propose Preference Optimization, a novel method that transforms quantitative reward signals into qualitative preference signals via statistical comparison modeling.
arXiv Detail & Related papers (2025-05-13T16:47:00Z) - How Robust Are Router-LLMs? Analysis of the Fragility of LLM Routing Capabilities [62.474732677086855]
Large language model (LLM) routing has emerged as a crucial strategy for balancing computational costs with performance.
We propose the DSC benchmark: Diverse, Simple, and Categorized, an evaluation framework that categorizes router performance across a broad spectrum of query types.
arXiv Detail & Related papers (2025-03-20T19:52:30Z) - MixLLM: Dynamic Routing in Mixed Large Language Models [57.309520357563215]
Large Language Models (LLMs) exhibit potential artificial generic intelligence recently, however, their usage is costly with high response latency.
We develop MixLLM, a dynamic contextual-bandit-based routing system for query-LLM assignment.
arXiv Detail & Related papers (2025-02-09T02:26:15Z) - Confident or Seek Stronger: Exploring Uncertainty-Based On-device LLM Routing From Benchmarking to Generalization [61.02719787737867]
Large language models (LLMs) are increasingly deployed and democratized on edge devices.<n>One promising solution is uncertainty-based SLM routing, offloading high-stakes queries to stronger LLMs when resulting in low-confidence responses on SLM.<n>We conduct a comprehensive investigation into benchmarking and generalization of uncertainty-driven routing strategies from SLMs to LLMs over 1500+ settings.
arXiv Detail & Related papers (2025-02-06T18:59:11Z) - CITER: Collaborative Inference for Efficient Large Language Model Decoding with Token-Level Routing [56.98081258047281]
CITER enables efficient collaboration between small and large language models (SLMs & LLMs) through a token-level routing strategy.<n>We formulate router training as a policy optimization, where the router receives rewards based on both the quality of predictions and the inference costs of generation.<n>Our experiments show that CITER reduces the inference costs while preserving high-quality generation, offering a promising solution for real-time and resource-constrained applications.
arXiv Detail & Related papers (2025-02-04T03:36:44Z) - Doing More with Less: A Survey on Routing Strategies for Resource Optimisation in Large Language Model-Based Systems [1.430963201405577]
Large Language Model (LLM)-based systems are usually designed with a single, general-purpose LLM to handle all user queries.<n>These systems may be inefficient as different queries may require different levels of reasoning, domain knowledge or pre-processing.<n>A routing mechanism can therefore be employed to route queries to more appropriate components, such as smaller or specialised models.
arXiv Detail & Related papers (2025-02-01T12:08:38Z) - Reward-Guided Speculative Decoding for Efficient LLM Reasoning [80.55186052123196]
We introduce Reward-Guided Speculative Decoding (RSD), a novel framework aimed at improving the efficiency of inference in large language models (LLMs)<n>RSD incorporates a controlled bias to prioritize high-reward outputs, in contrast to existing speculative decoding methods that enforce strict unbiasedness.<n>RSD delivers significant efficiency gains against decoding with the target model only, while achieving significant better accuracy than parallel decoding method on average.
arXiv Detail & Related papers (2025-01-31T17:19:57Z) - A Unified Approach to Routing and Cascading for LLMs [5.653106385738822]
Large language models (LLMs) embedded in various agentic systems have increased the potential of model selection strategies to improve the cost-performance tradeoff.
Existing strategies involve either routing, where a single model is chosen per query, or cascading, which sequentially runs increasingly larger models until a satisfactory answer is found.
We derive a novel optimal strategy for cascading and prove the optimality of an existing routing strategy.
We propose cascade routing, a unified framework that integrates routing and cascading into a theoretically optimal strategy.
arXiv Detail & Related papers (2024-10-14T10:00:49Z) - Optimizing Inventory Routing: A Decision-Focused Learning Approach using
Neural Networks [0.0]
We formulate and propose a decision-focused learning-based approach to solving real-world IRPs.
This approach directly integrates inventory prediction and routing optimization within an end-to-end system potentially ensuring a robust supply chain strategy.
arXiv Detail & Related papers (2023-11-02T04:05:28Z) - Routing Arena: A Benchmark Suite for Neural Routing Solvers [8.158770689562672]
We propose a benchmark suite for Routing Problems that provides a seamless integration of consistent evaluation and the provision of baselines and benchmarks prevalent in the Machine Learning- and Operations Research field.
A comprehensive first experimental evaluation demonstrates that the most recent Operations Research solvers generate state-of-the-art results in terms of solution quality and runtime efficiency when it comes to the vehicle routing problem.
arXiv Detail & Related papers (2023-10-06T10:24:33Z) - Fidelity-Guarantee Entanglement Routing in Quantum Networks [64.49733801962198]
Entanglement routing establishes remote entanglement connection between two arbitrary nodes.
We propose purification-enabled entanglement routing designs to provide fidelity guarantee for multiple Source-Destination (SD) pairs in quantum networks.
arXiv Detail & Related papers (2021-11-15T14:07:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.