Related papers: Routing, Cascades, and User Choice for LLMs

Routing, Cascades, and User Choice for LLMs

URL: http://arxiv.org/abs/2602.09902v1
Date: Tue, 10 Feb 2026 15:39:31 GMT
Title: Routing, Cascades, and User Choice for LLMs
Authors: Rafid Mahmood,
Abstract summary: We study the effect of LLM routing with respect to user behavior.<n>We propose a game between an LLM provider with two models and a user who can re-prompt or abandon tasks.<n>The user's goal is to maximize their utility minus the delay from using the model, while the provider minimizes the cost of servicing the user.
Score: 9.28138618885869
License: http://creativecommons.org/licenses/by/4.0/
Abstract: To mitigate the trade-offs between performance and costs, LLM providers route user tasks to different models based on task difficulty and latency. We study the effect of LLM routing with respect to user behavior. We propose a game between an LLM provider with two models (standard and reasoning) and a user who can re-prompt or abandon tasks if the routed model cannot solve them. The user's goal is to maximize their utility minus the delay from using the model, while the provider minimizes the cost of servicing the user. We solve this Stackelberg game by fully characterizing the user best response and simplifying the provider problem. We observe that in nearly all cases, the optimal routing policy involves a static policy with no cascading that depends on the expected utility of the models to the user. Furthermore, we reveal a misalignment gap between the provider-optimal and user-preferred routes when the user's and provider's rankings of the models with respect to utility and cost differ. Finally, we demonstrate conditions for extreme misalignment where providers are incentivized to throttle the latency of the models to minimize their costs, consequently depressing user utility. The results yield simple threshold rules for single-provider, single-user interactions and clarify when routing, cascading, and throttling help or harm.

Related papers

When Routing Collapses: On the Degenerate Convergence of LLM Routers [46.01380774114097]
As user's cost budget increases, routers systematically default to the most capable and most expensive model.<n>We propose Equi, a decision-aware router that directly learns model rankings.<n>On RouterBench, Equi reduces cost by about 17% at GPT-4-level performance compared to the strongest prior router.
arXiv Detail & Related papers (2026-02-03T12:51:55Z)
RouteMoA: Dynamic Routing without Pre-Inference Boosts Efficient Mixture-of-Agents [91.0187958746262]
RouteMoA is an efficient mixture-of-agents framework with dynamic routing.<n>It employs a lightweight scorer to perform initial screening by predicting coarse-grained performance from the query.<n>It refines these scores through lightweight self- and cross-assessment based on existing model outputs, providing posterior correction without additional inference.
arXiv Detail & Related papers (2026-01-26T04:22:22Z)
Don't Start Over: A Cost-Effective Framework for Migrating Personalized Prompts Between LLMs [51.79252689855809]
Personalization in Large Language Models (LLMs) often relies on user-specific soft prompts.<n>We propose the Prompt-level User Migration Adapter (PUMA), a framework to efficiently migrate personalized prompts across incompatible models.<n>Experiments on three large-scale datasets show our method matches or even surpasses the performance of retraining from scratch, reducing computational cost by up to 98%.
arXiv Detail & Related papers (2026-01-17T12:30:31Z)
Learning to Route LLMs from Bandit Feedback: One Policy, Many Trade-offs [69.2486294522259]
BaRP is a Bandit Routing-feedback with Preferences approach that trains under the same partial-feedback restriction as deployment.<n> Framed as a contextual bandit over prompt features and a user preference vector, our method simulates an online feedback setting during training and adapts its routing decisions to each new prompt.
arXiv Detail & Related papers (2025-10-08T18:24:59Z)
Adaptive LLM Routing under Budget Constraints [12.432635540782874]
Large Language Models (LLMs) have revolutionized natural language processing, but their varying capabilities and costs pose challenges in practical applications.<n>Previous approaches treat this as a supervised learning problem, assuming complete knowledge of optimal query-LLM pairings.<n>We propose to study LLM routing as a contextual bandit problem, enabling adaptive decision-making using bandit feedback.
arXiv Detail & Related papers (2025-08-28T18:18:19Z)
Cost-Aware Contrastive Routing for LLMs [57.30288453580456]
We introduce Cost-Spectrum Contrastive Routing (CSCR), a lightweight framework that maps both prompts and models into a shared embedding space.<n>CSCR consistently outperforms baselines, improving the accuracy-cost tradeoff by up to 25%.
arXiv Detail & Related papers (2025-08-17T20:16:44Z)
OmniRouter: Budget and Performance Controllable Multi-LLM Routing [31.60019342381251]
Large language models (LLMs) deliver superior performance but require substantial computational resources and operate with relatively low efficiency.<n>We introduce Omni, a controllable routing framework for multi-LLM serving.<n>Experiments show that Omni achieves up to 6.30% improvement in response accuracy while simultaneously reducing computational costs by at least 10.15%.
arXiv Detail & Related papers (2025-02-27T22:35:31Z)
Universal Model Routing for Efficient LLM Inference [69.86195589350264]
Model routing is a technique for reducing the inference cost of large language models (LLMs)<n>We propose UniRoute, a new approach to the problem of dynamic routing, where new, previously unobserved LLMs are available at test time.<n>We show that these are estimates of a theoretically optimal routing rule, and quantify their errors via an excess risk bound.
arXiv Detail & Related papers (2025-02-12T20:30:28Z)
MixLLM: Dynamic Routing in Mixed Large Language Models [57.309520357563215]
Large Language Models (LLMs) exhibit potential artificial generic intelligence recently, however, their usage is costly with high response latency.<n>We develop MixLLM, a dynamic contextual-bandit-based routing system for query-LLM assignment.
arXiv Detail & Related papers (2025-02-09T02:26:15Z)
iServe: An Intent-based Serving System for LLMs [0.34998703934432684]
iServe is an intent-based system for distributed Large Language Models (LLMs) inference.<n>Instead of manually selecting deployment configurations, developers simply specify their intent.<n>iServe best meets user intents compared to state-of-the-art systems.
arXiv Detail & Related papers (2025-01-08T14:38:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.