IPR: Intelligent Prompt Routing with User-Controlled Quality-Cost Trade-offs
- URL: http://arxiv.org/abs/2509.06274v4
- Date: Thu, 09 Oct 2025 05:51:50 GMT
- Title: IPR: Intelligent Prompt Routing with User-Controlled Quality-Cost Trade-offs
- Authors: Aosong Feng, Balasubramaniam Srinivasan, Yun Zhou, Zhichao Xu, Kang Zhou, Sheng Guan, Yueyan Chen, Xian Wu, Ninad Kulkarni, Yi Zhang, Zhengyuan Shen, Dmitriy Bespalov, Soumya Smruti Mishra, Yifei Teng, Darren Yow-Bang Wang, Haibo Ding, Lin Lee Cheong,
- Abstract summary: textbfIngent textbfPrompt textbfRouting framework dynamically selects optimal models based on predicted response quality and user-specified tolerance levels.<n>IPR achieves 43.9% cost reduction while maintaining quality parity with the strongest model in the Claude family.
- Score: 19.658944117970137
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Routing incoming queries to the most cost-effective LLM while maintaining response quality poses a fundamental challenge in optimizing performance-cost trade-offs for large-scale commercial systems. We present IPR\, -- \,a quality-constrained \textbf{I}ntelligent \textbf{P}rompt \textbf{R}outing framework that dynamically selects optimal models based on predicted response quality and user-specified tolerance levels. IPR introduces three key innovations: (1) a modular architecture with lightweight quality estimators trained on 1.5M prompts annotated with calibrated quality scores, enabling fine-grained quality prediction across model families; (2) a user-controlled routing mechanism with tolerance parameter $\tau \in [0,1]$ that provides explicit control over quality-cost trade-offs; and (3) an extensible design using frozen encoders with model-specific adapters, reducing new model integration from days to hours. To rigorously train and evaluate IPR, we curate an industrial-level dataset IPRBench\footnote{IPRBench will be released upon legal approval.}, a comprehensive benchmark containing 1.5 million examples with response quality annotations across 11 LLM candidates. Deployed on a major cloud platform, IPR achieves 43.9\% cost reduction while maintaining quality parity with the strongest model in the Claude family and processes requests with sub-150ms latency. The deployed system and additional product details are publicly available at https://aws.amazon.com/bedrock/intelligent-prompt-routing/
Related papers
- EvoRoute: Experience-Driven Self-Routing LLM Agent Systems [100.64399490164959]
EvoRoute is a self-evolving model routing paradigm that transcends static, pre-defined model assignments.<n> Experiments on challenging agentic benchmarks demonstrate that EvoRoute, when integrated into off-the-shelf agentic systems, not only sustains or enhances system performance but also reduces execution cost by up to $80%$ and latency by over $70%$.
arXiv Detail & Related papers (2026-01-06T04:06:46Z) - Reliable LLM-Based Edge-Cloud-Expert Cascades for Telecom Knowledge Systems [54.916243942641444]
Large language models (LLMs) are emerging as key enablers of automation in domains such as telecommunications.<n>We study an edge-cloud-expert cascaded LLM-based knowledge system that supports decision-making through a question-and-answer pipeline.
arXiv Detail & Related papers (2025-12-23T03:10:09Z) - Design and Evaluation of Cost-Aware PoQ for Decentralized LLM Inference [4.254924788681319]
This paper introduces a cost-aware Proof of Quality (PoQ) framework for decentralized large language model (LLM) inference.<n>The design combines ground truth token level F1, lightweight learned evaluators, and GPT based judgments within a unified evaluation pipeline.<n> Monte Carlo simulations over 5,000 PoQ rounds demonstrate that the cost-aware reward scheme consistently assigns higher average rewards to high quality low cost inference models.
arXiv Detail & Related papers (2025-12-18T08:57:17Z) - MoE-Prism: Disentangling Monolithic Experts for Elastic MoE Services via Model-System Co-Designs [17.827406818899536]
MoE-Prism is a model-system co-design that transforms rigid MoE models into elastic services.<n>Our evaluation shows that MoE-Prismprovides over 4 times more distinct, stable operating points than the baseline.<n>This allows an AI service to dynamically improve throughput by up to 19.9% under a strict budget or reduce latency by up to 10.36% under limited resources.
arXiv Detail & Related papers (2025-10-22T08:40:01Z) - Learning to Route LLMs from Bandit Feedback: One Policy, Many Trade-offs [69.2486294522259]
BaRP is a Bandit Routing-feedback with Preferences approach that trains under the same partial-feedback restriction as deployment.<n> Framed as a contextual bandit over prompt features and a user preference vector, our method simulates an online feedback setting during training and adapts its routing decisions to each new prompt.
arXiv Detail & Related papers (2025-10-08T18:24:59Z) - Meta-Router: Bridging Gold-standard and Preference-based Evaluations in Large Language Model Routing [15.724480880994259]
A large language model (LLM) router selects the most appropriate model from a pool of candidates for each query.<n> preference-based data, collected via crowdsourcing or LLM-as-a-judge systems, are cheaper and more scalable, yet often biased in reflecting the true quality of responses.<n>We develop an integrative causal router training framework that corrects preference-data bias, address imbalances between two data sources, and improve routing robustness and efficiency.
arXiv Detail & Related papers (2025-09-29T21:44:00Z) - Cost-Aware Contrastive Routing for LLMs [57.30288453580456]
We introduce Cost-Spectrum Contrastive Routing (CSCR), a lightweight framework that maps both prompts and models into a shared embedding space.<n>CSCR consistently outperforms baselines, improving the accuracy-cost tradeoff by up to 25%.
arXiv Detail & Related papers (2025-08-17T20:16:44Z) - ReasonFlux-PRM: Trajectory-Aware PRMs for Long Chain-of-Thought Reasoning in LLMs [56.32212611983997]
We introduce ReasonFlux-PRM, a novel trajectory-aware PRM to evaluate trajectory-response type of reasoning traces.<n>ReasonFlux-PRM incorporates both step-level and trajectory-level supervision, enabling fine-grained reward assignment aligned with structured chain-of-thought data.<n>Our derived ReasonFlux-PRM-7B yields consistent performance improvements, achieving average gains of 12.1% in supervised fine-tuning, 4.5% in reinforcement learning, and 6.3% in test-time scaling.
arXiv Detail & Related papers (2025-06-23T17:59:02Z) - RM-R1: Reward Modeling as Reasoning [81.50471199906738]
Reasoning Reward Models (ReasRMs) formulate reward modeling as a reasoning task.<n>We propose a reasoning-oriented training pipeline and train a family of ReasRMs, RM-R1.<n>Our models achieve state-of-the-art performance across three reward model benchmarks on average.
arXiv Detail & Related papers (2025-05-05T06:11:12Z) - OmniRouter: Budget and Performance Controllable Multi-LLM Routing [31.60019342381251]
Large language models (LLMs) deliver superior performance but require substantial computational resources and operate with relatively low efficiency.<n>We introduce Omni, a controllable routing framework for multi-LLM serving.<n>Experiments show that Omni achieves up to 6.30% improvement in response accuracy while simultaneously reducing computational costs by at least 10.15%.
arXiv Detail & Related papers (2025-02-27T22:35:31Z) - MixLLM: Dynamic Routing in Mixed Large Language Models [57.309520357563215]
Large Language Models (LLMs) exhibit potential artificial generic intelligence recently, however, their usage is costly with high response latency.<n>We develop MixLLM, a dynamic contextual-bandit-based routing system for query-LLM assignment.
arXiv Detail & Related papers (2025-02-09T02:26:15Z) - Reward-Guided Speculative Decoding for Efficient LLM Reasoning [80.55186052123196]
We introduce Reward-Guided Speculative Decoding (RSD), a novel framework aimed at improving the efficiency of inference in large language models (LLMs)<n>RSD incorporates a controlled bias to prioritize high-reward outputs, in contrast to existing speculative decoding methods that enforce strict unbiasedness.<n>RSD delivers significant efficiency gains against decoding with the target model only, while achieving significant better accuracy than parallel decoding method on average.
arXiv Detail & Related papers (2025-01-31T17:19:57Z) - Hybrid LLM: Cost-Efficient and Quality-Aware Query Routing [53.748685766139715]
Large language models (LLMs) excel in most NLP tasks but also require expensive cloud servers for deployment due to their size.
We propose a hybrid inference approach which combines their respective strengths to save cost and maintain quality.
In experiments our approach allows us to make up to 40% fewer calls to the large model, with no drop in response quality.
arXiv Detail & Related papers (2024-04-22T23:06:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.