Models Under SCOPE: Scalable and Controllable Routing via Pre-hoc Reasoning
- URL: http://arxiv.org/abs/2601.22323v1
- Date: Thu, 29 Jan 2026 21:09:36 GMT
- Title: Models Under SCOPE: Scalable and Controllable Routing via Pre-hoc Reasoning
- Authors: Qi Cao, Shuhao Zhang, Ruizhe Zhou, Ruiyi Zhang, Peijia Qin, Pengtao Xie,
- Abstract summary: We propose SCOPE, a routing framework that goes beyond model selection by predicting their cost and performance.<n>SCOPE makes reasoning-based predictions by retrieving how models behave on similar problems, rather than relying on fixed model names.<n>It can boost accuracy by up to 25.7% when performance is the priority, or cut costs by up to 95.1% when efficiency matters most.
- Score: 28.165465162107253
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Model routing chooses which language model to use for each query. By sending easy queries to cheaper models and hard queries to stronger ones, it can significantly reduce inference cost while maintaining high accuracy. However, most existing routers treat this as a fixed choice among a small set of models, which makes them hard to adapt to new models or changing budget constraints. In this paper, we propose SCOPE (Scalable and Controllable Outcome Performance Estimator), a routing framework that goes beyond model selection by predicting their cost and performance. Trained with reinforcement learning, SCOPE makes reasoning-based predictions by retrieving how models behave on similar problems, rather than relying on fixed model names, enabling it to work with new, unseen models. Moreover, by explicitly predicting how accurate and how expensive a model will be, it turns routing into a dynamic decision problem, allowing users to easily control the trade-off between accuracy and cost. Experiments show that SCOPE is more than just a cost-saving tool. It flexibly adapts to user needs: it can boost accuracy by up to 25.7% when performance is the priority, or cut costs by up to 95.1% when efficiency matters most.
Related papers
- RouteMoA: Dynamic Routing without Pre-Inference Boosts Efficient Mixture-of-Agents [91.0187958746262]
RouteMoA is an efficient mixture-of-agents framework with dynamic routing.<n>It employs a lightweight scorer to perform initial screening by predicting coarse-grained performance from the query.<n>It refines these scores through lightweight self- and cross-assessment based on existing model outputs, providing posterior correction without additional inference.
arXiv Detail & Related papers (2026-01-26T04:22:22Z) - Confidence-Guided Stepwise Model Routing for Cost-Efficient Reasoning [20.41220110321494]
We propose Confidence-Guided Stepwise Model Routing for Cost-Efficient Reasoning.<n>STEER is a domain-agnostic framework that performs fine-grained, step-level routing between smaller and larger language models.<n>Our results establish model-internal confidence as a robust, domain-agnostic signal for model routing.
arXiv Detail & Related papers (2025-11-09T02:33:08Z) - Optimizing Reasoning Efficiency through Prompt Difficulty Prediction [14.470330195517903]
Reasoning language models perform well on complex tasks but are costly to deploy due to their size and long reasoning traces.<n>We propose a routing approach that assigns each problem to the smallest model likely to solve it, reducing compute without sacrificing accuracy.
arXiv Detail & Related papers (2025-11-05T19:14:53Z) - e1: Learning Adaptive Control of Reasoning Effort [88.51897900019485]
Increasing the thinking budget of AI models can significantly improve accuracy, but not all questions warrant the same amount of reasoning.<n>Users may prefer to allocate different amounts of reasoning effort depending on how they value output quality versus latency and cost.<n>We propose Adaptive Effort Control, a self-adaptive reinforcement learning method that trains models to use a user-specified fraction of tokens.
arXiv Detail & Related papers (2025-10-30T23:12:21Z) - RADAR: Reasoning-Ability and Difficulty-Aware Routing for Reasoning LLMs [51.88834210085435]
We present RADAR (Reasoning-Ability and Difficulty-Aware Routing), a lightweight, interpretable, and scalable routing framework.<n>Inspired by psychometrics, RADAR learns an item response model from model responses with different budgets to different queries.<n>We conduct extensive experiments on 8 widely used reasoning benchmarks, demonstrating the superior performance of RADAR compared to state-of-the-art routing methods.
arXiv Detail & Related papers (2025-09-29T19:33:44Z) - SynapseRoute: An Auto-Route Switching Framework on Dual-State Large Language Model [12.929385845055137]
We show that approximately 58% of medical questions can be accurately answered by the non-thinking mode alone.<n>We propose SynapseRoute, a machine learning-based dynamic routing framework that intelligently assigns input queries to either thinking or non-thinking modes.
arXiv Detail & Related papers (2025-07-03T17:33:58Z) - BEST-Route: Adaptive LLM Routing with Test-Time Optimal Compute [25.740809143951815]
BEST-Route is a novel routing framework that chooses a model and the number of responses to sample from it based on query difficulty and the quality thresholds.<n> Experiments on real-world datasets demonstrate that our method reduces costs by up to 60% with less than 1% performance drop.
arXiv Detail & Related papers (2025-06-28T01:52:50Z) - SafeRoute: Adaptive Model Selection for Efficient and Accurate Safety Guardrails in Large Language Models [63.63254955809224]
We propose a binary router that distinguishes hard examples from easy ones.<n>Our method selectively applies the larger safety guard model to the data that the router considers hard, improving efficiency while maintaining accuracy.<n> Experimental results on multiple benchmark datasets demonstrate that our adaptive model selection significantly enhances the trade-off between computational cost and safety performance.
arXiv Detail & Related papers (2025-02-18T02:51:17Z) - Model Cascading for Code: A Cascaded Black-Box Multi-Model Framework for Cost-Efficient Code Completion with Self-Testing [20.445496441396028]
We introduce a novel framework combining model cascading and inference-time self-testing algorithms to find multiple near-optimal self-testing options on the cost-accuracy tradeoff.<n>Our approach leverages self-generated tests to both enhance accuracy and evaluate model cascading decisions.<n> Experimental results show that our cascading approach reduces costs by an average of 26%, and up to 70% in the best case.
arXiv Detail & Related papers (2024-05-24T16:20:04Z) - MILO: Model-Agnostic Subset Selection Framework for Efficient Model
Training and Tuning [68.12870241637636]
We propose MILO, a model-agnostic subset selection framework that decouples the subset selection from model training.
Our empirical results indicate that MILO can train models $3times - 10 times$ faster and tune hyperparameters $20times - 75 times$ faster than full-dataset training or tuning without performance.
arXiv Detail & Related papers (2023-01-30T20:59:30Z) - Exploring Sparse Expert Models and Beyond [51.90860155810848]
Mixture-of-Experts (MoE) models can achieve promising results with outrageous large amount of parameters but constant computation cost.
We propose a simple method called expert prototyping that splits experts into different prototypes and applies $k$ top-$1$ routing.
This strategy improves the model quality but maintains constant computational costs, and our further exploration on extremely large-scale models reflects that it is more effective in training larger models.
arXiv Detail & Related papers (2021-05-31T16:12:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.