Confidence-Guided Stepwise Model Routing for Cost-Efficient Reasoning
- URL: http://arxiv.org/abs/2511.06190v1
- Date: Sun, 09 Nov 2025 02:33:08 GMT
- Title: Confidence-Guided Stepwise Model Routing for Cost-Efficient Reasoning
- Authors: Sangmook Lee, Dohyung Kim, Hyukhun Koh, Nakyeong Yang, Kyomin Jung,
- Abstract summary: We propose Confidence-Guided Stepwise Model Routing for Cost-Efficient Reasoning.<n>STEER is a domain-agnostic framework that performs fine-grained, step-level routing between smaller and larger language models.<n>Our results establish model-internal confidence as a robust, domain-agnostic signal for model routing.
- Score: 20.41220110321494
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Recent advances in Large Language Models (LLMs) - particularly model scaling and test-time techniques - have greatly enhanced the reasoning capabilities of language models at the expense of higher inference costs. To lower inference costs, prior works train router models or deferral mechanisms that allocate easy queries to a small, efficient model, while forwarding harder queries to larger, more expensive models. However, these trained router models often lack robustness under domain shifts and require expensive data synthesis techniques such as Monte Carlo rollouts to obtain sufficient ground-truth routing labels for training. In this work, we propose Confidence-Guided Stepwise Model Routing for Cost-Efficient Reasoning (STEER), a domain-agnostic framework that performs fine-grained, step-level routing between smaller and larger LLMs without utilizing external models. STEER leverages confidence scores from the smaller model's logits prior to generating a reasoning step, so that the large model is invoked only when necessary. Extensive evaluations using different LLMs on a diverse set of challenging benchmarks across multiple domains such as Mathematical Reasoning, Multi-Hop QA, and Planning tasks indicate that STEER achieves competitive or enhanced accuracy while reducing inference costs (up to +20% accuracy with 48% less FLOPs compared to solely using the larger model on AIME), outperforming baselines that rely on trained external modules. Our results establish model-internal confidence as a robust, domain-agnostic signal for model routing, offering a scalable pathway for efficient LLM deployment.
Related papers
- Models Under SCOPE: Scalable and Controllable Routing via Pre-hoc Reasoning [28.165465162107253]
We propose SCOPE, a routing framework that goes beyond model selection by predicting their cost and performance.<n>SCOPE makes reasoning-based predictions by retrieving how models behave on similar problems, rather than relying on fixed model names.<n>It can boost accuracy by up to 25.7% when performance is the priority, or cut costs by up to 95.1% when efficiency matters most.
arXiv Detail & Related papers (2026-01-29T21:09:36Z) - MMR-Bench: A Comprehensive Benchmark for Multimodal LLM Routing [41.77627136743721]
In practical deployments, workloads span lightweight OCR to complex multimodal reasoning.<n>routing is nontrivial due to modality fusion, wide variation in computational cost across models, and the absence of a standardized, budget-aware evaluation.<n>We present MMR-Bench, a unified benchmark that isolates the multimodal routing problem and enables comparison under fixed candidate sets and cost models.
arXiv Detail & Related papers (2026-01-25T12:44:14Z) - One LLM to Train Them All: Multi-Task Learning Framework for Fact-Checking [7.856998585396422]
Large language models (LLMs) are reshaping automated fact-checking (AFC) by enabling unified, end-to-end verification pipelines.<n>We propose textbfmulti-task learning (MTL) as a more efficient alternative that fine-tunes a single model to perform claim detection, evidence ranking, and stance detection jointly.
arXiv Detail & Related papers (2026-01-16T13:44:25Z) - xRouter: Training Cost-Aware LLMs Orchestration System via Reinforcement Learning [104.63494870852894]
We present x, a tool-calling-based routing system in which a learned router can either answer directly or invoke one or more external models.<n>Our implementation encompasses the full reinforcement learning framework, including reward and cost accounting.<n>Across diverse benchmarks, x achieves strong cost-performance trade-offs.
arXiv Detail & Related papers (2025-10-09T16:52:01Z) - Efficient LLM Collaboration via Planning [56.081879390960204]
Small and large models take turns acting as planner and executor, exchanging plans in a multi-stage cascade to collaboratively solve tasks.<n>We demonstrate that COPE achieves performance comparable to large proprietary models, while drastically reducing the inference API cost.
arXiv Detail & Related papers (2025-06-13T08:35:50Z) - OptMerge: Unifying Multimodal LLM Capabilities and Modalities via Model Merging [124.91183814854126]
Model merging seeks to combine multiple expert models into a single model.<n>We introduce a benchmark for model merging research that clearly divides the tasks for MLLM training and evaluation.<n>We find that model merging offers a promising way for building improved MLLMs without requiring training data.
arXiv Detail & Related papers (2025-05-26T12:23:14Z) - LightRouter: Towards Efficient LLM Collaboration with Minimal Overhead [19.573553157421774]
Light is a novel framework designed to systematically select and integrate a small subset of LLMs from a larger pool.<n>Experiments demonstrate that Light matches or outperforms widely-used ensemble baselines, achieving up to a 25% improvement in accuracy.<n>This work introduces a practical approach for efficient LLM selection and provides valuable insights into optimal strategies for model combination.
arXiv Detail & Related papers (2025-05-22T04:46:04Z) - Cost-Optimal Grouped-Query Attention for Long-Context Modeling [45.981681856747365]
Grouped-Query Attention (GQA) is a widely adopted strategy for reducing the computational cost of attention layers in large language models.<n>We analyze the relationship among context length, model size, GQA configuration, and model loss.<n>We propose a recipe for deriving cost-optimal GQA configurations.
arXiv Detail & Related papers (2025-03-12T17:50:42Z) - Reward-Guided Speculative Decoding for Efficient LLM Reasoning [80.55186052123196]
We introduce Reward-Guided Speculative Decoding (RSD), a novel framework aimed at improving the efficiency of inference in large language models (LLMs)<n>RSD incorporates a controlled bias to prioritize high-reward outputs, in contrast to existing speculative decoding methods that enforce strict unbiasedness.<n>RSD delivers significant efficiency gains against decoding with the target model only, while achieving significant better accuracy than parallel decoding method on average.
arXiv Detail & Related papers (2025-01-31T17:19:57Z) - Ranked from Within: Ranking Large Multimodal Models Without Labels [73.96543593298426]
We show that uncertainty scores derived from softmax distributions provide a robust basis for ranking models across various tasks.<n>This facilitates the ranking of LMMs on unlabeled data, providing a practical approach for selecting models for diverse target domains without requiring manual annotation.
arXiv Detail & Related papers (2024-12-09T13:05:43Z) - Reference Trustable Decoding: A Training-Free Augmentation Paradigm for Large Language Models [79.41139393080736]
Large language models (LLMs) have rapidly advanced and demonstrated impressive capabilities.
In-Context Learning (ICL) and.
Efficient Fine-Tuning (PEFT) are currently two mainstream methods for augmenting.
LLMs to downstream tasks.
We propose Reference Trustable Decoding (RTD), a paradigm that allows models to quickly adapt to new tasks without fine-tuning.
arXiv Detail & Related papers (2024-09-30T10:48:20Z) - RouteLLM: Learning to Route LLMs with Preference Data [41.687640419561504]
Large language models (LLMs) exhibit impressive capabilities across a wide range of tasks, yet the choice of which model to use often involves a trade-off between performance and cost.<n>We propose several efficient router models that dynamically select between a stronger and a weaker LLM during inference.<n>We develop a training framework for these routers leveraging human preference data and data augmentation techniques to enhance performance.
arXiv Detail & Related papers (2024-06-26T18:10:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.