Related papers: Confidence-Driven Multi-Scale Model Selection for Cost-Efficient Inference

Confidence-Driven Multi-Scale Model Selection for Cost-Efficient Inference

URL: http://arxiv.org/abs/2602.22090v1
Date: Wed, 25 Feb 2026 16:38:03 GMT
Title: Confidence-Driven Multi-Scale Model Selection for Cost-Efficient Inference
Authors: Bo-Wei Chen, Chung-Chi Chen, An-Zi Yen,
Abstract summary: Large Language Models (LLMs) have revolutionized inference across diverse natural language tasks.<n>We propose a confidence-driven strategy that dynamically selects the most suitable model based on confidence estimates.
Score: 10.009730627424629
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) have revolutionized inference across diverse natural language tasks, with larger models performing better but at higher computational costs. We propose a confidence-driven strategy that dynamically selects the most suitable model based on confidence estimates. By assessing a model's confidence in handling the task and response accuracy, tasks that are likely to be solved correctly are retained, while more uncertain or complex cases are delegated to a larger model, ensuring reliability while minimizing computation. Specifically, we evaluate a model's likelihood of knowing the correct answer and the probability that its response is accurate. Experiments on the Massive Multitask Language Understanding (MMLU) benchmark show that our approach achieves accuracy comparable to the largest model while reducing computational costs by 20\% to 40\%. When applied to GPT-4o API calls, it reduces token usage by approximately 60\%, further improving cost efficiency. These findings indicate the potential of confidence-based model selection to enhance real-world LLM deployment, particularly in resource-constrained settings such as edge devices and commercial API applications.

Related papers

On Calibration of Large Language Models: From Response To Capability [66.59139960234326]
Large language models (LLMs) are widely deployed as general-purpose problem solvers.<n>We introduce capability calibration, which targets the model's expected accuracy on a query.<n>Our results demonstrate that capability-calibrated confidence improves pass@$k$ prediction and inference budget allocation.
arXiv Detail & Related papers (2026-02-14T01:07:45Z)
Model Whisper: Steering Vectors Unlock Large Language Models' Potential in Test-time [6.741914038966904]
We introduce a lightweight component, Test-Time Steering Vectors (TTSV), which is prepended to the input while keeping the model's parameters entirely frozen.<n>TTSV is both lightweight and highly efficient to optimize, making it a true plug-and-play enhancement.<n>Our approach exhibits robust generalization, with its steering vectors proving highly transferable across diverse tasks.
arXiv Detail & Related papers (2025-12-04T12:36:16Z)
Semantic Agreement Enables Efficient Open-Ended LLM Cascades [18.119677655287607]
Cascade systems route computational requests to smaller models when possible and defer to larger models only when necessary.<n>We propose semantic agreement as a training-free signal for reliable deferral.<n>We find that semantic cascades match or surpass target-model quality at 40% of the cost and reduce latency by up to 60%.
arXiv Detail & Related papers (2025-09-26T03:51:28Z)
EfficientLLaVA:Generalizable Auto-Pruning for Large Vision-language Models [64.18350535770357]
We propose an automatic pruning method for large vision-language models to enhance the efficiency of multimodal reasoning.<n>Our approach only leverages a small number of samples to search for the desired pruning policy.<n>We conduct extensive experiments on the ScienceQA, Vizwiz, MM-vet, and LLaVA-Bench datasets for the task of visual question answering.
arXiv Detail & Related papers (2025-03-19T16:07:04Z)
Scalable Best-of-N Selection for Large Language Models via Self-Certainty [75.1351701045874]
Best-of-N selection is a key technique for improving the reasoning performance of Large Language Models (LLMs)<n>We propose self-certainty, a novel and efficient metric that leverages the inherent probability distribution of LLM outputs to estimate response quality without requiring external reward models.<n>Our findings establish self-certainty as a practical and efficient way for improving LLM reasoning capabilities.
arXiv Detail & Related papers (2025-02-25T19:08:07Z)
Model Cascading for Code: A Cascaded Black-Box Multi-Model Framework for Cost-Efficient Code Completion with Self-Testing [20.445496441396028]
We introduce a novel framework combining model cascading and inference-time self-testing algorithms to find multiple near-optimal self-testing options on the cost-accuracy tradeoff.<n>Our approach leverages self-generated tests to both enhance accuracy and evaluate model cascading decisions.<n> Experimental results show that our cascading approach reduces costs by an average of 26%, and up to 70% in the best case.
arXiv Detail & Related papers (2024-05-24T16:20:04Z)
BIRD: A Trustworthy Bayesian Inference Framework for Large Language Models [52.46248487458641]
Predictive models often need to work with incomplete information in real-world tasks.<n>Current large language models (LLMs) are insufficient for accurate estimations.<n>We propose BIRD, a novel probabilistic inference framework.
arXiv Detail & Related papers (2024-04-18T20:17:23Z)
Uncertainty-aware Language Modeling for Selective Question Answering [107.47864420630923]
We present an automatic large language model (LLM) conversion approach that produces uncertainty-aware LLMs. Our approach is model- and data-agnostic, is computationally-efficient, and does not rely on external models or systems.
arXiv Detail & Related papers (2023-11-26T22:47:54Z)
AutoMix: Automatically Mixing Language Models [62.51238143437967]
Large language models (LLMs) are now available from cloud API providers in various sizes and configurations.<n>We present Automix, an approach that strategically routes queries to larger LMs, based on the approximate correctness of outputs from a smaller LM.
arXiv Detail & Related papers (2023-10-19T17:57:39Z)
Pruning Large Language Models via Accuracy Predictor [0.0]
Large language models (LLMs) containing tens of billions of parameters (or even more) have demonstrated impressive capabilities in various NLP tasks. We propose a novel pruning approach: firstly, a training set of a certain number of architecture-accuracy pairs is established, and then a non-neural model is trained as an accuracy predictor.
arXiv Detail & Related papers (2023-09-18T06:38:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.