Related papers: Optimising Calls to Large Language Models with Uncertainty-Based Two-Tier Selection

Optimising Calls to Large Language Models with Uncertainty-Based Two-Tier Selection

URL: http://arxiv.org/abs/2405.02134v1
Date: Fri, 3 May 2024 14:38:59 GMT
Title: Optimising Calls to Large Language Models with Uncertainty-Based Two-Tier Selection
Authors: Guillem Ramírez, Alexandra Birch, Ivan Titov,
Abstract summary: Decision centers on whether to use a large LLM with better performance or a smaller one with reduced costs. We propose a simpler solution; we use only the uncertainty of the generations of the small LLM as the decision criterion. Our experiments reveal this simple solution optimally balances cost and performance, outperforming existing methods on 25 out of 27 experimental setups.
Score: 80.63946798650653
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Researchers and practitioners operating on a limited budget face the cost-performance trade-off dilemma. The challenging decision often centers on whether to use a large LLM with better performance or a smaller one with reduced costs. This has motivated recent research in the optimisation of LLM calls. Either a cascading strategy is used, where a smaller LLM or both are called sequentially, or a routing strategy is used, where only one model is ever called. Both scenarios are dependent on a decision criterion which is typically implemented by an extra neural model. In this work, we propose a simpler solution; we use only the uncertainty of the generations of the small LLM as the decision criterion. We compare our approach with both cascading and routing strategies using three different pairs of pre-trained small and large LLMs, on nine different tasks and against approaches that require an additional neural model. Our experiments reveal this simple solution optimally balances cost and performance, outperforming existing methods on 25 out of 27 experimental setups.

Related papers

Collaborative LLM Inference via Planning for Efficient Reasoning [50.04696654679751]
We propose a test-time collaboration framework in which a planner model first generates a plan, defined as a distilled and high-level abstraction of the problem.<n>Small and large models take turns acting as planner and reasoner, exchanging plans in a multi-round cascade to collaboratively solve complex tasks.<n>Our method achieves accuracy comparable to strong proprietary models alone, while significantly reducing reliance on paid inference.
arXiv Detail & Related papers (2025-06-13T08:35:50Z)
Route to Reason: Adaptive Routing for LLM and Reasoning Strategy Selection [7.045509749924679]
Route-To-Reason (RTR) is a novel unified routing framework that dynamically allocates both LMs and reasoning strategies according to task difficulty under budget constraints.<n>RTR learns compressed representations of both expert models and reasoning strategies, enabling their joint and adaptive selection at inference time.
arXiv Detail & Related papers (2025-05-26T02:53:17Z)
LightRouter: Towards Efficient LLM Collaboration with Minimal Overhead [19.573553157421774]
Light is a novel framework designed to systematically select and integrate a small subset of LLMs from a larger pool.<n>Experiments demonstrate that Light matches or outperforms widely-used ensemble baselines, achieving up to a 25% improvement in accuracy.<n>This work introduces a practical approach for efficient LLM selection and provides valuable insights into optimal strategies for model combination.
arXiv Detail & Related papers (2025-05-22T04:46:04Z)
Universal Model Routing for Efficient LLM Inference [72.65083061619752]
We consider the problem of dynamic routing, where new, previously unobserved LLMs are available at test time. We propose a new approach to this problem that relies on representing each LLM as a feature vector, derived based on predictions on a set of representative prompts. We prove that these strategies are estimates of a theoretically optimal routing rule, and provide an excess risk bound to quantify their errors.
arXiv Detail & Related papers (2025-02-12T20:30:28Z)
LLM Bandit: Cost-Efficient LLM Generation via Preference-Conditioned Dynamic Routing [3.090041654375235]
We present a novel framework that formulates the LLM selection process as a multi-armed bandit problem. Our approach incorporates a preference-conditioned dynamic routing mechanism, allowing users to specify their preferences at inference time. Our method achieves significant improvements in both accuracy and cost-effectiveness across various LLM platforms.
arXiv Detail & Related papers (2025-02-04T22:09:43Z)
PickLLM: Context-Aware RL-Assisted Large Language Model Routing [0.5325390073522079]
PickLLM is a lightweight framework that relies on Reinforcement Learning (RL) to route on-the-fly queries to available models. We demonstrate the speed of convergence for different learning rates and improvement in hard metrics such as cost per querying session and overall response latency.
arXiv Detail & Related papers (2024-12-12T06:27:12Z)
SIKeD: Self-guided Iterative Knowledge Distillation for mathematical reasoning [49.29200323760457]
Large Language Models (LLMs) can transfer their reasoning skills to smaller models. Smaller models are not expressive enough to fit the LLMs distribution on all strategies when distilled. This reliance on one strategy poses a challenge for smaller models when attempting to solve reasoning tasks that may be difficult with their preferred strategy.
arXiv Detail & Related papers (2024-10-24T09:29:18Z)
Efficient Reinforcement Learning with Large Language Model Priors [18.72288751305885]
Large language models (LLMs) have recently emerged as powerful general-purpose tools. We propose treating LLMs as prior action distributions and integrating them into RL frameworks. We show that incorporating LLM-based action priors significantly reduces exploration and complexity optimization.
arXiv Detail & Related papers (2024-10-10T13:54:11Z)
EVOLvE: Evaluating and Optimizing LLMs For Exploration [76.66831821738927]
Large language models (LLMs) remain under-studied in scenarios requiring optimal decision-making under uncertainty. We measure LLMs' (in)ability to make optimal decisions in bandits, a state-less reinforcement learning setting relevant to many applications. Motivated by the existence of optimal exploration algorithms, we propose efficient ways to integrate this algorithmic knowledge into LLMs.
arXiv Detail & Related papers (2024-10-08T17:54:03Z)
SelectLLM: Query-Aware Efficient Selection Algorithm for Large Language Models [8.558834738072363]
Large language models (LLMs) have gained increased popularity due to their remarkable success across various tasks. However, individual LLMs have limitations when applied to complex tasks because of such factors as training biases, model sizes, and the datasets used. We introduce SelectLLM, a novel algorithm that directs input queries to the most suitable subset of LLMs from a large pool.
arXiv Detail & Related papers (2024-08-16T06:11:21Z)
Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models [46.959380978972206]
We study inference scaling laws and compute-optimal inference for large language models (LLMs) training. As a first step towards understanding and designing compute-optimal inference methods, we studied cost-performance trade-offs for inference strategies. Our findings indicate smaller models (e.g., Llemma-7B) can outperform larger models given the same computation budgets.
arXiv Detail & Related papers (2024-08-01T17:16:04Z)
Efficient Sequential Decision Making with Large Language Models [19.083642464977224]
This paper focuses on extending the success of large language models (LLMs) to sequential decision making. Existing efforts either (i) re-train or finetune LLMs for decision making, or (ii) design prompts for pretrained LLMs. We propose a new approach that leverages online model selection algorithms to efficiently incorporate LLMs agents into sequential decision making.
arXiv Detail & Related papers (2024-06-17T22:13:22Z)
Delta-CoMe: Training-Free Delta-Compression with Mixed-Precision for Large Language Models [79.46938238953916]
Fine-tuning large language models (LLMs) to diverse applications is crucial to meet complex demands. Recent studies suggest decomposing a fine-tuned LLM into a base model and corresponding delta weights, which are then compressed using low-rank or low-bit approaches to reduce costs. In this work, we observe that existing low-rank and low-bit compression methods can significantly harm the model performance for task-specific fine-tuned LLMs.
arXiv Detail & Related papers (2024-06-13T07:57:27Z)
On Leveraging Large Language Models for Enhancing Entity Resolution: A Cost-efficient Approach [7.996010840316654]
We propose an uncertainty reduction framework using Large Language Models (LLMs) to improve entity resolution results. LLMs capitalize on their advanced linguistic capabilities and a pay-as-you-go'' model that provides significant advantages to those without extensive data science expertise. We show that our method is efficient and effective, offering promising applications in real-world tasks.
arXiv Detail & Related papers (2024-01-07T09:06:58Z)
Leaving the Nest: Going Beyond Local Loss Functions for Predict-Then-Optimize [57.22851616806617]
We show that our method achieves state-of-the-art results in four domains from the literature. Our approach outperforms the best existing method by nearly 200% when the localness assumption is broken.
arXiv Detail & Related papers (2023-05-26T11:17:45Z)
Breaking the Sample Complexity Barrier to Regret-Optimal Model-Free Reinforcement Learning [52.76230802067506]
A novel model-free algorithm is proposed to minimize regret in episodic reinforcement learning. The proposed algorithm employs an em early-settled reference update rule, with the aid of two Q-learning sequences. The design principle of our early-settled variance reduction method might be of independent interest to other RL settings.
arXiv Detail & Related papers (2021-10-09T21:13:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.