Hybrid LLM: Cost-Efficient and Quality-Aware Query Routing
- URL: http://arxiv.org/abs/2404.14618v1
- Date: Mon, 22 Apr 2024 23:06:42 GMT
- Title: Hybrid LLM: Cost-Efficient and Quality-Aware Query Routing
- Authors: Dujian Ding, Ankur Mallick, Chi Wang, Robert Sim, Subhabrata Mukherjee, Victor Ruhle, Laks V. S. Lakshmanan, Ahmed Hassan Awadallah,
- Abstract summary: Large language models (LLMs) excel in most NLP tasks but also require expensive cloud servers for deployment due to their size.
We propose a hybrid inference approach which combines their respective strengths to save cost and maintain quality.
In experiments our approach allows us to make up to 40% fewer calls to the large model, with no drop in response quality.
- Score: 53.748685766139715
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Large language models (LLMs) excel in most NLP tasks but also require expensive cloud servers for deployment due to their size, while smaller models that can be deployed on lower cost (e.g., edge) devices, tend to lag behind in terms of response quality. Therefore in this work we propose a hybrid inference approach which combines their respective strengths to save cost and maintain quality. Our approach uses a router that assigns queries to the small or large model based on the predicted query difficulty and the desired quality level. The desired quality level can be tuned dynamically at test time to seamlessly trade quality for cost as per the scenario requirements. In experiments our approach allows us to make up to 40% fewer calls to the large model, with no drop in response quality.
Related papers
- Efficient Hybrid Inference for LLMs: Reward-Based Token Modelling with Selective Cloud Assistance [0.0]
Large language models (LLMs) are known for their exceptional performance across a range of natural language processing tasks.
Smaller language models (SLMs), which can be deployed on lower-cost edge devices, struggle to match the performance of their larger counterparts.
This paper presents a novel hybrid inference approach that leverages the strengths of both model types.
arXiv Detail & Related papers (2024-09-15T15:12:45Z) - RouteLLM: Learning to Route LLMs with Preference Data [41.687640419561504]
Large language models (LLMs) exhibit impressive capabilities across a wide range of tasks, yet the choice of which model to use often involves a trade-off between performance and cost.
We propose several efficient router models that dynamically select between a stronger and a weaker LLM during inference.
We develop a training framework for these routers leveraging human preference data and data augmentation techniques to enhance performance.
arXiv Detail & Related papers (2024-06-26T18:10:22Z) - Improving Large Models with Small models: Lower Costs and Better Performance [81.55672406002715]
We propose Data Shunt$+$ (DS$+$), a general paradigm for collaboration of small and large models.
For instance, ChatGPT achieves an accuracy of $94.43%$ on Amazon Product sentiment analysis, and DS$+$ achieves an accuracy of $95.64%$, while the cost has been reduced to only $31.18%$.
arXiv Detail & Related papers (2024-06-15T14:44:43Z) - Optimising Calls to Large Language Models with Uncertainty-Based Two-Tier Selection [80.63946798650653]
Decision centers on whether to use a large LLM with better performance or a smaller one with reduced costs.
We propose a simpler solution; we use only the uncertainty of the generations of the small LLM as the decision criterion.
Our experiments reveal this simple solution optimally balances cost and performance, outperforming existing methods on 25 out of 27 experimental setups.
arXiv Detail & Related papers (2024-05-03T14:38:59Z) - Routoo: Learning to Route to Large Language Models Effectively [6.322844087292882]
Routoo is an architecture designed to optimize the selection of LLMs for specific prompts based on performance, cost, and efficiency.
Routtoo comprises two key components: a performance predictor and cost-aware selector.
Our results show that Routoo matches the performance of the Mixtral 8x7b model while reducing inference costs by one-third.
arXiv Detail & Related papers (2024-01-25T06:45:32Z) - A bi-objective $\epsilon$-constrained framework for quality-cost
optimization in language model ensembles [1.5039745292757671]
We propose an ensembling framework that uses diverse open-sourced Large Language Models (LLMs) to achieve high response quality while maintaining cost efficiency.
We formulate a bi-objective optimization problem to represent the quality-cost tradeoff and then introduce an additional budget constraint that reduces the problem to a straightforward 0/1 knapsack problem.
arXiv Detail & Related papers (2023-12-26T16:56:22Z) - AutoMix: Automatically Mixing Language Models [62.51238143437967]
Large language models (LLMs) are now available from cloud API providers in various sizes and configurations.
We present Automix, an approach that strategically routes queries to larger LMs, based on the approximate correctness of outputs from a smaller LM.
arXiv Detail & Related papers (2023-10-19T17:57:39Z) - Elastic Entangled Pair and Qubit Resource Management in Quantum Cloud
Computing [73.7522199491117]
Quantum cloud computing (QCC) offers a promising approach to efficiently provide quantum computing resources.
The fluctuations in user demand and quantum circuit requirements are challenging for efficient resource provisioning.
We propose a resource allocation model to provision quantum computing and networking resources.
arXiv Detail & Related papers (2023-07-25T00:38:46Z) - Entangled Pair Resource Allocation under Uncertain Fidelity Requirements [59.83361663430336]
In quantum networks, effective entanglement routing facilitates communication between quantum source and quantum destination nodes.
We propose a resource allocation model for entangled pairs and an entanglement routing model with a fidelity guarantee.
Our proposed model can reduce the total cost by at least 20% compared to the baseline model.
arXiv Detail & Related papers (2023-04-10T07:16:51Z) - Cocktail: Leveraging Ensemble Learning for Optimized Model Serving in
Public Cloud [9.149566952446058]
We proposeCocktail, a costeffective ensembling-based model serving framework.
A prototype implementation ofCocktailon the AWS EC2 platform and exhaustive evalua-tions using a variety of workloads demonstrate thatCocktailcan reduce deployment cost by 1.45x.
arXiv Detail & Related papers (2021-06-09T19:23:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.