Related papers: Hybrid LLM: Cost-Efficient and Quality-Aware Query Routing

Hybrid LLM: Cost-Efficient and Quality-Aware Query Routing

URL: http://arxiv.org/abs/2404.14618v1
Date: Mon, 22 Apr 2024 23:06:42 GMT
Title: Hybrid LLM: Cost-Efficient and Quality-Aware Query Routing
Authors: Dujian Ding, Ankur Mallick, Chi Wang, Robert Sim, Subhabrata Mukherjee, Victor Ruhle, Laks V. S. Lakshmanan, Ahmed Hassan Awadallah,
Abstract summary: Large language models (LLMs) excel in most NLP tasks but also require expensive cloud servers for deployment due to their size. We propose a hybrid inference approach which combines their respective strengths to save cost and maintain quality. In experiments our approach allows us to make up to 40% fewer calls to the large model, with no drop in response quality.
Score: 53.748685766139715
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Large language models (LLMs) excel in most NLP tasks but also require expensive cloud servers for deployment due to their size, while smaller models that can be deployed on lower cost (e.g., edge) devices, tend to lag behind in terms of response quality. Therefore in this work we propose a hybrid inference approach which combines their respective strengths to save cost and maintain quality. Our approach uses a router that assigns queries to the small or large model based on the predicted query difficulty and the desired quality level. The desired quality level can be tuned dynamically at test time to seamlessly trade quality for cost as per the scenario requirements. In experiments our approach allows us to make up to 40% fewer calls to the large model, with no drop in response quality.

Related papers

BEST-Route: Adaptive LLM Routing with Test-Time Optimal Compute [25.740809143951815]
BEST-Route is a novel routing framework that chooses a model and the number of responses to sample from it based on query difficulty and the quality thresholds.<n> Experiments on real-world datasets demonstrate that our method reduces costs by up to 60% with less than 1% performance drop.
arXiv Detail & Related papers (2025-06-28T01:52:50Z)
Federated In-Context Learning: Iterative Refinement for Improved Answer Quality [62.72381208029899]
In-context learning (ICL) enables language models to generate responses without modifying their parameters by leveraging examples provided in the input.<n>We propose Federated In-Context Learning (Fed-ICL), a general framework that enhances ICL through an iterative, collaborative process.<n>Fed-ICL progressively refines responses by leveraging multi-round interactions between clients and a central server, improving answer quality without the need to transmit model parameters.
arXiv Detail & Related papers (2025-06-09T05:33:28Z)
Dynamically Learned Test-Time Model Routing in Language Model Zoos with Service Level Guarantees [21.2175476090125]
Open-weight LLM zoos provide access to numerous high-quality models.<n>Most users simply want factually correct, safe, and satisfying responses without concerning themselves with model technicalities.<n>We introduce MESS+, a cost-optimal optimization algorithm for cost-optimal request routing.
arXiv Detail & Related papers (2025-05-26T13:11:08Z)
EfficientLLaVA:Generalizable Auto-Pruning for Large Vision-language Models [64.18350535770357]
We propose an automatic pruning method for large vision-language models to enhance the efficiency of multimodal reasoning. Our approach only leverages a small number of samples to search for the desired pruning policy. We conduct extensive experiments on the ScienceQA, Vizwiz, MM-vet, and LLaVA-Bench datasets for the task of visual question answering.
arXiv Detail & Related papers (2025-03-19T16:07:04Z)
Smart Routing: Cost-Effective Multi-LLM Serving for Multi-Core AIOS [31.60019342381251]
Existing scheduling frameworks mainly target at latency optimization. This paper proposes an efficient capability-cost coordinated scheduling framework, ECCOS, for multi-LLM serving.
arXiv Detail & Related papers (2025-02-27T22:35:31Z)
MixLLM: Dynamic Routing in Mixed Large Language Models [57.309520357563215]
Large Language Models (LLMs) exhibit potential artificial generic intelligence recently, however, their usage is costly with high response latency. We develop MixLLM, a dynamic contextual-bandit-based routing system for query-LLM assignment.
arXiv Detail & Related papers (2025-02-09T02:26:15Z)
Efficient Hybrid Inference for LLMs: Reward-Based Token Modelling with Selective Cloud Assistance [0.0]
Large language models (LLMs) are known for their exceptional performance across a range of natural language processing tasks. Smaller language models (SLMs), which can be deployed on lower-cost edge devices, struggle to match the performance of their larger counterparts. This paper presents a novel hybrid inference approach that leverages the strengths of both model types.
arXiv Detail & Related papers (2024-09-15T15:12:45Z)
RouteLLM: Learning to Route LLMs with Preference Data [41.687640419561504]
Large language models (LLMs) exhibit impressive capabilities across a wide range of tasks, yet the choice of which model to use often involves a trade-off between performance and cost. We propose several efficient router models that dynamically select between a stronger and a weaker LLM during inference. We develop a training framework for these routers leveraging human preference data and data augmentation techniques to enhance performance.
arXiv Detail & Related papers (2024-06-26T18:10:22Z)
Improving Large Models with Small models: Lower Costs and Better Performance [81.55672406002715]
We propose Data Shunt$+$ (DS$+$), a general paradigm for collaboration of small and large models. For instance, ChatGPT achieves an accuracy of $94.43%$ on Amazon Product sentiment analysis, and DS$+$ achieves an accuracy of $95.64%$, while the cost has been reduced to only $31.18%$.
arXiv Detail & Related papers (2024-06-15T14:44:43Z)
Optimising Calls to Large Language Models with Uncertainty-Based Two-Tier Selection [80.63946798650653]
Decision centers on whether to use a large LLM with better performance or a smaller one with reduced costs. We propose a simpler solution; we use only the uncertainty of the generations of the small LLM as the decision criterion. Our experiments reveal this simple solution optimally balances cost and performance, outperforming existing methods on 25 out of 27 experimental setups.
arXiv Detail & Related papers (2024-05-03T14:38:59Z)
Routoo: Learning to Route to Large Language Models Effectively [6.322844087292882]
Routoo is an architecture designed to optimize the selection of LLMs for specific prompts based on performance, cost, and efficiency. Routtoo comprises two key components: a performance predictor and cost-aware selector. Our results show that Routoo matches the performance of the Mixtral 8x7b model while reducing inference costs by one-third.
arXiv Detail & Related papers (2024-01-25T06:45:32Z)
A bi-objective $\epsilon$-constrained framework for quality-cost optimization in language model ensembles [1.5039745292757671]
We propose an ensembling framework that uses diverse open-sourced Large Language Models (LLMs) to achieve high response quality while maintaining cost efficiency. We formulate a bi-objective optimization problem to represent the quality-cost tradeoff and then introduce an additional budget constraint that reduces the problem to a straightforward 0/1 knapsack problem.
arXiv Detail & Related papers (2023-12-26T16:56:22Z)
AutoMix: Automatically Mixing Language Models [62.51238143437967]
Large language models (LLMs) are now available from cloud API providers in various sizes and configurations. We present Automix, an approach that strategically routes queries to larger LMs, based on the approximate correctness of outputs from a smaller LM.
arXiv Detail & Related papers (2023-10-19T17:57:39Z)
Elastic Entangled Pair and Qubit Resource Management in Quantum Cloud Computing [73.7522199491117]
Quantum cloud computing (QCC) offers a promising approach to efficiently provide quantum computing resources. The fluctuations in user demand and quantum circuit requirements are challenging for efficient resource provisioning. We propose a resource allocation model to provision quantum computing and networking resources.
arXiv Detail & Related papers (2023-07-25T00:38:46Z)
Entangled Pair Resource Allocation under Uncertain Fidelity Requirements [59.83361663430336]
In quantum networks, effective entanglement routing facilitates communication between quantum source and quantum destination nodes. We propose a resource allocation model for entangled pairs and an entanglement routing model with a fidelity guarantee. Our proposed model can reduce the total cost by at least 20% compared to the baseline model.
arXiv Detail & Related papers (2023-04-10T07:16:51Z)
Cocktail: Leveraging Ensemble Learning for Optimized Model Serving in Public Cloud [9.149566952446058]
We proposeCocktail, a costeffective ensembling-based model serving framework. A prototype implementation ofCocktailon the AWS EC2 platform and exhaustive evalua-tions using a variety of workloads demonstrate thatCocktailcan reduce deployment cost by 1.45x.
arXiv Detail & Related papers (2021-06-09T19:23:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.