Related papers: Routoo: Learning to Route to Large Language Models Effectively

Routoo: Learning to Route to Large Language Models Effectively

URL: http://arxiv.org/abs/2401.13979v3
Date: Wed, 02 Oct 2024 08:51:45 GMT
Title: Routoo: Learning to Route to Large Language Models Effectively
Authors: Alireza Mohammadshahi, Arshad Rafiq Shaikh, Majid Yazdani,
Abstract summary: Routoo is an architecture designed to optimize the selection of LLMs for specific prompts based on performance, cost, and efficiency. Routtoo comprises two key components: a performance predictor and cost-aware selector. Our results show that Routoo matches the performance of the Mixtral 8x7b model while reducing inference costs by one-third.
Score: 6.322844087292882
License: http://creativecommons.org/licenses/by/4.0/
Abstract: LLMs with superior response quality--particularly larger or closed-source models--often come with higher inference costs, making their deployment inefficient and costly. Meanwhile, developing foundational LLMs from scratch is becoming increasingly resource-intensive and impractical for many applications. To address the challenge of balancing quality and cost, we introduce Routoo, an architecture designed to optimize the selection of LLMs for specific prompts based on performance, cost, and efficiency. Routoo provides controllability over the trade-off between inference cost and quality, enabling significant reductions in inference costs for a given quality requirement. Routoo comprises two key components: a performance predictor and cost-aware selector. The performance predictor is a lightweight LLM that estimates the expected performance of various underlying LLMs on a given prompt without executing them. The cost-aware selector module then selects the most suitable model based on these predictions and constraints such as cost and latency, significantly reducing inference costs for the same quality. We evaluated Routoo using the MMLU benchmark across 57 domains employing open-source models. Our results show that Routoo matches the performance of the Mixtral 8x7b model while reducing inference costs by one-third. Additionally, by allowing increased costs, Routoo surpasses Mixtral's accuracy by over 5% at equivalent costs, achieving an accuracy of 75.9%. When integrating GPT4 into our model pool, Routoo nearly matches GPT4's performance at half the cost and exceeds it with a 25% cost reduction. These outcomes highlight Routoo's potential to significantly reduce inference costs without compromising quality, and even to establish new state-of-the-art results by leveraging the collective capabilities of multiple LLMs.

Related papers

Smart Routing: Cost-Effective Multi-LLM Serving for Multi-Core AIOS [31.60019342381251]
Existing scheduling frameworks mainly target at latency optimization. This paper proposes an efficient capability-cost coordinated scheduling framework, ECCOS, for multi-LLM serving.
arXiv Detail & Related papers (2025-02-27T22:35:31Z)
Universal Model Routing for Efficient LLM Inference [72.65083061619752]
We consider the problem of dynamic routing, where new, previously unobserved LLMs are available at test time. We propose a new approach to this problem that relies on representing each LLM as a feature vector, derived based on predictions on a set of representative prompts. We prove that these strategies are estimates of a theoretically optimal routing rule, and provide an excess risk bound to quantify their errors.
arXiv Detail & Related papers (2025-02-12T20:30:28Z)
MixLLM: Dynamic Routing in Mixed Large Language Models [57.309520357563215]
Large Language Models (LLMs) exhibit potential artificial generic intelligence recently, however, their usage is costly with high response latency. We develop MixLLM, a dynamic contextual-bandit-based routing system for query-LLM assignment.
arXiv Detail & Related papers (2025-02-09T02:26:15Z)
CITER: Collaborative Inference for Efficient Large Language Model Decoding with Token-Level Routing [56.98081258047281]
Collaborative Inference with Token-lEvel Routing (CITER) is a framework that enables efficient collaboration between small and large language models. We formulate router training as a policy optimization, where the router receives rewards based on both the quality of predictions and the inference costs of generation. Our experiments show that CITER reduces the inference costs while preserving high-quality generation, offering a promising solution for real-time and resource-constrained applications.
arXiv Detail & Related papers (2025-02-04T03:36:44Z)
Reward-Guided Speculative Decoding for Efficient LLM Reasoning [80.55186052123196]
We introduce Reward-Guided Speculative Decoding (RSD), a novel framework aimed at improving the efficiency of inference in large language models (LLMs) RSD incorporates a controlled bias to prioritize high-reward outputs, in contrast to existing speculative decoding methods that enforce strict unbiasedness. RSD delivers significant efficiency gains against decoding with the target model only, while achieving significant better accuracy than parallel decoding method on average.
arXiv Detail & Related papers (2025-01-31T17:19:57Z)
RouteLLM: Learning to Route LLMs with Preference Data [41.687640419561504]
Large language models (LLMs) exhibit impressive capabilities across a wide range of tasks, yet the choice of which model to use often involves a trade-off between performance and cost. We propose several efficient router models that dynamically select between a stronger and a weaker LLM during inference. We develop a training framework for these routers leveraging human preference data and data augmentation techniques to enhance performance.
arXiv Detail & Related papers (2024-06-26T18:10:22Z)
OptLLM: Optimal Assignment of Queries to Large Language Models [12.07164196530872]
We propose a framework for addressing the cost-effective query allocation problem for large language models (LLMs) Our framework, named OptLLM, provides users with a range of optimal solutions to choose from, aligning with their budget constraints and performance preferences. To evaluate the effectiveness of OptLLM, we conduct extensive experiments on various types of tasks, including text classification, question answering, sentiment analysis, reasoning, and log parsing.
arXiv Detail & Related papers (2024-05-24T01:05:37Z)
Optimising Calls to Large Language Models with Uncertainty-Based Two-Tier Selection [80.63946798650653]
Decision centers on whether to use a large LLM with better performance or a smaller one with reduced costs. We propose a simpler solution; we use only the uncertainty of the generations of the small LLM as the decision criterion. Our experiments reveal this simple solution optimally balances cost and performance, outperforming existing methods on 25 out of 27 experimental setups.
arXiv Detail & Related papers (2024-05-03T14:38:59Z)
Hybrid LLM: Cost-Efficient and Quality-Aware Query Routing [53.748685766139715]
Large language models (LLMs) excel in most NLP tasks but also require expensive cloud servers for deployment due to their size. We propose a hybrid inference approach which combines their respective strengths to save cost and maintain quality. In experiments our approach allows us to make up to 40% fewer calls to the large model, with no drop in response quality.
arXiv Detail & Related papers (2024-04-22T23:06:42Z)
SMART: Automatically Scaling Down Language Models with Accuracy Guarantees for Reduced Processing Fees [21.801053526411415]
Large Language Models (LLMs) have significantly boosted performance in natural language processing (NLP) tasks. The deployment of high-performance LLMs incurs substantial costs, primarily due to the increased number of parameters aimed at enhancing model performance. We introduce SMART, a novel framework designed to minimize the inference costs of NLP tasks while ensuring sufficient result quality.
arXiv Detail & Related papers (2024-03-11T17:45:47Z)
Switching the Loss Reduces the Cost in Batch (Offline) Reinforcement Learning [57.154674117714265]
We show that the number of samples needed to learn a near-optimal policy with FQI-log scales with the accumulated cost of the optimal policy. We empirically verify that FQI-log uses fewer samples than FQI trained with squared loss on problems where the optimal policy reliably achieves the goal.
arXiv Detail & Related papers (2024-03-08T15:30:58Z)
MobiLlama: Towards Accurate and Lightweight Fully Transparent GPT [87.4910758026772]
"Bigger the better" has been the predominant trend in recent Large Language Models (LLMs) development. This paper explores the "less is more" paradigm by addressing the challenge of designing accurate yet efficient Small Language Models (SLMs) for resource constrained devices.
arXiv Detail & Related papers (2024-02-26T18:59:03Z)
Towards Optimizing the Costs of LLM Usage [4.032848774697859]
We study optimization problems trading off the quality and costs, both theoretically and empirically. We propose several deterministics for reducing tokens in a quality aware manner. Our methods reduce costs by 40%- 90% while improving quality by 4%-7%.
arXiv Detail & Related papers (2024-01-29T16:36:31Z)
Democratizing LLMs: An Exploration of Cost-Performance Trade-offs in Self-Refined Open-Source Models [53.859446823312126]
SoTA open source models of varying sizes from 7B - 65B, on average, improve 8.2% from their baseline performance. Strikingly, even models with extremely small memory footprints, such as Vicuna-7B, show a 11.74% improvement overall and up to a 25.39% improvement in high-creativity, open ended tasks.
arXiv Detail & Related papers (2023-10-11T15:56:00Z)
On Optimal Caching and Model Multiplexing for Large Model Inference [66.50550915522551]
Large Language Models (LLMs) and other large foundation models have achieved noteworthy success, but their size exacerbates existing resource consumption and latency challenges. We study two approaches for mitigating these challenges: employing a cache to store previous queries and learning a model multiplexer to choose from an ensemble of models for query processing.
arXiv Detail & Related papers (2023-06-03T05:01:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.