RouteLLM: Learning to Route LLMs with Preference Data
- URL: http://arxiv.org/abs/2406.18665v3
- Date: Sun, 21 Jul 2024 10:33:08 GMT
- Title: RouteLLM: Learning to Route LLMs with Preference Data
- Authors: Isaac Ong, Amjad Almahairi, Vincent Wu, Wei-Lin Chiang, Tianhao Wu, Joseph E. Gonzalez, M Waleed Kadous, Ion Stoica,
- Abstract summary: Large language models (LLMs) exhibit impressive capabilities across a wide range of tasks, yet the choice of which model to use often involves a trade-off between performance and cost.
We propose several efficient router models that dynamically select between a stronger and a weaker LLM during inference.
We develop a training framework for these routers leveraging human preference data and data augmentation techniques to enhance performance.
- Score: 41.687640419561504
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) exhibit impressive capabilities across a wide range of tasks, yet the choice of which model to use often involves a trade-off between performance and cost. More powerful models, though effective, come with higher expenses, while less capable models are more cost-effective. To address this dilemma, we propose several efficient router models that dynamically select between a stronger and a weaker LLM during inference, aiming to optimize the balance between cost and response quality. We develop a training framework for these routers leveraging human preference data and data augmentation techniques to enhance performance. Our evaluation on widely-recognized benchmarks shows that our approach significantly reduces costs-by over 2 times in certain cases-without compromising the quality of responses. Interestingly, our router models also demonstrate significant transfer learning capabilities, maintaining their performance even when the strong and weak models are changed at test time. This highlights the potential of these routers to provide a cost-effective yet high-performance solution for deploying LLMs.
Related papers
- Scaling Autonomous Agents via Automatic Reward Modeling And Planning [52.39395405893965]
Large language models (LLMs) have demonstrated remarkable capabilities across a range of tasks.
However, they still struggle with problems requiring multi-step decision-making and environmental feedback.
We propose a framework that can automatically learn a reward model from the environment without human annotations.
arXiv Detail & Related papers (2025-02-17T18:49:25Z) - Universal Model Routing for Efficient LLM Inference [72.65083061619752]
We consider the problem of dynamic routing, where new, previously unobserved LLMs are available at test time.
We propose a new approach to this problem that relies on representing each LLM as a feature vector, derived based on predictions on a set of representative prompts.
We prove that these strategies are estimates of a theoretically optimal routing rule, and provide an excess risk bound to quantify their errors.
arXiv Detail & Related papers (2025-02-12T20:30:28Z) - Training Language Models to Reason Efficiently [14.390800014819439]
We use reinforcement learning to train large reasoning models to reason efficiently.
Our method incentivizes models to minimize unnecessary computational overhead while maintaining accuracy.
Experiments on two open-weight large reasoning models demonstrate significant reductions in inference cost while preserving most of the accuracy.
arXiv Detail & Related papers (2025-02-06T19:18:16Z) - CITER: Collaborative Inference for Efficient Large Language Model Decoding with Token-Level Routing [56.98081258047281]
CITER enables efficient collaboration between small and large language models (SLMs & LLMs) through a token-level routing strategy.
We formulate router training as a policy optimization, where the router receives rewards based on both the quality of predictions and the inference costs of generation.
Our experiments show that CITER reduces the inference costs while preserving high-quality generation, offering a promising solution for real-time and resource-constrained applications.
arXiv Detail & Related papers (2025-02-04T03:36:44Z) - FTP: A Fine-grained Token-wise Pruner for Large Language Models via Token Routing [17.01412432658081]
Large language models (LLMs) have demonstrated superior performance across various tasks by adhering to scaling laws.
We propose a fine-grained token-wise pruning approach for the LLMs, which presents a learnable router to adaptively identify the less important tokens.
Our approach achieves state-of-the-art (SOTA) pruning results, surpassing other existing pruning methods.
arXiv Detail & Related papers (2024-12-16T07:09:46Z) - Real-time Adapting Routing (RAR): Improving Efficiency Through Continuous Learning in Software Powered by Layered Foundation Models [5.716829002003189]
Existing routing models rely on learning the optimal routing decision from carefully curated data.
We propose Real-time Adaptive Routing (RAR), an approach to continuously adapt FM routing decisions.
RAR routes 50.2% fewer requests to computationally expensive models while maintaining around 90.5% of the general response quality.
arXiv Detail & Related papers (2024-11-14T23:02:30Z) - Hybrid LLM: Cost-Efficient and Quality-Aware Query Routing [53.748685766139715]
Large language models (LLMs) excel in most NLP tasks but also require expensive cloud servers for deployment due to their size.
We propose a hybrid inference approach which combines their respective strengths to save cost and maintain quality.
In experiments our approach allows us to make up to 40% fewer calls to the large model, with no drop in response quality.
arXiv Detail & Related papers (2024-04-22T23:06:42Z) - Optimization-driven Machine Learning for Intelligent Reflecting Surfaces
Assisted Wireless Networks [82.33619654835348]
Intelligent surface (IRS) has been employed to reshape the wireless channels by controlling individual scattering elements' phase shifts.
Due to the large size of scattering elements, the passive beamforming is typically challenged by the high computational complexity.
In this article, we focus on machine learning (ML) approaches for performance in IRS-assisted wireless networks.
arXiv Detail & Related papers (2020-08-29T08:39:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.