Related papers: RouteLLM: Learning to Route LLMs with Preference Data

RouteLLM: Learning to Route LLMs with Preference Data

URL: http://arxiv.org/abs/2406.18665v3
Date: Sun, 21 Jul 2024 10:33:08 GMT
Title: RouteLLM: Learning to Route LLMs with Preference Data
Authors: Isaac Ong, Amjad Almahairi, Vincent Wu, Wei-Lin Chiang, Tianhao Wu, Joseph E. Gonzalez, M Waleed Kadous, Ion Stoica,
Abstract summary: Large language models (LLMs) exhibit impressive capabilities across a wide range of tasks, yet the choice of which model to use often involves a trade-off between performance and cost. We propose several efficient router models that dynamically select between a stronger and a weaker LLM during inference. We develop a training framework for these routers leveraging human preference data and data augmentation techniques to enhance performance.
Score: 41.687640419561504
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) exhibit impressive capabilities across a wide range of tasks, yet the choice of which model to use often involves a trade-off between performance and cost. More powerful models, though effective, come with higher expenses, while less capable models are more cost-effective. To address this dilemma, we propose several efficient router models that dynamically select between a stronger and a weaker LLM during inference, aiming to optimize the balance between cost and response quality. We develop a training framework for these routers leveraging human preference data and data augmentation techniques to enhance performance. Our evaluation on widely-recognized benchmarks shows that our approach significantly reduces costs-by over 2 times in certain cases-without compromising the quality of responses. Interestingly, our router models also demonstrate significant transfer learning capabilities, maintaining their performance even when the strong and weak models are changed at test time. This highlights the potential of these routers to provide a cost-effective yet high-performance solution for deploying LLMs.

Related papers

Leveraging In-Context Learning for Language Model Agents [51.2996117207114]
In-context learning (ICL) with dynamically selected demonstrations combines the flexibility of prompting large language models (LLMs) with the ability to leverage training data to improve performance.<n>We show that set-selection of trajectories of similar tasks as demonstrations significantly improves performance, reliability, robustness, and efficiency of LLM agents.<n>We find that demonstrations obtained from larger models (in the annotation phase) also improve smaller models, and that ICL agents can even rival costlier trained agents.
arXiv Detail & Related papers (2025-06-16T05:37:49Z)
TagRouter: Learning Route to LLMs through Tags for Open-Domain Text Generation Tasks [6.621120466118939]
Model routing allocates queries to the suitable model, improving system performance while reducing costs.<n>We propose Tag, a training-free model routing method designed to optimize the synergy among multiple large language models (LLM)<n> Experimental results demonstrate that Tag outperforms 13 baseline methods, increasing the accept rate of system by 6.15% and reducing costs by 17.20%, achieving optimal cost-efficiency.
arXiv Detail & Related papers (2025-06-14T12:17:47Z)
LightRouter: Towards Efficient LLM Collaboration with Minimal Overhead [19.573553157421774]
Light is a novel framework designed to systematically select and integrate a small subset of LLMs from a larger pool.<n>Experiments demonstrate that Light matches or outperforms widely-used ensemble baselines, achieving up to a 25% improvement in accuracy.<n>This work introduces a practical approach for efficient LLM selection and provides valuable insights into optimal strategies for model combination.
arXiv Detail & Related papers (2025-05-22T04:46:04Z)
Cost-Optimal Grouped-Query Attention for Long-Context LLMs [64.90662568387683]
Building effective Transformer-based large language models (LLMs) has recently become a research focus. We compare models with different parameter sizes, context lengths, and attention head configurations in terms of model performance, computational cost, and memory cost. Our studies show that, when processing sufficiently long sequences, a larger model with fewer attention heads can achieve a lower loss while incurring lower computational and memory costs.
arXiv Detail & Related papers (2025-03-12T17:50:42Z)
OmniRouter: Budget and Performance Controllable Multi-LLM Routing [31.60019342381251]
Large language models (LLMs) deliver superior performance but require substantial computational resources and operate with relatively low efficiency.<n>We introduce Omni, a controllable routing framework for multi-LLM serving.<n>Experiments show that Omni achieves up to 6.30% improvement in response accuracy while simultaneously reducing computational costs by at least 10.15%.
arXiv Detail & Related papers (2025-02-27T22:35:31Z)
Universal Model Routing for Efficient LLM Inference [72.65083061619752]
We consider the problem of dynamic routing, where new, previously unobserved LLMs are available at test time. We propose a new approach to this problem that relies on representing each LLM as a feature vector, derived based on predictions on a set of representative prompts. We prove that these strategies are estimates of a theoretically optimal routing rule, and provide an excess risk bound to quantify their errors.
arXiv Detail & Related papers (2025-02-12T20:30:28Z)
MixLLM: Dynamic Routing in Mixed Large Language Models [57.309520357563215]
Large Language Models (LLMs) exhibit potential artificial generic intelligence recently, however, their usage is costly with high response latency. We develop MixLLM, a dynamic contextual-bandit-based routing system for query-LLM assignment.
arXiv Detail & Related papers (2025-02-09T02:26:15Z)
Training Language Models to Reason Efficiently [14.390800014819439]
We use reinforcement learning to train large reasoning models to reason efficiently. Our method incentivizes models to minimize unnecessary computational overhead while maintaining accuracy. Experiments on two open-weight large reasoning models demonstrate significant reductions in inference cost while preserving most of the accuracy.
arXiv Detail & Related papers (2025-02-06T19:18:16Z)
CITER: Collaborative Inference for Efficient Large Language Model Decoding with Token-Level Routing [56.98081258047281]
Collaborative Inference with Token-lEvel Routing (CITER) is a framework that enables efficient collaboration between small and large language models. We formulate router training as a policy optimization, where the router receives rewards based on both the quality of predictions and the inference costs of generation. Our experiments show that CITER reduces the inference costs while preserving high-quality generation, offering a promising solution for real-time and resource-constrained applications.
arXiv Detail & Related papers (2025-02-04T03:36:44Z)
FTP: A Fine-grained Token-wise Pruner for Large Language Models via Token Routing [17.01412432658081]
Large language models (LLMs) have demonstrated superior performance across various tasks by adhering to scaling laws. We propose a fine-grained token-wise pruning approach for the LLMs, which presents a learnable router to adaptively identify the less important tokens. Our approach achieves state-of-the-art (SOTA) pruning results, surpassing other existing pruning methods.
arXiv Detail & Related papers (2024-12-16T07:09:46Z)
Star-Agents: Automatic Data Optimization with LLM Agents for Instruction Tuning [71.2981957820888]
We propose a novel Star-Agents framework, which automates the enhancement of data quality across datasets. The framework initially generates diverse instruction data with multiple LLM agents through a bespoke sampling method. The generated data undergo a rigorous evaluation using a dual-model method that assesses both difficulty and quality.
arXiv Detail & Related papers (2024-11-21T02:30:53Z)
Real-time Adapting Routing (RAR): Improving Efficiency Through Continuous Learning in Software Powered by Layered Foundation Models [5.716829002003189]
Existing routing models rely on learning the optimal routing decision from carefully curated data. We propose Real-time Adaptive Routing (RAR), an approach to continuously adapt FM routing decisions. RAR routes 50.2% fewer requests to computationally expensive models while maintaining around 90.5% of the general response quality.
arXiv Detail & Related papers (2024-11-14T23:02:30Z)
TensorOpera Router: A Multi-Model Router for Efficient LLM Inference [27.2803289964386]
TO-lemma is a non-monolithic LLM querying system. It seamlessly integrates various LLM experts into a single query interface. It dynamically routes incoming queries to the most high-performant expert based on query's requirements.
arXiv Detail & Related papers (2024-08-22T11:57:07Z)
Optimising Calls to Large Language Models with Uncertainty-Based Two-Tier Selection [80.63946798650653]
Decision centers on whether to use a large LLM with better performance or a smaller one with reduced costs. We propose a simpler solution; we use only the uncertainty of the generations of the small LLM as the decision criterion. Our experiments reveal this simple solution optimally balances cost and performance, outperforming existing methods on 25 out of 27 experimental setups.
arXiv Detail & Related papers (2024-05-03T14:38:59Z)
Hybrid LLM: Cost-Efficient and Quality-Aware Query Routing [53.748685766139715]
Large language models (LLMs) excel in most NLP tasks but also require expensive cloud servers for deployment due to their size. We propose a hybrid inference approach which combines their respective strengths to save cost and maintain quality. In experiments our approach allows us to make up to 40% fewer calls to the large model, with no drop in response quality.
arXiv Detail & Related papers (2024-04-22T23:06:42Z)
Which LLM to Play? Convergence-Aware Online Model Selection with Time-Increasing Bandits [43.65904435249823]
We propose a time-increasing bandit algorithm TI-UCB, which effectively predicts the increase of model performances. Our results highlight the importance of utilizing increasing-then-converging pattern for more efficient and economic model selection.
arXiv Detail & Related papers (2024-03-11T23:52:46Z)
Cheaply Evaluating Inference Efficiency Metrics for Autoregressive Transformer APIs [66.30706841821123]
Large language models (LLMs) power many state-of-the-art systems in natural language processing. LLMs are extremely computationally expensive, even at inference time. We propose a new metric for comparing inference efficiency across models.
arXiv Detail & Related papers (2023-05-03T21:51:42Z)
Online and Scalable Model Selection with Multi-Armed Bandits [0.0]
We present Automatic Model Selector (AMS), a system for scalable online selection of bidding strategies based on real-world performance metrics. AMS allocates the most traffic to the best-performing models while decreasing traffic to those with poorer online performance. In live-traffic tests on multiple ad campaigns, the AMS system proved highly effective at improving ad campaign performance.
arXiv Detail & Related papers (2021-01-25T20:12:52Z)
Optimization-driven Machine Learning for Intelligent Reflecting Surfaces Assisted Wireless Networks [82.33619654835348]
Intelligent surface (IRS) has been employed to reshape the wireless channels by controlling individual scattering elements' phase shifts. Due to the large size of scattering elements, the passive beamforming is typically challenged by the high computational complexity. In this article, we focus on machine learning (ML) approaches for performance in IRS-assisted wireless networks.
arXiv Detail & Related papers (2020-08-29T08:39:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.