Related papers: Dynamically Learned Test-Time Model Routing in Language Model Zoos with Service Level Guarantees

Dynamically Learned Test-Time Model Routing in Language Model Zoos with Service Level Guarantees

URL: http://arxiv.org/abs/2505.19947v1
Date: Mon, 26 May 2025 13:11:08 GMT
Title: Dynamically Learned Test-Time Model Routing in Language Model Zoos with Service Level Guarantees
Authors: Herbert Woisetschläger, Ryan Zhang, Shiqiang Wang, Hans-Arno Jacobsen,
Abstract summary: Open-weight LLM zoos provide access to numerous high-quality models.<n>Most users simply want factually correct, safe, and satisfying responses without concerning themselves with model technicalities.<n>We introduce MESS+, a cost-optimal optimization algorithm for cost-optimal request routing.
Score: 21.2175476090125
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Open-weight LLM zoos provide access to numerous high-quality models, but selecting the appropriate model for specific tasks remains challenging and requires technical expertise. Most users simply want factually correct, safe, and satisfying responses without concerning themselves with model technicalities, while inference service providers prioritize minimizing operating costs. These competing interests are typically mediated through service level agreements (SLAs) that guarantee minimum service quality. We introduce MESS+, a stochastic optimization algorithm for cost-optimal LLM request routing while providing rigorous SLA compliance guarantees. MESS+ learns request satisfaction probabilities of LLMs in real-time as users interact with the system, based on which model selection decisions are made by solving a per-request optimization problem. Our algorithm includes a novel combination of virtual queues and request satisfaction prediction, along with a theoretical analysis of cost optimality and constraint satisfaction. Across a wide range of state-of-the-art LLM benchmarks, MESS+ achieves an average of 2x cost savings compared to existing LLM routing techniques.

Related papers

AdaptiveLLM: A Framework for Selecting Optimal Cost-Efficient LLM for Code-Generation Based on CoT Length [5.856039862078523]
We introduce AdaptiveLLM, a framework that dynamically selects optimal Large Language Models (LLMs) for a given coding task by automatically assessing task difficulty.<n>Our framework first estimates task difficulty using Chain-of-Thought lengths generated by reasoning model, clusters these into three difficulty levels via k-means, and fine-tunes CodeBERT to embed difficulty-aware features.<n>Our framework achieves a 7.86% improvement in pass@1 score while reducing resource consumption by 88.9% compared to baseline method ComplexityNet.
arXiv Detail & Related papers (2025-06-12T09:43:48Z)
Collab: Controlled Decoding using Mixture of Agents for LLM Alignment [90.6117569025754]
Reinforcement learning from human feedback has emerged as an effective technique to align Large Language models.<n>Controlled Decoding provides a mechanism for aligning a model at inference time without retraining.<n>We propose a mixture of agent-based decoding strategies leveraging the existing off-the-shelf aligned LLM policies.
arXiv Detail & Related papers (2025-03-27T17:34:25Z)
Smart Routing: Cost-Effective Multi-LLM Serving for Multi-Core AIOS [31.60019342381251]
Existing scheduling frameworks mainly target at latency optimization.<n>This paper proposes an efficient capability-cost coordinated scheduling framework, ECCOS, for multi-LLM serving.
arXiv Detail & Related papers (2025-02-27T22:35:31Z)
Dynamic LLM Routing and Selection based on User Preferences: Balancing Performance, Cost, and Ethics [0.6999740786886538]
We introduce OptiRoute, an advanced model routing engine designed to dynamically select and route tasks to the optimal large language model (LLMs)<n>OptiRoute captures both functional (e.g., accuracy, speed, cost) and non-functional (e.g., helpfulness, harmlessness, honesty) criteria to efficiently match tasks with the best-fit models.<n>This makes it ideal for real-time applications in cloud-based ML platforms, personalized AI services, and regulated industries.
arXiv Detail & Related papers (2025-02-23T19:23:22Z)
Scaling Autonomous Agents via Automatic Reward Modeling And Planning [52.39395405893965]
Large language models (LLMs) have demonstrated remarkable capabilities across a range of tasks.<n>However, they still struggle with problems requiring multi-step decision-making and environmental feedback.<n>We propose a framework that can automatically learn a reward model from the environment without human annotations.
arXiv Detail & Related papers (2025-02-17T18:49:25Z)
LLM-Lasso: A Robust Framework for Domain-Informed Feature Selection and Regularization [59.75242204923353]
We introduce LLM-Lasso, a framework that leverages large language models (LLMs) to guide feature selection in Lasso regression.<n>LLMs generate penalty factors for each feature, which are converted into weights for the Lasso penalty using a simple, tunable model.<n>Features identified as more relevant by the LLM receive lower penalties, increasing their likelihood of being retained in the final model.
arXiv Detail & Related papers (2025-02-15T02:55:22Z)
MixLLM: Dynamic Routing in Mixed Large Language Models [57.309520357563215]
Large Language Models (LLMs) exhibit potential artificial generic intelligence recently, however, their usage is costly with high response latency.<n>We develop MixLLM, a dynamic contextual-bandit-based routing system for query-LLM assignment.
arXiv Detail & Related papers (2025-02-09T02:26:15Z)
LLM Bandit: Cost-Efficient LLM Generation via Preference-Conditioned Dynamic Routing [3.090041654375235]
We present a novel framework that formulates the LLM selection process as a multi-armed bandit problem.<n>Our approach incorporates a preference-conditioned dynamic routing mechanism, allowing users to specify their preferences at inference time.<n>Our method achieves significant improvements in both accuracy and cost-effectiveness across various LLM platforms.
arXiv Detail & Related papers (2025-02-04T22:09:43Z)
Doing More with Less: A Survey on Routing Strategies for Resource Optimisation in Large Language Model-Based Systems [1.430963201405577]
Large Language Model (LLM)-based systems are usually designed with a single, general-purpose LLM to handle all user queries.<n>These systems may be inefficient as different queries may require different levels of reasoning, domain knowledge or pre-processing.<n>A routing mechanism can therefore be employed to route queries to more appropriate components, such as smaller or specialised models.
arXiv Detail & Related papers (2025-02-01T12:08:38Z)
EVOLvE: Evaluating and Optimizing LLMs For Exploration [76.66831821738927]
Large language models (LLMs) remain under-studied in scenarios requiring optimal decision-making under uncertainty. We measure LLMs' (in)ability to make optimal decisions in bandits, a state-less reinforcement learning setting relevant to many applications. Motivated by the existence of optimal exploration algorithms, we propose efficient ways to integrate this algorithmic knowledge into LLMs.
arXiv Detail & Related papers (2024-10-08T17:54:03Z)
Optimising Calls to Large Language Models with Uncertainty-Based Two-Tier Selection [80.63946798650653]
Decision centers on whether to use a large LLM with better performance or a smaller one with reduced costs. We propose a simpler solution; we use only the uncertainty of the generations of the small LLM as the decision criterion. Our experiments reveal this simple solution optimally balances cost and performance, outperforming existing methods on 25 out of 27 experimental setups.
arXiv Detail & Related papers (2024-05-03T14:38:59Z)
SMART: Automatically Scaling Down Language Models with Accuracy Guarantees for Reduced Processing Fees [21.801053526411415]
Large Language Models (LLMs) have significantly boosted performance in natural language processing (NLP) tasks. The deployment of high-performance LLMs incurs substantial costs, primarily due to the increased number of parameters aimed at enhancing model performance. We introduce SMART, a novel framework designed to minimize the inference costs of NLP tasks while ensuring sufficient result quality.
arXiv Detail & Related papers (2024-03-11T17:45:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.