Related papers: Efficient Contextual LLM Cascades through Budget-Constrained Policy Learning

Efficient Contextual LLM Cascades through Budget-Constrained Policy Learning

URL: http://arxiv.org/abs/2404.13082v2
Date: Tue, 19 Nov 2024 22:02:49 GMT
Title: Efficient Contextual LLM Cascades through Budget-Constrained Policy Learning
Authors: Xuechen Zhang, Zijian Huang, Ege Onur Taga, Carlee Joe-Wong, Samet Oymak, Jiasi Chen,
Abstract summary: TREACLE is a reinforcement learning policy that jointly selects the model and prompting scheme while respecting the user's monetary cost and latency constraints. Our evaluations show that TREACLE enables cost savings of up to 85% compared to baselines, while maintaining high accuracy.
Score: 31.972053219549757
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent successes in natural language processing have led to the proliferation of large language models (LLMs) by multiple providers. Each LLM offering has different inference accuracy, monetary cost, and latency, and their accuracy further depends on the exact wording of the question (i.e., the specific prompt). At the same time, users often have a limit on monetary budget and latency to answer all their questions, and they do not know which LLMs to choose for each question to meet their accuracy and long term budget requirements. To navigate this rich design space, we propose TREACLE ($\underline{T}$hrifty $\underline{Rea}$soning via $\underline{C}$ontext-Aware $\underline{L}$LM and Prompt S$\underline{e}$lection), a reinforcement learning policy that jointly selects the model and prompting scheme while respecting the user's monetary cost and latency constraints. TREACLE uses the problem context, including question text embeddings (reflecting the type or difficulty of a query) and the response history (reflecting the consistency of previous responses) to make smart decisions. Our evaluations on standard reasoning datasets (GSM8K, CSQA, and LLC) with various LLMs and prompts show that TREACLE enables cost savings of up to 85% compared to baselines, while maintaining high accuracy. Importantly, it provides the user with the ability to gracefully trade off accuracy for cost.

Related papers

Calibrate-Then-Act: Cost-Aware Exploration in LLM Agents [39.79150560622891]
We show that we can induce LLMs to explicitly reason about balancing these cost-uncertainty tradeoffs.<n>We formalize multiple tasks, including information retrieval and coding, as sequential decision-making problems under uncertainty.<n>Our results on information-seeking QA and on a simplified coding task show that making cost-benefit tradeoffs explicit with CTA can help agents discover more optimal decision-making strategies.
arXiv Detail & Related papers (2026-02-18T18:46:14Z)
Learning Steerable Clarification Policies with Collaborative Self-play [67.67872810596839]
To handle ambiguous queries, AI assistants need a policy for managing their uncertainty.<n>We propose to train steerable policies for managing this uncertainty using self-play.<n>We show this leads to a steerable policy that changes its behavior predictably conditioned on the provided costs.
arXiv Detail & Related papers (2025-12-03T18:49:54Z)
Adaptive LLM Routing under Budget Constraints [12.432635540782874]
Large Language Models (LLMs) have revolutionized natural language processing, but their varying capabilities and costs pose challenges in practical applications.<n>Previous approaches treat this as a supervised learning problem, assuming complete knowledge of optimal query-LLM pairings.<n>We propose to study LLM routing as a contextual bandit problem, enabling adaptive decision-making using bandit feedback.
arXiv Detail & Related papers (2025-08-28T18:18:19Z)
MixLLM: Dynamic Routing in Mixed Large Language Models [57.309520357563215]
Large Language Models (LLMs) exhibit potential artificial generic intelligence recently, however, their usage is costly with high response latency. We develop MixLLM, a dynamic contextual-bandit-based routing system for query-LLM assignment.
arXiv Detail & Related papers (2025-02-09T02:26:15Z)
PickLLM: Context-Aware RL-Assisted Large Language Model Routing [0.5325390073522079]
PickLLM is a lightweight framework that relies on Reinforcement Learning (RL) to route on-the-fly queries to available models. We demonstrate the speed of convergence for different learning rates and improvement in hard metrics such as cost per querying session and overall response latency.
arXiv Detail & Related papers (2024-12-12T06:27:12Z)
Reasoning Robustness of LLMs to Adversarial Typographical Errors [49.99118660264703]
Large Language Models (LLMs) have demonstrated impressive capabilities in reasoning using Chain-of-Thought (CoT) prompting. We study the reasoning robustness of LLMs to typographical errors, which can naturally occur in users' queries. We design an Adversarial Typo Attack ($texttATA$) algorithm that iteratively samples typos for words that are important to the query and selects the edit that is most likely to succeed in attacking.
arXiv Detail & Related papers (2024-11-08T05:54:05Z)
Grounding by Trying: LLMs with Reinforcement Learning-Enhanced Retrieval [55.63711219190506]
Large language models (LLMs) often struggle with posing the right search queries. We introduce $underlineLe$arning to $underlineRe$trieve by $underlineT$rying (LeReT) LeReT can improve the absolute retrieval accuracy by up to 29% and the downstream generator evaluations by 17%.
arXiv Detail & Related papers (2024-10-30T17:02:54Z)
MetaLLM: A High-performant and Cost-efficient Dynamic Framework for Wrapping LLMs [20.793892860721712]
We introduce MetaLLM, a framework that dynamically and intelligently routes each query to the optimal large language models (LLMs)<n>By framing the selection problem as a multi-armed bandit, MetaLLM balances prediction accuracy and cost efficiency under uncertainty.<n>Our experiments, conducted on popular LLM platforms such as OpenAI and Together AI, showcase MetaLLM's efficacy in real-world scenarios.
arXiv Detail & Related papers (2024-07-15T15:45:07Z)
Cost-efficient Knowledge-based Question Answering with Large Language Models [28.816821631082856]
Knowledge-based question answering (KBQA) is widely used in many scenarios that necessitate domain knowledge. Large language models (LLMs) bring opportunities to KBQA, while their costs are significantly higher and absence of domain-specific knowledge during pre-training. We propose Coke, a novel cost-efficient strategy for KBQA with LLMs, modeled as a tailored multi-armed bandit problem.
arXiv Detail & Related papers (2024-05-27T16:37:34Z)
Cost-Effective Online Multi-LLM Selection with Versatile Reward Models [30.892090566736652]
We introduce the textitC2MAB-V, an online model for selecting and using large language models (LLMs) textitC2MAB-V is specifically tailored for various collaborative task types with different reward models. We show that textitC2MAB-V effectively balances performance and cost-efficiency with nine LLMs for three application scenarios.
arXiv Detail & Related papers (2024-05-26T14:38:24Z)
OptLLM: Optimal Assignment of Queries to Large Language Models [12.07164196530872]
We propose a framework for addressing the cost-effective query allocation problem for large language models (LLMs) Our framework, named OptLLM, provides users with a range of optimal solutions to choose from, aligning with their budget constraints and performance preferences. To evaluate the effectiveness of OptLLM, we conduct extensive experiments on various types of tasks, including text classification, question answering, sentiment analysis, reasoning, and log parsing.
arXiv Detail & Related papers (2024-05-24T01:05:37Z)
CLAMBER: A Benchmark of Identifying and Clarifying Ambiguous Information Needs in Large Language Models [60.59638232596912]
We introduce CLAMBER, a benchmark for evaluating large language models (LLMs) Building upon the taxonomy, we construct 12K high-quality data to assess the strengths, weaknesses, and potential risks of various off-the-shelf LLMs. Our findings indicate the limited practical utility of current LLMs in identifying and clarifying ambiguous user queries.
arXiv Detail & Related papers (2024-05-20T14:34:01Z)
LLoCO: Learning Long Contexts Offline [63.3458260335454]
We propose LLoCO, a novel approach to processing long contexts. LLoCO learns contexts offline through context compression and in-domain parameter-efficient finetuning with LoRA. Our approach extends the effective context window of a 4k token LLaMA2-7B model to handle up to 128k tokens.
arXiv Detail & Related papers (2024-04-11T17:57:22Z)
Are Large Language Models Really Robust to Word-Level Perturbations? [68.60618778027694]
We propose a novel rational evaluation approach that leverages pre-trained reward models as diagnostic tools. Longer conversations manifest the comprehensive grasp of language models in terms of their proficiency in understanding questions. Our results demonstrate that LLMs frequently exhibit vulnerability to word-level perturbations that are commonplace in daily language usage.
arXiv Detail & Related papers (2023-09-20T09:23:46Z)
Question Answering as Programming for Solving Time-Sensitive Questions [84.07553016489769]
Question answering plays a pivotal role in human daily life because it involves our acquisition of knowledge about the world. Recently, Large Language Models (LLMs) have shown remarkable intelligence in question answering. This can be attributed to the LLMs' inability to perform rigorous reasoning based on surface-level text semantics. We propose a novel approach where we reframe the $textbfQ$uestion $textbfA$rogrogering task.
arXiv Detail & Related papers (2023-05-23T16:35:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.