Related papers: Cost-Efficient Serving of LLM Agents via Test-Time Plan Caching

Cost-Efficient Serving of LLM Agents via Test-Time Plan Caching

URL: http://arxiv.org/abs/2506.14852v1
Date: Tue, 17 Jun 2025 04:42:30 GMT
Title: Cost-Efficient Serving of LLM Agents via Test-Time Plan Caching
Authors: Qizheng Zhang, Michael Wornow, Kunle Olukotun,
Abstract summary: LLM-based agentic applications incur substantial costs due to extensive planning and reasoning requirements.<n>Existing LLM caching techniques are insufficient for agentic applications where outputs depend on external data or environmental contexts.<n>We propose agentic plan caching, a novel approach that extracts, stores, adapts, and reuses structured plan templates.
Score: 2.382770686742571
License: http://creativecommons.org/licenses/by/4.0/
Abstract: LLM-based agentic applications have shown increasingly remarkable capabilities in complex workflows but incur substantial costs due to extensive planning and reasoning requirements. Existing LLM caching techniques (like context caching and semantic caching), primarily designed for serving chatbots, are insufficient for agentic applications where outputs depend on external data or environmental contexts. We propose agentic plan caching, a novel approach that extracts, stores, adapts, and reuses structured plan templates from planning stages of agentic applications across semantically similar tasks to reduce the cost of serving. Unlike traditional semantic caching, our system extracts plan templates from completed agent executions at test-time, employs keyword extraction to match new requests against cached plans, and utilizes lightweight models to adapt these templates to task-specific plans with contexts. Evaluation across multiple real-world agentic applications shows that our system can reduce costs by 46.62% on average while maintaining performance, offering a more efficient solution for serving LLM-based agents that complements existing LLM serving infrastructures.

Related papers

SmartLLMs Scheduler: A Framework for Cost-Effective LLMs Utilization [9.615876932810126]
Large Language Models (LLMs) have shown remarkable capabilities in a variety of software engineering tasks.<n>Existing optimization strategies for deploying LLMs for diverse tasks focus on static scheduling.<n>We propose the SmartLLMs Scheduler (SLS), a dynamic and cost-effective scheduling solution.
arXiv Detail & Related papers (2025-08-05T09:35:52Z)
OmniRouter: Budget and Performance Controllable Multi-LLM Routing [31.60019342381251]
Large language models (LLMs) deliver superior performance but require substantial computational resources and operate with relatively low efficiency.<n>We introduce Omni, a controllable routing framework for multi-LLM serving.<n>Experiments show that Omni achieves up to 6.30% improvement in response accuracy while simultaneously reducing computational costs by at least 10.15%.
arXiv Detail & Related papers (2025-02-27T22:35:31Z)
Doing More with Less: A Survey on Routing Strategies for Resource Optimisation in Large Language Model-Based Systems [1.430963201405577]
Large Language Model (LLM)-based systems are usually designed with a single, general-purpose LLM to handle all user queries.<n>These systems may be inefficient as different queries may require different levels of reasoning, domain knowledge or pre-processing.<n>A routing mechanism can therefore be employed to route queries to more appropriate components, such as smaller or specialised models.
arXiv Detail & Related papers (2025-02-01T12:08:38Z)
CATP-LLM: Empowering Large Language Models for Cost-Aware Tool Planning [43.13654681136326]
We propose the Cost-Aware Tool Planning with LLMs (CATP-LLM) framework for cost-aware tool planning.<n>Specifically, we design a tool planning language to enhance the LLM for creating multi-branch non-sequential plans.<n>We also present OpenCATP, the first dataset for cost-aware planning, which comprises 11,100 evaluation samples from diverse tasks.
arXiv Detail & Related papers (2024-11-25T12:05:49Z)
DocETL: Agentic Query Rewriting and Evaluation for Complex Document Processing [10.712756715779822]
Large Language Models (LLMs) have shown promise in data processing.<n>These frameworks focus on reducing cost when executing user-specified operations.<n>This is problematic for complex tasks and data.<n>We present DocETL, a system that optimize complex document processing pipelines.
arXiv Detail & Related papers (2024-10-16T03:22:35Z)
AutoML-Agent: A Multi-Agent LLM Framework for Full-Pipeline AutoML [56.565200973244146]
Automated machine learning (AutoML) accelerates AI development by automating tasks in the development pipeline.<n>Recent works have started exploiting large language models (LLM) to lessen such burden.<n>This paper proposes AutoML-Agent, a novel multi-agent framework tailored for full-pipeline AutoML.
arXiv Detail & Related papers (2024-10-03T20:01:09Z)
Learning to Plan for Retrieval-Augmented Large Language Models from Knowledge Graphs [59.76268575344119]
We introduce a novel framework for enhancing large language models' (LLMs) planning capabilities by using planning data derived from knowledge graphs (KGs) LLMs fine-tuned with KG data have improved planning capabilities, better equipping them to handle complex QA tasks that involve retrieval.
arXiv Detail & Related papers (2024-06-20T13:07:38Z)
Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More? [54.667202878390526]
Long-context language models (LCLMs) have the potential to revolutionize our approach to tasks traditionally reliant on external tools like retrieval systems or databases. We introduce LOFT, a benchmark of real-world tasks requiring context up to millions of tokens designed to evaluate LCLMs' performance on in-context retrieval and reasoning. Our findings reveal LCLMs' surprising ability to rival state-of-the-art retrieval and RAG systems, despite never having been explicitly trained for these tasks.
arXiv Detail & Related papers (2024-06-19T00:28:58Z)
Ask-before-Plan: Proactive Language Agents for Real-World Planning [68.08024918064503]
Proactive Agent Planning requires language agents to predict clarification needs based on user-agent conversation and agent-environment interaction. We propose a novel multi-agent framework, Clarification-Execution-Planning (textttCEP), which consists of three agents specialized in clarification, execution, and planning.
arXiv Detail & Related papers (2024-06-18T14:07:28Z)
Efficient Prompting for LLM-based Generative Internet of Things [88.84327500311464]
Large language models (LLMs) have demonstrated remarkable capacities on various tasks, and integrating the capacities of LLMs into the Internet of Things (IoT) applications has drawn much research attention recently. Due to security concerns, many institutions avoid accessing state-of-the-art commercial LLM services, requiring the deployment and utilization of open-source LLMs in a local network setting. We propose a LLM-based Generative IoT (GIoT) system deployed in the local network setting in this study.
arXiv Detail & Related papers (2024-06-14T19:24:00Z)
Formal-LLM: Integrating Formal Language and Natural Language for Controllable LLM-based Agents [39.53593677934238]
Large Language Models (LLMs) enable AI Agents to automatically generate and execute multi-step plans to solve complex tasks. However, current LLM-based agents frequently generate invalid or non-executable plans. This paper proposes a novel "Formal-LLM" framework for LLM-based agents by integrating the expressiveness of natural language and the precision of formal language.
arXiv Detail & Related papers (2024-02-01T17:30:50Z)
OverPrompt: Enhancing ChatGPT through Efficient In-Context Learning [49.38867353135258]
We propose OverPrompt, leveraging the in-context learning capability of LLMs to handle multiple task inputs. Our experiments show that OverPrompt can achieve cost-efficient zero-shot classification without causing significant detriment to task performance.
arXiv Detail & Related papers (2023-05-24T10:08:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.