Related papers: Don't Break the Cache: An Evaluation of Prompt Caching for Long-Horizon Agentic Tasks

Don't Break the Cache: An Evaluation of Prompt Caching for Long-Horizon Agentic Tasks

URL: http://arxiv.org/abs/2601.06007v1
Date: Fri, 09 Jan 2026 18:41:57 GMT
Title: Don't Break the Cache: An Evaluation of Prompt Caching for Long-Horizon Agentic Tasks
Authors: Elias Lumer, Faheem Nizar, Akshaya Jangiti, Kevin Frank, Anmol Gulati, Mandar Phadate, Vamse Kumar Subbiah,
Abstract summary: We present a comprehensive evaluation of prompt caching across three major Large Language Model (LLM) providers.<n>Our results demonstrate that prompt caching reduces API costs by 45-80% and improves time to first token by 13-31% across providers.
Score: 1.2292307778008844
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent advancements in Large Language Model (LLM) agents have enabled complex multi-turn agentic tasks requiring extensive tool calling, where conversations can span dozens of API calls with increasingly large context windows. However, although major LLM providers offer prompt caching to reduce cost and latency, its benefits for agentic workloads remain underexplored in the research literature. To our knowledge, no prior work quantifies these cost savings or compares caching strategies for multi-turn agentic tasks. We present a comprehensive evaluation of prompt caching across three major LLM providers (OpenAI, Anthropic, and Google) and compare three caching strategies, including full context caching, system prompt only caching, and caching that excludes dynamic tool results. We evaluate on DeepResearchBench, a multi-turn agentic benchmark where agents autonomously execute real-world web search tool calls to answer complex research questions, measuring both API cost and time to first token (TTFT) across over 500 agent sessions with 10,000-token system prompts. Our results demonstrate that prompt caching reduces API costs by 45-80% and improves time to first token by 13-31% across providers. We find that strategic prompt cache block control, such as placing dynamic content at the end of the system prompt, avoiding dynamic traditional function calling, and excluding dynamic tool results, provides more consistent benefits than naive full-context caching, which can paradoxically increase latency. Our analysis reveals nuanced variations in caching behavior across providers, and we provide practical guidance for implementing prompt caching in production agentic systems.

Related papers

Efficient Multimodal Planning Agent for Visual Question-Answering [67.26245301307539]
This paper proposes a method that trains a multimodal planning agent, dynamically decomposing the mRAG pipeline to solve the VQA task.<n>In our experiments, the agent can help reduce redundant computations, cutting search time by over 60% compared to existing methods.
arXiv Detail & Related papers (2026-01-28T14:58:59Z)
ToolCaching: Towards Efficient Caching for LLM Tool-calling [13.738787213936225]
Caching is a classic solution to the problem of redundant or repeated tool-calling requests.<n>We propose ToolCaching, an efficient feature-driven and adaptive caching framework.<n>ToolCaching achieves up to 11% higher cache hit ratios and 34% lower latency compared to standard policies.
arXiv Detail & Related papers (2026-01-20T09:25:59Z)
Branch-and-Browse: Efficient and Controllable Web Exploration with Tree-Structured Reasoning and Action Memory [69.49061918994882]
Branch-and-Browse is a fine-grained web agent framework that unifies structured reasoning-acting, contextual memory, and efficient execution.<n>On the WebArena benchmark, Branch-and-Browse achieves a task success rate of 35.8% and reduces execution time by up to 40.4% relative to state-of-the-art methods.
arXiv Detail & Related papers (2025-10-18T00:45:37Z)
Semantic Caching for Low-Cost LLM Serving: From Offline Learning to Online Adaptation [54.61034867177997]
Caching inference responses allows them to be retrieved without another forward pass through the Large Language Models.<n>Traditional exact-match caching overlooks the semantic similarity between queries, leading to unnecessary recomputation.<n>We present a principled, learning-based framework for semantic cache eviction under unknown query and cost distributions.
arXiv Detail & Related papers (2025-08-11T06:53:27Z)
ContextCache: Context-Aware Semantic Cache for Multi-Turn Queries in Large Language Models [33.729482204460815]
This demonstration introduces ContextCache, a context-aware semantic caching system for multi-turn dialogues.<n> ContextCache employs a two-stage retrieval architecture that first executes vector-based retrieval on the current query to identify potential matches and then integrates current and historical dialogue representations through self-attention mechanisms for precise contextual matching.<n> cached responses exhibit approximately 10 times lower latency than direct LLM invocation, enabling significant computational cost reductions for conversational applications.
arXiv Detail & Related papers (2025-06-28T07:25:12Z)
A Generative Caching System for Large Language Models [1.2132389187658934]
Caching has the potential to be of significant benefit for accessing large language models (LLMs)<n>This paper presents a new caching system for improving user experiences with LLMs.<n>A key feature we provide is generative caching, wherein multiple cached responses can be synthesized to provide answers to queries which have never been seen before.
arXiv Detail & Related papers (2025-03-22T01:17:56Z)
Auditing Prompt Caching in Language Model APIs [77.02079451561718]
We investigate the privacy leakage caused by prompt caching in large language models (LLMs)<n>We detect global cache sharing across users in seven API providers, including OpenAI.<n>We find evidence that OpenAI's embedding model is a decoder-only Transformer, which was previously not publicly known.
arXiv Detail & Related papers (2025-02-11T18:58:04Z)
SCALM: Towards Semantic Caching for Automated Chat Services with Large Language Models [15.742472622602557]
We propose SCALM, a new cache architecture that emphasizes semantic analysis and identifies significant cache entries and patterns. Our evaluations show that SCALM increases cache hit ratios and reduces operational costs for LLMChat services.
arXiv Detail & Related papers (2024-05-24T08:16:22Z)
MeanCache: User-Centric Semantic Caching for LLM Web Services [8.350378532274405]
Caching is a natural solution to reduce inference costs on repeated queries.<n>This paper introduces MeanCache, a user-centric semantic cache for LLM-based services.<n>MeanCache identifies semantically similar queries to determine cache hit or miss.
arXiv Detail & Related papers (2024-03-05T06:23:50Z)
Accelerating Deep Learning Classification with Error-controlled Approximate-key Caching [72.50506500576746]
We propose a novel caching paradigm, that we named approximate-key caching. While approximate cache hits alleviate DL inference workload and increase the system throughput, they however introduce an approximation error. We analytically model our caching system performance for classic LRU and ideal caches, we perform a trace-driven evaluation of the expected performance, and we compare the benefits of our proposed approach with the state-of-the-art similarity caching.
arXiv Detail & Related papers (2021-12-13T13:49:11Z)
Reinforcement Learning for Caching with Space-Time Popularity Dynamics [61.55827760294755]
caching is envisioned to play a critical role in next-generation networks. To intelligently prefetch and store contents, a cache node should be able to learn what and when to cache. This chapter presents a versatile reinforcement learning based approach for near-optimal caching policy design.
arXiv Detail & Related papers (2020-05-19T01:23:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.