Related papers: Generative Caching for Structurally Similar Prompts and Responses

Generative Caching for Structurally Similar Prompts and Responses

URL: http://arxiv.org/abs/2511.17565v1
Date: Fri, 14 Nov 2025 00:22:00 GMT
Title: Generative Caching for Structurally Similar Prompts and Responses
Authors: Sarthak Chakraborty, Suman Nath, Xuchao Zhang, Chetan Bansal, Indranil Gupta,
Abstract summary: Large Language Models (LLMs) are increasingly being used to plan, reason, and execute tasks across diverse scenarios.<n>In use cases like repeatable and agentic settings, prompts are often reused with minor variations while having a similar structure for recurring tasks.<n>We introduce ourmethod, a generative cache that produces variation-aware responses for structurally similar prompts.
Score: 15.50345473013337
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) are increasingly being used to plan, reason, and execute tasks across diverse scenarios. In use cases like repeatable workflows and agentic settings, prompts are often reused with minor variations while having a similar structure for recurring tasks. This opens up opportunities for caching. However, exact prompt matching fails on such structurally similar prompts, while semantic caching may produce incorrect responses by ignoring critical differences. To address this, we introduce \ourmethod{}, a generative cache that produces variation-aware responses for structurally similar prompts. \ourmethod{} identifies reusable response patterns across similar prompt structures and synthesizes customized outputs for new requests. We show that \ourmethod{} achieves 83\% cache hit rate, while having minimal incorrect hits on datasets without prompt repetition. In agentic workflows, it improves cache hit rate by $\sim$20\% and reduces end-to-end execution latency by $\sim$34\% compared to standard prompt matching.

Related papers

Asynchronous Verified Semantic Caching for Tiered LLM Architectures [0.7204795910838664]
Large language models (LLMs) now sit in the critical path of search, assistance, and agentic.<n>Production deployments typically use a tiered static-dynamic design: a static cache of curated, offline vetted responses mined from logs, backed by a dynamic cache populated online.<n>We introduce textbfKrites, an asynchronous, LLM-judged caching policy that expands static coverage without changing serving decisions.
arXiv Detail & Related papers (2026-02-13T18:25:00Z)
AMA: Adaptive Memory via Multi-Agent Collaboration [54.490349689939166]
We propose Adaptive Memory via Multi-Agent Collaboration (AMA), a novel framework that leverages coordinated agents to manage memory across multiple granularities.<n>AMA significantly outperforms state-of-the-art baselines while reducing token consumption by approximately 80% compared to full-context methods.
arXiv Detail & Related papers (2026-01-28T08:09:49Z)
SemanticALLI: Caching Reasoning, Not Just Responses, in Agentic Systems [4.029672905329379]
We introduce SemanticALLI, a pipeline-aware architecture within PMG's marketing intelligence platform.<n>By decomposing generation into Analytic Intent Resolution (AIR) and Visualization Synthesis (VS), SemanticALLI structured intermediate representations (IRs) to first-class, cacheable artifacts.<n>Our structured approach allows for an additional stage, the Visualization Synthesis stage, to achieve an 83.10% hit rate, bypassing 4,023 LLM calls with a median latency of just 2.66 ms.
arXiv Detail & Related papers (2026-01-22T19:42:21Z)
Semantic Caching for Low-Cost LLM Serving: From Offline Learning to Online Adaptation [54.61034867177997]
Caching inference responses allows them to be retrieved without another forward pass through the Large Language Models.<n>Traditional exact-match caching overlooks the semantic similarity between queries, leading to unnecessary recomputation.<n>We present a principled, learning-based framework for semantic cache eviction under unknown query and cost distributions.
arXiv Detail & Related papers (2025-08-11T06:53:27Z)
ContextCache: Context-Aware Semantic Cache for Multi-Turn Queries in Large Language Models [33.729482204460815]
This demonstration introduces ContextCache, a context-aware semantic caching system for multi-turn dialogues.<n> ContextCache employs a two-stage retrieval architecture that first executes vector-based retrieval on the current query to identify potential matches and then integrates current and historical dialogue representations through self-attention mechanisms for precise contextual matching.<n> cached responses exhibit approximately 10 times lower latency than direct LLM invocation, enabling significant computational cost reductions for conversational applications.
arXiv Detail & Related papers (2025-06-28T07:25:12Z)
vCache: Verified Semantic Prompt Caching [95.16654660556975]
This paper proposes vCache, the first verified semantic cache with user-defined error rate guarantees.<n>It employs an online learning algorithm to estimate an optimal threshold for each cached prompt, enabling reliable cache responses without additional training.<n>Our experiments show that vCache consistently meets the specified error bounds while outperforming state-of-the-art static-threshold and fine-tuned embedding baselines.
arXiv Detail & Related papers (2025-02-06T04:16:20Z)
Prompt Cache: Modular Attention Reuse for Low-Latency Inference [12.610067639587461]
We present Prompt Cache, an approach for accelerating inference for large language models (LLM) by reusing attention states across different prompts. Prompt Cache employs a schema to explicitly define such reusable text segments, called prompt modules. We show that Prompt Cache significantly reduce latency in time-to-first-token, especially for longer prompts.
arXiv Detail & Related papers (2023-11-07T18:17:05Z)
Answering Ambiguous Questions via Iterative Prompting [84.3426020642704]
In open-domain question answering, due to the ambiguity of questions, multiple plausible answers may exist. One approach is to directly predict all valid answers, but this can struggle with balancing relevance and diversity. We present AmbigPrompt to address the imperfections of existing approaches to answering ambiguous questions.
arXiv Detail & Related papers (2023-07-08T04:32:17Z)
ARCH: Efficient Adversarial Regularized Training with Caching [91.74682538906691]
Adversarial regularization can improve model generalization in many natural language processing tasks. We propose a new adversarial regularization method ARCH, where perturbations are generated and cached once every several epochs. We evaluate our proposed method on a set of neural machine translation and natural language understanding tasks.
arXiv Detail & Related papers (2021-09-15T02:05:37Z)
Query Resolution for Conversational Search with Limited Supervision [63.131221660019776]
We propose QuReTeC (Query Resolution by Term Classification), a neural query resolution model based on bidirectional transformers. We show that QuReTeC outperforms state-of-the-art models, and furthermore, that our distant supervision method can be used to substantially reduce the amount of human-curated data required to train QuReTeC.
arXiv Detail & Related papers (2020-05-24T11:37:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.