Related papers: ToolCaching: Towards Efficient Caching for LLM Tool-calling

ToolCaching: Towards Efficient Caching for LLM Tool-calling

URL: http://arxiv.org/abs/2601.15335v1
Date: Tue, 20 Jan 2026 09:25:59 GMT
Title: ToolCaching: Towards Efficient Caching for LLM Tool-calling
Authors: Yi Zhai, Dian Shen, Junzhou Luo, Bin Yang,
Abstract summary: Caching is a classic solution to the problem of redundant or repeated tool-calling requests.<n>We propose ToolCaching, an efficient feature-driven and adaptive caching framework.<n>ToolCaching achieves up to 11% higher cache hit ratios and 34% lower latency compared to standard policies.
Score: 13.738787213936225
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent advances in Large Language Models (LLMs) have revolutionized web applications, enabling intelligent search, recommendation, and assistant services with natural language interfaces. Tool-calling extends LLMs with the ability to interact with external APIs, greatly enhancing their practical utility. While prior research has improved tool-calling performance by adopting traditional computer systems techniques, such as parallel and asynchronous execution, the challenge of redundant or repeated tool-calling requests remains largely unaddressed. Caching is a classic solution to this problem, but applying it to LLM tool-calling introduces new difficulties due to heterogeneous request semantics, dynamic workloads, and varying freshness requirements, which render conventional cache policies ineffective. To address these issues, we propose ToolCaching, an efficient feature-driven and adaptive caching framework for LLM tool-calling systems. ToolCaching systematically integrates semantic and system-level features to evaluate request cacheability and estimate caching value. At its core, the VAAC algorithm integrates bandit-based admission with value-driven, multi-factor eviction, jointly accounting for request frequency, recency, and caching value. Extensive experiments on synthetic and public tool-calling workloads demonstrate that ToolCaching with VAAC achieves up to 11% higher cache hit ratios and 34% lower latency compared to standard policies, effectively accelerating LLM tool-calling in practical applications.

Related papers

MemSifter: Offloading LLM Memory Retrieval via Outcome-Driven Proxy Reasoning [78.46301394559903]
Large Language Models (LLMs) are increasingly used for long-duration tasks.<n>Current methods face a trade-off between cost and accuracy.<n>MemSifter is a novel framework that offloads the memory retrieval process to a small-scale proxy model.
arXiv Detail & Related papers (2026-03-03T02:57:38Z)
Don't Break the Cache: An Evaluation of Prompt Caching for Long-Horizon Agentic Tasks [1.2292307778008844]
We present a comprehensive evaluation of prompt caching across three major Large Language Model (LLM) providers.<n>Our results demonstrate that prompt caching reduces API costs by 45-80% and improves time to first token by 13-31% across providers.
arXiv Detail & Related papers (2026-01-09T18:41:57Z)
SimpleMem: Efficient Lifelong Memory for LLM Agents [73.74399447715052]
We introduce SimpleMem, an efficient memory framework based on semantic lossless compression.<n>We propose a three-stage pipeline designed to maximize information density and token utilization.<n> Experiments on benchmark datasets show that our method consistently outperforms baseline approaches in accuracy, retrieval efficiency, and inference cost.
arXiv Detail & Related papers (2026-01-05T21:02:49Z)
Semantic Caching for Low-Cost LLM Serving: From Offline Learning to Online Adaptation [54.61034867177997]
Caching inference responses allows them to be retrieved without another forward pass through the Large Language Models.<n>Traditional exact-match caching overlooks the semantic similarity between queries, leading to unnecessary recomputation.<n>We present a principled, learning-based framework for semantic cache eviction under unknown query and cost distributions.
arXiv Detail & Related papers (2025-08-11T06:53:27Z)
Sparse-dLLM: Accelerating Diffusion LLMs with Dynamic Cache Eviction [72.27673320976933]
Diffusion Large Language Models (dLLMs) enable breakthroughs in reasoning and parallel decoding.<n>Current caching techniques accelerate decoding by storing full-layer states, yet impose substantial memory usage.<n>We propose Sparse-dLLM, the first training-free framework integrating dynamic cache eviction with sparse attention.
arXiv Detail & Related papers (2025-08-04T16:14:03Z)
TweakLLM: A Routing Architecture for Dynamic Tailoring of Cached Responses [2.1604594801267667]
Large Language Models (LLMs) process millions of queries daily, making efficient response caching a compelling optimization for reducing cost and latency.<n>We present TweakLLM, a novel routing architecture that employs a lightweight LLM to dynamically adapt cached responses to incoming prompts.
arXiv Detail & Related papers (2025-07-31T15:50:57Z)
IC-Cache: Efficient Large Language Model Serving via In-context Caching [16.75800945078601]
IC-Cache is a caching system that enables live capability augmentation to improve serving efficiency.<n>We show that IC-Cache improves LLM serving throughput by 1.4-5.9x and reduces latency by 28-71% without hurting response quality.
arXiv Detail & Related papers (2025-01-22T07:52:38Z)
GPT Semantic Cache: Reducing LLM Costs and Latency via Semantic Embedding Caching [0.0]
GPT Semantic Cache is a method that leverages semantic caching of query embeddings in in-memory storage (Redis)<n>By storing user queries, our approach efficiently identifies semantically similar questions, allowing for the retrieval of pre-generated responses without redundant API calls to the Large Language Models.<n>Our experiments demonstrate that GPT Semantic Cache reduces API calls by up to 68.8% across various query categories, with cache hit rates ranging from 61.6% to 68.8%.
arXiv Detail & Related papers (2024-11-08T02:21:19Z)
Online Adaptation of Language Models with a Memory of Amortized Contexts [82.02369596879817]
Memory of Amortized Contexts (MAC) is an efficient and effective online adaptation framework for large language models. We show how MAC can be combined with and improve the performance of popular alternatives such as retrieval augmented generations.
arXiv Detail & Related papers (2024-03-07T08:34:57Z)
Anchor-based Large Language Models [33.86392289481657]
This study introduces Anchor-based LLMs (AnLLMs), which utilize an anchor-based self-attention network (AnSAN) and also an anchor-based inference strategy. AnLLMs maintain similar accuracy levels while achieving up to 99% keys/values cache reduction and up to 3.5 times faster inference.
arXiv Detail & Related papers (2024-02-12T12:48:02Z)
Accelerating Deep Learning Classification with Error-controlled Approximate-key Caching [72.50506500576746]
We propose a novel caching paradigm, that we named approximate-key caching. While approximate cache hits alleviate DL inference workload and increase the system throughput, they however introduce an approximation error. We analytically model our caching system performance for classic LRU and ideal caches, we perform a trace-driven evaluation of the expected performance, and we compare the benefits of our proposed approach with the state-of-the-art similarity caching.
arXiv Detail & Related papers (2021-12-13T13:49:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.