Auditing Prompt Caching in Language Model APIs
- URL: http://arxiv.org/abs/2502.07776v1
- Date: Tue, 11 Feb 2025 18:58:04 GMT
- Title: Auditing Prompt Caching in Language Model APIs
- Authors: Chenchen Gu, Xiang Lisa Li, Rohith Kuditipudi, Percy Liang, Tatsunori Hashimoto,
- Abstract summary: We investigate the privacy leakage caused by prompt caching in large language models (LLMs)
We detect global cache sharing across users in seven API providers, including OpenAI.
We find evidence that OpenAI's embedding model is a decoder-only Transformer, which was previously not publicly known.
- Score: 77.02079451561718
- License:
- Abstract: Prompt caching in large language models (LLMs) results in data-dependent timing variations: cached prompts are processed faster than non-cached prompts. These timing differences introduce the risk of side-channel timing attacks. For example, if the cache is shared across users, an attacker could identify cached prompts from fast API response times to learn information about other users' prompts. Because prompt caching may cause privacy leakage, transparency around the caching policies of API providers is important. To this end, we develop and conduct statistical audits to detect prompt caching in real-world LLM API providers. We detect global cache sharing across users in seven API providers, including OpenAI, resulting in potential privacy leakage about users' prompts. Timing variations due to prompt caching can also result in leakage of information about model architecture. Namely, we find evidence that OpenAI's embedding model is a decoder-only Transformer, which was previously not publicly known.
Related papers
- On the Differential Privacy and Interactivity of Privacy Sandbox Reports [78.21466601986265]
The Privacy Sandbox initiative from Google includes APIs for enabling privacy-preserving advertising functionalities.
We provide a formal model for analyzing the privacy of these APIs and show that they satisfy a formal DP guarantee.
arXiv Detail & Related papers (2024-12-22T08:22:57Z) - InputSnatch: Stealing Input in LLM Services via Timing Side-Channel Attacks [9.748438507132207]
Large language models (LLMs) possess extensive knowledge and question-answering capabilities.
cache-sharing methods are commonly employed to enhance efficiency by reusing cached states or responses for the same or similar inference requests.
We propose a novel timing-based side-channel attack to execute input theft in LLMs inference.
arXiv Detail & Related papers (2024-11-27T10:14:38Z) - Prompt Tuning as User Inherent Profile Inference Machine [53.78398656789463]
We propose UserIP-Tuning, which uses prompt-tuning to infer user profiles.
A profile quantization codebook bridges the modality gap by profile embeddings into collaborative IDs.
Experiments on four public datasets show that UserIP-Tuning outperforms state-of-the-art recommendation algorithms.
arXiv Detail & Related papers (2024-08-13T02:25:46Z) - Efficient Inference of Vision Instruction-Following Models with Elastic Cache [76.44955111634545]
We introduce Elastic Cache, a novel strategy for efficient deployment of instruction-following large vision-language models.
We propose an importance-driven cache merging strategy to prune redundancy caches.
For instruction encoding, we utilize the frequency to evaluate the importance of caches.
Results on a range of LVLMs demonstrate that Elastic Cache not only boosts efficiency but also notably outperforms existing pruning methods in language generation.
arXiv Detail & Related papers (2024-07-25T15:29:05Z) - Hidden Web Caches Discovery [3.9272151228741716]
This paper presents a novel methodology for cache detection using timing analysis.
Our approach eliminates the dependency on cache status headers, making it applicable to any web server.
arXiv Detail & Related papers (2024-07-23T08:58:06Z) - SCALM: Towards Semantic Caching for Automated Chat Services with Large Language Models [15.742472622602557]
We propose SCALM, a new cache architecture that emphasizes semantic analysis and identifies significant cache entries and patterns.
Our evaluations show that SCALM increases cache hit ratios and reduces operational costs for LLMChat services.
arXiv Detail & Related papers (2024-05-24T08:16:22Z) - MeanCache: User-Centric Semantic Cache for Large Language Model Based Web Services [8.350378532274405]
Caching is a natural solution to reduce inference costs on repeated queries.
This paper introduces MeanCache, a user-centric semantic cache for LLM-based services.
MeanCache identifies semantically similar queries to determine cache hit or miss.
arXiv Detail & Related papers (2024-03-05T06:23:50Z) - Prompt Cache: Modular Attention Reuse for Low-Latency Inference [12.610067639587461]
We present Prompt Cache, an approach for accelerating inference for large language models (LLM) by reusing attention states across different prompts.
Prompt Cache employs a schema to explicitly define such reusable text segments, called prompt modules.
We show that Prompt Cache significantly reduce latency in time-to-first-token, especially for longer prompts.
arXiv Detail & Related papers (2023-11-07T18:17:05Z) - Accelerating Deep Learning Classification with Error-controlled
Approximate-key Caching [72.50506500576746]
We propose a novel caching paradigm, that we named approximate-key caching.
While approximate cache hits alleviate DL inference workload and increase the system throughput, they however introduce an approximation error.
We analytically model our caching system performance for classic LRU and ideal caches, we perform a trace-driven evaluation of the expected performance, and we compare the benefits of our proposed approach with the state-of-the-art similarity caching.
arXiv Detail & Related papers (2021-12-13T13:49:11Z) - Reinforcement Learning for Caching with Space-Time Popularity Dynamics [61.55827760294755]
caching is envisioned to play a critical role in next-generation networks.
To intelligently prefetch and store contents, a cache node should be able to learn what and when to cache.
This chapter presents a versatile reinforcement learning based approach for near-optimal caching policy design.
arXiv Detail & Related papers (2020-05-19T01:23:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.