Related papers: Auditing Prompt Caching in Language Model APIs

Auditing Prompt Caching in Language Model APIs

URL: http://arxiv.org/abs/2502.07776v1
Date: Tue, 11 Feb 2025 18:58:04 GMT
Title: Auditing Prompt Caching in Language Model APIs
Authors: Chenchen Gu, Xiang Lisa Li, Rohith Kuditipudi, Percy Liang, Tatsunori Hashimoto,
Abstract summary: We investigate the privacy leakage caused by prompt caching in large language models (LLMs)<n>We detect global cache sharing across users in seven API providers, including OpenAI.<n>We find evidence that OpenAI's embedding model is a decoder-only Transformer, which was previously not publicly known.
Score: 77.02079451561718
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Prompt caching in large language models (LLMs) results in data-dependent timing variations: cached prompts are processed faster than non-cached prompts. These timing differences introduce the risk of side-channel timing attacks. For example, if the cache is shared across users, an attacker could identify cached prompts from fast API response times to learn information about other users' prompts. Because prompt caching may cause privacy leakage, transparency around the caching policies of API providers is important. To this end, we develop and conduct statistical audits to detect prompt caching in real-world LLM API providers. We detect global cache sharing across users in seven API providers, including OpenAI, resulting in potential privacy leakage about users' prompts. Timing variations due to prompt caching can also result in leakage of information about model architecture. Namely, we find evidence that OpenAI's embedding model is a decoder-only Transformer, which was previously not publicly known.

Related papers

A Generative Caching System for Large Language Models [1.2132389187658934]
Caching has the potential to be of significant benefit for accessing large language models (LLMs) This paper presents a new caching system for improving user experiences with LLMs. A key feature we provide is generative caching, wherein multiple cached responses can be synthesized to provide answers to queries which have never been seen before.
arXiv Detail & Related papers (2025-03-22T01:17:56Z)
On the Differential Privacy and Interactivity of Privacy Sandbox Reports [78.21466601986265]
The Privacy Sandbox initiative from Google includes APIs for enabling privacy-preserving advertising functionalities.<n>We provide a formal model for analyzing the privacy of these APIs and show that they satisfy a formal DP guarantee.
arXiv Detail & Related papers (2024-12-22T08:22:57Z)
InputSnatch: Stealing Input in LLM Services via Timing Side-Channel Attacks [9.748438507132207]
Large language models (LLMs) possess extensive knowledge and question-answering capabilities.<n> cache-sharing methods are commonly employed to enhance efficiency by reusing cached states or responses for the same or similar inference requests.<n>We propose a novel timing-based side-channel attack to execute input theft in LLMs inference.
arXiv Detail & Related papers (2024-11-27T10:14:38Z)
Prompt Tuning as User Inherent Profile Inference Machine [53.78398656789463]
We propose UserIP-Tuning, which uses prompt-tuning to infer user profiles. A profile quantization codebook bridges the modality gap by profile embeddings into collaborative IDs. Experiments on four public datasets show that UserIP-Tuning outperforms state-of-the-art recommendation algorithms.
arXiv Detail & Related papers (2024-08-13T02:25:46Z)
Efficient Inference of Vision Instruction-Following Models with Elastic Cache [76.44955111634545]
We introduce Elastic Cache, a novel strategy for efficient deployment of instruction-following large vision-language models. We propose an importance-driven cache merging strategy to prune redundancy caches. For instruction encoding, we utilize the frequency to evaluate the importance of caches. Results on a range of LVLMs demonstrate that Elastic Cache not only boosts efficiency but also notably outperforms existing pruning methods in language generation.
arXiv Detail & Related papers (2024-07-25T15:29:05Z)
Hidden Web Caches Discovery [3.9272151228741716]
This paper presents a novel methodology for cache detection using timing analysis. Our approach eliminates the dependency on cache status headers, making it applicable to any web server.
arXiv Detail & Related papers (2024-07-23T08:58:06Z)
SCALM: Towards Semantic Caching for Automated Chat Services with Large Language Models [15.742472622602557]
We propose SCALM, a new cache architecture that emphasizes semantic analysis and identifies significant cache entries and patterns. Our evaluations show that SCALM increases cache hit ratios and reduces operational costs for LLMChat services.
arXiv Detail & Related papers (2024-05-24T08:16:22Z)
Prompt Cache: Modular Attention Reuse for Low-Latency Inference [12.610067639587461]
We present Prompt Cache, an approach for accelerating inference for large language models (LLM) by reusing attention states across different prompts. Prompt Cache employs a schema to explicitly define such reusable text segments, called prompt modules. We show that Prompt Cache significantly reduce latency in time-to-first-token, especially for longer prompts.
arXiv Detail & Related papers (2023-11-07T18:17:05Z)
Accelerating Deep Learning Classification with Error-controlled Approximate-key Caching [72.50506500576746]
We propose a novel caching paradigm, that we named approximate-key caching. While approximate cache hits alleviate DL inference workload and increase the system throughput, they however introduce an approximation error. We analytically model our caching system performance for classic LRU and ideal caches, we perform a trace-driven evaluation of the expected performance, and we compare the benefits of our proposed approach with the state-of-the-art similarity caching.
arXiv Detail & Related papers (2021-12-13T13:49:11Z)
Reinforcement Learning for Caching with Space-Time Popularity Dynamics [61.55827760294755]
caching is envisioned to play a critical role in next-generation networks. To intelligently prefetch and store contents, a cache node should be able to learn what and when to cache. This chapter presents a versatile reinforcement learning based approach for near-optimal caching policy design.
arXiv Detail & Related papers (2020-05-19T01:23:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.