Efficient Prompt Caching via Embedding Similarity
- URL: http://arxiv.org/abs/2402.01173v1
- Date: Fri, 2 Feb 2024 06:34:11 GMT
- Title: Efficient Prompt Caching via Embedding Similarity
- Authors: Hanlin Zhu, Banghua Zhu, Jiantao Jiao
- Abstract summary: We focus on the prediction accuracy of prompt caching for single-round question-answering tasks via embedding similarity.
We propose a distillation-based method to fine-tune the existing embeddings for better better prediction.
We also conduct simulations demonstrating that our trained models achieve better caching efficiency than the previous embedding model.
- Score: 26.456212783693545
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) have achieved huge success in numerous natural
language process (NLP) tasks. However, it faces the challenge of significant
resource consumption during inference. In this paper, we aim to improve the
inference efficiency of LLMs by prompt caching, i.e., if the current prompt can
be answered by the same response of a previous prompt, one can directly utilize
that previous response without calling the LLM. Specifically, we focus on the
prediction accuracy of prompt caching for single-round question-answering tasks
via embedding similarity. The existing embeddings of prompts mostly focus on
whether two prompts are semantically similar, which is not necessarily
equivalent to whether the same response can answer them. Therefore, we propose
a distillation-based method to fine-tune the existing embeddings for better
caching prediction. Theoretically, we provide finite-sample guarantees for the
convergence of our method under different types of loss functions. Empirically,
we carefully construct a hard dataset based on Kwiatkowski et al. (2019) where
the existing embedding model (Wang et al., 2022) only achieves an AUC of 0.51.
We then fine-tune the above embedding model, which significantly improves the
AUC of caching prediction from 0.51 to 0.81. We also conduct simulations
demonstrating that our trained models achieve better caching efficiency than
the previous embedding model.
Related papers
- Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters [27.656263126925815]
We study the scaling of inference-time computation in LLMs.
We find that in both cases, the effectiveness of different approaches to scaling test-time compute critically varies depending on the difficulty of the prompt.
arXiv Detail & Related papers (2024-08-06T17:35:05Z) - An Efficient Inference Framework for Early-exit Large Language Models [5.048467183620882]
Early-exit models improve the inference efficiency of LLMs by skipping rest layers and directly generate output tokens when confident enough.
There is no work of LLM inference framework that takes early-exit models into consideration.
We solve two key challenges in building efficient inference framework for early-exit models: (1) batch inference at iteration-level granularity; and (2) KV cache management.
arXiv Detail & Related papers (2024-07-25T07:50:17Z) - Learning-to-Cache: Accelerating Diffusion Transformer via Layer Caching [56.286064975443026]
We make an interesting and somehow surprising observation: the computation of a large proportion of layers in the diffusion transformer, through a caching mechanism, can be readily removed even without updating the model parameters.
We introduce a novel scheme, named Learningto-Cache (L2C), that learns to conduct caching in a dynamic manner for diffusion transformers.
Experimental results show that L2C largely outperforms samplers such as DDIM and DPM-r, alongside prior cache-based methods at the same inference speed.
arXiv Detail & Related papers (2024-06-03T18:49:57Z) - Adaptive Sparse Gaussian Process [0.0]
We propose the first adaptive sparse Gaussian Process (GP) able to address all these issues.
We first reformulate a variational sparse GP algorithm to make it adaptive through a forgetting factor.
We then propose updating a single inducing point of the sparse GP model together with the remaining model parameters every time a new sample arrives.
arXiv Detail & Related papers (2023-02-20T21:34:36Z) - Post-Processing Temporal Action Detection [134.26292288193298]
Temporal Action Detection (TAD) methods typically take a pre-processing step in converting an input varying-length video into a fixed-length snippet representation sequence.
This pre-processing step would temporally downsample the video, reducing the inference resolution and hampering the detection performance in the original temporal resolution.
We introduce a novel model-agnostic post-processing method without model redesign and retraining.
arXiv Detail & Related papers (2022-11-27T19:50:37Z) - MoEBERT: from BERT to Mixture-of-Experts via Importance-Guided
Adaptation [68.30497162547768]
We propose MoEBERT, which uses a Mixture-of-Experts structure to increase model capacity and inference speed.
We validate the efficiency and effectiveness of MoEBERT on natural language understanding and question answering tasks.
arXiv Detail & Related papers (2022-04-15T23:19:37Z) - Embarrassingly Simple Performance Prediction for Abductive Natural
Language Inference [10.536415845097661]
We propose a method for predicting the performance of NLI models without fine-tuning them.
We show that the accuracy of the cosine similarity approach correlates strongly with the accuracy of the classification approach with a Pearson correlation coefficient of 0.65.
Our method can lead to significant time savings in the process of model selection.
arXiv Detail & Related papers (2022-02-21T18:10:24Z) - A Lagrangian Duality Approach to Active Learning [119.36233726867992]
We consider the batch active learning problem, where only a subset of the training data is labeled.
We formulate the learning problem using constrained optimization, where each constraint bounds the performance of the model on labeled samples.
We show, via numerical experiments, that our proposed approach performs similarly to or better than state-of-the-art active learning methods.
arXiv Detail & Related papers (2022-02-08T19:18:49Z) - Accelerating Deep Learning Classification with Error-controlled
Approximate-key Caching [72.50506500576746]
We propose a novel caching paradigm, that we named approximate-key caching.
While approximate cache hits alleviate DL inference workload and increase the system throughput, they however introduce an approximation error.
We analytically model our caching system performance for classic LRU and ideal caches, we perform a trace-driven evaluation of the expected performance, and we compare the benefits of our proposed approach with the state-of-the-art similarity caching.
arXiv Detail & Related papers (2021-12-13T13:49:11Z) - Learning Dense Representations of Phrases at Scale [22.792942611601347]
We show for the first time that we can learn dense phrase representations alone that achieve much stronger performance in open-domain QA.
Our model DensePhrases improves previous phrase retrieval models by 15%-25% absolute accuracy.
Our model is easy to parallelize due to pure dense representations and processes more than 10 questions per second on CPUs.
arXiv Detail & Related papers (2020-12-23T12:28:17Z) - Pre-training Is (Almost) All You Need: An Application to Commonsense
Reasoning [61.32992639292889]
Fine-tuning of pre-trained transformer models has become the standard approach for solving common NLP tasks.
We introduce a new scoring method that casts a plausibility ranking task in a full-text format.
We show that our method provides a much more stable training phase across random restarts.
arXiv Detail & Related papers (2020-04-29T10:54:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.