Related papers: An Ensemble Embedding Approach for Improving Semantic Caching Performance in LLM-based Systems

An Ensemble Embedding Approach for Improving Semantic Caching Performance in LLM-based Systems

URL: http://arxiv.org/abs/2507.07061v1
Date: Tue, 08 Jul 2025 09:20:12 GMT
Title: An Ensemble Embedding Approach for Improving Semantic Caching Performance in LLM-based Systems
Authors: Shervin Ghaffari, Zohre Bahranifard, Mohammad Akbari,
Abstract summary: This paper presents an ensemble embedding approach that combines multiple embedding models through a trained meta-encoder to improve semantic similarity detection.<n>We evaluate our method using the Quora Question Pairs dataset, measuring cache hit ratios, cache miss ratios, token savings, and response times.
Score: 4.364576564103288
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Semantic caching enhances the efficiency of large language model (LLM) systems by identifying semantically similar queries, storing responses once, and serving them for subsequent equivalent requests. However, existing semantic caching frameworks rely on single embedding models for query representation, which limits their ability to capture the diverse semantic relationships present in real-world query distributions. This paper presents an ensemble embedding approach that combines multiple embedding models through a trained meta-encoder to improve semantic similarity detection in LLM caching systems. We evaluate our method using the Quora Question Pairs (QQP) dataset, measuring cache hit ratios, cache miss ratios, token savings, and response times. Our ensemble approach achieves a 92\% cache hit ratio for semantically equivalent queries while maintaining an 85\% accuracy in correctly rejecting non-equivalent queries as cache misses. These results demonstrate that ensemble embedding methods significantly outperform single-model approaches in distinguishing between semantically similar and dissimilar queries, leading to more effective caching performance and reduced computational overhead in LLM-based systems.

Related papers

TweakLLM: A Routing Architecture for Dynamic Tailoring of Cached Responses [1.7079407109348677]
Large Language Models (LLMs) process millions of queries daily, making efficient response caching a compelling optimization for reducing cost and latency.<n>We present TweakLLM, a novel routing architecture that employs a lightweight LLM to dynamically adapt cached responses to incoming prompts.
arXiv Detail & Related papers (2025-07-31T15:50:57Z)
ContextCache: Context-Aware Semantic Cache for Multi-Turn Queries in Large Language Models [33.729482204460815]
This demonstration introduces ContextCache, a context-aware semantic caching system for multi-turn dialogues.<n> ContextCache employs a two-stage retrieval architecture that first executes vector-based retrieval on the current query to identify potential matches and then integrates current and historical dialogue representations through self-attention mechanisms for precise contextual matching.<n> cached responses exhibit approximately 10 times lower latency than direct LLM invocation, enabling significant computational cost reductions for conversational applications.
arXiv Detail & Related papers (2025-06-28T07:25:12Z)
vCache: Verified Semantic Prompt Caching [75.87215136638828]
This paper proposes vCache, the first verified semantic cache with user-defined error rate guarantees.<n>It employs an online learning algorithm to estimate an optimal threshold for each cached prompt, enabling reliable cache responses without additional training.<n>Our experiments show that vCache consistently meets the specified error bounds while outperforming state-of-the-art static-threshold and fine-tuned embedding baselines.
arXiv Detail & Related papers (2025-02-06T04:16:20Z)
Effective Instruction Parsing Plugin for Complex Logical Query Answering on Knowledge Graphs [51.33342412699939]
Knowledge Graph Query Embedding (KGQE) aims to embed First-Order Logic (FOL) queries in a low-dimensional KG space for complex reasoning over incomplete KGs. Recent studies integrate various external information (such as entity types and relation context) to better capture the logical semantics of FOL queries. We propose an effective Query Instruction Parsing (QIPP) that captures latent query patterns from code-like query instructions.
arXiv Detail & Related papers (2024-10-27T03:18:52Z)
CART: A Generative Cross-Modal Retrieval Framework with Coarse-To-Fine Semantic Modeling [53.97609687516371]
Cross-modal retrieval aims to search for instances, which are semantically related to the query through the interaction of different modal data.<n>Traditional solutions utilize a single-tower or dual-tower framework to explicitly compute the score between queries and candidates.<n>We propose a generative cross-modal retrieval framework (CART) based on coarse-to-fine semantic modeling.
arXiv Detail & Related papers (2024-06-25T12:47:04Z)
MeanCache: User-Centric Semantic Caching for LLM Web Services [8.350378532274405]
Caching is a natural solution to reduce inference costs on repeated queries.<n>This paper introduces MeanCache, a user-centric semantic cache for LLM-based services.<n>MeanCache identifies semantically similar queries to determine cache hit or miss.
arXiv Detail & Related papers (2024-03-05T06:23:50Z)
LLMs for Test Input Generation for Semantic Caches [1.8628177380024746]
Large language models (LLMs) enable state-of-the-art semantic capabilities to be added to software systems. At scale, the cost of serving thousands of users increases massively affecting also user experience. We present VaryGen, an approach for using LLMs for test input generation that produces similar questions from unstructured text documents.
arXiv Detail & Related papers (2024-01-16T06:16:33Z)
UnifieR: A Unified Retriever for Large-Scale Retrieval [84.61239936314597]
Large-scale retrieval is to recall relevant documents from a huge collection given a query. Recent retrieval methods based on pre-trained language models (PLM) can be coarsely categorized into either dense-vector or lexicon-based paradigms. We propose a new learning framework, UnifieR which unifies dense-vector and lexicon-based retrieval in one model with a dual-representing capability.
arXiv Detail & Related papers (2022-05-23T11:01:59Z)
Accelerating Deep Learning Classification with Error-controlled Approximate-key Caching [72.50506500576746]
We propose a novel caching paradigm, that we named approximate-key caching. While approximate cache hits alleviate DL inference workload and increase the system throughput, they however introduce an approximation error. We analytically model our caching system performance for classic LRU and ideal caches, we perform a trace-driven evaluation of the expected performance, and we compare the benefits of our proposed approach with the state-of-the-art similarity caching.
arXiv Detail & Related papers (2021-12-13T13:49:11Z)
Query Focused Multi-Document Summarization with Distant Supervision [88.39032981994535]
Existing work relies heavily on retrieval-style methods for estimating the relevance between queries and text segments. We propose a coarse-to-fine modeling framework which introduces separate modules for estimating whether segments are relevant to the query. We demonstrate that our framework outperforms strong comparison systems on standard QFS benchmarks.
arXiv Detail & Related papers (2020-04-06T22:35:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.