Accelerating LLM Inference with Precomputed Query Storage
- URL: http://arxiv.org/abs/2509.25919v1
- Date: Tue, 30 Sep 2025 08:14:04 GMT
- Title: Accelerating LLM Inference with Precomputed Query Storage
- Authors: Jay H. Park, Youngju Cho, Choungsol Lee, Moonwook Oh, Euiseong Seo,
- Abstract summary: StorInfer is a storage-assisted large language model (LLM) inference system.<n>When a user query semantically matches a precomputed query, StorInfer bypasses expensive GPU inference and instantly returns the stored response.
- Score: 0.13048920509133805
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language model (LLM) inference often suffers from high latency, particularly in resource-constrained environments such as on-device or edge deployments. To address this challenge, we present StorInfer, a novel storage-assisted LLM inference system that accelerates response time by precomputing and storing predictable query-response pairs offline. When a user query semantically matches a precomputed query, StorInfer bypasses expensive GPU inference and instantly returns the stored response, significantly reducing latency and compute costs. To maximize coverage and effectiveness, StorInfer employs an LLM-driven generator that adaptively produces diverse and deduplicated queries based on a given knowledge base. This is achieved via two techniques: adaptive query masking, which prevents regeneration of similar queries, and adaptive sampling, which dynamically tunes generation parameters to promote semantic diversity. The resulting query-response pairs are embedded and indexed using a disk-backed vector database to enable fast, similarity-based retrieval at runtime. Using this approach, we generated 150K unique precomputed pairs (taking up to 830 MB of storage space), achieving up to 17.3% latency reduction with no loss in response quality. Our evaluation across multiple QA datasets demonstrates the practicality and scalability of storage-assisted inference, especially in scenarios with predictable query distributions. StorInfer highlights a promising direction in leveraging storage as a primary enabler for efficient, low-latency LLM deployment.
Related papers
- AMA: Adaptive Memory via Multi-Agent Collaboration [54.490349689939166]
We propose Adaptive Memory via Multi-Agent Collaboration (AMA), a novel framework that leverages coordinated agents to manage memory across multiple granularities.<n>AMA significantly outperforms state-of-the-art baselines while reducing token consumption by approximately 80% compared to full-context methods.
arXiv Detail & Related papers (2026-01-28T08:09:49Z) - SimpleMem: Efficient Lifelong Memory for LLM Agents [73.74399447715052]
We introduce SimpleMem, an efficient memory framework based on semantic lossless compression.<n>We propose a three-stage pipeline designed to maximize information density and token utilization.<n> Experiments on benchmark datasets show that our method consistently outperforms baseline approaches in accuracy, retrieval efficiency, and inference cost.
arXiv Detail & Related papers (2026-01-05T21:02:49Z) - Rethinking On-policy Optimization for Query Augmentation [49.87723664806526]
We present the first systematic comparison of prompting-based and RL-based query augmentation across diverse benchmarks.<n>We introduce a novel hybrid method, On-policy Pseudo-document Query Expansion (OPQE), which learns to generate a pseudo-document that maximizes retrieval performance.
arXiv Detail & Related papers (2025-10-20T04:16:28Z) - Hybrid Deep Searcher: Integrating Parallel and Sequential Search Reasoning [57.78245296980122]
We introduce HDS-QA (Hybrid Deep Search QA), a dataset automatically generated from Natural Questions.<n>It comprises hybrid-hop questions that combine parallelizable independent subqueries (executable simultaneously) and sequentially dependent subqueries (requiring step-by-step resolution)<n>We name the model HybridDeepSearcher, which outperforms state-of-the-art baselines across multiple benchmarks.
arXiv Detail & Related papers (2025-08-26T15:15:17Z) - Semantic Caching for Low-Cost LLM Serving: From Offline Learning to Online Adaptation [54.61034867177997]
Caching inference responses allows them to be retrieved without another forward pass through the Large Language Models.<n>Traditional exact-match caching overlooks the semantic similarity between queries, leading to unnecessary recomputation.<n>We present a principled, learning-based framework for semantic cache eviction under unknown query and cost distributions.
arXiv Detail & Related papers (2025-08-11T06:53:27Z) - From Single to Multi-Granularity: Toward Long-Term Memory Association and Selection of Conversational Agents [79.87304940020256]
Large Language Models (LLMs) have been widely adopted in conversational agents.<n>MemGAS is a framework that enhances memory consolidation by constructing multi-granularity association, adaptive selection, and retrieval.<n> Experiments on four long-term memory benchmarks demonstrate that MemGAS outperforms state-of-the-art methods on both question answer and retrieval tasks.
arXiv Detail & Related papers (2025-05-26T06:13:07Z) - Shared Disk KV Cache Management for Efficient Multi-Instance Inference in RAG-Powered LLMs [5.02504911036896]
Recent large language models (LLMs) face increasing inference latency as input context length and model size grow.<n>This paper proposes a method to reduce TTFT by leveraging a disk-based key-value (KV) cache to lessen the computational burden during the prefill stage.<n>We also introduce a disk-based shared KV cache management system, called Shared RAG-DCache, for multi-instance LLM RAG service environments.
arXiv Detail & Related papers (2025-04-16T04:59:18Z) - Leveraging Approximate Caching for Faster Retrieval-Augmented Generation [3.0111172730438565]
We introduce Proximity, an approximate key-value cache that optimize the RAG workflow by leveraging similarities in user queries.<n>Instead of treating each query independently, Proximity reuses previously retrieved documents when similar queries appear.<n>Our experiments demonstrate that Proximity with our LSH scheme and a realistically skewed MedRAG workload reduces database calls by 78.9% while maintaining database recall and test accuracy.
arXiv Detail & Related papers (2025-03-07T15:54:04Z) - Don't Do RAG: When Cache-Augmented Generation is All You Need for Knowledge Tasks [11.053340674721005]
Retrieval-augmented generation (RAG) has gained traction as a powerful approach for enhancing language models by integrating external knowledge sources.<n>This paper proposes an alternative paradigm, cache-augmented generation (CAG) that bypasses real-time retrieval.
arXiv Detail & Related papers (2024-12-20T06:58:32Z) - Is the House Ready For Sleeptime? Generating and Evaluating Situational Queries for Embodied Question Answering [48.43453390717167]
We present and tackle the problem of Embodied Question Answering with Situational Queries (S-EQA) in a household environment.<n>Unlike prior EQA work, situational queries require the agent to correctly identify multiple object-states and reach a consensus on their states for an answer.<n>We introduce a novel Prompt-Generate-Evaluate scheme that wraps around an LLM's output to generate unique situational queries and corresponding consensus object information.
arXiv Detail & Related papers (2024-05-08T00:45:20Z) - Attendre: Wait To Attend By Retrieval With Evicted Queries in
Memory-Based Transformers for Long Context Processing [2.9733429388858714]
One effective approach is to use a FIFO memory to store keys and values of an attention sublayer from past chunks to allow subsequent queries to attend.
We propose to use eviction policies, such as LRA and LFA, to reduce the memory size and adapt to various architectures.
We also propose the Attendre layer, a wait-to-attend mechanism by retrieving the key-value memory with evicted queries in the query memory.
arXiv Detail & Related papers (2024-01-10T02:20:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.