ESPN: Memory-Efficient Multi-Vector Information Retrieval
- URL: http://arxiv.org/abs/2312.05417v1
- Date: Sat, 9 Dec 2023 00:19:42 GMT
- Title: ESPN: Memory-Efficient Multi-Vector Information Retrieval
- Authors: Susav Shrestha, Narasimha Reddy, Zongwang Li
- Abstract summary: Multi-vector models amplify memory and storage requirements for retrieval indices by an order of magnitude.
We introduce Embedding from Storage Pipelined Network (ESPN) where we offload the entire re-ranking embedding tables to reduce the memory requirements by 5-16x.
We design a software prefetcher with hit rates exceeding 90%, improving SSD based retrieval up to 6.4x, and demonstrate that we can maintain near memory levels of query latency even for large query batch sizes.
- Score: 0.36832029288386137
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Recent advances in large language models have demonstrated remarkable
effectiveness in information retrieval (IR) tasks. While many neural IR systems
encode queries and documents into single-vector representations, multi-vector
models elevate the retrieval quality by producing multi-vector representations
and facilitating similarity searches at the granularity of individual tokens.
However, these models significantly amplify memory and storage requirements for
retrieval indices by an order of magnitude. This escalation in index size
renders the scalability of multi-vector IR models progressively challenging due
to their substantial memory demands. We introduce Embedding from Storage
Pipelined Network (ESPN) where we offload the entire re-ranking embedding
tables to SSDs and reduce the memory requirements by 5-16x. We design a
software prefetcher with hit rates exceeding 90%, improving SSD based retrieval
up to 6.4x, and demonstrate that we can maintain near memory levels of query
latency even for large query batch sizes.
Related papers
- Reducing the Footprint of Multi-Vector Retrieval with Minimal Performance Impact via Token Pooling [5.232135930253723]
Multi-vector retrieval methods, spearheaded by ColBERT, have become an increasingly popular approach to Neural IR.
However, the storage and memory requirements necessary to store the large number of associated vectors remain an important drawback.
We introduce a simple clustering-based token pooling approach to aggressively reduce the number of vectors that need to be stored.
arXiv Detail & Related papers (2024-09-23T03:12:43Z) - RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval [24.472784635757016]
RetrievalAttention is a training-free approach to both accelerate attention computation and reduce GPU memory consumption.
Our evaluation shows that RetrievalAttention only needs to access 1--3% of data while maintaining high model accuracy.
arXiv Detail & Related papers (2024-09-16T17:59:52Z) - CORM: Cache Optimization with Recent Message for Large Language Model Inference [57.109354287786154]
We introduce an innovative method for optimizing the KV cache, which considerably minimizes its memory footprint.
CORM, a KV cache eviction policy, dynamically retains essential key-value pairs for inference without the need for model fine-tuning.
Our validation shows that CORM reduces the inference memory usage of KV cache by up to 70% with negligible performance degradation across six tasks in LongBench.
arXiv Detail & Related papers (2024-04-24T16:11:54Z) - AiSAQ: All-in-Storage ANNS with Product Quantization for DRAM-free Information Retrieval [1.099532646524593]
DiskANN achieves good recall-speed balance for large-scale datasets using both of RAM and storage.
Despite it claims to save memory usage by loading compressed vectors by product quantization (PQ), its memory usage increases in proportion to the scale of datasets.
We propose All-in-Storage ANNS with Product Quantization (AiSAQ), which offloads the compressed vectors to storage.
arXiv Detail & Related papers (2024-04-09T04:20:27Z) - Multimodal Learned Sparse Retrieval with Probabilistic Expansion Control [66.78146440275093]
Learned retrieval (LSR) is a family of neural methods that encode queries and documents into sparse lexical vectors.
We explore the application of LSR to the multi-modal domain, with a focus on text-image retrieval.
Current approaches like LexLIP and STAIR require complex multi-step training on massive datasets.
Our proposed approach efficiently transforms dense vectors from a frozen dense model into sparse lexical vectors.
arXiv Detail & Related papers (2024-02-27T14:21:56Z) - RelayAttention for Efficient Large Language Model Serving with Long System Prompts [59.50256661158862]
This paper aims to improve the efficiency of LLM services that involve long system prompts.
handling these system prompts requires heavily redundant memory accesses in existing causal attention algorithms.
We propose RelayAttention, an attention algorithm that allows reading hidden states from DRAM exactly once for a batch of input tokens.
arXiv Detail & Related papers (2024-02-22T18:58:28Z) - MEMORY-VQ: Compression for Tractable Internet-Scale Memory [45.7528997281282]
Memory-based methods like LUMEN pre-compute token representations for retrieved passages to drastically speed up inference.
We propose MEMORY-VQ, a new method to reduce storage requirements of memory-augmented models without sacrificing performance.
arXiv Detail & Related papers (2023-08-28T21:11:18Z) - Memformer: A Memory-Augmented Transformer for Sequence Modeling [55.780849185884996]
We present Memformer, an efficient neural network for sequence modeling.
Our model achieves linear time complexity and constant memory space complexity when processing long sequences.
arXiv Detail & Related papers (2020-10-14T09:03:36Z) - MS-RANAS: Multi-Scale Resource-Aware Neural Architecture Search [94.80212602202518]
We propose Multi-Scale Resource-Aware Neural Architecture Search (MS-RANAS)
We employ a one-shot architecture search approach in order to obtain a reduced search cost.
We achieve state-of-the-art results in terms of accuracy-speed trade-off.
arXiv Detail & Related papers (2020-09-29T11:56:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.