MEMORY-VQ: Compression for Tractable Internet-Scale Memory
- URL: http://arxiv.org/abs/2308.14903v1
- Date: Mon, 28 Aug 2023 21:11:18 GMT
- Title: MEMORY-VQ: Compression for Tractable Internet-Scale Memory
- Authors: Yury Zemlyanskiy, Michiel de Jong, Luke Vilnis, Santiago Onta\~n\'on,
William W. Cohen, Sumit Sanghai, Joshua Ainslie
- Abstract summary: Memory-based methods like LUMEN pre-compute token representations for retrieved passages to drastically speed up inference.
We propose MEMORY-VQ, a new method to reduce storage requirements of memory-augmented models without sacrificing performance.
- Score: 45.7528997281282
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Retrieval augmentation is a powerful but expensive method to make language
models more knowledgeable about the world. Memory-based methods like LUMEN
pre-compute token representations for retrieved passages to drastically speed
up inference. However, memory also leads to much greater storage requirements
from storing pre-computed representations.
We propose MEMORY-VQ, a new method to reduce storage requirements of
memory-augmented models without sacrificing performance. Our method uses a
vector quantization variational autoencoder (VQ-VAE) to compress token
representations. We apply MEMORY-VQ to the LUMEN model to obtain LUMEN-VQ, a
memory model that achieves a 16x compression rate with comparable performance
on the KILT benchmark. LUMEN-VQ enables practical retrieval augmentation even
for extremely large retrieval corpora.
Related papers
- CSR:Achieving 1 Bit Key-Value Cache via Sparse Representation [63.65323577445951]
We propose a novel approach called Cache Sparse Representation (CSR)
CSR transforms the dense Key-Value cache tensor into sparse indexes and weights, offering a more memory-efficient representation during LLM inference.
Our experiments demonstrate CSR achieves performance comparable to state-of-the-art KV cache quantization algorithms.
arXiv Detail & Related papers (2024-12-16T13:01:53Z) - Memory Layers at Scale [67.00854080570979]
This work takes memory layers beyond proof-of-concept, proving their utility at contemporary scale.
On downstream tasks, language models augmented with our improved memory layer outperform dense models with more than twice the budget, as well as mixture-of-expert models when matched for both compute and parameters.
We provide a fully parallelizable memory layer implementation, demonstrating scaling laws with up to 128B memory parameters, pretrained to 1 trillion tokens, comparing to base models with up to 8B parameters.
arXiv Detail & Related papers (2024-12-12T23:56:57Z) - CORM: Cache Optimization with Recent Message for Large Language Model Inference [57.109354287786154]
We introduce an innovative method for optimizing the KV cache, which considerably minimizes its memory footprint.
CORM, a KV cache eviction policy, dynamically retains essential key-value pairs for inference without the need for model fine-tuning.
Our validation shows that CORM reduces the inference memory usage of KV cache by up to 70% with negligible performance degradation across six tasks in LongBench.
arXiv Detail & Related papers (2024-04-24T16:11:54Z) - ESPN: Memory-Efficient Multi-Vector Information Retrieval [0.36832029288386137]
Multi-vector models amplify memory and storage requirements for retrieval indices by an order of magnitude.
We introduce Embedding from Storage Pipelined Network (ESPN) where we offload the entire re-ranking embedding tables to reduce the memory requirements by 5-16x.
We design a software prefetcher with hit rates exceeding 90%, improving SSD based retrieval up to 6.4x, and demonstrate that we can maintain near memory levels of query latency even for large query batch sizes.
arXiv Detail & Related papers (2023-12-09T00:19:42Z) - GLIMMER: generalized late-interaction memory reranker [29.434777627686692]
Memory-augmentation is a powerful approach for incorporating external information into language models.
Recent work introduced LUMEN, a memory-retrieval hybrid that partially pre-computes memory and updates memory representations on the fly with a smaller live encoder.
We propose GLIMMER, which improves on this approach through 1) exploiting free access to the powerful memory representations by applying a shallow reranker on top of memory to drastically improve retrieval quality at low cost.
arXiv Detail & Related papers (2023-06-17T01:54:25Z) - A Model or 603 Exemplars: Towards Memory-Efficient Class-Incremental
Learning [56.450090618578]
Class-Incremental Learning (CIL) aims to train a model with limited memory size to meet this requirement.
We show that when counting the model size into the total budget and comparing methods with aligned memory size, saving models do not consistently work.
We propose a simple yet effective baseline, denoted as MEMO for Memory-efficient Expandable MOdel.
arXiv Detail & Related papers (2022-05-26T08:24:01Z) - Memformer: A Memory-Augmented Transformer for Sequence Modeling [55.780849185884996]
We present Memformer, an efficient neural network for sequence modeling.
Our model achieves linear time complexity and constant memory space complexity when processing long sequences.
arXiv Detail & Related papers (2020-10-14T09:03:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.