MEMORY-VQ: Compression for Tractable Internet-Scale Memory
- URL: http://arxiv.org/abs/2308.14903v1
- Date: Mon, 28 Aug 2023 21:11:18 GMT
- Title: MEMORY-VQ: Compression for Tractable Internet-Scale Memory
- Authors: Yury Zemlyanskiy, Michiel de Jong, Luke Vilnis, Santiago Onta\~n\'on,
William W. Cohen, Sumit Sanghai, Joshua Ainslie
- Abstract summary: Memory-based methods like LUMEN pre-compute token representations for retrieved passages to drastically speed up inference.
We propose MEMORY-VQ, a new method to reduce storage requirements of memory-augmented models without sacrificing performance.
- Score: 45.7528997281282
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Retrieval augmentation is a powerful but expensive method to make language
models more knowledgeable about the world. Memory-based methods like LUMEN
pre-compute token representations for retrieved passages to drastically speed
up inference. However, memory also leads to much greater storage requirements
from storing pre-computed representations.
We propose MEMORY-VQ, a new method to reduce storage requirements of
memory-augmented models without sacrificing performance. Our method uses a
vector quantization variational autoencoder (VQ-VAE) to compress token
representations. We apply MEMORY-VQ to the LUMEN model to obtain LUMEN-VQ, a
memory model that achieves a 16x compression rate with comparable performance
on the KILT benchmark. LUMEN-VQ enables practical retrieval augmentation even
for extremely large retrieval corpora.
Related papers
- R$^3$Mem: Bridging Memory Retention and Retrieval via Reversible Compression [24.825945729508682]
We propose R$3$Mem, a memory network that optimize both information Retention and Retrieval.
R$3$Mem employs virtual memory tokens to compress and encode infinitely long histories, further enhanced by a hierarchical compression strategy.
Experiments demonstrate that our memory design achieves state-of-the-art performance in long-context language modeling and retrieval-augmented generation tasks.
arXiv Detail & Related papers (2025-02-21T21:39:00Z) - When Compression Meets Model Compression: Memory-Efficient Double Compression for Large Language Models [12.687035979970194]
This paper introduces a framework to compress large language models (LLMs) after quantization.
A compression-aware quantization is first proposed to enhance model weight compressibility by re-scaling the model parameters before quantization, followed by a pruning method to improve further.
Experiments show inference with the compressed model can achieve a 40% reduction in memory size with negligible loss in accuracy and inference speed.
arXiv Detail & Related papers (2025-02-21T13:11:22Z) - A Universal Framework for Compressing Embeddings in CTR Prediction [68.27582084015044]
We introduce a Model-agnostic Embedding Compression (MEC) framework that compresses embedding tables by quantizing pre-trained embeddings.
Our approach consists of two stages: first, we apply popularity-weighted regularization to balance code distribution between high- and low-frequency features.
Experiments on three datasets reveal that our method reduces memory usage by over 50x while maintaining or improving recommendation performance.
arXiv Detail & Related papers (2025-02-21T10:12:34Z) - CSR:Achieving 1 Bit Key-Value Cache via Sparse Representation [63.65323577445951]
We propose a novel approach called Cache Sparse Representation (CSR)
CSR transforms the dense Key-Value cache tensor into sparse indexes and weights, offering a more memory-efficient representation during LLM inference.
Our experiments demonstrate CSR achieves performance comparable to state-of-the-art KV cache quantization algorithms.
arXiv Detail & Related papers (2024-12-16T13:01:53Z) - Memory Layers at Scale [67.00854080570979]
This work takes memory layers beyond proof-of-concept, proving their utility at contemporary scale.
On downstream tasks, language models augmented with our improved memory layer outperform dense models with more than twice the budget, as well as mixture-of-expert models when matched for both compute and parameters.
We provide a fully parallelizable memory layer implementation, demonstrating scaling laws with up to 128B memory parameters, pretrained to 1 trillion tokens, comparing to base models with up to 8B parameters.
arXiv Detail & Related papers (2024-12-12T23:56:57Z) - CORM: Cache Optimization with Recent Message for Large Language Model Inference [57.109354287786154]
We introduce an innovative method for optimizing the KV cache, which considerably minimizes its memory footprint.
CORM, a KV cache eviction policy, dynamically retains essential key-value pairs for inference without the need for model fine-tuning.
Our validation shows that CORM reduces the inference memory usage of KV cache by up to 70% with negligible performance degradation across six tasks in LongBench.
arXiv Detail & Related papers (2024-04-24T16:11:54Z) - ESPN: Memory-Efficient Multi-Vector Information Retrieval [0.36832029288386137]
Multi-vector models amplify memory and storage requirements for retrieval indices by an order of magnitude.
We introduce Embedding from Storage Pipelined Network (ESPN) where we offload the entire re-ranking embedding tables to reduce the memory requirements by 5-16x.
We design a software prefetcher with hit rates exceeding 90%, improving SSD based retrieval up to 6.4x, and demonstrate that we can maintain near memory levels of query latency even for large query batch sizes.
arXiv Detail & Related papers (2023-12-09T00:19:42Z) - GLIMMER: generalized late-interaction memory reranker [29.434777627686692]
Memory-augmentation is a powerful approach for incorporating external information into language models.
Recent work introduced LUMEN, a memory-retrieval hybrid that partially pre-computes memory and updates memory representations on the fly with a smaller live encoder.
We propose GLIMMER, which improves on this approach through 1) exploiting free access to the powerful memory representations by applying a shallow reranker on top of memory to drastically improve retrieval quality at low cost.
arXiv Detail & Related papers (2023-06-17T01:54:25Z) - A Model or 603 Exemplars: Towards Memory-Efficient Class-Incremental
Learning [56.450090618578]
Class-Incremental Learning (CIL) aims to train a model with limited memory size to meet this requirement.
We show that when counting the model size into the total budget and comparing methods with aligned memory size, saving models do not consistently work.
We propose a simple yet effective baseline, denoted as MEMO for Memory-efficient Expandable MOdel.
arXiv Detail & Related papers (2022-05-26T08:24:01Z) - Semantically Constrained Memory Allocation (SCMA) for Embedding in
Efficient Recommendation Systems [27.419109620575313]
A key challenge for deep learning models is to work with millions of categorical classes or tokens.
We propose a novel formulation of memory shared embedding, where memory is shared in proportion to the overlap in semantic information.
We demonstrate a significant reduction in the memory footprint while maintaining performance.
arXiv Detail & Related papers (2021-02-24T19:55:49Z) - Kanerva++: extending The Kanerva Machine with differentiable, locally
block allocated latent memory [75.65949969000596]
Episodic and semantic memory are critical components of the human memory model.
We develop a new principled Bayesian memory allocation scheme that bridges the gap between episodic and semantic memory.
We demonstrate that this allocation scheme improves performance in memory conditional image generation.
arXiv Detail & Related papers (2021-02-20T18:40:40Z) - Neural Network Compression for Noisy Storage Devices [71.4102472611862]
Conventionally, model compression and physical storage are decoupled.
This approach forces the storage to treat each bit of the compressed model equally, and to dedicate the same amount of resources to each bit.
We propose a radically different approach that: (i) employs analog memories to maximize the capacity of each memory cell, and (ii) jointly optimize model compression and physical storage to maximize memory utility.
arXiv Detail & Related papers (2021-02-15T18:19:07Z) - Memformer: A Memory-Augmented Transformer for Sequence Modeling [55.780849185884996]
We present Memformer, an efficient neural network for sequence modeling.
Our model achieves linear time complexity and constant memory space complexity when processing long sequences.
arXiv Detail & Related papers (2020-10-14T09:03:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.