Pre-computed memory or on-the-fly encoding? A hybrid approach to
retrieval augmentation makes the most of your compute
- URL: http://arxiv.org/abs/2301.10448v2
- Date: Fri, 2 Jun 2023 23:13:05 GMT
- Title: Pre-computed memory or on-the-fly encoding? A hybrid approach to
retrieval augmentation makes the most of your compute
- Authors: Michiel de Jong, Yury Zemlyanskiy, Nicholas FitzGerald, Joshua
Ainslie, Sumit Sanghai, Fei Sha, William Cohen
- Abstract summary: Fusion-in-Decoders are powerful, setting the state of the art on a variety of knowledge-intensive tasks.
Some work avoids this cost by pre-encoding a text corpus into a memory and retrieving dense representations directly.
We propose LUMEN, a hybrid between these two extremes, pre-computing the majority of the retrieval representation and completing the encoding on the fly.
We show that LUMEN significantly outperforms pure memory on multiple question-answering tasks while being much cheaper than FiD, and outperforms both for any given compute budget.
- Score: 23.85786594315147
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Retrieval-augmented language models such as Fusion-in-Decoder are powerful,
setting the state of the art on a variety of knowledge-intensive tasks.
However, they are also expensive, due to the need to encode a large number of
retrieved passages. Some work avoids this cost by pre-encoding a text corpus
into a memory and retrieving dense representations directly. However,
pre-encoding memory incurs a severe quality penalty as the memory
representations are not conditioned on the current input. We propose LUMEN, a
hybrid between these two extremes, pre-computing the majority of the retrieval
representation and completing the encoding on the fly using a live encoder that
is conditioned on the question and fine-tuned for the task. We show that LUMEN
significantly outperforms pure memory on multiple question-answering tasks
while being much cheaper than FiD, and outperforms both for any given compute
budget. Moreover, the advantage of LUMEN over FiD increases with model size.
Related papers
- FastFiD: Improve Inference Efficiency of Open Domain Question Answering via Sentence Selection [61.9638234358049]
FastFiD is a novel approach that executes sentence selection on encoded passages.
This aids in retaining valuable sentences while reducing the context length required for generating answers.
arXiv Detail & Related papers (2024-08-12T17:50:02Z) - Hardware-Aware Parallel Prompt Decoding for Memory-Efficient Acceleration of LLM Inference [19.167604927651073]
Auto-regressive decoding of Large Language Models (LLMs) results in significant overheads in their hardware performance.
We propose a novel parallel prompt decoding that requires only $0.0002$% trainable parameters, enabling efficient training on a single A100-40GB GPU in just 16 hours.
Our approach demonstrates up to 2.49$times$ speedup and maintains a minimal memory overhead of just $0.0004$%.
arXiv Detail & Related papers (2024-05-28T22:19:30Z) - Parallel Decoding via Hidden Transfer for Lossless Large Language Model Acceleration [54.897493351694195]
We propose a novel parallel decoding approach, namely textithidden transfer, which decodes multiple successive tokens simultaneously in a single forward pass.
In terms of acceleration metrics, we outperform all the single-model acceleration techniques, including Medusa and Self-Speculative decoding.
arXiv Detail & Related papers (2024-04-18T09:17:06Z) - Triple-Encoders: Representations That Fire Together, Wire Together [51.15206713482718]
Contrastive Learning is a representation learning method that encodes relative distances between utterances into the embedding space via a bi-encoder.
This study introduces triple-encoders, which efficiently compute distributed utterance mixtures from these independently encoded utterances.
We find that triple-encoders lead to a substantial improvement over bi-encoders, and even to better zero-shot generalization than single-vector representation models.
arXiv Detail & Related papers (2024-02-19T18:06:02Z) - MEMORY-VQ: Compression for Tractable Internet-Scale Memory [45.7528997281282]
Memory-based methods like LUMEN pre-compute token representations for retrieved passages to drastically speed up inference.
We propose MEMORY-VQ, a new method to reduce storage requirements of memory-augmented models without sacrificing performance.
arXiv Detail & Related papers (2023-08-28T21:11:18Z) - Encode-Store-Retrieve: Augmenting Human Memory through Language-Encoded Egocentric Perception [19.627636189321393]
A promising avenue for memory augmentation is through the use of augmented reality head-mounted displays to capture and preserve egocentric videos.
The current technology lacks the capability to encode and store such large amounts of data efficiently.
We propose a memory augmentation agent that involves leveraging natural language encoding for video data and storing them in a vector database.
arXiv Detail & Related papers (2023-08-10T18:43:44Z) - In-context Autoencoder for Context Compression in a Large Language Model [70.7621953091318]
We propose the In-context Autoencoder (ICAE) to compress a long context into short compact memory slots.
ICAE is first pretrained using both autoencoding and language modeling objectives on massive text data.
arXiv Detail & Related papers (2023-07-13T17:59:21Z) - GLIMMER: generalized late-interaction memory reranker [29.434777627686692]
Memory-augmentation is a powerful approach for incorporating external information into language models.
Recent work introduced LUMEN, a memory-retrieval hybrid that partially pre-computes memory and updates memory representations on the fly with a smaller live encoder.
We propose GLIMMER, which improves on this approach through 1) exploiting free access to the powerful memory representations by applying a shallow reranker on top of memory to drastically improve retrieval quality at low cost.
arXiv Detail & Related papers (2023-06-17T01:54:25Z) - Towards Accurate Image Coding: Improved Autoregressive Image Generation
with Dynamic Vector Quantization [73.52943587514386]
Existing vector quantization (VQ) based autoregressive models follow a two-stage generation paradigm.
We propose a novel two-stage framework: (1) Dynamic-Quantization VAE (DQ-VAE) which encodes image regions into variable-length codes based their information densities for accurate representation.
arXiv Detail & Related papers (2023-05-19T14:56:05Z) - REVEAL: Retrieval-Augmented Visual-Language Pre-Training with
Multi-Source Multimodal Knowledge Memory [119.98011559193574]
We propose an end-to-end Retrieval-Augmented Visual Language Model (REVEAL)
It learns to encode world knowledge into a large-scale memory, and to retrieve from it to answer knowledge-intensive queries.
A key novelty in our approach is that the memory, encoder, retriever and generator are all pre-trained end-to-end on a massive amount of data.
arXiv Detail & Related papers (2022-12-10T06:17:56Z) - Recurrent Relational Memory Network for Unsupervised Image Captioning [26.802700428311745]
Unsupervised image captioning with no annotations is a challenge in computer vision.
In this paper, we propose a novel memory-based network rather than an emerging GAN model.
Our solution enjoys less learnable parameters and higher computational efficiency than GAN-based methods.
arXiv Detail & Related papers (2020-06-24T10:44:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.