Related papers: Pre-computed memory or on-the-fly encoding? A hybrid approach to retrieval augmentation makes the most of your compute

Pre-computed memory or on-the-fly encoding? A hybrid approach to retrieval augmentation makes the most of your compute

URL: http://arxiv.org/abs/2301.10448v2
Date: Fri, 2 Jun 2023 23:13:05 GMT
Title: Pre-computed memory or on-the-fly encoding? A hybrid approach to retrieval augmentation makes the most of your compute
Authors: Michiel de Jong, Yury Zemlyanskiy, Nicholas FitzGerald, Joshua Ainslie, Sumit Sanghai, Fei Sha, William Cohen
Abstract summary: Fusion-in-Decoders are powerful, setting the state of the art on a variety of knowledge-intensive tasks. Some work avoids this cost by pre-encoding a text corpus into a memory and retrieving dense representations directly. We propose LUMEN, a hybrid between these two extremes, pre-computing the majority of the retrieval representation and completing the encoding on the fly. We show that LUMEN significantly outperforms pure memory on multiple question-answering tasks while being much cheaper than FiD, and outperforms both for any given compute budget.
Score: 23.85786594315147
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Retrieval-augmented language models such as Fusion-in-Decoder are powerful, setting the state of the art on a variety of knowledge-intensive tasks. However, they are also expensive, due to the need to encode a large number of retrieved passages. Some work avoids this cost by pre-encoding a text corpus into a memory and retrieving dense representations directly. However, pre-encoding memory incurs a severe quality penalty as the memory representations are not conditioned on the current input. We propose LUMEN, a hybrid between these two extremes, pre-computing the majority of the retrieval representation and completing the encoding on the fly using a live encoder that is conditioned on the question and fine-tuned for the task. We show that LUMEN significantly outperforms pure memory on multiple question-answering tasks while being much cheaper than FiD, and outperforms both for any given compute budget. Moreover, the advantage of LUMEN over FiD increases with model size.

Related papers

FastFiD: Improve Inference Efficiency of Open Domain Question Answering via Sentence Selection [61.9638234358049]
FastFiD is a novel approach that executes sentence selection on encoded passages. This aids in retaining valuable sentences while reducing the context length required for generating answers.
arXiv Detail & Related papers (2024-08-12T17:50:02Z)
Hardware-Aware Parallel Prompt Decoding for Memory-Efficient Acceleration of LLM Inference [19.167604927651073]
Auto-regressive decoding of Large Language Models (LLMs) results in significant overheads in their hardware performance. We propose a novel parallel prompt decoding that requires only $0.0002$% trainable parameters, enabling efficient training on a single A100-40GB GPU in just 16 hours. Our approach demonstrates up to 2.49$times$ speedup and maintains a minimal memory overhead of just $0.0004$%.
arXiv Detail & Related papers (2024-05-28T22:19:30Z)
Parallel Decoding via Hidden Transfer for Lossless Large Language Model Acceleration [54.897493351694195]
We propose a novel parallel decoding approach, namely textithidden transfer, which decodes multiple successive tokens simultaneously in a single forward pass. In terms of acceleration metrics, we outperform all the single-model acceleration techniques, including Medusa and Self-Speculative decoding.
arXiv Detail & Related papers (2024-04-18T09:17:06Z)
Triple-Encoders: Representations That Fire Together, Wire Together [51.15206713482718]
Contrastive Learning is a representation learning method that encodes relative distances between utterances into the embedding space via a bi-encoder. This study introduces triple-encoders, which efficiently compute distributed utterance mixtures from these independently encoded utterances. We find that triple-encoders lead to a substantial improvement over bi-encoders, and even to better zero-shot generalization than single-vector representation models.
arXiv Detail & Related papers (2024-02-19T18:06:02Z)
MEMORY-VQ: Compression for Tractable Internet-Scale Memory [45.7528997281282]
Memory-based methods like LUMEN pre-compute token representations for retrieved passages to drastically speed up inference. We propose MEMORY-VQ, a new method to reduce storage requirements of memory-augmented models without sacrificing performance.
arXiv Detail & Related papers (2023-08-28T21:11:18Z)
Encode-Store-Retrieve: Augmenting Human Memory through Language-Encoded Egocentric Perception [19.627636189321393]
A promising avenue for memory augmentation is through the use of augmented reality head-mounted displays to capture and preserve egocentric videos. The current technology lacks the capability to encode and store such large amounts of data efficiently. We propose a memory augmentation agent that involves leveraging natural language encoding for video data and storing them in a vector database.
arXiv Detail & Related papers (2023-08-10T18:43:44Z)
In-context Autoencoder for Context Compression in a Large Language Model [70.7621953091318]
We propose the In-context Autoencoder (ICAE) to compress a long context into short compact memory slots. ICAE is first pretrained using both autoencoding and language modeling objectives on massive text data.
arXiv Detail & Related papers (2023-07-13T17:59:21Z)
GLIMMER: generalized late-interaction memory reranker [29.434777627686692]
Memory-augmentation is a powerful approach for incorporating external information into language models. Recent work introduced LUMEN, a memory-retrieval hybrid that partially pre-computes memory and updates memory representations on the fly with a smaller live encoder. We propose GLIMMER, which improves on this approach through 1) exploiting free access to the powerful memory representations by applying a shallow reranker on top of memory to drastically improve retrieval quality at low cost.
arXiv Detail & Related papers (2023-06-17T01:54:25Z)
Towards Accurate Image Coding: Improved Autoregressive Image Generation with Dynamic Vector Quantization [73.52943587514386]
Existing vector quantization (VQ) based autoregressive models follow a two-stage generation paradigm. We propose a novel two-stage framework: (1) Dynamic-Quantization VAE (DQ-VAE) which encodes image regions into variable-length codes based their information densities for accurate representation.
arXiv Detail & Related papers (2023-05-19T14:56:05Z)
REVEAL: Retrieval-Augmented Visual-Language Pre-Training with Multi-Source Multimodal Knowledge Memory [119.98011559193574]
We propose an end-to-end Retrieval-Augmented Visual Language Model (REVEAL) It learns to encode world knowledge into a large-scale memory, and to retrieve from it to answer knowledge-intensive queries. A key novelty in our approach is that the memory, encoder, retriever and generator are all pre-trained end-to-end on a massive amount of data.
arXiv Detail & Related papers (2022-12-10T06:17:56Z)
Recurrent Relational Memory Network for Unsupervised Image Captioning [26.802700428311745]
Unsupervised image captioning with no annotations is a challenge in computer vision. In this paper, we propose a novel memory-based network rather than an emerging GAN model. Our solution enjoys less learnable parameters and higher computational efficiency than GAN-based methods.
arXiv Detail & Related papers (2020-06-24T10:44:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.