Pre-computed memory or on-the-fly encoding? A hybrid approach to
retrieval augmentation makes the most of your compute
- URL: http://arxiv.org/abs/2301.10448v2
- Date: Fri, 2 Jun 2023 23:13:05 GMT
- Title: Pre-computed memory or on-the-fly encoding? A hybrid approach to
retrieval augmentation makes the most of your compute
- Authors: Michiel de Jong, Yury Zemlyanskiy, Nicholas FitzGerald, Joshua
Ainslie, Sumit Sanghai, Fei Sha, William Cohen
- Abstract summary: Fusion-in-Decoders are powerful, setting the state of the art on a variety of knowledge-intensive tasks.
Some work avoids this cost by pre-encoding a text corpus into a memory and retrieving dense representations directly.
We propose LUMEN, a hybrid between these two extremes, pre-computing the majority of the retrieval representation and completing the encoding on the fly.
We show that LUMEN significantly outperforms pure memory on multiple question-answering tasks while being much cheaper than FiD, and outperforms both for any given compute budget.
- Score: 23.85786594315147
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Retrieval-augmented language models such as Fusion-in-Decoder are powerful,
setting the state of the art on a variety of knowledge-intensive tasks.
However, they are also expensive, due to the need to encode a large number of
retrieved passages. Some work avoids this cost by pre-encoding a text corpus
into a memory and retrieving dense representations directly. However,
pre-encoding memory incurs a severe quality penalty as the memory
representations are not conditioned on the current input. We propose LUMEN, a
hybrid between these two extremes, pre-computing the majority of the retrieval
representation and completing the encoding on the fly using a live encoder that
is conditioned on the question and fine-tuned for the task. We show that LUMEN
significantly outperforms pure memory on multiple question-answering tasks
while being much cheaper than FiD, and outperforms both for any given compute
budget. Moreover, the advantage of LUMEN over FiD increases with model size.
Related papers
- Hardware-Aware Parallel Prompt Decoding for Memory-Efficient Acceleration of LLM Inference [19.167604927651073]
Auto-regressive decoding of Large Language Models (LLMs) results in significant overheads in their hardware performance.
We propose a novel parallel prompt decoding that requires only $0.0002$% trainable parameters, enabling efficient training on a single A100-40GB GPU in just 16 hours.
Our approach demonstrates up to 2.49$times$ speedup and maintains a minimal memory overhead of just $0.0004$%.
arXiv Detail & Related papers (2024-05-28T22:19:30Z) - You Only Cache Once: Decoder-Decoder Architectures for Language Models [132.4064488592704]
We introduce a decoder-decoder architecture, YOCO, for large language models.
YOCO only caches key-value pairs once.
The overall model behaves like a decoder-only Transformer, although YOCO only caches once.
arXiv Detail & Related papers (2024-05-08T17:57:39Z) - Parallel Decoding via Hidden Transfer for Lossless Large Language Model Acceleration [54.897493351694195]
We propose a novel parallel decoding approach, namely textithidden transfer, which decodes multiple successive tokens simultaneously in a single forward pass.
In terms of acceleration metrics, we outperform all the single-model acceleration techniques, including Medusa and Self-Speculative decoding.
arXiv Detail & Related papers (2024-04-18T09:17:06Z) - Triple-Encoders: Representations That Fire Together, Wire Together [51.15206713482718]
Contrastive Learning is a representation learning method that encodes relative distances between utterances into the embedding space via a bi-encoder.
This study introduces triple-encoders, which efficiently compute distributed utterance mixtures from these independently encoded utterances.
We find that triple-encoders lead to a substantial improvement over bi-encoders, and even to better zero-shot generalization than single-vector representation models.
arXiv Detail & Related papers (2024-02-19T18:06:02Z) - MEMORY-VQ: Compression for Tractable Internet-Scale Memory [45.7528997281282]
Memory-based methods like LUMEN pre-compute token representations for retrieved passages to drastically speed up inference.
We propose MEMORY-VQ, a new method to reduce storage requirements of memory-augmented models without sacrificing performance.
arXiv Detail & Related papers (2023-08-28T21:11:18Z) - In-context Autoencoder for Context Compression in a Large Language Model [70.7621953091318]
We propose the In-context Autoencoder (ICAE) to compress a long context into short compact memory slots.
ICAE is first pretrained using both autoencoding and language modeling objectives on massive text data.
arXiv Detail & Related papers (2023-07-13T17:59:21Z) - GLIMMER: generalized late-interaction memory reranker [29.434777627686692]
Memory-augmentation is a powerful approach for incorporating external information into language models.
Recent work introduced LUMEN, a memory-retrieval hybrid that partially pre-computes memory and updates memory representations on the fly with a smaller live encoder.
We propose GLIMMER, which improves on this approach through 1) exploiting free access to the powerful memory representations by applying a shallow reranker on top of memory to drastically improve retrieval quality at low cost.
arXiv Detail & Related papers (2023-06-17T01:54:25Z) - Towards Accurate Image Coding: Improved Autoregressive Image Generation
with Dynamic Vector Quantization [73.52943587514386]
Existing vector quantization (VQ) based autoregressive models follow a two-stage generation paradigm.
We propose a novel two-stage framework: (1) Dynamic-Quantization VAE (DQ-VAE) which encodes image regions into variable-length codes based their information densities for accurate representation.
arXiv Detail & Related papers (2023-05-19T14:56:05Z) - REVEAL: Retrieval-Augmented Visual-Language Pre-Training with
Multi-Source Multimodal Knowledge Memory [119.98011559193574]
We propose an end-to-end Retrieval-Augmented Visual Language Model (REVEAL)
It learns to encode world knowledge into a large-scale memory, and to retrieve from it to answer knowledge-intensive queries.
A key novelty in our approach is that the memory, encoder, retriever and generator are all pre-trained end-to-end on a massive amount of data.
arXiv Detail & Related papers (2022-12-10T06:17:56Z) - Is Your Language Model Ready for Dense Representation Fine-tuning? [15.238322226336232]
This paper shows that one cause lies in the readiness of the LM to expose its knowledge through dense representation in fine-tuning.
We present Condenser, a general pre-training architecture based on Transformer LMs, to improve dense optimization readiness.
arXiv Detail & Related papers (2021-04-16T17:36:44Z) - Recurrent Relational Memory Network for Unsupervised Image Captioning [26.802700428311745]
Unsupervised image captioning with no annotations is a challenge in computer vision.
In this paper, we propose a novel memory-based network rather than an emerging GAN model.
Our solution enjoys less learnable parameters and higher computational efficiency than GAN-based methods.
arXiv Detail & Related papers (2020-06-24T10:44:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.