Improving language models by retrieving from trillions of tokens
- URL: http://arxiv.org/abs/2112.04426v1
- Date: Wed, 8 Dec 2021 17:32:34 GMT
- Title: Improving language models by retrieving from trillions of tokens
- Authors: Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza
Rutherford, Katie Millican, George van den Driessche, Jean-Baptiste Lespiau,
Bogdan Damoc, Aidan Clark, Diego de Las Casas, Aurelia Guy, Jacob Menick,
Roman Ring, Tom Hennigan, Saffron Huang, Loren Maggiore, Chris Jones, Albin
Cassirer, Andy Brock, Michela Paganini, Geoffrey Irving, Oriol Vinyals, Simon
Osindero, Karen Simonyan, Jack W. Rae, Erich Elsen, Laurent Sifre
- Abstract summary: We enhance auto-regressive language models by conditioning on document chunks retrieved from a large corpus.
With a $2$ trillion token database, our Retrieval-Enhanced Transformer (RETRO) obtains comparable performance to GPT-3 and Jurassic-1 on the Pile.
- Score: 50.42630445476544
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We enhance auto-regressive language models by conditioning on document chunks
retrieved from a large corpus, based on local similarity with preceding tokens.
With a $2$ trillion token database, our Retrieval-Enhanced Transformer (RETRO)
obtains comparable performance to GPT-3 and Jurassic-1 on the Pile, despite
using 25$\times$ fewer parameters. After fine-tuning, RETRO performance
translates to downstream knowledge-intensive tasks such as question answering.
RETRO combines a frozen Bert retriever, a differentiable encoder and a chunked
cross-attention mechanism to predict tokens based on an order of magnitude more
data than what is typically consumed during training. We typically train RETRO
from scratch, yet can also rapidly RETROfit pre-trained transformers with
retrieval and still achieve good performance. Our work opens up new avenues for
improving language models through explicit memory at unprecedented scale.
Related papers
- Pushdown Layers: Encoding Recursive Structure in Transformer Language
Models [86.75729087623259]
Recursion is a prominent feature of human language, and fundamentally challenging for self-attention.
This work introduces Pushdown Layers, a new self-attention layer.
Transformers equipped with Pushdown Layers achieve dramatically better and 3-5x more sample-efficient syntactic generalization.
arXiv Detail & Related papers (2023-10-29T17:27:18Z) - On the Generalization Ability of Retrieval-Enhanced Transformers [1.0552465253379135]
Off-loading memory from trainable weights to a retrieval database can significantly improve language modeling.
It has been suggested that at least some of this performance gain is due to non-trivial generalization based on both model weights and retrieval.
We find that the performance gains from retrieval largely originate from overlapping tokens between the database and the test data.
arXiv Detail & Related papers (2023-02-23T16:11:04Z) - NarrowBERT: Accelerating Masked Language Model Pretraining and Inference [50.59811343945605]
We propose NarrowBERT, a modified transformer encoder that increases the throughput for masked language model pretraining by more than $2times$.
NarrowBERT sparsifies the transformer model such that the self-attention queries and feedforward layers only operate on the masked tokens of each sentence during pretraining.
We show that NarrowBERT increases the throughput at inference time by as much as $3.5times$ with minimal (or no) performance degradation on sentence encoding tasks like MNLI.
arXiv Detail & Related papers (2023-01-11T23:45:50Z) - History Compression via Language Models in Reinforcement Learning [5.937618881286057]
In a partially observable Markov decision process (POMDP), an agent typically uses a representation of the past to approximate the underlying MDP.
We propose to utilize a frozen Pretrained Language Transformer (PLT) for history representation and compression to improve sample efficiency.
Our new method, HELM, enables actor-critic network architectures that contain a pretrained language Transformer for history representation as a memory module.
arXiv Detail & Related papers (2022-05-24T17:59:29Z) - Sentence Bottleneck Autoencoders from Transformer Language Models [53.350633961266375]
We build a sentence-level autoencoder from a pretrained, frozen transformer language model.
We adapt the masked language modeling objective as a generative, denoising one, while only training a sentence bottleneck and a single-layer modified transformer decoder.
We demonstrate that the sentence representations discovered by our model achieve better quality than previous methods that extract representations from pretrained transformers on text similarity tasks, style transfer, and single-sentence classification tasks in the GLUE benchmark, while using fewer parameters than large pretrained models.
arXiv Detail & Related papers (2021-08-31T19:39:55Z) - Funnel-Transformer: Filtering out Sequential Redundancy for Efficient
Language Processing [112.2208052057002]
We propose Funnel-Transformer which gradually compresses the sequence of hidden states to a shorter one.
With comparable or fewer FLOPs, Funnel-Transformer outperforms the standard Transformer on a wide variety of sequence-level prediction tasks.
arXiv Detail & Related papers (2020-06-05T05:16:23Z) - Segatron: Segment-Aware Transformer for Language Modeling and
Understanding [79.84562707201323]
We propose a segment-aware Transformer (Segatron) to generate better contextual representations from sequential tokens.
We first introduce the segment-aware mechanism to Transformer-XL, which is a popular Transformer-based language model.
We find that our method can further improve the Transformer-XL base model and large model, achieving 17.1 perplexity on the WikiText-103 dataset.
arXiv Detail & Related papers (2020-04-30T17:38:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.