Griffin: Mixing Gated Linear Recurrences with Local Attention for
Efficient Language Models
- URL: http://arxiv.org/abs/2402.19427v1
- Date: Thu, 29 Feb 2024 18:24:46 GMT
- Title: Griffin: Mixing Gated Linear Recurrences with Local Attention for
Efficient Language Models
- Authors: Soham De, Samuel L. Smith, Anushan Fernando, Aleksandar Botev, George
Cristian-Muraru, Albert Gu, Ruba Haroun, Leonard Berrada, Yutian Chen,
Srivatsan Srinivasan, Guillaume Desjardins, Arnaud Doucet, David Budden, Yee
Whye Teh, Razvan Pascanu, Nando De Freitas and Caglar Gulcehre
- Abstract summary: We propose Hawk, an RNN with gated linear recurrences, and Griffin, a hybrid model that mixes gated linear recurrences with local attention.
Hawk exceeds the reported performance of Mamba on downstream tasks, while Griffin matches the performance of Llama-2 despite being trained on over 6 times fewer tokens.
We scale Griffin up to 14B parameters, and explain how to shard our models for efficient distributed training.
- Score: 101.70220733111164
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recurrent neural networks (RNNs) have fast inference and scale efficiently on
long sequences, but they are difficult to train and hard to scale. We propose
Hawk, an RNN with gated linear recurrences, and Griffin, a hybrid model that
mixes gated linear recurrences with local attention. Hawk exceeds the reported
performance of Mamba on downstream tasks, while Griffin matches the performance
of Llama-2 despite being trained on over 6 times fewer tokens. We also show
that Griffin can extrapolate on sequences significantly longer than those seen
during training. Our models match the hardware efficiency of Transformers
during training, and during inference they have lower latency and significantly
higher throughput. We scale Griffin up to 14B parameters, and explain how to
shard our models for efficient distributed training.
Related papers
- Parallelizing Linear Transformers with the Delta Rule over Sequence Length [49.88826673324244]
This work describes a hardware-efficient algorithm for training linear transformers with the delta rule.
We train a 1.3B model for 100B tokens and find that it outperforms recent linear-time baselines.
We also experiment with two hybrid models which combine DeltaNet layers with sliding-window attention layers every other layer or two global attention layers.
arXiv Detail & Related papers (2024-06-10T17:24:42Z) - RecurrentGemma: Moving Past Transformers for Efficient Open Language Models [103.59785165735727]
We introduce RecurrentGemma, an open language model which uses Google's novel Griffin architecture.
Griffin combines linear recurrences with local attention to achieve excellent performance on language.
arXiv Detail & Related papers (2024-04-11T15:27:22Z) - Gated Linear Attention Transformers with Hardware-Efficient Training [60.670102007737476]
This work describes a hardware-efficient algorithm for linear attention that trades off memory movement against parallelizability.
We then generalize this algorithm to a more expressive variant of linear attention with data-dependent gates.
When used as a replacement for the standard attention layer in Transformers, the resulting gated linear attention Transformer is found to perform competitively.
arXiv Detail & Related papers (2023-12-11T18:51:59Z) - Winner-Take-All Column Row Sampling for Memory Efficient Adaptation of
Language Model [92.55145016562867]
We propose a new family of unbiased estimators called WTA-CRS, for matrix production with reduced variance.
Our work provides both theoretical and experimental evidence that, in the context of tuning transformers, our proposed estimators exhibit lower variance compared to existing ones.
arXiv Detail & Related papers (2023-05-24T15:52:08Z) - SpikeGPT: Generative Pre-trained Language Model with Spiking Neural Networks [21.616328837090396]
Spiking Neural Networks (SNNs) leverage sparse and event-driven activations to reduce the computational overhead associated with model inference.
We implement generative language model with binary, event-driven spiking activation units.
SpikeGPT is the largest backpropagation-trained SNN model to date, rendering it suitable for both the generation and comprehension of natural language.
arXiv Detail & Related papers (2023-02-27T16:43:04Z) - Sequence Length is a Domain: Length-based Overfitting in Transformer
Models [0.0]
In machine translation, the neural-based systems perform worse on very long sequences when compared to the preceding phrase-based translation approaches.
We show that the observed drop in performance is due to the hypothesis length corresponding to the lengths seen by the model during training rather than the length of the input sequence.
arXiv Detail & Related papers (2021-09-15T13:25:19Z) - FNet: Mixing Tokens with Fourier Transforms [0.578717214982749]
We show that Transformer encoder architectures can be massively sped up with limited accuracy costs.
We replace the self-attention sublayers with simple linear transformations that "mix" input tokens.
The resulting model, which we name FNet, scales very efficiently to long inputs.
arXiv Detail & Related papers (2021-05-09T03:32:48Z) - Finetuning Pretrained Transformers into RNNs [81.72974646901136]
Transformers have outperformed recurrent neural networks (RNNs) in natural language generation.
A linear-complexity recurrent variant has proven well suited for autoregressive generation.
This work aims to convert a pretrained transformer into its efficient recurrent counterpart.
arXiv Detail & Related papers (2021-03-24T10:50:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.