Griffin: Mixing Gated Linear Recurrences with Local Attention for
Efficient Language Models
- URL: http://arxiv.org/abs/2402.19427v1
- Date: Thu, 29 Feb 2024 18:24:46 GMT
- Title: Griffin: Mixing Gated Linear Recurrences with Local Attention for
Efficient Language Models
- Authors: Soham De, Samuel L. Smith, Anushan Fernando, Aleksandar Botev, George
Cristian-Muraru, Albert Gu, Ruba Haroun, Leonard Berrada, Yutian Chen,
Srivatsan Srinivasan, Guillaume Desjardins, Arnaud Doucet, David Budden, Yee
Whye Teh, Razvan Pascanu, Nando De Freitas and Caglar Gulcehre
- Abstract summary: We propose Hawk, an RNN with gated linear recurrences, and Griffin, a hybrid model that mixes gated linear recurrences with local attention.
Hawk exceeds the reported performance of Mamba on downstream tasks, while Griffin matches the performance of Llama-2 despite being trained on over 6 times fewer tokens.
We scale Griffin up to 14B parameters, and explain how to shard our models for efficient distributed training.
- Score: 101.70220733111164
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recurrent neural networks (RNNs) have fast inference and scale efficiently on
long sequences, but they are difficult to train and hard to scale. We propose
Hawk, an RNN with gated linear recurrences, and Griffin, a hybrid model that
mixes gated linear recurrences with local attention. Hawk exceeds the reported
performance of Mamba on downstream tasks, while Griffin matches the performance
of Llama-2 despite being trained on over 6 times fewer tokens. We also show
that Griffin can extrapolate on sequences significantly longer than those seen
during training. Our models match the hardware efficiency of Transformers
during training, and during inference they have lower latency and significantly
higher throughput. We scale Griffin up to 14B parameters, and explain how to
shard our models for efficient distributed training.
Related papers
- The Mamba in the Llama: Distilling and Accelerating Hybrid Models [76.64055251296548]
We show that it is feasible to distill large Transformers into linear RNNs by reusing the linear projection weights from attention layers with academic GPU resources.
The resulting hybrid model, which incorporates a quarter of the attention layers, achieves performance comparable to the original Transformer in chat benchmarks.
arXiv Detail & Related papers (2024-08-27T17:56:11Z) - Parallelizing Linear Transformers with the Delta Rule over Sequence Length [49.88826673324244]
This work describes a hardware-efficient algorithm for training linear transformers with the delta rule.
We train a 1.3B model for 100B tokens and find that it outperforms recent linear-time baselines.
arXiv Detail & Related papers (2024-06-10T17:24:42Z) - RecurrentGemma: Moving Past Transformers for Efficient Open Language Models [103.59785165735727]
We introduce RecurrentGemma, a family of open language models using Google's novel Griffin architecture.
Griffin combines linear recurrences with local attention to achieve excellent performance on language.
We provide two sizes of models, containing 2B and 9B parameters, and provide pre-trained and instruction tuned variants for both.
arXiv Detail & Related papers (2024-04-11T15:27:22Z) - Winner-Take-All Column Row Sampling for Memory Efficient Adaptation of Language Model [89.8764435351222]
We propose a new family of unbiased estimators called WTA-CRS, for matrix production with reduced variance.
Our work provides both theoretical and experimental evidence that, in the context of tuning transformers, our proposed estimators exhibit lower variance compared to existing ones.
arXiv Detail & Related papers (2023-05-24T15:52:08Z) - SpikeGPT: Generative Pre-trained Language Model with Spiking Neural Networks [21.616328837090396]
Spiking Neural Networks (SNNs) leverage sparse and event-driven activations to reduce the computational overhead associated with model inference.
We implement generative language model with binary, event-driven spiking activation units.
SpikeGPT is the largest backpropagation-trained SNN model to date, rendering it suitable for both the generation and comprehension of natural language.
arXiv Detail & Related papers (2023-02-27T16:43:04Z) - Sequence Length is a Domain: Length-based Overfitting in Transformer
Models [0.0]
In machine translation, the neural-based systems perform worse on very long sequences when compared to the preceding phrase-based translation approaches.
We show that the observed drop in performance is due to the hypothesis length corresponding to the lengths seen by the model during training rather than the length of the input sequence.
arXiv Detail & Related papers (2021-09-15T13:25:19Z) - FNet: Mixing Tokens with Fourier Transforms [0.578717214982749]
We show that Transformer encoder architectures can be massively sped up with limited accuracy costs.
We replace the self-attention sublayers with simple linear transformations that "mix" input tokens.
The resulting model, which we name FNet, scales very efficiently to long inputs.
arXiv Detail & Related papers (2021-05-09T03:32:48Z) - Finetuning Pretrained Transformers into RNNs [81.72974646901136]
Transformers have outperformed recurrent neural networks (RNNs) in natural language generation.
A linear-complexity recurrent variant has proven well suited for autoregressive generation.
This work aims to convert a pretrained transformer into its efficient recurrent counterpart.
arXiv Detail & Related papers (2021-03-24T10:50:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.