Primer: Searching for Efficient Transformers for Language Modeling
- URL: http://arxiv.org/abs/2109.08668v1
- Date: Fri, 17 Sep 2021 17:50:39 GMT
- Title: Primer: Searching for Efficient Transformers for Language Modeling
- Authors: David R. So, Wojciech Ma\'nke, Hanxiao Liu, Zihang Dai, Noam Shazeer,
Quoc V. Le
- Abstract summary: Training and inference costs of large Transformer models have grown rapidly and become expensive.
Here we aim to reduce the costs of Transformers by searching for a more efficient variant.
We identify an architecture, named Primer, that has a smaller training cost than the original Transformer.
- Score: 79.2677566332444
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Transformer models have been central to recent advances in natural
language processing. The training and inference costs of these models, however,
have grown rapidly and become prohibitively expensive. Here we aim to reduce
the costs of Transformers by searching for a more efficient variant. Compared
to previous approaches, our search is performed at a lower level, over the
primitives that define a Transformer TensorFlow program. We identify an
architecture, named Primer, that has a smaller training cost than the original
Transformer and other variants for auto-regressive language modeling. Primer's
improvements can be mostly attributed to two simple modifications: squaring
ReLU activations and adding a depthwise convolution layer after each Q, K, and
V projection in self-attention.
Experiments show Primer's gains over Transformer increase as compute scale
grows and follow a power law with respect to quality at optimal model sizes. We
also verify empirically that Primer can be dropped into different codebases to
significantly speed up training without additional tuning. For example, at a
500M parameter size, Primer improves the original T5 architecture on C4
auto-regressive language modeling, reducing the training cost by 4X.
Furthermore, the reduced training cost means Primer needs much less compute to
reach a target one-shot performance. For instance, in a 1.9B parameter
configuration similar to GPT-3 XL, Primer uses 1/3 of the training compute to
achieve the same one-shot performance as Transformer. We open source our models
and several comparisons in T5 to help with reproducibility.
Related papers
- Winner-Take-All Column Row Sampling for Memory Efficient Adaptation of Language Model [89.8764435351222]
We propose a new family of unbiased estimators called WTA-CRS, for matrix production with reduced variance.
Our work provides both theoretical and experimental evidence that, in the context of tuning transformers, our proposed estimators exhibit lower variance compared to existing ones.
arXiv Detail & Related papers (2023-05-24T15:52:08Z) - Structured Pruning of Self-Supervised Pre-trained Models for Speech
Recognition and Understanding [43.68557263195205]
Self-supervised speech representation learning (SSL) has shown to be effective in various downstream tasks, but SSL models are usually large and slow.
We propose three task-specific structured pruning methods to deal with such heterogeneous networks.
Experiments on LibriSpeech and SLURP show that the proposed method is more accurate than the original wav2vecbase with 10% to 30% less, and is able to reduce the computation by 40% to 50% without any degradation.
arXiv Detail & Related papers (2023-02-27T20:39:54Z) - Language Modeling using LMUs: 10x Better Data Efficiency or Improved
Scaling Compared to Transformers [4.899818550820576]
We construct a Legendre Memory Unit based model that introduces a general prior for sequence processing.
We show that our new architecture attains the same accuracy as transformers with 10x fewer tokens.
arXiv Detail & Related papers (2021-10-05T23:20:37Z) - Scale Efficiently: Insights from Pre-training and Fine-tuning
Transformers [57.931830650323]
This paper presents scaling insights from pretraining and finetuning Transformers.
We show that aside from only the model size, model shape matters for downstream fine-tuning.
We present improved scaling protocols whereby our redesigned models achieve similar downstream fine-tuning quality.
arXiv Detail & Related papers (2021-09-22T12:29:15Z) - Finetuning Pretrained Transformers into RNNs [81.72974646901136]
Transformers have outperformed recurrent neural networks (RNNs) in natural language generation.
A linear-complexity recurrent variant has proven well suited for autoregressive generation.
This work aims to convert a pretrained transformer into its efficient recurrent counterpart.
arXiv Detail & Related papers (2021-03-24T10:50:43Z) - Switch Transformers: Scaling to Trillion Parameter Models with Simple
and Efficient Sparsity [35.84448624327473]
We simplify the MoE routing algorithm and design intuitive improved models with reduced communication and computational costs.
We show large sparse models may be trained, for the first time, with lower precision (bfloat16) formats.
We design models based off T5-Base and T5-Large to obtain up to 7x increases in pre-training speed with the same computational resources.
arXiv Detail & Related papers (2021-01-11T16:11:52Z) - Shortformer: Better Language Modeling using Shorter Inputs [62.51758040848735]
We show that initially training the model on short subsequences, before moving on to longer ones, both reduces overall training time.
We then show how to improve the efficiency of recurrence methods in transformers.
arXiv Detail & Related papers (2020-12-31T18:52:59Z) - Adding Recurrence to Pretrained Transformers for Improved Efficiency and
Context Size [41.624797099537375]
We present a novel method for applying pretrained transformer language models.
We find that our method attains better perplexity than an unmodified GPT-2 model on the PG-19 and WikiText-103 corpora.
arXiv Detail & Related papers (2020-08-16T23:19:30Z) - Train Large, Then Compress: Rethinking Model Size for Efficient Training
and Inference of Transformers [94.43313684188819]
We study the impact of model size in this setting, focusing on Transformer models for NLP tasks that are limited by compute.
We first show that even though smaller Transformer models execute faster per iteration, wider and deeper models converge in significantly fewer steps.
This leads to an apparent trade-off between the training efficiency of large Transformer models and the inference efficiency of small Transformer models.
arXiv Detail & Related papers (2020-02-26T21:17:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.