Adding Recurrence to Pretrained Transformers for Improved Efficiency and
Context Size
- URL: http://arxiv.org/abs/2008.07027v1
- Date: Sun, 16 Aug 2020 23:19:30 GMT
- Title: Adding Recurrence to Pretrained Transformers for Improved Efficiency and
Context Size
- Authors: Davis Yoshida, Allyson Ettinger, Kevin Gimpel
- Abstract summary: We present a novel method for applying pretrained transformer language models.
We find that our method attains better perplexity than an unmodified GPT-2 model on the PG-19 and WikiText-103 corpora.
- Score: 41.624797099537375
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Fine-tuning a pretrained transformer for a downstream task has become a
standard method in NLP in the last few years. While the results from these
models are impressive, applying them can be extremely computationally
expensive, as is pretraining new models with the latest architectures. We
present a novel method for applying pretrained transformer language models
which lowers their memory requirement both at training and inference time. An
additional benefit is that our method removes the fixed context size constraint
that most transformer models have, allowing for more flexible use. When applied
to the GPT-2 language model, we find that our method attains better perplexity
than an unmodified GPT-2 model on the PG-19 and WikiText-103 corpora, for a
given amount of computation or memory.
Related papers
- Making the Most of your Model: Methods for Finetuning and Applying Pretrained Transformers [0.21756081703276003]
This thesis provides methods and analysis of models which make progress on this goal.
We introduce two new finetuning methods which add new capabilities to the models they are used on.
We provide theoretical and empirical insights on the divergence of model-likelihood and output quality.
arXiv Detail & Related papers (2024-08-29T03:50:24Z) - Linearizing Large Language Models [26.94551511277412]
We present a method to uptrain existing large pre-trained transformers into Recurrent Neural Networks (RNNs) with a modest compute budget.
We find that our linearization technique leads to competitive performance on standard benchmarks, but we identify persistent in-context learning and long-context modeling shortfalls for even the largest linear models.
arXiv Detail & Related papers (2024-05-10T17:59:08Z) - Memory-efficient Stochastic methods for Memory-based Transformers [3.360916255196531]
Memory-based transformers can require a large amount of memory and can be quite inefficient.
We propose a novel two-phase training mechanism and a novel regularization technique to improve the training efficiency of memory-based transformers.
arXiv Detail & Related papers (2023-11-14T12:37:25Z) - Trainable Transformer in Transformer [48.754918968374334]
We propose an efficient construction, Transformer in Transformer (in short, TinT), that allows a transformer to simulate and fine-tune complex models internally during inference.
TinT accommodates many common transformer variants and its design ideas also improve the efficiency of past instantiations of simple models inside transformers.
These findings suggest that large pre-trained language models are capable of performing intricate inferences.
arXiv Detail & Related papers (2023-07-03T17:53:39Z) - Efficient GPT Model Pre-training using Tensor Train Matrix
Representation [65.96485282393361]
Large-scale transformer models feature billions of parameters, leading to difficulties in their deployment and prohibitive training costs from scratch.
To reduce the number of parameters in the GPT-2 architecture, we replace the matrices of fully-connected layers with the corresponding Train Matrix(TTM) structure.
The resulting GPT-based model stores up to 40% fewer parameters, showing the perplexity comparable to the original model.
arXiv Detail & Related papers (2023-06-05T08:38:25Z) - Quantization-Aware and Tensor-Compressed Training of Transformers for
Natural Language Understanding [12.030179065286928]
The paper proposes a quantization-aware tensor-compressed training approach to reduce the model size, arithmetic operations, and runtime latency of transformer-based models.
A layer-by-layer distillation is applied to distill a quantized and tensor-compressed student model from a pre-trained transformer.
The performance is demonstrated in two natural language understanding tasks, showing up to $63times$ compression ratio, little accuracy loss and remarkable inference and training speedup.
arXiv Detail & Related papers (2023-06-01T18:32:08Z) - Learning to Grow Pretrained Models for Efficient Transformer Training [72.20676008625641]
We learn to grow pretrained transformers, where we learn to linearly map the parameters of the smaller model to initialize the larger model.
Experiments across both language and vision transformers demonstrate that our learned Linear Growth Operator (LiGO) can save up to 50% computational cost of training from scratch.
arXiv Detail & Related papers (2023-03-02T05:21:18Z) - bert2BERT: Towards Reusable Pretrained Language Models [51.078081486422896]
We propose bert2BERT, which can effectively transfer the knowledge of an existing smaller pre-trained model to a large model.
bert2BERT saves about 45% and 47% computational cost of pre-training BERT_BASE and GPT_BASE by reusing the models of almost their half sizes.
arXiv Detail & Related papers (2021-10-14T04:05:25Z) - Train Large, Then Compress: Rethinking Model Size for Efficient Training
and Inference of Transformers [94.43313684188819]
We study the impact of model size in this setting, focusing on Transformer models for NLP tasks that are limited by compute.
We first show that even though smaller Transformer models execute faster per iteration, wider and deeper models converge in significantly fewer steps.
This leads to an apparent trade-off between the training efficiency of large Transformer models and the inference efficiency of small Transformer models.
arXiv Detail & Related papers (2020-02-26T21:17:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.