Efficient GPT Model Pre-training using Tensor Train Matrix
Representation
- URL: http://arxiv.org/abs/2306.02697v1
- Date: Mon, 5 Jun 2023 08:38:25 GMT
- Title: Efficient GPT Model Pre-training using Tensor Train Matrix
Representation
- Authors: Viktoriia Chekalina, Georgii Novikov, Julia Gusak, Ivan Oseledets,
Alexander Panchenko
- Abstract summary: Large-scale transformer models feature billions of parameters, leading to difficulties in their deployment and prohibitive training costs from scratch.
To reduce the number of parameters in the GPT-2 architecture, we replace the matrices of fully-connected layers with the corresponding Train Matrix(TTM) structure.
The resulting GPT-based model stores up to 40% fewer parameters, showing the perplexity comparable to the original model.
- Score: 65.96485282393361
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large-scale transformer models have shown remarkable performance in language
modelling tasks. However, such models feature billions of parameters, leading
to difficulties in their deployment and prohibitive training costs from
scratch. To reduce the number of the parameters in the GPT-2 architecture, we
replace the matrices of fully-connected layers with the corresponding Tensor
Train Matrix~(TTM) structure. Finally, we customize forward and backward
operations through the TTM-based layer for simplicity and the stableness of
further training. % The resulting GPT-2-based model stores up to 40% fewer
parameters, showing the perplexity comparable to the original model. On the
downstream tasks, including language understanding and text summarization, the
model performs similarly to the original GPT-2 model. The proposed tensorized
layers could be used to efficiently pre-training other Transformer models.
Related papers
- Trainable Transformer in Transformer [48.754918968374334]
We propose an efficient construction, Transformer in Transformer (in short, TinT), that allows a transformer to simulate and fine-tune complex models internally during inference.
TinT accommodates many common transformer variants and its design ideas also improve the efficiency of past instantiations of simple models inside transformers.
These findings suggest that large pre-trained language models are capable of performing intricate inferences.
arXiv Detail & Related papers (2023-07-03T17:53:39Z) - TensorGPT: Efficient Compression of the Embedding Layer in LLMs based on
the Tensor-Train Decomposition [22.84674270619026]
This work proposes an approach based on the Matrix-Train Decomposition (TTD)
Each token embedding is treated as a Product State (MPS) that can be efficiently computed in a distributed manner.
The experimental results on GPT-2 demonstrate that, through our approach, the embedding layer can be compressed by a factor of up to 38.40 times, and when the compression factor is 3.31 times, even produced a better performance than the original GPT-2 model.
arXiv Detail & Related papers (2023-07-02T09:33:09Z) - Model-Generated Pretraining Signals Improves Zero-Shot Generalization of
Text-to-Text Transformers [98.30298332661323]
This paper explores the effectiveness of model-generated signals in improving zero-shot generalization of text-to-text Transformers such as T5.
We develop a new model, METRO-T0, which is pretrained using the redesigned ELECTRA-Style pretraining strategies and then prompt-finetuned on a mixture of NLP tasks.
Our analysis on model's neural activation and parameter sensitivity reveals that the effectiveness of METRO-T0 stems from more balanced contribution of parameters and better utilization of their capacity.
arXiv Detail & Related papers (2023-05-21T21:06:23Z) - Scaling Pre-trained Language Models to Deeper via Parameter-efficient
Architecture [68.13678918660872]
We design a more capable parameter-sharing architecture based on matrix product operator (MPO)
MPO decomposition can reorganize and factorize the information of a parameter matrix into two parts.
Our architecture shares the central tensor across all layers for reducing the model size.
arXiv Detail & Related papers (2023-03-27T02:34:09Z) - Leveraging Pre-trained Models for Failure Analysis Triplets Generation [0.0]
We leverage the attention mechanism of pre-trained causal language models such as Transformer model for the downstream task of generating Failure Analysis Triplets (FATs)
We observe that Generative Pre-trained Transformer 2 (GPT2) outperformed other transformer model for the failure analysis triplet generation (FATG) task.
In particular, we observe that GPT2 (trained on 1.5B parameters) outperforms pre-trained BERT, BART and GPT3 by a large margin on ROUGE.
arXiv Detail & Related papers (2022-10-31T17:21:15Z) - End-to-End Training for Back-Translation with Categorical Reparameterization Trick [0.0]
Back-translation is an effective semi-supervised learning framework in neural machine translation (NMT)
A pre-trained NMT model translates monolingual sentences and makes synthetic bilingual sentence pairs for the training of the other NMT model.
The discrete property of translated sentences prevents information gradient from flowing between the two NMT models.
arXiv Detail & Related papers (2022-02-17T06:31:03Z) - Improving Neural Machine Translation by Denoising Training [95.96569884410137]
We present a simple and effective pretraining strategy Denoising Training DoT for neural machine translation.
We update the model parameters with source- and target-side denoising tasks at the early stage and then tune the model normally.
Experiments show DoT consistently improves the neural machine translation performance across 12 bilingual and 16 multilingual directions.
arXiv Detail & Related papers (2022-01-19T00:11:38Z) - DSEE: Dually Sparsity-embedded Efficient Tuning of Pre-trained Language
Models [152.29364079385635]
As pre-trained models grow bigger, the fine-tuning process can be time-consuming and computationally expensive.
We propose a framework for resource- and parameter-efficient fine-tuning by leveraging the sparsity prior in both weight updates and the final model weights.
Our proposed framework, dubbed Dually Sparsity-Embedded Efficient Tuning (DSEE), aims to achieve two key objectives: (i) parameter efficient fine-tuning and (ii) resource-efficient inference.
arXiv Detail & Related papers (2021-10-30T03:29:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.