Kronecker Decomposition for GPT Compression
- URL: http://arxiv.org/abs/2110.08152v1
- Date: Fri, 15 Oct 2021 15:28:39 GMT
- Title: Kronecker Decomposition for GPT Compression
- Authors: Ali Edalati, Marzieh Tahaei, Ahmad Rashid, Vahid Partovi Nia, James J.
Clark, Mehdi Rezagholizadeh
- Abstract summary: GPT is an auto-regressive Transformer-based pre-trained language model which has attracted a lot of attention in the natural language processing (NLP) domain.
Despite the superior performance of GPT, GPT can be very prohibitive for deploying this model on devices with limited computational power or memory.
In this work, we use Kronecker decomposition to compress the linear mappings of the GPT-22 model.
- Score: 8.60086973058282
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: GPT is an auto-regressive Transformer-based pre-trained language model which
has attracted a lot of attention in the natural language processing (NLP)
domain due to its state-of-the-art performance in several downstream tasks. The
success of GPT is mostly attributed to its pre-training on huge amount of data
and its large number of parameters (from ~100M to billions of parameters).
Despite the superior performance of GPT (especially in few-shot or zero-shot
setup), this overparameterized nature of GPT can be very prohibitive for
deploying this model on devices with limited computational power or memory.
This problem can be mitigated using model compression techniques; however,
compressing GPT models has not been investigated much in the literature. In
this work, we use Kronecker decomposition to compress the linear mappings of
the GPT-22 model. Our Kronecker GPT-2 model (KnGPT2) is initialized based on
the Kronecker decomposed version of the GPT-2 model and then is undergone a
very light pre-training on only a small portion of the training data with
intermediate layer knowledge distillation (ILKD). Finally, our KnGPT2 is
fine-tuned on down-stream tasks using ILKD as well. We evaluate our model on
both language modeling and General Language Understanding Evaluation benchmark
tasks and show that with more efficient pre-training and similar number of
parameters, our KnGPT2 outperforms the existing DistilGPT2 model significantly.
Related papers
- GPT vs RETRO: Exploring the Intersection of Retrieval and Parameter-Efficient Fine-Tuning [48.71952325015267]
We apply PEFT methods to a modified Retrieval-Enhanced Transformer (RETRO) and a baseline GPT model across several sizes.
We show that RETRO models outperform GPT models in zero-shot settings due to their unique pre-training process.
This work presents the first comprehensive comparison of various PEFT methods integrated with RAG, applied to both GPT and RETRO models.
arXiv Detail & Related papers (2024-07-05T14:16:47Z) - Aligning GPTRec with Beyond-Accuracy Goals with Reinforcement Learning [67.71952251641545]
GPTRec is an alternative to the Top-K model for item-by-item recommendations.
We show that GPTRec offers a better tradeoff between accuracy and secondary metrics than classic greedy re-ranking techniques.
Our experiments on two datasets show that GPTRec's Next-K generation approach offers a better tradeoff between accuracy and secondary metrics than classic greedy re-ranking techniques.
arXiv Detail & Related papers (2024-03-07T19:47:48Z) - TQCompressor: improving tensor decomposition methods in neural networks
via permutations [0.0]
We introduce TQCompressor, a novel method for neural network model compression with improved tensor decompositions.
This enhancement makes it possible to reduce loss in model expressivity which is usually associated with factorization.
TQCompressedGPT-2 surpasses DistilGPT-2 and KnGPT-2 in comparative evaluations.
arXiv Detail & Related papers (2024-01-29T18:07:56Z) - Approximating Human-Like Few-shot Learning with GPT-based Compression [55.699707962017975]
We seek to equip generative pre-trained models with human-like learning capabilities that enable data compression during inference.
We present a novel approach that utilizes the Generative Pre-trained Transformer (GPT) to approximate Kolmogorov complexity.
arXiv Detail & Related papers (2023-08-14T05:22:33Z) - TensorGPT: Efficient Compression of Large Language Models based on Tensor-Train Decomposition [19.897367559948336]
We propose a training-free model compression approach based on the Matrix-Train Decomposition (TTD)
We then investigate the low-rank structures extracted by this approach, in terms of the compression ratio, the language task performance, and latency on a typical low-end device (i.e. Raspberry Pi)
arXiv Detail & Related papers (2023-07-02T09:33:09Z) - Efficient GPT Model Pre-training using Tensor Train Matrix
Representation [65.96485282393361]
Large-scale transformer models feature billions of parameters, leading to difficulties in their deployment and prohibitive training costs from scratch.
To reduce the number of parameters in the GPT-2 architecture, we replace the matrices of fully-connected layers with the corresponding Train Matrix(TTM) structure.
The resulting GPT-based model stores up to 40% fewer parameters, showing the perplexity comparable to the original model.
arXiv Detail & Related papers (2023-06-05T08:38:25Z) - InheritSumm: A General, Versatile and Compact Summarizer by Distilling
from GPT [75.29359361404073]
InheritSumm is a versatile and compact summarization model derived from GPT-3.5 through distillation.
It achieves similar or superior performance to GPT-3.5 in zeroshot and fewshot settings.
arXiv Detail & Related papers (2023-05-22T14:52:32Z) - DSEE: Dually Sparsity-embedded Efficient Tuning of Pre-trained Language
Models [152.29364079385635]
As pre-trained models grow bigger, the fine-tuning process can be time-consuming and computationally expensive.
We propose a framework for resource- and parameter-efficient fine-tuning by leveraging the sparsity prior in both weight updates and the final model weights.
Our proposed framework, dubbed Dually Sparsity-Embedded Efficient Tuning (DSEE), aims to achieve two key objectives: (i) parameter efficient fine-tuning and (ii) resource-efficient inference.
arXiv Detail & Related papers (2021-10-30T03:29:47Z) - A Short Study on Compressing Decoder-Based Language Models [9.090064110056224]
Pre-trained Language Models (PLMs) have been successful for a wide range of natural language processing (NLP) tasks.
The state-of-the-art of PLMs are extremely large to be used on edge devices.
The topic of model compression has attracted increasing attention in the NLP community.
arXiv Detail & Related papers (2021-10-16T03:37:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.