TensorGPT: Efficient Compression of the Embedding Layer in LLMs based on
the Tensor-Train Decomposition
- URL: http://arxiv.org/abs/2307.00526v1
- Date: Sun, 2 Jul 2023 09:33:09 GMT
- Title: TensorGPT: Efficient Compression of the Embedding Layer in LLMs based on
the Tensor-Train Decomposition
- Authors: Mingxue Xu, Yao Lei Xu, Danilo P. Mandic
- Abstract summary: This work proposes an approach based on the Matrix-Train Decomposition (TTD)
Each token embedding is treated as a Product State (MPS) that can be efficiently computed in a distributed manner.
The experimental results on GPT-2 demonstrate that, through our approach, the embedding layer can be compressed by a factor of up to 38.40 times, and when the compression factor is 3.31 times, even produced a better performance than the original GPT-2 model.
- Score: 22.84674270619026
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: High-dimensional token embeddings underpin Large Language Models (LLMs), as
they can capture subtle semantic information and significantly enhance the
modelling of complex language patterns. However, the associated high
dimensionality also introduces considerable model parameters, and a
prohibitively high model storage. To address this issue, this work proposes an
approach based on the Tensor-Train Decomposition (TTD), where each token
embedding is treated as a Matrix Product State (MPS) that can be efficiently
computed in a distributed manner. The experimental results on GPT-2 demonstrate
that, through our approach, the embedding layer can be compressed by a factor
of up to 38.40 times, and when the compression factor is 3.31 times, even
produced a better performance than the original GPT-2 model.
Related papers
- Tensor Polynomial Additive Model [40.30621617188693]
The TPAM preserves the inherent interpretability of additive models, transparent decision-making and the extraction of meaningful feature values.
It can enhance accuracy by up to 30%, and compression rate by up to 5 times, while maintaining a good interpretability.
arXiv Detail & Related papers (2024-06-05T06:23:11Z) - Data-freeWeight Compress and Denoise for Large Language Models [101.53420111286952]
We propose a novel approach termed Data-free Joint Rank-k Approximation for compressing the parameter matrices.
We achieve a model pruning of 80% parameters while retaining 93.43% of the original performance without any calibration data.
arXiv Detail & Related papers (2024-02-26T05:51:47Z) - Rethinking Compression: Reduced Order Modelling of Latent Features in
Large Language Models [9.91972450276408]
This paper introduces an innovative approach for the parametric and practical compression of Large Language Models (LLMs) based on reduced order modelling.
Our method represents a significant advancement in model compression by leveraging matrix decomposition, demonstrating superior efficacy compared to the prevailing state-of-the-art structured pruning method.
arXiv Detail & Related papers (2023-12-12T07:56:57Z) - Efficient GPT Model Pre-training using Tensor Train Matrix
Representation [65.96485282393361]
Large-scale transformer models feature billions of parameters, leading to difficulties in their deployment and prohibitive training costs from scratch.
To reduce the number of parameters in the GPT-2 architecture, we replace the matrices of fully-connected layers with the corresponding Train Matrix(TTM) structure.
The resulting GPT-based model stores up to 40% fewer parameters, showing the perplexity comparable to the original model.
arXiv Detail & Related papers (2023-06-05T08:38:25Z) - Scaling Pre-trained Language Models to Deeper via Parameter-efficient
Architecture [68.13678918660872]
We design a more capable parameter-sharing architecture based on matrix product operator (MPO)
MPO decomposition can reorganize and factorize the information of a parameter matrix into two parts.
Our architecture shares the central tensor across all layers for reducing the model size.
arXiv Detail & Related papers (2023-03-27T02:34:09Z) - Hierarchical mixtures of Gaussians for combined dimensionality reduction
and clustering [5.819751855626331]
We show how a family of such two-stage models can be combined into a single, hierarchical model that we call a hierarchical mixture of Gaussians (HMoG)
An HMoG simultaneously captures both dimensionality-reduction and clustering, and its performance is quantified in closed-form by the likelihood function.
We apply HMoGs to synthetic data and RNA sequencing data, and demonstrate how they exceed the limitations of two-stage models.
arXiv Detail & Related papers (2022-06-10T02:03:18Z) - Compression of Generative Pre-trained Language Models via Quantization [62.80110048377957]
We find that previous quantization methods fail on generative tasks due to the textithomogeneous word embeddings
We propose a token-level contrastive distillation to learn distinguishable word embeddings, and a module-wise dynamic scaling to make quantizers adaptive to different modules.
arXiv Detail & Related papers (2022-03-21T02:11:35Z) - Parameter-Efficient Mixture-of-Experts Architecture for Pre-trained
Language Models [68.9288651177564]
We present a novel MoE architecture based on matrix product operators (MPO) from quantum many-body physics.
With the decomposed MPO structure, we can reduce the parameters of the original MoE architecture.
Experiments on the three well-known downstream natural language datasets based on GPT2 show improved performance and efficiency in increasing model capacity.
arXiv Detail & Related papers (2022-03-02T13:44:49Z) - GLaM: Efficient Scaling of Language Models with Mixture-of-Experts [84.33607245023049]
We propose and develop a family of language models named GLaM (Generalist Language Model)
GLaM uses a sparsely activated mixture-of-experts architecture to scale the model capacity while also incurring substantially less training cost compared to dense variants.
It consumes only 1/3 of the energy used to train GPT-3 and requires half of the flops for inference, while still achieving better overall zero-shot and one-shot performance across 29 NLP tasks.
arXiv Detail & Related papers (2021-12-13T18:58:19Z) - Embedding Compression with Isotropic Iterative Quantization [40.567720430910725]
Continuous representation of words is a standard component in deep learning-based NLP models.
We propose an isotropic iterative quantization (IIQ) approach for compressing embedding vectors into binary ones.
arXiv Detail & Related papers (2020-01-11T20:53:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.