TensorGPT: Efficient Compression of the Embedding Layer in LLMs based on
the Tensor-Train Decomposition
- URL: http://arxiv.org/abs/2307.00526v1
- Date: Sun, 2 Jul 2023 09:33:09 GMT
- Title: TensorGPT: Efficient Compression of the Embedding Layer in LLMs based on
the Tensor-Train Decomposition
- Authors: Mingxue Xu, Yao Lei Xu, Danilo P. Mandic
- Abstract summary: This work proposes an approach based on the Matrix-Train Decomposition (TTD)
Each token embedding is treated as a Product State (MPS) that can be efficiently computed in a distributed manner.
The experimental results on GPT-2 demonstrate that, through our approach, the embedding layer can be compressed by a factor of up to 38.40 times, and when the compression factor is 3.31 times, even produced a better performance than the original GPT-2 model.
- Score: 22.84674270619026
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: High-dimensional token embeddings underpin Large Language Models (LLMs), as
they can capture subtle semantic information and significantly enhance the
modelling of complex language patterns. However, the associated high
dimensionality also introduces considerable model parameters, and a
prohibitively high model storage. To address this issue, this work proposes an
approach based on the Tensor-Train Decomposition (TTD), where each token
embedding is treated as a Matrix Product State (MPS) that can be efficiently
computed in a distributed manner. The experimental results on GPT-2 demonstrate
that, through our approach, the embedding layer can be compressed by a factor
of up to 38.40 times, and when the compression factor is 3.31 times, even
produced a better performance than the original GPT-2 model.
Related papers
- TensorSLM: Energy-efficient Embedding Compression of Sub-billion Parameter Language Models on Low-end Devices [19.897367559948336]
This paper proposes a training-free token embedding compression approach using Train Decomposition (TTD)<n>We evaluate the extracted low-rank structures across compression ratio, language task performance, latency, and energy consumption on a typical low-end device, i.e. Raspberry Pi.
arXiv Detail & Related papers (2025-06-16T14:09:43Z) - Forget the Data and Fine-Tuning! Just Fold the Network to Compress [13.611551223875194]
We introduce model folding, a novel data-free model compression technique that merges structurally similar neurons across layers.
We show that model folding achieves comparable performance to data-driven compression techniques and outperforms recently proposed data-free methods.
This approach is particularly effective for compressing large-scale models, making it suitable for deployment in resource-constrained environments.
arXiv Detail & Related papers (2025-02-14T15:10:43Z) - Choose Your Model Size: Any Compression by a Single Gradient Descent [9.074689052563878]
We present Any Compression via Iterative Pruning (ACIP)
ACIP is an algorithmic approach to determine a compression-performance trade-off from a single gradient descent run.
We show that ACIP seamlessly complements common quantization-based compression techniques.
arXiv Detail & Related papers (2025-02-03T18:40:58Z) - Language Models as Zero-shot Lossless Gradient Compressors: Towards
General Neural Parameter Prior Models [66.1595537904019]
Large language models (LLMs) can act as gradient priors in a zero-shot setting.
We introduce LM-GC, a novel method that integrates LLMs with arithmetic coding.
arXiv Detail & Related papers (2024-09-26T13:38:33Z) - MoDeGPT: Modular Decomposition for Large Language Model Compression [59.361006801465344]
This paper introduces textbfModular bfDecomposition (MoDeGPT), a novel structured compression framework.
MoDeGPT partitions the Transformer block into modules comprised of matrix pairs and reduces the hidden dimensions.
Our experiments show MoDeGPT, without backward propagation, matches or surpasses previous structured compression methods.
arXiv Detail & Related papers (2024-08-19T01:30:14Z) - Pruning Large Language Models with Semi-Structural Adaptive Sparse Training [17.381160429641316]
We propose a pruning pipeline for semi-structured sparse models via retraining, termed Adaptive Sparse Trainer (AST)
AST transforms dense models into sparse ones by applying decay to masked weights while allowing the model to adaptively select masks throughout the training process.
Our work demonstrates the feasibility of deploying semi-structured sparse large language models and introduces a novel method for achieving highly compressed models.
arXiv Detail & Related papers (2024-07-30T06:33:44Z) - Knowledge Translation: A New Pathway for Model Compression [22.106103818486144]
TextbfKnowledge textbfTranslation (KT)
A translation'' model is trained to receive the parameters of a larger model and generate compressed parameters.
We propose a comprehensive framework for KT, introduce data augmentation strategies to enhance model performance despite restricted training data, and successfully demonstrate the feasibility of KT on the MNIST dataset.
arXiv Detail & Related papers (2024-01-11T09:25:42Z) - Efficient GPT Model Pre-training using Tensor Train Matrix
Representation [65.96485282393361]
Large-scale transformer models feature billions of parameters, leading to difficulties in their deployment and prohibitive training costs from scratch.
To reduce the number of parameters in the GPT-2 architecture, we replace the matrices of fully-connected layers with the corresponding Train Matrix(TTM) structure.
The resulting GPT-based model stores up to 40% fewer parameters, showing the perplexity comparable to the original model.
arXiv Detail & Related papers (2023-06-05T08:38:25Z) - Scaling Pre-trained Language Models to Deeper via Parameter-efficient
Architecture [68.13678918660872]
We design a more capable parameter-sharing architecture based on matrix product operator (MPO)
MPO decomposition can reorganize and factorize the information of a parameter matrix into two parts.
Our architecture shares the central tensor across all layers for reducing the model size.
arXiv Detail & Related papers (2023-03-27T02:34:09Z) - What do Compressed Large Language Models Forget? Robustness Challenges
in Model Compression [68.82486784654817]
We study two popular model compression techniques including knowledge distillation and pruning.
We show that compressed models are significantly less robust than their PLM counterparts on adversarial test sets.
We develop a regularization strategy for model compression based on sample uncertainty.
arXiv Detail & Related papers (2021-10-16T00:20:04Z) - Intrinsic Dimensionality Explains the Effectiveness of Language Model
Fine-Tuning [52.624194343095304]
We argue that analyzing fine-tuning through the lens of intrinsic dimension provides us with empirical and theoretical intuitions.
We empirically show that common pre-trained models have a very low intrinsic dimension.
arXiv Detail & Related papers (2020-12-22T07:42:30Z) - Self-Supervised GAN Compression [32.21713098893454]
We show that a standard model compression technique, weight pruning, cannot be applied to GANs using existing methods.
We then develop a self-supervised compression technique which uses the trained discriminator to supervise the training of a compressed generator.
We show that this framework has a compelling performance to high degrees of sparsity, can be easily applied to new tasks and models, and enables meaningful comparisons between different pruning granularities.
arXiv Detail & Related papers (2020-07-03T04:18:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.