TQCompressor: improving tensor decomposition methods in neural networks
via permutations
- URL: http://arxiv.org/abs/2401.16367v1
- Date: Mon, 29 Jan 2024 18:07:56 GMT
- Title: TQCompressor: improving tensor decomposition methods in neural networks
via permutations
- Authors: V. Abronin, A. Naumov, D. Mazur, D. Bystrov, K. Tsarova, Ar. Melnikov,
I. Oseledets, S. Dolgov, R. Brasher, M. Perelshtein
- Abstract summary: We introduce TQCompressor, a novel method for neural network model compression with improved tensor decompositions.
This enhancement makes it possible to reduce loss in model expressivity which is usually associated with factorization.
TQCompressedGPT-2 surpasses DistilGPT-2 and KnGPT-2 in comparative evaluations.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce TQCompressor, a novel method for neural network model
compression with improved tensor decompositions. We explore the challenges
posed by the computational and storage demands of pre-trained language models
in NLP tasks and propose a permutation-based enhancement to Kronecker
decomposition. This enhancement makes it possible to reduce loss in model
expressivity which is usually associated with factorization. We demonstrate
this method applied to the GPT-2$_{small}$. The result of the compression is
TQCompressedGPT-2 model, featuring 81 mln. parameters compared to 124 mln. in
the GPT-2$_{small}$. We make TQCompressedGPT-2 publicly available. We further
enhance the performance of the TQCompressedGPT-2 through a training strategy
involving multi-step knowledge distillation, using only a 3.1% of the
OpenWebText. TQCompressedGPT-2 surpasses DistilGPT-2 and KnGPT-2 in comparative
evaluations, marking an advancement in the efficient and effective deployment
of models in resource-constrained environments.
Related papers
- MoDeGPT: Modular Decomposition for Large Language Model Compression [59.361006801465344]
This paper introduces textbfModular bfDecomposition (MoDeGPT), a novel structured compression framework.
MoDeGPT partitions the Transformer block into modules comprised of matrix pairs and reduces the hidden dimensions.
Our experiments show MoDeGPT, without backward propagation, matches or surpasses previous structured compression methods.
arXiv Detail & Related papers (2024-08-19T01:30:14Z) - Approximating Human-Like Few-shot Learning with GPT-based Compression [55.699707962017975]
We seek to equip generative pre-trained models with human-like learning capabilities that enable data compression during inference.
We present a novel approach that utilizes the Generative Pre-trained Transformer (GPT) to approximate Kolmogorov complexity.
arXiv Detail & Related papers (2023-08-14T05:22:33Z) - TensorGPT: Efficient Compression of Large Language Models based on Tensor-Train Decomposition [19.897367559948336]
We propose a training-free model compression approach based on the Matrix-Train Decomposition (TTD)
We then investigate the low-rank structures extracted by this approach, in terms of the compression ratio, the language task performance, and latency on a typical low-end device (i.e. Raspberry Pi)
arXiv Detail & Related papers (2023-07-02T09:33:09Z) - Efficient GPT Model Pre-training using Tensor Train Matrix
Representation [65.96485282393361]
Large-scale transformer models feature billions of parameters, leading to difficulties in their deployment and prohibitive training costs from scratch.
To reduce the number of parameters in the GPT-2 architecture, we replace the matrices of fully-connected layers with the corresponding Train Matrix(TTM) structure.
The resulting GPT-based model stores up to 40% fewer parameters, showing the perplexity comparable to the original model.
arXiv Detail & Related papers (2023-06-05T08:38:25Z) - On-demand compute reduction with stochastic wav2vec 2.0 [63.22845151306881]
We propose compression for on-demand compute reduction for wav2vec 2.0 (W2V2) models.
Our results for models pre-trained on 960h Librispeech dataset and fine-tuned on 10h of transcribed data show that using the same model, we get a smooth trade-off between word error rate (WER) and inference time.
arXiv Detail & Related papers (2022-04-25T19:25:46Z) - Compression of Generative Pre-trained Language Models via Quantization [62.80110048377957]
We find that previous quantization methods fail on generative tasks due to the textithomogeneous word embeddings
We propose a token-level contrastive distillation to learn distinguishable word embeddings, and a module-wise dynamic scaling to make quantizers adaptive to different modules.
arXiv Detail & Related papers (2022-03-21T02:11:35Z) - Low-rank Tensor Decomposition for Compression of Convolutional Neural
Networks Using Funnel Regularization [1.8579693774597708]
We propose a model reduction method to compress the pre-trained networks using low-rank tensor decomposition.
A new regularization method, called funnel function, is proposed to suppress the unimportant factors during the compression.
For ResNet18 with ImageNet2012, our reduced model can reach more than twi times speed up in terms of GMAC with merely 0.7% Top-1 accuracy drop.
arXiv Detail & Related papers (2021-12-07T13:41:51Z) - A Short Study on Compressing Decoder-Based Language Models [9.090064110056224]
Pre-trained Language Models (PLMs) have been successful for a wide range of natural language processing (NLP) tasks.
The state-of-the-art of PLMs are extremely large to be used on edge devices.
The topic of model compression has attracted increasing attention in the NLP community.
arXiv Detail & Related papers (2021-10-16T03:37:08Z) - Kronecker Decomposition for GPT Compression [8.60086973058282]
GPT is an auto-regressive Transformer-based pre-trained language model which has attracted a lot of attention in the natural language processing (NLP) domain.
Despite the superior performance of GPT, GPT can be very prohibitive for deploying this model on devices with limited computational power or memory.
In this work, we use Kronecker decomposition to compress the linear mappings of the GPT-22 model.
arXiv Detail & Related papers (2021-10-15T15:28:39Z) - An Efficient Statistical-based Gradient Compression Technique for
Distributed Training Systems [77.88178159830905]
Sparsity-Inducing Distribution-based Compression (SIDCo) is a threshold-based sparsification scheme that enjoys similar threshold estimation quality to deep gradient compression (DGC)
Our evaluation shows SIDCo speeds up training by up to 41:7%, 7:6%, and 1:9% compared to the no-compression baseline, Topk, and DGC compressors, respectively.
arXiv Detail & Related papers (2021-01-26T13:06:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.