KroneckerBERT: Learning Kronecker Decomposition for Pre-trained Language
Models via Knowledge Distillation
- URL: http://arxiv.org/abs/2109.06243v1
- Date: Mon, 13 Sep 2021 18:19:30 GMT
- Title: KroneckerBERT: Learning Kronecker Decomposition for Pre-trained Language
Models via Knowledge Distillation
- Authors: Marzieh S. Tahaei, Ella Charlaix, Vahid Partovi Nia, Ali Ghodsi and
Mehdi Rezagholizadeh
- Abstract summary: We push the limits of state-of-the-art Transformer-based pre-trained language model compression using Kronecker decomposition.
We present our KroneckerBERT, a compressed version of the BERT_BASE model obtained using this framework.
Our experiments indicate that the proposed model has promising out-of-distribution robustness and is superior to the state-of-the-art compression methods on SQuAD.
- Score: 5.8287955127529365
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The development of over-parameterized pre-trained language models has made a
significant contribution toward the success of natural language processing.
While over-parameterization of these models is the key to their generalization
power, it makes them unsuitable for deployment on low-capacity devices. We push
the limits of state-of-the-art Transformer-based pre-trained language model
compression using Kronecker decomposition. We use this decomposition for
compression of the embedding layer, all linear mappings in the multi-head
attention, and the feed-forward network modules in the Transformer layer. We
perform intermediate-layer knowledge distillation using the uncompressed model
as the teacher to improve the performance of the compressed model. We present
our KroneckerBERT, a compressed version of the BERT_BASE model obtained using
this framework. We evaluate the performance of KroneckerBERT on well-known NLP
benchmarks and show that for a high compression factor of 19 (5% of the size of
the BERT_BASE model), our KroneckerBERT outperforms state-of-the-art
compression methods on the GLUE. Our experiments indicate that the proposed
model has promising out-of-distribution robustness and is superior to the
state-of-the-art compression methods on SQuAD.
Related papers
- Language Models as Zero-shot Lossless Gradient Compressors: Towards
General Neural Parameter Prior Models [66.1595537904019]
Large language models (LLMs) can act as gradient priors in a zero-shot setting.
We introduce LM-GC, a novel method that integrates LLMs with arithmetic coding.
arXiv Detail & Related papers (2024-09-26T13:38:33Z) - Convolutional Neural Network Compression Based on Low-Rank Decomposition [3.3295360710329738]
This paper proposes a model compression method that integrates Variational Bayesian Matrix Factorization.
VBMF is employed to estimate the rank of the weight tensor at each layer.
Experimental results show that for both high and low compression ratios, our compression model exhibits advanced performance.
arXiv Detail & Related papers (2024-08-29T06:40:34Z) - Unified Low-rank Compression Framework for Click-through Rate Prediction [15.813889566241539]
We propose a unified low-rank decomposition framework for compressing CTR prediction models.
Our framework can achieve better performance than the original model.
Our framework can be applied to embedding tables and layers in various CTR prediction models.
arXiv Detail & Related papers (2024-05-28T13:06:32Z) - Activations and Gradients Compression for Model-Parallel Training [85.99744701008802]
We study how simultaneous compression of activations and gradients in model-parallel distributed training setup affects convergence.
We find that gradients require milder compression rates than activations.
Experiments also show that models trained with TopK perform well only when compression is also applied during inference.
arXiv Detail & Related papers (2024-01-15T15:54:54Z) - oBERTa: Improving Sparse Transfer Learning via improved initialization,
distillation, and pruning regimes [82.99830498937729]
oBERTa is an easy-to-use set of language models for Natural Language Processing.
It allows NLP practitioners to obtain between 3.8 and 24.3 times faster models without expertise in model compression.
We explore the use of oBERTa on seven representative NLP tasks.
arXiv Detail & Related papers (2023-03-30T01:37:19Z) - Compression of Generative Pre-trained Language Models via Quantization [62.80110048377957]
We find that previous quantization methods fail on generative tasks due to the textithomogeneous word embeddings
We propose a token-level contrastive distillation to learn distinguishable word embeddings, and a module-wise dynamic scaling to make quantizers adaptive to different modules.
arXiv Detail & Related papers (2022-03-21T02:11:35Z) - The Optimal BERT Surgeon: Scalable and Accurate Second-Order Pruning for
Large Language Models [23.12519490211362]
This paper studies the accuracy-compression trade-off for unstructured weight pruning in the context of BERT models.
We introduce Optimal BERT Surgeon (O-BERT-S), an efficient and accurate weight pruning method based on approximate second-order information.
We investigate the impact of this pruning method when compounding compression approaches for Transformer-based models.
arXiv Detail & Related papers (2022-03-14T16:40:31Z) - What do Compressed Large Language Models Forget? Robustness Challenges
in Model Compression [68.82486784654817]
We study two popular model compression techniques including knowledge distillation and pruning.
We show that compressed models are significantly less robust than their PLM counterparts on adversarial test sets.
We develop a regularization strategy for model compression based on sample uncertainty.
arXiv Detail & Related papers (2021-10-16T00:20:04Z) - BinaryBERT: Pushing the Limit of BERT Quantization [74.65543496761553]
We propose BinaryBERT, which pushes BERT quantization to the limit with weight binarization.
We find that a binary BERT is hard to be trained directly than a ternary counterpart due to its complex and irregular loss landscapes.
Empirical results show that BinaryBERT has negligible performance drop compared to the full-precision BERT-base.
arXiv Detail & Related papers (2020-12-31T16:34:54Z) - Efficient Transformer-based Large Scale Language Representations using
Hardware-friendly Block Structured Pruning [12.761055946548437]
We propose an efficient transformer-based large-scale language representation using hardware-friendly block structure pruning.
Besides the significantly reduced weight storage and computation, the proposed approach achieves high compression rates.
It is suitable to deploy the final compressed model on resource-constrained edge devices.
arXiv Detail & Related papers (2020-09-17T04:45:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.