Related papers: A Short Study on Compressing Decoder-Based Language Models

A Short Study on Compressing Decoder-Based Language Models

URL: http://arxiv.org/abs/2110.08460v1
Date: Sat, 16 Oct 2021 03:37:08 GMT
Title: A Short Study on Compressing Decoder-Based Language Models
Authors: Tianda Li, Yassir El Mesbahi, Ivan Kobyzev, Ahmad Rashid, Atif Mahmud, Nithin Anchuri, Habib Hajimolahoseini, Yang Liu, Mehdi Rezagholizadeh
Abstract summary: Pre-trained Language Models (PLMs) have been successful for a wide range of natural language processing (NLP) tasks. The state-of-the-art of PLMs are extremely large to be used on edge devices. The topic of model compression has attracted increasing attention in the NLP community.
Score: 9.090064110056224
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Pre-trained Language Models (PLMs) have been successful for a wide range of natural language processing (NLP) tasks. The state-of-the-art of PLMs, however, are extremely large to be used on edge devices. As a result, the topic of model compression has attracted increasing attention in the NLP community. Most of the existing works focus on compressing encoder-based models (tiny-BERT, distilBERT, distilRoBERTa, etc), however, to the best of our knowledge, the compression of decoder-based models (such as GPT-2) has not been investigated much. Our paper aims to fill this gap. Specifically, we explore two directions: 1) we employ current state-of-the-art knowledge distillation techniques to improve fine-tuning of DistilGPT-2. 2) we pre-train a compressed GPT-2 model using layer truncation and compare it against the distillation-based method (DistilGPT2). The training time of our compressed model is significantly less than DistilGPT-2, but it can achieve better performance when fine-tuned on downstream tasks. We also demonstrate the impact of data cleaning on model performance.

Related papers

TQCompressor: improving tensor decomposition methods in neural networks via permutations [0.0]
We introduce TQCompressor, a novel method for neural network model compression with improved tensor decompositions. This enhancement makes it possible to reduce loss in model expressivity which is usually associated with factorization. TQCompressedGPT-2 surpasses DistilGPT-2 and KnGPT-2 in comparative evaluations.
arXiv Detail & Related papers (2024-01-29T18:07:56Z)
Activations and Gradients Compression for Model-Parallel Training [85.99744701008802]
We study how simultaneous compression of activations and gradients in model-parallel distributed training setup affects convergence. We find that gradients require milder compression rates than activations. Experiments also show that models trained with TopK perform well only when compression is also applied during inference.
arXiv Detail & Related papers (2024-01-15T15:54:54Z)
Co-training and Co-distillation for Quality Improvement and Compression of Language Models [88.94539115180919]
Knowledge Distillation (KD) compresses expensive pre-trained language models (PLMs) by transferring their knowledge to smaller models. Most smaller models fail to surpass the performance of the original larger model, resulting in sacrificing performance to improve inference speed. We propose Co-Training and Co-Distillation (CTCD), a novel framework that improves performance and inference speed together by co-training two models.
arXiv Detail & Related papers (2023-11-06T03:29:00Z)
TensorGPT: Efficient Compression of Large Language Models based on Tensor-Train Decomposition [19.897367559948336]
We propose a training-free model compression approach based on the Matrix-Train Decomposition (TTD) We then investigate the low-rank structures extracted by this approach, in terms of the compression ratio, the language task performance, and latency on a typical low-end device (i.e. Raspberry Pi)
arXiv Detail & Related papers (2023-07-02T09:33:09Z)
oBERTa: Improving Sparse Transfer Learning via improved initialization, distillation, and pruning regimes [82.99830498937729]
oBERTa is an easy-to-use set of language models for Natural Language Processing. It allows NLP practitioners to obtain between 3.8 and 24.3 times faster models without expertise in model compression. We explore the use of oBERTa on seven representative NLP tasks.
arXiv Detail & Related papers (2023-03-30T01:37:19Z)
Can Model Compression Improve NLP Fairness [3.172761915061083]
This is the first paper to examine the effect of distillation and pruning on the toxicity and bias of generative language models. We test Knowledge Distillation and Pruning methods on the GPT2 model and found a consistent pattern of toxicity and bias reduction.
arXiv Detail & Related papers (2022-01-21T05:14:51Z)
What do Compressed Large Language Models Forget? Robustness Challenges in Model Compression [68.82486784654817]
We study two popular model compression techniques including knowledge distillation and pruning. We show that compressed models are significantly less robust than their PLM counterparts on adversarial test sets. We develop a regularization strategy for model compression based on sample uncertainty.
arXiv Detail & Related papers (2021-10-16T00:20:04Z)
Kronecker Decomposition for GPT Compression [8.60086973058282]
GPT is an auto-regressive Transformer-based pre-trained language model which has attracted a lot of attention in the natural language processing (NLP) domain. Despite the superior performance of GPT, GPT can be very prohibitive for deploying this model on devices with limited computational power or memory. In this work, we use Kronecker decomposition to compress the linear mappings of the GPT-22 model.
arXiv Detail & Related papers (2021-10-15T15:28:39Z)
CPM-2: Large-scale Cost-effective Pre-trained Language Models [71.59893315671997]
We present a suite of cost-effective techniques for the use of PLMs to deal with the efficiency issues of pre-training, fine-tuning, and inference. We introduce knowledge inheritance to accelerate the pre-training process by exploiting existing PLMs instead of training models from scratch. We implement a new inference toolkit, namely InfMoE, for using large-scale PLMs with limited computational resources.
arXiv Detail & Related papers (2021-06-20T15:43:54Z)
MetaDistiller: Network Self-Boosting via Meta-Learned Top-Down Distillation [153.56211546576978]
In this work, we propose that better soft targets with higher compatibil-ity can be generated by using a label generator. We can employ the meta-learning technique to optimize this label generator. The experiments are conducted on two standard classificationbenchmarks, namely CIFAR-100 and ILSVRC2012.
arXiv Detail & Related papers (2020-08-27T13:04:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.