What do Compressed Large Language Models Forget? Robustness Challenges
in Model Compression
- URL: http://arxiv.org/abs/2110.08419v1
- Date: Sat, 16 Oct 2021 00:20:04 GMT
- Title: What do Compressed Large Language Models Forget? Robustness Challenges
in Model Compression
- Authors: Mengnan Du, Subhabrata Mukherjee, Yu Cheng, Milad Shokouhi, Xia Hu,
Ahmed Hassan Awadallah
- Abstract summary: We study two popular model compression techniques including knowledge distillation and pruning.
We show that compressed models are significantly less robust than their PLM counterparts on adversarial test sets.
We develop a regularization strategy for model compression based on sample uncertainty.
- Score: 68.82486784654817
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent works have focused on compressing pre-trained language models (PLMs)
like BERT where the major focus has been to improve the compressed model
performance for downstream tasks. However, there has been no study in analyzing
the impact of compression on the generalizability and robustness of these
models. Towards this end, we study two popular model compression techniques
including knowledge distillation and pruning and show that compressed models
are significantly less robust than their PLM counterparts on adversarial test
sets although they obtain similar performance on in-distribution development
sets for a task. Further analysis indicates that the compressed models overfit
on the easy samples and generalize poorly on the hard ones. We further leverage
this observation to develop a regularization strategy for model compression
based on sample uncertainty. Experimental results on several natural language
understanding tasks demonstrate our mitigation framework to improve both the
adversarial generalization as well as in-distribution task performance of the
compressed models.
Related papers
- Activations and Gradients Compression for Model-Parallel Training [85.99744701008802]
We study how simultaneous compression of activations and gradients in model-parallel distributed training setup affects convergence.
We find that gradients require milder compression rates than activations.
Experiments also show that models trained with TopK perform well only when compression is also applied during inference.
arXiv Detail & Related papers (2024-01-15T15:54:54Z) - The Cost of Compression: Investigating the Impact of Compression on
Parametric Knowledge in Language Models [11.156816338995503]
Large language models (LLMs) provide faster inference, smaller memory footprints, and enables local deployment.
Two standard compression techniques are pruning and quantization, with the former eliminating redundant connections in model layers and the latter representing model parameters with fewer bits.
Existing research on LLM compression primarily focuses on performance in terms of general metrics like perplexity or downstream task accuracy.
More fine-grained metrics, such as those measuring parametric knowledge, remain significantly underexplored.
arXiv Detail & Related papers (2023-12-01T22:27:12Z) - Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution [67.9215891673174]
We propose score entropy as a novel loss that naturally extends score matching to discrete spaces.
We test our Score Entropy Discrete Diffusion models on standard language modeling tasks.
arXiv Detail & Related papers (2023-10-25T17:59:12Z) - Retrieval-based Knowledge Transfer: An Effective Approach for Extreme
Large Language Model Compression [64.07696663255155]
Large-scale pre-trained language models (LLMs) have demonstrated exceptional performance in various natural language processing (NLP) tasks.
However, the massive size of these models poses huge challenges for their deployment in real-world applications.
We introduce a novel compression paradigm called Retrieval-based Knowledge Transfer (RetriKT) which effectively transfers the knowledge of LLMs to extremely small-scale models.
arXiv Detail & Related papers (2023-10-24T07:58:20Z) - Benchmarking Adversarial Robustness of Compressed Deep Learning Models [15.737988622271219]
This study seeks to understand the effect of adversarial inputs crafted for base models on their pruned versions.
Our findings reveal that while the benefits of pruning enhanced generalizability, compression, and faster inference times are preserved, adversarial robustness remains comparable to the base model.
arXiv Detail & Related papers (2023-08-16T06:06:56Z) - DQ-BART: Efficient Sequence-to-Sequence Model via Joint Distillation and
Quantization [75.72231742114951]
Large-scale pre-trained sequence-to-sequence models like BART and T5 achieve state-of-the-art performance on many generative NLP tasks.
These models pose a great challenge in resource-constrained scenarios owing to their large memory requirements and high latency.
We propose to jointly distill and quantize the model, where knowledge is transferred from the full-precision teacher model to the quantized and distilled low-precision student model.
arXiv Detail & Related papers (2022-03-21T18:04:25Z) - Compression of Generative Pre-trained Language Models via Quantization [62.80110048377957]
We find that previous quantization methods fail on generative tasks due to the textithomogeneous word embeddings
We propose a token-level contrastive distillation to learn distinguishable word embeddings, and a module-wise dynamic scaling to make quantizers adaptive to different modules.
arXiv Detail & Related papers (2022-03-21T02:11:35Z) - Can Model Compression Improve NLP Fairness [3.172761915061083]
This is the first paper to examine the effect of distillation and pruning on the toxicity and bias of generative language models.
We test Knowledge Distillation and Pruning methods on the GPT2 model and found a consistent pattern of toxicity and bias reduction.
arXiv Detail & Related papers (2022-01-21T05:14:51Z) - Model Compression for Dynamic Forecast Combination [9.281199058905017]
We show that compressing dynamic forecasting ensembles into an individual model leads to a comparable predictive performance.
We also show that the compressed individual model with best average rank is a rule-based regression model.
arXiv Detail & Related papers (2021-04-05T09:55:35Z) - Self-Supervised GAN Compression [32.21713098893454]
We show that a standard model compression technique, weight pruning, cannot be applied to GANs using existing methods.
We then develop a self-supervised compression technique which uses the trained discriminator to supervise the training of a compressed generator.
We show that this framework has a compelling performance to high degrees of sparsity, can be easily applied to new tasks and models, and enables meaningful comparisons between different pruning granularities.
arXiv Detail & Related papers (2020-07-03T04:18:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.