What Happens When Small Is Made Smaller? Exploring the Impact of Compression on Small Data Pretrained Language Models
- URL: http://arxiv.org/abs/2404.04759v1
- Date: Sat, 6 Apr 2024 23:52:53 GMT
- Title: What Happens When Small Is Made Smaller? Exploring the Impact of Compression on Small Data Pretrained Language Models
- Authors: Busayo Awobade, Mardiyyah Oduwole, Steven Kolawole,
- Abstract summary: This paper investigates the effectiveness of pruning, knowledge distillation, and quantization on an exclusively low-resourced, small-data language model, AfriBERTa.
Through a battery of experiments, we assess the effects of compression on performance across several metrics beyond accuracy.
- Score: 2.2871867623460216
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Compression techniques have been crucial in advancing machine learning by enabling efficient training and deployment of large-scale language models. However, these techniques have received limited attention in the context of low-resource language models, which are trained on even smaller amounts of data and under computational constraints, a scenario known as the "low-resource double-bind." This paper investigates the effectiveness of pruning, knowledge distillation, and quantization on an exclusively low-resourced, small-data language model, AfriBERTa. Through a battery of experiments, we assess the effects of compression on performance across several metrics beyond accuracy. Our study provides evidence that compression techniques significantly improve the efficiency and effectiveness of small-data language models, confirming that the prevailing beliefs regarding the effects of compression on large, heavily parameterized models hold true for less-parameterized, small-data models.
Related papers
- Retrieval-based Knowledge Transfer: An Effective Approach for Extreme
Large Language Model Compression [64.07696663255155]
Large-scale pre-trained language models (LLMs) have demonstrated exceptional performance in various natural language processing (NLP) tasks.
However, the massive size of these models poses huge challenges for their deployment in real-world applications.
We introduce a novel compression paradigm called Retrieval-based Knowledge Transfer (RetriKT) which effectively transfers the knowledge of LLMs to extremely small-scale models.
arXiv Detail & Related papers (2023-10-24T07:58:20Z) - Boosting Inference Efficiency: Unleashing the Power of Parameter-Shared
Pre-trained Language Models [109.06052781040916]
We introduce a technique to enhance the inference efficiency of parameter-shared language models.
We also propose a simple pre-training technique that leads to fully or partially shared models.
Results demonstrate the effectiveness of our methods on both autoregressive and autoencoding PLMs.
arXiv Detail & Related papers (2023-10-19T15:13:58Z) - Uncovering the Hidden Cost of Model Compression [43.62624133952414]
Visual Prompting has emerged as a pivotal method for transfer learning in computer vision.
Model compression detrimentally impacts the performance of visual prompting-based transfer.
However, negative effects on calibration are not present when models are compressed via quantization.
arXiv Detail & Related papers (2023-08-29T01:47:49Z) - Intriguing Properties of Compression on Multilingual Models [17.06142742945346]
We propose a framework to characterize the impact of sparsifying multilingual pre-trained language models during fine-tuning.
Applying this framework to mBERT named entity recognition models across 40 languages, we find that compression confers several intriguing and previously unknown generalization properties.
arXiv Detail & Related papers (2022-11-04T20:28:01Z) - Too Brittle To Touch: Comparing the Stability of Quantization and
Distillation Towards Developing Lightweight Low-Resource MT Models [12.670354498961492]
State-of-the-art machine translation models are often able to adapt to the paucity of data for low-resource languages.
Knowledge Distillation is one popular technique to develop competitive, lightweight models.
arXiv Detail & Related papers (2022-10-27T05:30:13Z) - Parameter-Efficient Sparsity for Large Language Models Fine-Tuning [63.321205487234074]
We propose a.
sparse-efficient Sparse Training (PST) method to reduce the number of trainable parameters during sparse-aware training.
Experiments with diverse networks (i.e., BERT, RoBERTa and GPT-2) demonstrate PST performs on par or better than previous sparsity methods.
arXiv Detail & Related papers (2022-05-23T02:43:45Z) - What Do Compressed Multilingual Machine Translation Models Forget? [102.50127671423752]
We show that the performance of under-represented languages drops significantly, while the average BLEU metric only slightly decreases.
We demonstrate that compression amplifies intrinsic gender and semantic biases, even in high-resource languages.
arXiv Detail & Related papers (2022-05-22T13:54:44Z) - What do Compressed Large Language Models Forget? Robustness Challenges
in Model Compression [68.82486784654817]
We study two popular model compression techniques including knowledge distillation and pruning.
We show that compressed models are significantly less robust than their PLM counterparts on adversarial test sets.
We develop a regularization strategy for model compression based on sample uncertainty.
arXiv Detail & Related papers (2021-10-16T00:20:04Z) - Robustness in Compressed Neural Networks for Object Detection [2.9823962001574182]
The sensitivity of compressed models to different distortion types is nuanced.
Some of the corruptions are heavily impacted by the compression methods.
Data augmentation was confirmed to positively affect models' robustness.
arXiv Detail & Related papers (2021-02-10T15:52:11Z) - MixKD: Towards Efficient Distillation of Large-scale Language Models [129.73786264834894]
We propose MixKD, a data-agnostic distillation framework, to endow the resulting model with stronger generalization ability.
We prove from a theoretical perspective that under reasonable conditions MixKD gives rise to a smaller gap between the error and the empirical error.
Experiments under a limited-data setting and ablation studies further demonstrate the advantages of the proposed approach.
arXiv Detail & Related papers (2020-11-01T18:47:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.