oBERTa: Improving Sparse Transfer Learning via improved initialization,
distillation, and pruning regimes
- URL: http://arxiv.org/abs/2303.17612v3
- Date: Tue, 6 Jun 2023 16:30:09 GMT
- Title: oBERTa: Improving Sparse Transfer Learning via improved initialization,
distillation, and pruning regimes
- Authors: Daniel Campos, Alexandre Marques, Mark Kurtz, and ChengXiang Zhai
- Abstract summary: oBERTa is an easy-to-use set of language models for Natural Language Processing.
It allows NLP practitioners to obtain between 3.8 and 24.3 times faster models without expertise in model compression.
We explore the use of oBERTa on seven representative NLP tasks.
- Score: 82.99830498937729
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we introduce the range of oBERTa language models, an
easy-to-use set of language models which allows Natural Language Processing
(NLP) practitioners to obtain between 3.8 and 24.3 times faster models without
expertise in model compression. Specifically, oBERTa extends existing work on
pruning, knowledge distillation, and quantization and leverages frozen
embeddings improves distillation and model initialization to deliver higher
accuracy on a broad range of transfer tasks. In generating oBERTa, we explore
how the highly optimized RoBERTa differs from the BERT for pruning during
pre-training and finetuning. We find it less amenable to compression during
fine-tuning. We explore the use of oBERTa on seven representative NLP tasks and
find that the improved compression techniques allow a pruned oBERTa model to
match the performance of BERTbase and exceed the performance of Prune OFA Large
on the SQUAD V1.1 Question Answering dataset, despite being 8x and 2x,
respectively faster in inference. We release our code, training regimes, and
associated model for broad usage to encourage usage and experimentation
Related papers
- Sensi-BERT: Towards Sensitivity Driven Fine-Tuning for
Parameter-Efficient BERT [6.029590006321152]
We present Sensi-BERT, a sensitivity driven efficient fine-tuning of BERT models for downstream tasks.
Our experiments show the efficacy of Sensi-BERT across different downstream tasks including MNLI, QQP, QNLI, SST-2 and SQuAD.
arXiv Detail & Related papers (2023-07-14T17:24:15Z) - Sparse*BERT: Sparse Models Generalize To New tasks and Domains [79.42527716035879]
This paper studies how models pruned using Gradual Unstructured Magnitude Pruning can transfer between domains and tasks.
We demonstrate that our general sparse model Sparse*BERT can become SparseBioBERT simply by pretraining the compressed architecture on unstructured biomedical text.
arXiv Detail & Related papers (2022-05-25T02:51:12Z) - The Optimal BERT Surgeon: Scalable and Accurate Second-Order Pruning for
Large Language Models [23.12519490211362]
This paper studies the accuracy-compression trade-off for unstructured weight pruning in the context of BERT models.
We introduce Optimal BERT Surgeon (O-BERT-S), an efficient and accurate weight pruning method based on approximate second-order information.
We investigate the impact of this pruning method when compounding compression approaches for Transformer-based models.
arXiv Detail & Related papers (2022-03-14T16:40:31Z) - DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with
Gradient-Disentangled Embedding Sharing [117.41016786835452]
This paper presents a new pre-trained language model, DeBERTaV3, which improves the original DeBERTa model.
vanilla embedding sharing in ELECTRA hurts training efficiency and model performance.
We propose a new gradient-disentangled embedding sharing method that avoids the tug-of-war dynamics.
arXiv Detail & Related papers (2021-11-18T06:48:00Z) - Prune Once for All: Sparse Pre-Trained Language Models [0.6063525456640462]
We present a new method for training sparse pre-trained Transformer language models by integrating weight pruning and model distillation.
These sparse pre-trained models can be used to transfer learning for a wide range of tasks while maintaining their sparsity pattern.
We show how the compressed sparse pre-trained models we trained transfer their knowledge to five different downstream natural language tasks with minimal accuracy loss.
arXiv Detail & Related papers (2021-11-10T15:52:40Z) - AutoBERT-Zero: Evolving BERT Backbone from Scratch [94.89102524181986]
We propose an Operation-Priority Neural Architecture Search (OP-NAS) algorithm to automatically search for promising hybrid backbone architectures.
We optimize both the search algorithm and evaluation of candidate models to boost the efficiency of our proposed OP-NAS.
Experiments show that the searched architecture (named AutoBERT-Zero) significantly outperforms BERT and its variants of different model capacities in various downstream tasks.
arXiv Detail & Related papers (2021-07-15T16:46:01Z) - TernaryBERT: Distillation-aware Ultra-low Bit BERT [53.06741585060951]
We propose TernaryBERT, which ternarizes the weights in a fine-tuned BERT model.
Experiments on the GLUE benchmark and SQuAD show that our proposed TernaryBERT outperforms the other BERT quantization methods.
arXiv Detail & Related papers (2020-09-27T10:17:28Z) - DeeBERT: Dynamic Early Exiting for Accelerating BERT Inference [69.93692147242284]
Large-scale pre-trained language models such as BERT have brought significant improvements to NLP applications.
We propose a simple but effective method, DeeBERT, to accelerate BERT inference.
Experiments show that DeeBERT is able to save up to 40% inference time with minimal degradation in model quality.
arXiv Detail & Related papers (2020-04-27T17:58:05Z) - LadaBERT: Lightweight Adaptation of BERT through Hybrid Model
Compression [21.03685890385275]
BERT is a cutting-edge language representation model pre-trained by a large corpus.
BERT is memory-intensive and leads to unsatisfactory latency of user requests.
We propose a hybrid solution named LadaBERT, which combines the advantages of different model compression methods.
arXiv Detail & Related papers (2020-04-08T17:18:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.