Small-Bench NLP: Benchmark for small single GPU trained models in
Natural Language Processing
- URL: http://arxiv.org/abs/2109.10847v2
- Date: Thu, 23 Sep 2021 06:19:05 GMT
- Title: Small-Bench NLP: Benchmark for small single GPU trained models in
Natural Language Processing
- Authors: Kamal Raj Kanakarajan and Bhuvana Kundumani and Malaikannan
Sankarasubbu
- Abstract summary: Small-Bench NLP is a benchmark for small efficient neural language models trained on a single GPU.
Our ELECTRA-DeBERTa small model architecture achieves an average score of 81.53 which is comparable to that of BERT-Base's 82.20 (110M parameters)
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent progress in the Natural Language Processing domain has given us
several State-of-the-Art (SOTA) pretrained models which can be finetuned for
specific tasks. These large models with billions of parameters trained on
numerous GPUs/TPUs over weeks are leading in the benchmark leaderboards. In
this paper, we discuss the need for a benchmark for cost and time effective
smaller models trained on a single GPU. This will enable researchers with
resource constraints experiment with novel and innovative ideas on
tokenization, pretraining tasks, architecture, fine tuning methods etc. We set
up Small-Bench NLP, a benchmark for small efficient neural language models
trained on a single GPU. Small-Bench NLP benchmark comprises of eight NLP tasks
on the publicly available GLUE datasets and a leaderboard to track the progress
of the community. Our ELECTRA-DeBERTa (15M parameters) small model architecture
achieves an average score of 81.53 which is comparable to that of BERT-Base's
82.20 (110M parameters). Our models, code and leaderboard are available at
https://github.com/smallbenchnlp
Related papers
- Cramming: Training a Language Model on a Single GPU in One Day [64.18297923419627]
Recent trends in language modeling have focused on increasing performance through scaling.
We investigate the downstream performance achievable with a transformer-based language model trained completely from scratch with masked language modeling for a single day on a single consumer GPU.
We provide evidence that even in this constrained setting, performance closely follows scaling laws observed in large-compute settings.
arXiv Detail & Related papers (2022-12-28T18:59:28Z) - Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than
In-Context Learning [81.3514358542452]
Few-shot in-context learning (ICL) incurs substantial computational, memory, and storage costs because it involves processing all of the training examples every time a prediction is made.
parameter-efficient fine-tuning offers an alternative paradigm where a small set of parameters are trained to enable a model to perform the new task.
In this paper, we rigorously compare few-shot ICL and parameter-efficient fine-tuning and demonstrate that the latter offers better accuracy as well as dramatically lower computational costs.
arXiv Detail & Related papers (2022-05-11T17:10:41Z) - Simpler is Better: off-the-shelf Continual Learning Through Pretrained
Backbones [0.0]
We propose a baseline (off-the-shelf) for Continual Learning of Computer Vision problems.
We exploit the power of pretrained models to compute a class prototype and fill a memory bank.
We compare our pipeline with common CNN models and show the superiority of Vision Transformers.
arXiv Detail & Related papers (2022-05-03T16:03:46Z) - MoEBERT: from BERT to Mixture-of-Experts via Importance-Guided
Adaptation [68.30497162547768]
We propose MoEBERT, which uses a Mixture-of-Experts structure to increase model capacity and inference speed.
We validate the efficiency and effectiveness of MoEBERT on natural language understanding and question answering tasks.
arXiv Detail & Related papers (2022-04-15T23:19:37Z) - pNLP-Mixer: an Efficient all-MLP Architecture for Language [10.634940525287014]
pNLP-Mixer model for on-device NLP achieves high weight-efficiency thanks to a novel projection layer.
We evaluate a pNLP-Mixer model of only one megabyte in size on two multi-lingual semantic parsing datasets, MTOP and multiATIS.
Our model consistently beats the state-of-the-art of tiny models, which is twice as large, by a margin up to 7.8% on MTOP.
arXiv Detail & Related papers (2022-02-09T09:01:29Z) - DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with
Gradient-Disentangled Embedding Sharing [117.41016786835452]
This paper presents a new pre-trained language model, DeBERTaV3, which improves the original DeBERTa model.
vanilla embedding sharing in ELECTRA hurts training efficiency and model performance.
We propose a new gradient-disentangled embedding sharing method that avoids the tug-of-war dynamics.
arXiv Detail & Related papers (2021-11-18T06:48:00Z) - Efficient Large-Scale Language Model Training on GPU Clusters [19.00915720435389]
Large language models have led to state-of-the-art accuracies across a range of tasks.
Memory capacity is limited, making it impossible to fit large models on a single GPU.
The number of compute operations required to train these models can result in unrealistically long training times.
arXiv Detail & Related papers (2021-04-09T16:43:11Z) - A Tensor Compiler for Unified Machine Learning Prediction Serving [8.362773007171118]
Machine Learning (ML) adoption in the enterprise requires simpler and more efficient software infrastructure.
Model scoring is a primary contributor to infrastructure complexity and cost as models are trained once but used many times.
We propose HUMMINGBIRD, a novel approach to model scoring that compiles featurization operators and traditional ML models into a small set of tensor operations.
arXiv Detail & Related papers (2020-10-09T21:02:47Z) - Real-Time Execution of Large-scale Language Models on Mobile [49.32610509282623]
We find the best model structure of BERT for a given computation size to match specific devices.
Our framework can guarantee the identified model to meet both resource and real-time specifications of mobile devices.
Specifically, our model is 5.2x faster on CPU and 4.1x faster on GPU with 0.5-2% accuracy loss compared with BERT-base.
arXiv Detail & Related papers (2020-09-15T01:59:17Z) - The Right Tool for the Job: Matching Model and Instance Complexities [62.95183777679024]
As NLP models become larger, executing a trained model requires significant computational resources incurring monetary and environmental costs.
We propose a modification to contextual representation fine-tuning which, during inference, allows for an early (and fast) "exit"
We test our proposed modification on five different datasets in two tasks: three text classification datasets and two natural language inference benchmarks.
arXiv Detail & Related papers (2020-04-16T04:28:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.