Distill or Annotate? Cost-Efficient Fine-Tuning of Compact Models
- URL: http://arxiv.org/abs/2305.01645v3
- Date: Wed, 5 Jul 2023 20:41:59 GMT
- Title: Distill or Annotate? Cost-Efficient Fine-Tuning of Compact Models
- Authors: Junmo Kang, Wei Xu, Alan Ritter
- Abstract summary: Fine-tuning large models is highly effective, however, inference can be expensive and produces carbon emissions.
We show that distilling from T5-XXL (11B) to T5-Small (60M) is almost always a cost-efficient strategy compared to annotating more data.
We will make our code, datasets, annotation cost estimates, and baseline models available as a benchmark to support further work on cost-efficient training of compact models.
- Score: 19.464992602919015
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Fine-tuning large models is highly effective, however, inference can be
expensive and produces carbon emissions. Knowledge distillation has been shown
to be a practical solution to reduce inference costs, but the distillation
process itself requires significant computational resources. Rather than buying
or renting GPUs to fine-tune, then distill a large model, an NLP practitioner
might instead choose to allocate the available budget to hire annotators and
manually label additional fine-tuning data. In this paper, we investigate how
to most efficiently use a fixed budget to build a compact model. Through
extensive experiments on six diverse tasks, we show that distilling from T5-XXL
(11B) to T5-Small (60M) is almost always a cost-efficient strategy compared to
annotating more data to directly train a compact model (T5-Small). We further
investigate how the optimal budget allocated towards computation varies across
scenarios. We will make our code, datasets, annotation cost estimates, and
baseline models available as a benchmark to support further work on
cost-efficient training of compact models.
Related papers
- Revisiting Cascaded Ensembles for Efficient Inference [32.914852531806]
A common approach to make machine learning inference more efficient is to use example-specific adaptive schemes.
In this work we study a simple scheme for adaptive inference.
We build a cascade of ensembles (CoE), beginning with resource-efficient models and growing to larger, more expressive models.
arXiv Detail & Related papers (2024-07-02T15:14:12Z) - The ALCHEmist: Automated Labeling 500x CHEaper Than LLM Data Annotators [11.056579191156498]
Large pretrained models can be used as annotators, helping replace or augment crowdworkers.
This comes at a cost: employing top-of-the-line models often requires paying thousands of dollars for API calls.
We propose a simple alternative: rather than directly querying labels from pretrained models, we task models to generate programs that can produce labels.
arXiv Detail & Related papers (2024-06-25T17:58:26Z) - Improving Large Models with Small models: Lower Costs and Better Performance [81.55672406002715]
We propose Data Shunt$+$ (DS$+$), a general paradigm for collaboration of small and large models.
For instance, ChatGPT achieves an accuracy of $94.43%$ on Amazon Product sentiment analysis, and DS$+$ achieves an accuracy of $95.64%$, while the cost has been reduced to only $31.18%$.
arXiv Detail & Related papers (2024-06-15T14:44:43Z) - Distilling Step-by-Step! Outperforming Larger Language Models with Less
Training Data and Smaller Model Sizes [91.58845026796149]
We introduce Distilling step-by-step, a new mechanism that trains small models that outperform large language models.
We present three findings across 4 NLP benchmarks.
arXiv Detail & Related papers (2023-05-03T17:50:56Z) - Gradient-Free Structured Pruning with Unlabeled Data [57.999191898036706]
We propose a gradient-free structured pruning framework that uses only unlabeled data.
Up to 40% of the original FLOP count can be reduced with less than a 4% accuracy loss across all tasks considered.
arXiv Detail & Related papers (2023-03-07T19:12:31Z) - Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints [59.39280540478479]
We propose sparse upcycling -- a simple way to reuse sunk training costs by initializing a sparsely activated Mixture-of-Experts model from a dense checkpoint.
We show that sparsely upcycled T5 Base, Large, and XL language models and Vision Transformer Base and Large models, respectively, significantly outperform their dense counterparts on SuperGLUE and ImageNet.
arXiv Detail & Related papers (2022-12-09T18:57:37Z) - DeepSpeed Data Efficiency: Improving Deep Learning Model Quality and
Training Efficiency via Efficient Data Sampling and Routing [57.86954315102865]
DeepSpeed Data Efficiency is a framework that makes better use of data, increases training efficiency, and improves model quality.
For GPT-3 1.3B language model pretraining, our work achieves 12.5x less data/time/cost, while still maintaining 95% of model quality compared to baseline with full data and cost.
For GPT-3 1.3B and BERT-large pretraining, our work can also achieve the same model quality with up to 2x less data/time/cost, or achieve better model quality under same data/time/cost.
arXiv Detail & Related papers (2022-12-07T12:27:28Z) - An Experimental Design Perspective on Model-Based Reinforcement Learning [73.37942845983417]
In practical applications of RL, it is expensive to observe state transitions from the environment.
We propose an acquisition function that quantifies how much information a state-action pair would provide about the optimal solution to a Markov decision process.
arXiv Detail & Related papers (2021-12-09T23:13:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.