Distill or Annotate? Cost-Efficient Fine-Tuning of Compact Models
- URL: http://arxiv.org/abs/2305.01645v3
- Date: Wed, 5 Jul 2023 20:41:59 GMT
- Title: Distill or Annotate? Cost-Efficient Fine-Tuning of Compact Models
- Authors: Junmo Kang, Wei Xu, Alan Ritter
- Abstract summary: Fine-tuning large models is highly effective, however, inference can be expensive and produces carbon emissions.
We show that distilling from T5-XXL (11B) to T5-Small (60M) is almost always a cost-efficient strategy compared to annotating more data.
We will make our code, datasets, annotation cost estimates, and baseline models available as a benchmark to support further work on cost-efficient training of compact models.
- Score: 19.464992602919015
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Fine-tuning large models is highly effective, however, inference can be
expensive and produces carbon emissions. Knowledge distillation has been shown
to be a practical solution to reduce inference costs, but the distillation
process itself requires significant computational resources. Rather than buying
or renting GPUs to fine-tune, then distill a large model, an NLP practitioner
might instead choose to allocate the available budget to hire annotators and
manually label additional fine-tuning data. In this paper, we investigate how
to most efficiently use a fixed budget to build a compact model. Through
extensive experiments on six diverse tasks, we show that distilling from T5-XXL
(11B) to T5-Small (60M) is almost always a cost-efficient strategy compared to
annotating more data to directly train a compact model (T5-Small). We further
investigate how the optimal budget allocated towards computation varies across
scenarios. We will make our code, datasets, annotation cost estimates, and
baseline models available as a benchmark to support further work on
cost-efficient training of compact models.
Related papers
- Balancing the Budget: Understanding Trade-offs Between Supervised and Preference-Based Finetuning [18.381178799923514]
Post-training of Large Language Models often involves a pipeline of Supervised Finetuning (SFT) followed by Preference Finetuning (PFT)
We study how to optimally allocate a fixed training data budget between the two stages.
arXiv Detail & Related papers (2025-02-16T21:57:35Z) - uDistil-Whisper: Label-Free Data Filtering for Knowledge Distillation in Low-Data Regimes [34.947522647009436]
We show that best-distilled models outperform the teacher model by 5-7 WER points and are on par with or outperform similar supervised data filtering setups.
Our models are also 25-50% more compute- and memory-efficient while maintaining performance equal to or better than that of the teacher model.
arXiv Detail & Related papers (2024-07-01T13:07:01Z) - Improving Large Models with Small models: Lower Costs and Better Performance [81.55672406002715]
We propose Data Shunt$+$ (DS$+$), a general paradigm for collaboration of small and large models.
For instance, ChatGPT achieves an accuracy of $94.43%$ on Amazon Product sentiment analysis, and DS$+$ achieves an accuracy of $95.64%$, while the cost has been reduced to only $31.18%$.
arXiv Detail & Related papers (2024-06-15T14:44:43Z) - Distilling Step-by-Step! Outperforming Larger Language Models with Less
Training Data and Smaller Model Sizes [91.58845026796149]
We introduce Distilling step-by-step, a new mechanism that trains small models that outperform large language models.
We present three findings across 4 NLP benchmarks.
arXiv Detail & Related papers (2023-05-03T17:50:56Z) - Gradient-Free Structured Pruning with Unlabeled Data [57.999191898036706]
We propose a gradient-free structured pruning framework that uses only unlabeled data.
Up to 40% of the original FLOP count can be reduced with less than a 4% accuracy loss across all tasks considered.
arXiv Detail & Related papers (2023-03-07T19:12:31Z) - Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints [59.39280540478479]
We propose sparse upcycling -- a simple way to reuse sunk training costs by initializing a sparsely activated Mixture-of-Experts model from a dense checkpoint.
We show that sparsely upcycled T5 Base, Large, and XL language models and Vision Transformer Base and Large models, respectively, significantly outperform their dense counterparts on SuperGLUE and ImageNet.
arXiv Detail & Related papers (2022-12-09T18:57:37Z) - DeepSpeed Data Efficiency: Improving Deep Learning Model Quality and
Training Efficiency via Efficient Data Sampling and Routing [57.86954315102865]
DeepSpeed Data Efficiency is a framework that makes better use of data, increases training efficiency, and improves model quality.
For GPT-3 1.3B language model pretraining, our work achieves 12.5x less data/time/cost, while still maintaining 95% of model quality compared to baseline with full data and cost.
For GPT-3 1.3B and BERT-large pretraining, our work can also achieve the same model quality with up to 2x less data/time/cost, or achieve better model quality under same data/time/cost.
arXiv Detail & Related papers (2022-12-07T12:27:28Z) - An Experimental Design Perspective on Model-Based Reinforcement Learning [73.37942845983417]
In practical applications of RL, it is expensive to observe state transitions from the environment.
We propose an acquisition function that quantifies how much information a state-action pair would provide about the optimal solution to a Markov decision process.
arXiv Detail & Related papers (2021-12-09T23:13:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.