Related papers: Distill or Annotate? Cost-Efficient Fine-Tuning of Compact Models

Distill or Annotate? Cost-Efficient Fine-Tuning of Compact Models

URL: http://arxiv.org/abs/2305.01645v3
Date: Wed, 5 Jul 2023 20:41:59 GMT
Title: Distill or Annotate? Cost-Efficient Fine-Tuning of Compact Models
Authors: Junmo Kang, Wei Xu, Alan Ritter
Abstract summary: Fine-tuning large models is highly effective, however, inference can be expensive and produces carbon emissions. We show that distilling from T5-XXL (11B) to T5-Small (60M) is almost always a cost-efficient strategy compared to annotating more data. We will make our code, datasets, annotation cost estimates, and baseline models available as a benchmark to support further work on cost-efficient training of compact models.
Score: 19.464992602919015
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Fine-tuning large models is highly effective, however, inference can be expensive and produces carbon emissions. Knowledge distillation has been shown to be a practical solution to reduce inference costs, but the distillation process itself requires significant computational resources. Rather than buying or renting GPUs to fine-tune, then distill a large model, an NLP practitioner might instead choose to allocate the available budget to hire annotators and manually label additional fine-tuning data. In this paper, we investigate how to most efficiently use a fixed budget to build a compact model. Through extensive experiments on six diverse tasks, we show that distilling from T5-XXL (11B) to T5-Small (60M) is almost always a cost-efficient strategy compared to annotating more data to directly train a compact model (T5-Small). We further investigate how the optimal budget allocated towards computation varies across scenarios. We will make our code, datasets, annotation cost estimates, and baseline models available as a benchmark to support further work on cost-efficient training of compact models.

Related papers

Complexity-aware fine-tuning [2.0393477576774752]
General-purpose Large Language Models (LLMs) are frequently fine-tuned through supervised fine-tuning (SFT) to enhance performance in specific domains.<n>We propose a novel blueprint for efficient fine-tuning that uses reasoning only for complex data identified by entropy.
arXiv Detail & Related papers (2025-06-26T13:13:24Z)
Collaborative LLM Inference via Planning for Efficient Reasoning [50.04696654679751]
We propose a test-time collaboration framework in which a planner model first generates a plan, defined as a distilled and high-level abstraction of the problem.<n>Small and large models take turns acting as planner and reasoner, exchanging plans in a multi-round cascade to collaboratively solve complex tasks.<n>Our method achieves accuracy comparable to strong proprietary models alone, while significantly reducing reliance on paid inference.
arXiv Detail & Related papers (2025-06-13T08:35:50Z)
Plan and Budget: Effective and Efficient Test-Time Scaling on Large Language Model Reasoning [19.258292534503887]
Plan-and-Budget is a model-agnostic, test-time framework that decomposes complex queries into sub-questions and allocates token budgets based on estimated complexity using adaptive scheduling.<n>Plan-and-Budget improves reasoning efficiency across a range of tasks and models, achieving up to +70% accuracy gains, tangential -39% token reduction, and +187.5% improvement in $E3$.
arXiv Detail & Related papers (2025-05-22T01:56:29Z)
Balancing the Budget: Understanding Trade-offs Between Supervised and Preference-Based Finetuning [18.381178799923514]
Post-training of Large Language Models often involves a pipeline of Supervised Finetuning (SFT) followed by Preference Finetuning (PFT) We study how to optimally allocate a fixed training data budget between the two stages.
arXiv Detail & Related papers (2025-02-16T21:57:35Z)
Revisiting Cascaded Ensembles for Efficient Inference [32.914852531806]
A common approach to make machine learning inference more efficient is to use example-specific adaptive schemes. In this work we study a simple scheme for adaptive inference. We build a cascade of ensembles (CoE), beginning with resource-efficient models and growing to larger, more expressive models.
arXiv Detail & Related papers (2024-07-02T15:14:12Z)
uDistil-Whisper: Label-Free Data Filtering for Knowledge Distillation in Low-Data Regimes [34.947522647009436]
We show that best-distilled models outperform the teacher model by 5-7 WER points and are on par with or outperform similar supervised data filtering setups. Our models are also 25-50% more compute- and memory-efficient while maintaining performance equal to or better than that of the teacher model.
arXiv Detail & Related papers (2024-07-01T13:07:01Z)
The ALCHEmist: Automated Labeling 500x CHEaper Than LLM Data Annotators [11.056579191156498]
Large pretrained models can be used as annotators, helping replace or augment crowdworkers. This comes at a cost: employing top-of-the-line models often requires paying thousands of dollars for API calls. We propose a simple alternative: rather than directly querying labels from pretrained models, we task models to generate programs that can produce labels.
arXiv Detail & Related papers (2024-06-25T17:58:26Z)
Improving Large Models with Small models: Lower Costs and Better Performance [81.55672406002715]
We propose Data Shunt$+$ (DS$+$), a general paradigm for collaboration of small and large models. For instance, ChatGPT achieves an accuracy of $94.43%$ on Amazon Product sentiment analysis, and DS$+$ achieves an accuracy of $95.64%$, while the cost has been reduced to only $31.18%$.
arXiv Detail & Related papers (2024-06-15T14:44:43Z)
Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes [91.58845026796149]
We introduce Distilling step-by-step, a new mechanism that trains small models that outperform large language models. We present three findings across 4 NLP benchmarks.
arXiv Detail & Related papers (2023-05-03T17:50:56Z)
Gradient-Free Structured Pruning with Unlabeled Data [57.999191898036706]
We propose a gradient-free structured pruning framework that uses only unlabeled data. Up to 40% of the original FLOP count can be reduced with less than a 4% accuracy loss across all tasks considered.
arXiv Detail & Related papers (2023-03-07T19:12:31Z)
Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints [59.39280540478479]
We propose sparse upcycling -- a simple way to reuse sunk training costs by initializing a sparsely activated Mixture-of-Experts model from a dense checkpoint. We show that sparsely upcycled T5 Base, Large, and XL language models and Vision Transformer Base and Large models, respectively, significantly outperform their dense counterparts on SuperGLUE and ImageNet.
arXiv Detail & Related papers (2022-12-09T18:57:37Z)
DeepSpeed Data Efficiency: Improving Deep Learning Model Quality and Training Efficiency via Efficient Data Sampling and Routing [57.86954315102865]
DeepSpeed Data Efficiency is a framework that makes better use of data, increases training efficiency, and improves model quality. For GPT-3 1.3B language model pretraining, our work achieves 12.5x less data/time/cost, while still maintaining 95% of model quality compared to baseline with full data and cost. For GPT-3 1.3B and BERT-large pretraining, our work can also achieve the same model quality with up to 2x less data/time/cost, or achieve better model quality under same data/time/cost.
arXiv Detail & Related papers (2022-12-07T12:27:28Z)
An Experimental Design Perspective on Model-Based Reinforcement Learning [73.37942845983417]
In practical applications of RL, it is expensive to observe state transitions from the environment. We propose an acquisition function that quantifies how much information a state-action pair would provide about the optimal solution to a Markov decision process.
arXiv Detail & Related papers (2021-12-09T23:13:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.