Distilling Step-by-Step! Outperforming Larger Language Models with Less
Training Data and Smaller Model Sizes
- URL: http://arxiv.org/abs/2305.02301v2
- Date: Wed, 5 Jul 2023 16:59:31 GMT
- Title: Distilling Step-by-Step! Outperforming Larger Language Models with Less
Training Data and Smaller Model Sizes
- Authors: Cheng-Yu Hsieh, Chun-Liang Li, Chih-Kuan Yeh, Hootan Nakhost, Yasuhisa
Fujii, Alexander Ratner, Ranjay Krishna, Chen-Yu Lee, Tomas Pfister
- Abstract summary: We introduce Distilling step-by-step, a new mechanism that trains small models that outperform large language models.
We present three findings across 4 NLP benchmarks.
- Score: 91.58845026796149
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deploying large language models (LLMs) is challenging because they are memory
inefficient and compute-intensive for practical applications. In reaction,
researchers train smaller task-specific models by either finetuning with human
labels or distilling using LLM-generated labels. However, finetuning and
distillation require large amounts of training data to achieve comparable
performance to LLMs. We introduce Distilling step-by-step, a new mechanism that
(a) trains smaller models that outperform LLMs, and (b) achieves so by
leveraging less training data needed by finetuning or distillation. Our method
extracts LLM rationales as additional supervision for training small models
within a multi-task framework. We present three findings across 4 NLP
benchmarks: First, compared to both finetuning and distillation, our mechanism
achieves better performance with much fewer labeled/unlabeled training
examples. Second, compared to few-shot prompted LLMs, we achieve better
performance using substantially smaller model sizes. Third, we reduce both the
model size and the amount of data required to outperform LLMs; our finetuned
770M T5 model outperforms the few-shot prompted 540B PaLM model using only 80%
of available data on a benchmark, whereas standard finetuning the same T5 model
struggles to match even by using 100% of the dataset. We release the code at:
https://github.com/google-research/distilling-step-by-step .
Related papers
- SELF-GUIDE: Better Task-Specific Instruction Following via Self-Synthetic Finetuning [70.21358720599821]
Large language models (LLMs) hold the promise of solving diverse tasks when provided with appropriate natural language prompts.
We propose SELF-GUIDE, a multi-stage mechanism in which we synthesize task-specific input-output pairs from the student LLM.
We report an absolute improvement of approximately 15% for classification tasks and 18% for generation tasks in the benchmark's metrics.
arXiv Detail & Related papers (2024-07-16T04:41:58Z) - uDistil-Whisper: Label-Free Data Filtering for Knowledge Distillation via Large-Scale Pseudo Labelling [16.655022975392992]
We show that it is possible to distill Whisper models into relatively small models without using any labeled data.
Our models are 25-50% more compute and memory efficient while maintaining performance equal to or better than the teacher model.
arXiv Detail & Related papers (2024-07-01T13:07:01Z) - Delta-CoMe: Training-Free Delta-Compression with Mixed-Precision for Large Language Models [79.46938238953916]
Fine-tuning large language models (LLMs) to diverse applications is crucial to meet complex demands.
Recent studies suggest decomposing a fine-tuned LLM into a base model and corresponding delta weights, which are then compressed using low-rank or low-bit approaches to reduce costs.
In this work, we observe that existing low-rank and low-bit compression methods can significantly harm the model performance for task-specific fine-tuned LLMs.
arXiv Detail & Related papers (2024-06-13T07:57:27Z) - Mixed Distillation Helps Smaller Language Model Better Reasoning [27.934081882868902]
We introduce Mixed Distillation (MD) framework, which capitalizes on the strengths of Program of Thought (PoT) and Chain of Thought (CoT) capabilities within large language models (LLMs)
Our experimental results show that MD significantly enhances the single-path and multi-path reasoning ability of smaller models in various tasks.
arXiv Detail & Related papers (2023-12-17T14:28:28Z) - Scaling Relationship on Learning Mathematical Reasoning with Large
Language Models [75.29595679428105]
We investigate how the pre-training loss, supervised data amount, and augmented data amount influence the reasoning performances of a supervised LLM.
We find that rejection samples from multiple models push LLaMA-7B to an accuracy of 49.3% on GSM8K which outperforms the supervised fine-tuning (SFT) accuracy of 35.9% significantly.
arXiv Detail & Related papers (2023-08-03T15:34:01Z) - MiniLLM: Knowledge Distillation of Large Language Models [112.93051247165089]
Knowledge Distillation (KD) is a promising technique for reducing the high computational demand of large language models (LLMs)
We propose a KD approach that distills LLMs into smaller language models.
Our method is scalable for different model families with 120M to 13B parameters.
arXiv Detail & Related papers (2023-06-14T14:44:03Z) - Small Language Models Improve Giants by Rewriting Their Outputs [18.025736098795296]
We tackle the problem of leveraging training data to improve the performance of large language models (LLMs) without fine-tuning.
We create a pool of candidates from the LLM through few-shot prompting and we employ a compact model, the LM-corrector (LMCor), specifically trained to merge these candidates to produce an enhanced output.
Experiments on four natural language generation tasks demonstrate that even a small LMCor model (250M) substantially improves the few-shot performance of LLMs (62B), matching and even outperforming standard fine-tuning.
arXiv Detail & Related papers (2023-05-22T22:07:50Z) - Complementary Ensemble Learning [1.90365714903665]
We derive a technique to improve performance of state-of-the-art deep learning models.
Specifically, we train auxiliary models which are able to complement state-of-the-art model uncertainty.
arXiv Detail & Related papers (2021-11-09T03:23:05Z) - LiST: Lite Self-training Makes Efficient Few-shot Learners [91.28065455714018]
LiST improves by 35% over classic fine-tuning methods and 6% over prompt-tuning with 96% reduction in number of trainable parameters when fine-tuned with no more than 30 labeled examples from each target domain.
arXiv Detail & Related papers (2021-10-12T18:47:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.