Distilling Step-by-Step! Outperforming Larger Language Models with Less
Training Data and Smaller Model Sizes
- URL: http://arxiv.org/abs/2305.02301v2
- Date: Wed, 5 Jul 2023 16:59:31 GMT
- Title: Distilling Step-by-Step! Outperforming Larger Language Models with Less
Training Data and Smaller Model Sizes
- Authors: Cheng-Yu Hsieh, Chun-Liang Li, Chih-Kuan Yeh, Hootan Nakhost, Yasuhisa
Fujii, Alexander Ratner, Ranjay Krishna, Chen-Yu Lee, Tomas Pfister
- Abstract summary: We introduce Distilling step-by-step, a new mechanism that trains small models that outperform large language models.
We present three findings across 4 NLP benchmarks.
- Score: 91.58845026796149
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deploying large language models (LLMs) is challenging because they are memory
inefficient and compute-intensive for practical applications. In reaction,
researchers train smaller task-specific models by either finetuning with human
labels or distilling using LLM-generated labels. However, finetuning and
distillation require large amounts of training data to achieve comparable
performance to LLMs. We introduce Distilling step-by-step, a new mechanism that
(a) trains smaller models that outperform LLMs, and (b) achieves so by
leveraging less training data needed by finetuning or distillation. Our method
extracts LLM rationales as additional supervision for training small models
within a multi-task framework. We present three findings across 4 NLP
benchmarks: First, compared to both finetuning and distillation, our mechanism
achieves better performance with much fewer labeled/unlabeled training
examples. Second, compared to few-shot prompted LLMs, we achieve better
performance using substantially smaller model sizes. Third, we reduce both the
model size and the amount of data required to outperform LLMs; our finetuned
770M T5 model outperforms the few-shot prompted 540B PaLM model using only 80%
of available data on a benchmark, whereas standard finetuning the same T5 model
struggles to match even by using 100% of the dataset. We release the code at:
https://github.com/google-research/distilling-step-by-step .
Related papers
- LLAVADI: What Matters For Multimodal Large Language Models Distillation [77.73964744238519]
In this work, we do not propose a new efficient model structure or train small-scale MLLMs from scratch.
Our studies involve training strategies, model choices, and distillation algorithms in the knowledge distillation process.
By evaluating different benchmarks and proper strategy, even a 2.7B small-scale model can perform on par with larger models with 7B or 13B parameters.
arXiv Detail & Related papers (2024-07-28T06:10:47Z) - uDistil-Whisper: Label-Free Data Filtering for Knowledge Distillation in Low-Data Regimes [34.947522647009436]
We show that it is possible to distill large Whisper models into relatively small ones without using any labeled data.
Our models are also 25-50% more compute- and memory-efficient while maintaining performance equal to or better than that of the teacher model.
arXiv Detail & Related papers (2024-07-01T13:07:01Z) - Delta-CoMe: Training-Free Delta-Compression with Mixed-Precision for Large Language Models [79.46938238953916]
Fine-tuning large language models (LLMs) to diverse applications is crucial to meet complex demands.
Recent studies suggest decomposing a fine-tuned LLM into a base model and corresponding delta weights, which are then compressed using low-rank or low-bit approaches to reduce costs.
In this work, we observe that existing low-rank and low-bit compression methods can significantly harm the model performance for task-specific fine-tuned LLMs.
arXiv Detail & Related papers (2024-06-13T07:57:27Z) - Mixed Distillation Helps Smaller Language Model Better Reasoning [27.934081882868902]
We introduce Mixed Distillation (MD) framework, which capitalizes on the strengths of Program of Thought (PoT) and Chain of Thought (CoT) capabilities within large language models (LLMs)
Our experimental results show that MD significantly enhances the single-path and multi-path reasoning ability of smaller models in various tasks.
arXiv Detail & Related papers (2023-12-17T14:28:28Z) - Scaling Relationship on Learning Mathematical Reasoning with Large
Language Models [75.29595679428105]
We investigate how the pre-training loss, supervised data amount, and augmented data amount influence the reasoning performances of a supervised LLM.
We find that rejection samples from multiple models push LLaMA-7B to an accuracy of 49.3% on GSM8K which outperforms the supervised fine-tuning (SFT) accuracy of 35.9% significantly.
arXiv Detail & Related papers (2023-08-03T15:34:01Z) - MiniLLM: Knowledge Distillation of Large Language Models [112.93051247165089]
Knowledge Distillation (KD) is a promising technique for reducing the high computational demand of large language models (LLMs)
We propose a KD approach that distills LLMs into smaller language models.
Our method is scalable for different model families with 120M to 13B parameters.
arXiv Detail & Related papers (2023-06-14T14:44:03Z) - Small Language Models Improve Giants by Rewriting Their Outputs [18.025736098795296]
We tackle the problem of leveraging training data to improve the performance of large language models (LLMs) without fine-tuning.
We create a pool of candidates from the LLM through few-shot prompting and we employ a compact model, the LM-corrector (LMCor), specifically trained to merge these candidates to produce an enhanced output.
Experiments on four natural language generation tasks demonstrate that even a small LMCor model (250M) substantially improves the few-shot performance of LLMs (62B), matching and even outperforming standard fine-tuning.
arXiv Detail & Related papers (2023-05-22T22:07:50Z) - Complementary Ensemble Learning [1.90365714903665]
We derive a technique to improve performance of state-of-the-art deep learning models.
Specifically, we train auxiliary models which are able to complement state-of-the-art model uncertainty.
arXiv Detail & Related papers (2021-11-09T03:23:05Z) - LiST: Lite Self-training Makes Efficient Few-shot Learners [91.28065455714018]
LiST improves by 35% over classic fine-tuning methods and 6% over prompt-tuning with 96% reduction in number of trainable parameters when fine-tuned with no more than 30 labeled examples from each target domain.
arXiv Detail & Related papers (2021-10-12T18:47:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.