Baby Llama: knowledge distillation from an ensemble of teachers trained
on a small dataset with no performance penalty
- URL: http://arxiv.org/abs/2308.02019v2
- Date: Tue, 24 Oct 2023 17:58:42 GMT
- Title: Baby Llama: knowledge distillation from an ensemble of teachers trained
on a small dataset with no performance penalty
- Authors: Inar Timiryasov and Jean-Loup Tastet
- Abstract summary: We trained an ensemble consisting of a GPT-2 and small LLaMA models on a developmentally-plausible, 10M-word BabyLM dataset.
We distilled it into a small, 58M- parameter LLaMA model, which exceeds in performance both of its teachers as well as a similar model trained without distillation.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present our submission to the BabyLM challenge, whose goal was to improve
the sample efficiency of language models. We trained an ensemble consisting of
a GPT-2 and small LLaMA models on the developmentally-plausible, 10M-word
BabyLM dataset, then distilled it into a small, 58M-parameter LLaMA model,
which exceeds in performance both of its teachers as well as a similar model
trained without distillation. This suggests that distillation can not only
retain the full performance of the teacher model when the latter is trained on
a sufficiently small dataset; it can exceed it, and lead to significantly
better performance than direct training.
Related papers
- Mixed Distillation Helps Smaller Language Model Better Reasoning [27.934081882868902]
We introduce Mixed Distillation (MD) framework, which capitalizes on the strengths of Program of Thought (PoT) and Chain of Thought (CoT) capabilities within large language models (LLMs)
Our experimental results show that MD significantly enhances the single-path and multi-path reasoning ability of smaller models in various tasks.
arXiv Detail & Related papers (2023-12-17T14:28:28Z) - Teaching Language Models to Self-Improve through Interactive Demonstrations [83.9421355808174]
Self-improving ability of large language models has been shown to be absent and difficult to learn for smaller models.
We introduce TriPosT, a training algorithm that endows smaller models with such self-improvement ability.
We show that our approach can improve a LLaMA-7b's performance on math and reasoning tasks by up to 7.13%.
arXiv Detail & Related papers (2023-10-20T14:11:04Z) - MiniLLM: Knowledge Distillation of Large Language Models [112.93051247165089]
Knowledge Distillation (KD) is a promising technique for reducing the high computational demand of large language models (LLMs)
We propose a KD approach that distills LLMs into smaller language models.
Our method is scalable for different model families with 120M to 13B parameters.
arXiv Detail & Related papers (2023-06-14T14:44:03Z) - Impossible Distillation: from Low-Quality Model to High-Quality Dataset & Model for Summarization and Paraphrasing [59.58984194238254]
We present Impossible Distillation, a novel framework for paraphrasing and sentence summarization.
Unlike prior works that rely on an extreme-scale teacher model, we hypothesize and verify the paraphrastic proximity intrinsic to pre-trained LMs.
By identifying and distilling generations from these subspaces, Impossible Distillation produces a high-quality dataset and model even from GPT2-scale LMs.
arXiv Detail & Related papers (2023-05-26T05:19:24Z) - DisCo: Distilled Student Models Co-training for Semi-supervised Text
Mining [23.418419374791107]
DisCo is a semi-supervised learning framework for fine-tuning a cohort of small student models generated from a large PLM.
We show that DisCo can produce student models that are 7.6 times smaller and 4.8 times faster in inference than the baseline PLMs.
arXiv Detail & Related papers (2023-05-20T03:23:16Z) - Distilling Step-by-Step! Outperforming Larger Language Models with Less
Training Data and Smaller Model Sizes [91.58845026796149]
We introduce Distilling step-by-step, a new mechanism that trains small models that outperform large language models.
We present three findings across 4 NLP benchmarks.
arXiv Detail & Related papers (2023-05-03T17:50:56Z) - Beyond Self-Supervision: A Simple Yet Effective Network Distillation
Alternative to Improve Backbones [40.33419553042038]
We propose to improve existing baseline networks via knowledge distillation from off-the-shelf pre-trained big powerful models.
Our solution performs distillation by only driving prediction of the student model consistent with that of the teacher model.
We empirically find that such simple distillation settings perform extremely effective, for example, the top-1 accuracy on ImageNet-1k validation set of MobileNetV3-large and ResNet50-D can be significantly improved.
arXiv Detail & Related papers (2021-03-10T09:32:44Z) - Reinforced Multi-Teacher Selection for Knowledge Distillation [54.72886763796232]
knowledge distillation is a popular method for model compression.
Current methods assign a fixed weight to a teacher model in the whole distillation.
Most of the existing methods allocate an equal weight to every teacher model.
In this paper, we observe that, due to the complexity of training examples and the differences in student model capability, learning differentially from teacher models can lead to better performance of student models distilled.
arXiv Detail & Related papers (2020-12-11T08:56:39Z) - MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression
of Pre-Trained Transformers [117.67424061746247]
We present a simple and effective approach to compress large Transformer based pre-trained models.
We propose distilling the self-attention module of the last Transformer layer of the teacher, which is effective and flexible for the student.
Experimental results demonstrate that our monolingual model outperforms state-of-the-art baselines in different parameter size of student models.
arXiv Detail & Related papers (2020-02-25T15:21:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.