BabyLlama-2: Ensemble-Distilled Models Consistently Outperform Teachers
With Limited Data
- URL: http://arxiv.org/abs/2409.17312v1
- Date: Wed, 25 Sep 2024 19:46:49 GMT
- Title: BabyLlama-2: Ensemble-Distilled Models Consistently Outperform Teachers
With Limited Data
- Authors: Jean-Loup Tastet, Inar Timiryasov
- Abstract summary: We present BabyLlama-2, a 345 million parameter model distillation-pretrained from two teachers on a 10 million word corpus for the BabyLM competition.
On BLiMP and SuperGLUE benchmarks, BabyLlama-2 outperforms baselines trained on both 10 and 100 million word datasets with the same data mix, as well as its teacher models.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present BabyLlama-2, a 345 million parameter model distillation-pretrained
from two teachers on a 10 million word corpus for the BabyLM competition. On
BLiMP and SuperGLUE benchmarks, BabyLlama-2 outperforms baselines trained on
both 10 and 100 million word datasets with the same data mix, as well as its
teacher models. Through an extensive hyperparameter sweep, we demonstrate that
the advantages of distillation cannot be attributed to suboptimal
hyperparameter selection of the teachers. Our findings underscore the need for
further investigation into distillation techniques, particularly in
data-limited settings.
Related papers
- Baby Llama: knowledge distillation from an ensemble of teachers trained
on a small dataset with no performance penalty [0.0]
We trained an ensemble consisting of a GPT-2 and small LLaMA models on a developmentally-plausible, 10M-word BabyLM dataset.
We distilled it into a small, 58M- parameter LLaMA model, which exceeds in performance both of its teachers as well as a similar model trained without distillation.
arXiv Detail & Related papers (2023-08-03T20:20:01Z) - Impossible Distillation: from Low-Quality Model to High-Quality Dataset & Model for Summarization and Paraphrasing [59.58984194238254]
We present Impossible Distillation, a novel framework for paraphrasing and sentence summarization.
Unlike prior works that rely on an extreme-scale teacher model, we hypothesize and verify the paraphrastic proximity intrinsic to pre-trained LMs.
By identifying and distilling generations from these subspaces, Impossible Distillation produces a high-quality dataset and model even from GPT2-scale LMs.
arXiv Detail & Related papers (2023-05-26T05:19:24Z) - Accurate Knowledge Distillation with n-best Reranking [2.9526110883017433]
We propose utilizing n-best reranking to enhance Sequence-Level Knowledge Distillation (Kim and Rush, 2016)
We leverage a diverse set of models with different inductive biases, objective functions or architectures, including some publicly-available large language models, to pick the highest-quality hypotheses as labels.
Our results demonstrate that utilizing pseudo-labels generated by our n-best reranker leads to a significantly more accurate student model.
arXiv Detail & Related papers (2023-05-20T01:53:03Z) - Distilling Step-by-Step! Outperforming Larger Language Models with Less
Training Data and Smaller Model Sizes [91.58845026796149]
We introduce Distilling step-by-step, a new mechanism that trains small models that outperform large language models.
We present three findings across 4 NLP benchmarks.
arXiv Detail & Related papers (2023-05-03T17:50:56Z) - Alexa Teacher Model: Pretraining and Distilling Multi-Billion-Parameter
Encoders for Natural Language Understanding Systems [63.713297451300086]
We present results from a large-scale experiment on pretraining encoders with non-embedding parameter counts ranging from 700M to 9.3B.
Their subsequent distillation into smaller models ranging from 17M-170M parameters, and their application to the Natural Language Understanding (NLU) component of a virtual assistant system.
arXiv Detail & Related papers (2022-06-15T20:44:23Z) - Unified and Effective Ensemble Knowledge Distillation [92.67156911466397]
Ensemble knowledge distillation can extract knowledge from multiple teacher models and encode it into a single student model.
Many existing methods learn and distill the student model on labeled data only.
We propose a unified and effective ensemble knowledge distillation method that distills a single student model from an ensemble of teacher models on both labeled and unlabeled data.
arXiv Detail & Related papers (2022-04-01T16:15:39Z) - ERNIE-Tiny : A Progressive Distillation Framework for Pretrained
Transformer Compression [20.23732233214849]
We propose a four-stage progressive distillation framework ERNIE-Tiny to compress pretrained language models (PLMs)
Experiments show that a 4-layer ERNIE-Tiny maintains over 98.0%performance of its 12-layer teacher BERT base on GLUE benchmark.
ERNIE-Tiny achieves a new compression SOTA on five Chinese NLP tasks, outperforming BERT base by 0.4% accuracy with 7.5x fewer parameters and9.4x faster inference speed.
arXiv Detail & Related papers (2021-06-04T04:00:16Z) - Distilling Double Descent [65.85258126760502]
Distillation is the technique of training a "student" model based on examples that are labeled by a separate "teacher" model.
We show, that, even when the teacher model is highly over parameterized, and provides emphhard labels, using a very large held-out unlabeled dataset can result in a model that outperforms more "traditional" approaches.
arXiv Detail & Related papers (2021-02-13T02:26:48Z) - MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression
of Pre-Trained Transformers [117.67424061746247]
We present a simple and effective approach to compress large Transformer based pre-trained models.
We propose distilling the self-attention module of the last Transformer layer of the teacher, which is effective and flexible for the student.
Experimental results demonstrate that our monolingual model outperforms state-of-the-art baselines in different parameter size of student models.
arXiv Detail & Related papers (2020-02-25T15:21:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.