Lifting the Curse of Capacity Gap in Distilling Language Models
- URL: http://arxiv.org/abs/2305.12129v1
- Date: Sat, 20 May 2023 07:30:55 GMT
- Title: Lifting the Curse of Capacity Gap in Distilling Language Models
- Authors: Chen Zhang, Yang Yang, Jiahao Liu, Jingang Wang, Yunsen Xian, Benyou
Wang, Dawei Song
- Abstract summary: We propose a mixture of minimal experts (MiniMoE) which imposes extra parameters to the student but introduces almost no additional inference compute.
With a compression rate as much as $sim$50$times$, MiniMoE preserves $sim$95% GLUE score of the teacher.
- Score: 19.370268407987652
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Pretrained language models (LMs) have shown compelling performance on various
downstream tasks, but unfortunately they require a tremendous amount of
inference compute. Knowledge distillation finds a path to compress LMs to small
ones with a teacher-student paradigm. However, when the capacity gap between
the teacher and the student is large, a curse of capacity gap appears, invoking
a deficiency in distilling LMs. While a few studies have been carried out to
fill the gap, the curse is not yet well tackled. In this paper, we aim at
lifting the curse of capacity gap via enlarging the capacity of the student
without notably increasing the inference compute. Largely motivated by sparse
activation regime of mixture of experts (MoE), we propose a mixture of minimal
experts (MiniMoE), which imposes extra parameters to the student but introduces
almost no additional inference compute. Experimental results on GLUE and CoNLL
demonstrate the curse of capacity gap is lifted by the magic of MiniMoE to a
large extent. MiniMoE also achieves the state-of-the-art performance at small
FLOPs compared with a range of competitive baselines. With a compression rate
as much as $\sim$50$\times$, MiniMoE preserves $\sim$95\% GLUE score of the
teacher.
Related papers
- MiniPLM: Knowledge Distillation for Pre-Training Language Models [109.83741809808483]
MiniPLM is a KD framework for pre-training student language models.
For efficiency, MiniPLM performs offline teacher LM inference, allowing KD for multiple student LMs without adding training-time costs.
For flexibility, MiniPLM operates solely on the training corpus, enabling KD across model families.
arXiv Detail & Related papers (2024-10-22T17:40:32Z) - PLaD: Preference-based Large Language Model Distillation with Pseudo-Preference Pairs [47.35598271306371]
Large Language Models (LLMs) have exhibited impressive capabilities in various tasks, yet their vast parameter sizes restrict their applicability in resource-constrained settings.
Knowledge distillation (KD) offers a viable solution by transferring expertise from large teacher models to compact student models.
We present PLaD, a novel preference-based LLM distillation framework.
arXiv Detail & Related papers (2024-06-05T03:08:25Z) - NutePrune: Efficient Progressive Pruning with Numerous Teachers for Large Language Models [2.9449838351181374]
We propose an efficient progressive Numerous-teacher pruning method (NutePrune)
NutePrune mitigates excessive memory costs by loading only one intact model and integrating it with various masks and LoRA modules.
In LLaMA-7B experiments, NutePrune retains 97.17% of the performance of the original model at 20% sparsity and 95.07% at 25% sparsity.
arXiv Detail & Related papers (2024-02-15T08:03:12Z) - Towards the Law of Capacity Gap in Distilling Language Models [13.630180187069904]
Language model (LM) distillation is a trending area that aims to distil the knowledge residing in a large teacher LM to a small student one.
textscMiniMA is demonstrated to outperform a wide range of 3B competitors and could even compete with several 7B models.
arXiv Detail & Related papers (2023-11-13T03:36:18Z) - Democratizing Reasoning Ability: Tailored Learning from Large Language
Model [97.4921006089966]
We propose a tailored learning approach to distill such reasoning ability to smaller LMs.
We exploit the potential of LLM as a reasoning teacher by building an interactive multi-round learning paradigm.
To exploit the reasoning potential of the smaller LM, we propose self-reflection learning to motivate the student to learn from self-made mistakes.
arXiv Detail & Related papers (2023-10-20T07:50:10Z) - MiniLLM: Knowledge Distillation of Large Language Models [112.93051247165089]
Knowledge Distillation (KD) is a promising technique for reducing the high computational demand of large language models (LLMs)
We propose a KD approach that distills LLMs into smaller language models.
Our method is scalable for different model families with 120M to 13B parameters.
arXiv Detail & Related papers (2023-06-14T14:44:03Z) - DisCo: Distilled Student Models Co-training for Semi-supervised Text
Mining [23.418419374791107]
DisCo is a semi-supervised learning framework for fine-tuning a cohort of small student models generated from a large PLM.
We show that DisCo can produce student models that are 7.6 times smaller and 4.8 times faster in inference than the baseline PLMs.
arXiv Detail & Related papers (2023-05-20T03:23:16Z) - Multi-stage Distillation Framework for Cross-Lingual Semantic Similarity
Matching [12.833080411053842]
Cross-lingual knowledge distillation can significantly improve the performance of pre-trained models for cross-lingual similarity matching tasks.
We propose a multi-stage distillation framework for constructing a small-size but high-performance cross-lingual model.
Our method can compress the size of XLM-R and MiniLM by more than 50%, while the performance is only reduced by about 1%.
arXiv Detail & Related papers (2022-09-13T10:33:04Z) - Task-Specific Expert Pruning for Sparse Mixture-of-Experts [105.20605021416276]
Mixture-of-Experts (MoE) model is powerful for large-scale pre-training.
MoE is hard to be deployed on cloud or mobile environment.
We propose a general method to progressively drop the non-professional experts for the target downstream task.
arXiv Detail & Related papers (2022-06-01T07:09:01Z) - Reducing the Teacher-Student Gap via Spherical Knowledge Disitllation [67.75526580926149]
Knowledge distillation aims at obtaining a compact and effective model by learning the mapping function from a much larger one.
We investigate the capacity gap problem by study the gap of confidence between teacher and student.
We find that the magnitude of confidence is not necessary for knowledge distillation and could harm the student performance if the student are forced to learn confidence.
arXiv Detail & Related papers (2020-10-15T03:03:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.