Co-training and Co-distillation for Quality Improvement and Compression
of Language Models
- URL: http://arxiv.org/abs/2311.02849v2
- Date: Tue, 7 Nov 2023 18:41:55 GMT
- Title: Co-training and Co-distillation for Quality Improvement and Compression
of Language Models
- Authors: Hayeon Lee, Rui Hou, Jongpil Kim, Davis Liang, Hongbo Zhang, Sung Ju
Hwang, Alexander Min
- Abstract summary: Knowledge Distillation (KD) compresses expensive pre-trained language models (PLMs) by transferring their knowledge to smaller models.
Most smaller models fail to surpass the performance of the original larger model, resulting in sacrificing performance to improve inference speed.
We propose Co-Training and Co-Distillation (CTCD), a novel framework that improves performance and inference speed together by co-training two models.
- Score: 88.94539115180919
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Knowledge Distillation (KD) compresses computationally expensive pre-trained
language models (PLMs) by transferring their knowledge to smaller models,
allowing their use in resource-constrained or real-time settings. However, most
smaller models fail to surpass the performance of the original larger model,
resulting in sacrificing performance to improve inference speed. To address
this issue, we propose Co-Training and Co-Distillation (CTCD), a novel
framework that improves performance and inference speed together by co-training
two models while mutually distilling knowledge. The CTCD framework successfully
achieves this based on two significant findings: 1) Distilling knowledge from
the smaller model to the larger model during co-training improves the
performance of the larger model. 2) The enhanced performance of the larger
model further boosts the performance of the smaller model. The CTCD framework
shows promise as it can be combined with existing techniques like architecture
design or data augmentation, replacing one-way KD methods, to achieve further
performance improvement. Extensive ablation studies demonstrate the
effectiveness of CTCD, and the small model distilled by CTCD outperforms the
original larger model by a significant margin of 1.66 on the GLUE benchmark.
Related papers
- A Collaborative Ensemble Framework for CTR Prediction [73.59868761656317]
We propose a novel framework, Collaborative Ensemble Training Network (CETNet), to leverage multiple distinct models.
Unlike naive model scaling, our approach emphasizes diversity and collaboration through collaborative learning.
We validate our framework on three public datasets and a large-scale industrial dataset from Meta.
arXiv Detail & Related papers (2024-11-20T20:38:56Z) - Stable Consistency Tuning: Understanding and Improving Consistency Models [40.2712218203989]
Diffusion models achieve superior generation quality but suffer from slow generation speed due to iterative nature of denoising.
consistency models, a new generative family, achieve competitive performance with significantly faster sampling.
We propose a novel framework for understanding consistency models by modeling the denoising process of the diffusion model as a Markov Decision Process (MDP) and framing consistency model training as the value estimation through Temporal Difference(TD) Learning.
arXiv Detail & Related papers (2024-10-24T17:55:52Z) - Majority Kernels: An Approach to Leverage Big Model Dynamics for Efficient Small Model Training [32.154166415680066]
Methods like distillation, compression or quantization help leverage the highly performant large models to induce smaller performant ones.
This paper explores the hypothesis that a single training run can simultaneously train a larger model for performance and derive a smaller model for deployment.
arXiv Detail & Related papers (2024-02-07T17:07:41Z) - DistiLLM: Towards Streamlined Distillation for Large Language Models [53.46759297929675]
DistiLLM is a more effective and efficient KD framework for auto-regressive language models.
DisiLLM comprises two components: (1) a novel skew Kullback-Leibler divergence loss, where we unveil and leverage its theoretical properties, and (2) an adaptive off-policy approach designed to enhance the efficiency in utilizing student-generated outputs.
arXiv Detail & Related papers (2024-02-06T11:10:35Z) - A-SDM: Accelerating Stable Diffusion through Redundancy Removal and
Performance Optimization [54.113083217869516]
In this work, we first explore the computational redundancy part of the network.
We then prune the redundancy blocks of the model and maintain the network performance.
Thirdly, we propose a global-regional interactive (GRI) attention to speed up the computationally intensive attention part.
arXiv Detail & Related papers (2023-12-24T15:37:47Z) - One-stop Training of Multiple Capacity Models [74.87789190840527]
We propose a novel one-stop training framework to jointly train high-capacity and low-capactiy models.
Unlike knowledge distillation, where multiple capacity models are trained from scratch separately, our approach integrates supervisions from different capacity models simultaneously.
arXiv Detail & Related papers (2023-05-23T13:44:09Z) - DisCo: Effective Knowledge Distillation For Contrastive Learning of
Sentence Embeddings [36.37939188680754]
We propose an enhanced knowledge distillation framework termed Distill-Contrast (DisCo)
DisCo transfers the capability of a large sentence embedding model to a small student model on large unlabelled data.
We also propose Contrastive Knowledge Distillation (CKD) to enhance the consistencies among teacher model training, KD, and student model finetuning.
arXiv Detail & Related papers (2021-12-10T16:11:23Z) - MixKD: Towards Efficient Distillation of Large-scale Language Models [129.73786264834894]
We propose MixKD, a data-agnostic distillation framework, to endow the resulting model with stronger generalization ability.
We prove from a theoretical perspective that under reasonable conditions MixKD gives rise to a smaller gap between the error and the empirical error.
Experiments under a limited-data setting and ablation studies further demonstrate the advantages of the proposed approach.
arXiv Detail & Related papers (2020-11-01T18:47:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.