Autoregressive Knowledge Distillation through Imitation Learning
- URL: http://arxiv.org/abs/2009.07253v2
- Date: Thu, 29 Oct 2020 00:40:45 GMT
- Title: Autoregressive Knowledge Distillation through Imitation Learning
- Authors: Alexander Lin, Jeremy Wohlwend, Howard Chen, and Tao Lei
- Abstract summary: We develop a compression technique for autoregressive models driven by an imitation learning perspective on knowledge distillation.
Our method consistently outperforms other distillation algorithms, such as sequence-level knowledge distillation.
Student models trained with our method attain 1.4 to 4.8 BLEU/ROUGE points higher than those trained from scratch, while increasing inference speed by up to 14 times in comparison to the teacher model.
- Score: 70.12862707908769
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The performance of autoregressive models on natural language generation tasks
has dramatically improved due to the adoption of deep, self-attentive
architectures. However, these gains have come at the cost of hindering
inference speed, making state-of-the-art models cumbersome to deploy in
real-world, time-sensitive settings. We develop a compression technique for
autoregressive models that is driven by an imitation learning perspective on
knowledge distillation. The algorithm is designed to address the exposure bias
problem. On prototypical language generation tasks such as translation and
summarization, our method consistently outperforms other distillation
algorithms, such as sequence-level knowledge distillation. Student models
trained with our method attain 1.4 to 4.8 BLEU/ROUGE points higher than those
trained from scratch, while increasing inference speed by up to 14 times in
comparison to the teacher model.
Related papers
- MiniVLN: Efficient Vision-and-Language Navigation by Progressive Knowledge Distillation [17.27883003990266]
Vision-and-Language Navigation (VLN) is a core task in Embodied AI.
This paper introduces a two-stage knowledge distillation framework, producing a student model, MiniVLN.
Our findings indicate that the two-stage distillation approach is more effective in narrowing the performance gap between the teacher model and the student model.
arXiv Detail & Related papers (2024-09-27T14:54:54Z) - Continual Learning of Diffusion Models with Generative Distillation [34.52513912701778]
Diffusion models are powerful generative models that achieve state-of-the-art performance in image synthesis.
In this paper, we propose generative distillation, an approach that distils the entire reverse process of a diffusion model.
arXiv Detail & Related papers (2023-11-23T14:33:03Z) - Education distillation:getting student models to learn in shcools [15.473668050280304]
This paper introduces dynamic incremental learning into knowledge distillation.
It is proposed to take fragmented student models divided from the complete student model as lower-grade models.
arXiv Detail & Related papers (2023-11-23T05:20:18Z) - Can a student Large Language Model perform as well as it's teacher? [0.0]
Knowledge distillation aims to transfer knowledge from a high-capacity "teacher" model to a streamlined "student" model.
This paper provides a comprehensive overview of the knowledge distillation paradigm.
arXiv Detail & Related papers (2023-10-03T20:34:59Z) - BOOT: Data-free Distillation of Denoising Diffusion Models with
Bootstrapping [64.54271680071373]
Diffusion models have demonstrated excellent potential for generating diverse images.
Knowledge distillation has been recently proposed as a remedy that can reduce the number of inference steps to one or a few.
We present a novel technique called BOOT, that overcomes limitations with an efficient data-free distillation algorithm.
arXiv Detail & Related papers (2023-06-08T20:30:55Z) - HomoDistil: Homotopic Task-Agnostic Distillation of Pre-trained
Transformers [49.79405257763856]
This paper focuses on task-agnostic distillation.
It produces a compact pre-trained model that can be easily fine-tuned on various tasks with small computational costs and memory footprints.
We propose Homotopic Distillation (HomoDistil), a novel task-agnostic distillation approach equipped with iterative pruning.
arXiv Detail & Related papers (2023-02-19T17:37:24Z) - A Self-Paced Mixed Distillation Method for Non-Autoregressive Generation [135.84684279852098]
Non-Autoregressive (NAR) models significantly under-perform Auto-regressive (AR) models on various language generation tasks.
Among the NAR models, BANG is the first large-scale pre-training model on English un-labeled raw text corpus.
We propose a novel self-paced mixed distillation method to further improve the generation quality of BANG.
arXiv Detail & Related papers (2022-05-23T09:54:53Z) - Improving Non-autoregressive Generation with Mixup Training [51.61038444990301]
We present a non-autoregressive generation model based on pre-trained transformer models.
We propose a simple and effective iterative training method called MIx Source and pseudo Target.
Our experiments on three generation benchmarks including question generation, summarization and paraphrase generation, show that the proposed framework achieves the new state-of-the-art results.
arXiv Detail & Related papers (2021-10-21T13:04:21Z) - Reinforced Multi-Teacher Selection for Knowledge Distillation [54.72886763796232]
knowledge distillation is a popular method for model compression.
Current methods assign a fixed weight to a teacher model in the whole distillation.
Most of the existing methods allocate an equal weight to every teacher model.
In this paper, we observe that, due to the complexity of training examples and the differences in student model capability, learning differentially from teacher models can lead to better performance of student models distilled.
arXiv Detail & Related papers (2020-12-11T08:56:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.