Related papers: Autoregressive Knowledge Distillation through Imitation Learning

Autoregressive Knowledge Distillation through Imitation Learning

URL: http://arxiv.org/abs/2009.07253v2
Date: Thu, 29 Oct 2020 00:40:45 GMT
Title: Autoregressive Knowledge Distillation through Imitation Learning
Authors: Alexander Lin, Jeremy Wohlwend, Howard Chen, and Tao Lei
Abstract summary: We develop a compression technique for autoregressive models driven by an imitation learning perspective on knowledge distillation. Our method consistently outperforms other distillation algorithms, such as sequence-level knowledge distillation. Student models trained with our method attain 1.4 to 4.8 BLEU/ROUGE points higher than those trained from scratch, while increasing inference speed by up to 14 times in comparison to the teacher model.
Score: 70.12862707908769
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The performance of autoregressive models on natural language generation tasks has dramatically improved due to the adoption of deep, self-attentive architectures. However, these gains have come at the cost of hindering inference speed, making state-of-the-art models cumbersome to deploy in real-world, time-sensitive settings. We develop a compression technique for autoregressive models that is driven by an imitation learning perspective on knowledge distillation. The algorithm is designed to address the exposure bias problem. On prototypical language generation tasks such as translation and summarization, our method consistently outperforms other distillation algorithms, such as sequence-level knowledge distillation. Student models trained with our method attain 1.4 to 4.8 BLEU/ROUGE points higher than those trained from scratch, while increasing inference speed by up to 14 times in comparison to the teacher model.

Related papers

Training-Free Motion Customization for Distilled Video Generators with Adaptive Test-Time Distillation [53.877572078307935]
Distilled video generation models offer fast and efficient but struggle with motion customization when guided by reference videos.<n>We propose MotionEcho, a training-free test-time distillation framework that enables motion customization by leveraging diffusion teacher forcing.
arXiv Detail & Related papers (2025-06-24T06:20:15Z)
The Diffusion Duality [11.823724329261053]
Uniform-state diffusion processes naturally emerge from an underlying Gaussian diffusion.<n>Models trained with curriculum learning surpass autoregressive models in zero-shot perplexity on 3 of 7 benchmarks.<n>We present Discrete Consistency Distillation, which adapts consistency distillation from the continuous to the discrete setting.
arXiv Detail & Related papers (2025-06-12T16:55:35Z)
DDAE++: Enhancing Diffusion Models Towards Unified Generative and Discriminative Learning [53.27049077100897]
generative pre-training has been shown to yield discriminative representations, paving the way towards unified visual generation and understanding.<n>This work introduces self-conditioning, a mechanism that internally leverages the rich semantics inherent in denoising network to guide its own decoding layers.<n>Results are compelling: our method boosts both generation FID and recognition accuracy with 1% computational overhead and generalizes across diverse diffusion architectures.
arXiv Detail & Related papers (2025-05-16T08:47:16Z)
Topology-Guided Knowledge Distillation for Efficient Point Cloud Processing [3.3903891679981593]
This work introduces a novel distillation framework to transfer knowledge from a high-capacity teacher to a lightweight student model.<n>Our approach captures the underlying geometric structures of point clouds while selectively guiding the student model's learning process.<n>Our method achieves state-of-the-art performance among knowledge distillation techniques trained solely on LiDAR data.
arXiv Detail & Related papers (2025-05-12T22:15:54Z)
Honey, I Shrunk the Language Model: Impact of Knowledge Distillation Methods on Performance and Explainability [3.224880576815583]
High computational and storage demands of Large Language Models limit their deployment in resource-constrained environments. Previous research has introduced several distillation methods for both generating training data and for training the student model. Despite their relevance, the effects of state-of-the-art distillation methods on model performance and explainability have not been thoroughly investigated.
arXiv Detail & Related papers (2025-04-22T17:32:48Z)
Learning from Stochastic Teacher Representations Using Student-Guided Knowledge Distillation [64.15918654558816]
Self-distillation (SSD) training strategy is introduced for filtering and weighting teacher representation to distill from task-relevant representations only. Experimental results on real-world affective computing, wearable/biosignal datasets from the UCR Archive, the HAR dataset, and image classification datasets show that the proposed SSD method can outperform state-of-the-art methods.
arXiv Detail & Related papers (2025-04-19T14:08:56Z)
Numerical Pruning for Efficient Autoregressive Models [87.56342118369123]
This paper focuses on compressing decoder-only transformer-based autoregressive models through structural weight pruning. Specifically, we propose a training-free pruning method that calculates a numerical score with Newton's method for the Attention and modules, respectively. To verify the effectiveness of our method, we provide both theoretical support and extensive experiments.
arXiv Detail & Related papers (2024-12-17T01:09:23Z)
Boosting Alignment for Post-Unlearning Text-to-Image Generative Models [55.82190434534429]
Large-scale generative models have shown impressive image-generation capabilities, propelled by massive data. This often inadvertently leads to the generation of harmful or inappropriate content and raises copyright concerns. We propose a framework that seeks an optimal model update at each unlearning iteration, ensuring monotonic improvement on both objectives.
arXiv Detail & Related papers (2024-12-09T21:36:10Z)
MiniVLN: Efficient Vision-and-Language Navigation by Progressive Knowledge Distillation [17.27883003990266]
Vision-and-Language Navigation (VLN) is a core task in Embodied AI. This paper introduces a two-stage knowledge distillation framework, producing a student model, MiniVLN. Our findings indicate that the two-stage distillation approach is more effective in narrowing the performance gap between the teacher model and the student model.
arXiv Detail & Related papers (2024-09-27T14:54:54Z)
Continual Learning of Diffusion Models with Generative Distillation [34.52513912701778]
Diffusion models are powerful generative models that achieve state-of-the-art performance in image synthesis. In this paper, we propose generative distillation, an approach that distils the entire reverse process of a diffusion model.
arXiv Detail & Related papers (2023-11-23T14:33:03Z)
Education distillation:getting student models to learn in shcools [15.473668050280304]
This paper introduces dynamic incremental learning into knowledge distillation. It is proposed to take fragmented student models divided from the complete student model as lower-grade models.
arXiv Detail & Related papers (2023-11-23T05:20:18Z)
Can a student Large Language Model perform as well as it's teacher? [0.0]
Knowledge distillation aims to transfer knowledge from a high-capacity "teacher" model to a streamlined "student" model. This paper provides a comprehensive overview of the knowledge distillation paradigm.
arXiv Detail & Related papers (2023-10-03T20:34:59Z)
BOOT: Data-free Distillation of Denoising Diffusion Models with Bootstrapping [64.54271680071373]
Diffusion models have demonstrated excellent potential for generating diverse images. Knowledge distillation has been recently proposed as a remedy that can reduce the number of inference steps to one or a few. We present a novel technique called BOOT, that overcomes limitations with an efficient data-free distillation algorithm.
arXiv Detail & Related papers (2023-06-08T20:30:55Z)
HomoDistil: Homotopic Task-Agnostic Distillation of Pre-trained Transformers [49.79405257763856]
This paper focuses on task-agnostic distillation. It produces a compact pre-trained model that can be easily fine-tuned on various tasks with small computational costs and memory footprints. We propose Homotopic Distillation (HomoDistil), a novel task-agnostic distillation approach equipped with iterative pruning.
arXiv Detail & Related papers (2023-02-19T17:37:24Z)
A Self-Paced Mixed Distillation Method for Non-Autoregressive Generation [135.84684279852098]
Non-Autoregressive (NAR) models significantly under-perform Auto-regressive (AR) models on various language generation tasks. Among the NAR models, BANG is the first large-scale pre-training model on English un-labeled raw text corpus. We propose a novel self-paced mixed distillation method to further improve the generation quality of BANG.
arXiv Detail & Related papers (2022-05-23T09:54:53Z)
Improving Non-autoregressive Generation with Mixup Training [51.61038444990301]
We present a non-autoregressive generation model based on pre-trained transformer models. We propose a simple and effective iterative training method called MIx Source and pseudo Target. Our experiments on three generation benchmarks including question generation, summarization and paraphrase generation, show that the proposed framework achieves the new state-of-the-art results.
arXiv Detail & Related papers (2021-10-21T13:04:21Z)
Reinforced Multi-Teacher Selection for Knowledge Distillation [54.72886763796232]
knowledge distillation is a popular method for model compression. Current methods assign a fixed weight to a teacher model in the whole distillation. Most of the existing methods allocate an equal weight to every teacher model. In this paper, we observe that, due to the complexity of training examples and the differences in student model capability, learning differentially from teacher models can lead to better performance of student models distilled.
arXiv Detail & Related papers (2020-12-11T08:56:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.