A Self-Paced Mixed Distillation Method for Non-Autoregressive Generation
- URL: http://arxiv.org/abs/2205.11162v1
- Date: Mon, 23 May 2022 09:54:53 GMT
- Title: A Self-Paced Mixed Distillation Method for Non-Autoregressive Generation
- Authors: Weizhen Qi, Yeyun Gong, Yelong Shen, Jian Jiao, Yu Yan, Houqiang Li,
Ruofei Zhang, Weizhu Chen, Nan Duan
- Abstract summary: Non-Autoregressive (NAR) models significantly under-perform Auto-regressive (AR) models on various language generation tasks.
Among the NAR models, BANG is the first large-scale pre-training model on English un-labeled raw text corpus.
We propose a novel self-paced mixed distillation method to further improve the generation quality of BANG.
- Score: 135.84684279852098
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Non-Autoregressive generation is a sequence generation paradigm, which
removes the dependency between target tokens. It could efficiently reduce the
text generation latency with parallel decoding in place of token-by-token
sequential decoding. However, due to the known multi-modality problem,
Non-Autoregressive (NAR) models significantly under-perform Auto-regressive
(AR) models on various language generation tasks. Among the NAR models, BANG is
the first large-scale pre-training model on English un-labeled raw text corpus.
It considers different generation paradigms as its pre-training tasks including
Auto-regressive (AR), Non-Autoregressive (NAR), and semi-Non-Autoregressive
(semi-NAR) information flow with multi-stream strategy. It achieves
state-of-the-art performance without any distillation techniques. However, AR
distillation has been shown to be a very effective solution for improving NAR
performance. In this paper, we propose a novel self-paced mixed distillation
method to further improve the generation quality of BANG. Firstly, we propose
the mixed distillation strategy based on the AR stream knowledge. Secondly, we
encourage the model to focus on the samples with the same modality by
self-paced learning. The proposed self-paced mixed distillation algorithm
improves the generation quality and has no influence on the inference latency.
We carry out extensive experiments on summarization and question generation
tasks to validate the effectiveness. To further illustrate the commercial value
of our approach, we conduct experiments on three generation tasks in real-world
advertisements applications. Experimental results on commercial data show the
effectiveness of the proposed model. Compared with BANG, it achieves
significant BLEU score improvement. On the other hand, compared with
auto-regressive generation method, it achieves more than 7x speedup.
Related papers
- Self-Play Fine-Tuning of Diffusion Models for Text-to-Image Generation [59.184980778643464]
Fine-tuning Diffusion Models remains an underexplored frontier in generative artificial intelligence (GenAI)
In this paper, we introduce an innovative technique called self-play fine-tuning for diffusion models (SPIN-Diffusion)
Our approach offers an alternative to conventional supervised fine-tuning and RL strategies, significantly improving both model performance and alignment.
arXiv Detail & Related papers (2024-02-15T18:59:18Z) - Distilling Autoregressive Models to Obtain High-Performance
Non-Autoregressive Solvers for Vehicle Routing Problems with Faster Inference
Speed [8.184624214651283]
We propose a generic Guided Non-Autoregressive Knowledge Distillation (GNARKD) method to obtain high-performance NAR models having a low inference latency.
We evaluate GNARKD by applying it to three widely adopted AR models to obtain NAR VRP solvers for both synthesized and real-world instances.
arXiv Detail & Related papers (2023-12-19T07:13:32Z) - BOOT: Data-free Distillation of Denoising Diffusion Models with
Bootstrapping [64.54271680071373]
Diffusion models have demonstrated excellent potential for generating diverse images.
Knowledge distillation has been recently proposed as a remedy that can reduce the number of inference steps to one or a few.
We present a novel technique called BOOT, that overcomes limitations with an efficient data-free distillation algorithm.
arXiv Detail & Related papers (2023-06-08T20:30:55Z) - ELMER: A Non-Autoregressive Pre-trained Language Model for Efficient and
Effective Text Generation [97.64625999380425]
We study the text generation task under the approach of pre-trained language models (PLMs)
By leveraging the early exit technique, ELMER enables the token generations at different layers, according to their prediction confidence.
Experiments on three text generation tasks show that ELMER significantly outperforms NAR models.
arXiv Detail & Related papers (2022-10-24T14:46:47Z) - Improving Non-autoregressive Generation with Mixup Training [51.61038444990301]
We present a non-autoregressive generation model based on pre-trained transformer models.
We propose a simple and effective iterative training method called MIx Source and pseudo Target.
Our experiments on three generation benchmarks including question generation, summarization and paraphrase generation, show that the proposed framework achieves the new state-of-the-art results.
arXiv Detail & Related papers (2021-10-21T13:04:21Z) - Autoregressive Knowledge Distillation through Imitation Learning [70.12862707908769]
We develop a compression technique for autoregressive models driven by an imitation learning perspective on knowledge distillation.
Our method consistently outperforms other distillation algorithms, such as sequence-level knowledge distillation.
Student models trained with our method attain 1.4 to 4.8 BLEU/ROUGE points higher than those trained from scratch, while increasing inference speed by up to 14 times in comparison to the teacher model.
arXiv Detail & Related papers (2020-09-15T17:43:02Z) - Incorporating Reinforced Adversarial Learning in Autoregressive Image
Generation [39.55651747758391]
We propose to use Reinforced Adversarial Learning (RAL) based on policy gradient optimization for autoregressive models.
RAL also empowers the collaboration between different modules of the VQ-VAE framework.
The proposed method achieves state-of-the-art results on Celeba for 64 $times$ 64 image resolution.
arXiv Detail & Related papers (2020-07-20T08:10:07Z) - An EM Approach to Non-autoregressive Conditional Sequence Generation [49.11858479436565]
Autoregressive (AR) models have been the dominating approach to conditional sequence generation.
Non-autoregressive (NAR) models have been recently proposed to reduce the latency by generating all output tokens in parallel.
This paper proposes a new approach that jointly optimize both AR and NAR models in a unified Expectation-Maximization framework.
arXiv Detail & Related papers (2020-06-29T20:58:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.