ELMER: A Non-Autoregressive Pre-trained Language Model for Efficient and
Effective Text Generation
- URL: http://arxiv.org/abs/2210.13304v1
- Date: Mon, 24 Oct 2022 14:46:47 GMT
- Title: ELMER: A Non-Autoregressive Pre-trained Language Model for Efficient and
Effective Text Generation
- Authors: Junyi Li, Tianyi Tang, Wayne Xin Zhao, Jian-Yun Nie and Ji-Rong Wen
- Abstract summary: We study the text generation task under the approach of pre-trained language models (PLMs)
By leveraging the early exit technique, ELMER enables the token generations at different layers, according to their prediction confidence.
Experiments on three text generation tasks show that ELMER significantly outperforms NAR models.
- Score: 97.64625999380425
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study the text generation task under the approach of pre-trained language
models (PLMs). Typically, an auto-regressive (AR) method is adopted for
generating texts in a token-by-token manner. Despite many advantages of AR
generation, it usually suffers from inefficient inference. Therefore,
non-autoregressive (NAR) models are proposed to generate all target tokens
simultaneously. However, NAR models usually generate texts of lower quality due
to the absence of token dependency in the output text. In this paper, we
propose ELMER: an efficient and effective PLM for NAR text generation to
explicitly model the token dependency during NAR generation. By leveraging the
early exit technique, ELMER enables the token generations at different layers,
according to their prediction confidence (a more confident token will exit at a
lower layer). Besides, we propose a novel pre-training objective, Layer
Permutation Language Modeling, to pre-train ELMER by permuting the exit layer
for each token in sequences. Experiments on three text generation tasks show
that ELMER significantly outperforms NAR models and further narrows the
performance gap with AR PLMs (\eg ELMER (29.92) vs BART (30.61) ROUGE-L in
XSUM) while achieving over 10 times inference speedup.
Related papers
- Language Models can Self-Lengthen to Generate Long Texts [74.96074422345806]
This paper introduces an innovative iterative training framework called Self-Lengthen.
It leverages only the intrinsic knowledge and skills of Large Language Models without the need for auxiliary data or proprietary models.
Experiments on benchmarks and human evaluations show that Self-Lengthen outperforms existing methods in long-text generation.
arXiv Detail & Related papers (2024-10-31T13:47:10Z) - Attentive Multi-Layer Perceptron for Non-autoregressive Generation [46.14195464583495]
Non-autoregressive(NAR) generation gains increasing popularity for its efficiency and growing efficacy.
In this paper, we propose a novel variant, textbfAttentive textbfMulti-textbfLayer textbfPerceptron(AMLP), to produce a generation model with linear time and space complexity.
arXiv Detail & Related papers (2023-10-14T06:44:24Z) - Paraformer: Fast and Accurate Parallel Transformer for
Non-autoregressive End-to-End Speech Recognition [62.83832841523525]
We propose a fast and accurate parallel transformer, termed Paraformer.
It accurately predicts the number of output tokens and extract hidden variables.
It can attain comparable performance to the state-of-the-art AR transformer, with more than 10x speedup.
arXiv Detail & Related papers (2022-06-16T17:24:14Z) - A Self-Paced Mixed Distillation Method for Non-Autoregressive Generation [135.84684279852098]
Non-Autoregressive (NAR) models significantly under-perform Auto-regressive (AR) models on various language generation tasks.
Among the NAR models, BANG is the first large-scale pre-training model on English un-labeled raw text corpus.
We propose a novel self-paced mixed distillation method to further improve the generation quality of BANG.
arXiv Detail & Related papers (2022-05-23T09:54:53Z) - POINTER: Constrained Progressive Text Generation via Insertion-based
Generative Pre-training [93.79766670391618]
We present POINTER, a novel insertion-based approach for hard-constrained text generation.
The proposed method operates by progressively inserting new tokens between existing tokens in a parallel manner.
The resulting coarse-to-fine hierarchy makes the generation process intuitive and interpretable.
arXiv Detail & Related papers (2020-05-01T18:11:54Z) - ELECTRA: Pre-training Text Encoders as Discriminators Rather Than
Generators [108.3381301768299]
Masked language modeling (MLM) pre-training methods such as BERT corrupt the input by replacing some tokens with [MASK] and then train a model to reconstruct the original tokens.
We propose a more sample-efficient pre-training task called replaced token detection.
arXiv Detail & Related papers (2020-03-23T21:17:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.