Pipelined Decoder for Efficient Context-Aware Text Generation
- URL: http://arxiv.org/abs/2506.23431v2
- Date: Tue, 01 Jul 2025 04:16:14 GMT
- Title: Pipelined Decoder for Efficient Context-Aware Text Generation
- Authors: Zixian Huang, Chenxu Niu, Yu Gu, Gengyang Xiao, Xinwei Huang, Gong Cheng,
- Abstract summary: An autoregressive model requires the generation of a new token depending on all the previously generated tokens.<n>We propose a new decoder architecture that efficiently generates text in parallel for context-aware generation tasks.
- Score: 10.156322390341478
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As the basis of generative AI, an autoregressive model requires the generation of a new token depending on all the previously generated tokens, which brings high quality but also restricts the model to generate tokens one by one, forming a bottleneck limiting the generation speed. In this paper, we propose a new decoder architecture that efficiently generates text in parallel for context-aware generation tasks. Our proposed pipelined decoder initiates the generation of multiple subsequences simultaneously, and, at each time-step, it generates a new token for each subsequence to realize parallelism. Experiments on multiple text generation tasks, including question answering, text summarization, and keyphrase generation, show that our pipelined decoder significantly improves the generation speed without a significant loss of generation quality or additional memory consumption.
Related papers
- AdaDecode: Accelerating LLM Decoding with Adaptive Layer Parallelism [17.858104076062897]
Large language models (LLMs) are increasingly used for long-content generation.<n>We propose AdaDecode, which accelerates decoding without requiring auxiliary models or changes to the original model parameters.<n>AdaDecode consistently achieves superior decoding throughput with up to 1.73x speedup.
arXiv Detail & Related papers (2025-06-04T08:32:30Z) - Parallel Decoding via Hidden Transfer for Lossless Large Language Model Acceleration [54.897493351694195]
We propose a novel parallel decoding approach, namely textithidden transfer, which decodes multiple successive tokens simultaneously in a single forward pass.
In terms of acceleration metrics, we outperform all the single-model acceleration techniques, including Medusa and Self-Speculative decoding.
arXiv Detail & Related papers (2024-04-18T09:17:06Z) - Hierarchical Skip Decoding for Efficient Autoregressive Text Generation [9.16858904192541]
We propose a novel decoding strategy named Hierarchical Skip Decoding (HSD) for efficient autoregressive text generation.
With almost half of the layers skipped, HSD can sustain 90% of the text quality compared to vanilla autoregressive decoding.
arXiv Detail & Related papers (2024-03-22T02:44:05Z) - Self-Infilling Code Generation [60.12883980846781]
We introduce self-infilling code generation, a general framework that incorporates infilling operations into auto-regressive decoding.
We utilize this capability to introduce novel interruption and looping mechanisms in conventional decoding.
Our proposed decoding process is effective in enhancing both regularity and quality across several code generation benchmarks.
arXiv Detail & Related papers (2023-11-29T16:02:06Z) - FastFit: Towards Real-Time Iterative Neural Vocoder by Replacing U-Net
Encoder With Multiple STFTs [1.8047694351309207]
FastFit is a novel neural vocoder architecture that replaces the U-Net encoder with multiple short-time Fourier transforms (STFTs)
We show that FastFit achieves nearly twice the generation speed of baseline-based vocoders while maintaining high sound quality.
arXiv Detail & Related papers (2023-05-18T09:05:17Z) - SeqDiffuSeq: Text Diffusion with Encoder-Decoder Transformers [50.90457644954857]
In this work, we apply diffusion models to approach sequence-to-sequence text generation.
We propose SeqDiffuSeq, a text diffusion model for sequence-to-sequence generation.
Experiment results illustrate the good performance on sequence-to-sequence generation in terms of text quality and inference time.
arXiv Detail & Related papers (2022-12-20T15:16:24Z) - Towards Generating Real-World Time Series Data [52.51620668470388]
We propose a novel generative framework for time series data generation - RTSGAN.
RTSGAN learns an encoder-decoder module which provides a mapping between a time series instance and a fixed-dimension latent vector.
To generate time series with missing values, we further equip RTSGAN with an observation embedding layer and a decide-and-generate decoder.
arXiv Detail & Related papers (2021-11-16T11:31:37Z) - Parallel Refinements for Lexically Constrained Text Generation with BART [0.0]
We propose Constrained BART (CBART) for lexically constrained text generation.
CBART transfers part of the generation burden from the decoder to the encoder by decomposing this task into two sub-tasks, thereby improving the sentence quality.
Experiment results on One-Billion-Word and Yelp show that CBART can generate plausible text with high quality and diversity while significantly accelerating inference.
arXiv Detail & Related papers (2021-09-26T03:56:45Z) - Cascaded Text Generation with Markov Transformers [122.76100449018061]
Two dominant approaches to neural text generation are fully autoregressive models, using serial beam search decoding, and non-autoregressive models, using parallel decoding with no output dependencies.
This work proposes an autoregressive model with sub-linear parallel time generation. Noting that conditional random fields with bounded context can be decoded in parallel, we propose an efficient cascaded decoding approach for generating high-quality output.
This approach requires only a small modification from standard autoregressive training, while showing competitive accuracy/speed tradeoff compared to existing methods on five machine translation datasets.
arXiv Detail & Related papers (2020-06-01T17:52:15Z) - POINTER: Constrained Progressive Text Generation via Insertion-based
Generative Pre-training [93.79766670391618]
We present POINTER, a novel insertion-based approach for hard-constrained text generation.
The proposed method operates by progressively inserting new tokens between existing tokens in a parallel manner.
The resulting coarse-to-fine hierarchy makes the generation process intuitive and interpretable.
arXiv Detail & Related papers (2020-05-01T18:11:54Z) - PALM: Pre-training an Autoencoding&Autoregressive Language Model for
Context-conditioned Generation [92.7366819044397]
Self-supervised pre-training has emerged as a powerful technique for natural language understanding and generation.
This work presents PALM with a novel scheme that jointly pre-trains an autoencoding and autoregressive language model on a large unlabeled corpus.
An extensive set of experiments show that PALM achieves new state-of-the-art results on a variety of language generation benchmarks.
arXiv Detail & Related papers (2020-04-14T06:25:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.