Tandem Transformers for Inference Efficient LLMs
- URL: http://arxiv.org/abs/2402.08644v4
- Date: Sun, 20 Oct 2024 15:34:16 GMT
- Title: Tandem Transformers for Inference Efficient LLMs
- Authors: Aishwarya P S, Pranav Ajit Nair, Yashas Samaga, Toby Boyd, Sanjiv Kumar, Prateek Jain, Praneeth Netrapalli,
- Abstract summary: We introduce a novel architecture, Tandem transformers, to address these issues.
This architecture uniquely combines a small autoregressive model and a large model operating in block mode.
On the PaLM2 pretraining dataset, a tandem of PaLM2-Bison and PaLM2-Gecko demonstrates a 3.3% improvement in next-token prediction accuracy.
- Score: 49.75726447408795
- License:
- Abstract: The autoregressive nature of conventional large language models (LLMs) inherently limits inference speed, as tokens are generated sequentially. While speculative and parallel decoding techniques attempt to mitigate this, they face limitations: either relying on less accurate smaller models for generation or failing to fully leverage the base LLM's representations. We introduce a novel architecture, Tandem transformers, to address these issues. This architecture uniquely combines (1) a small autoregressive model and (2) a large model operating in block mode (processing multiple tokens simultaneously). The small model's predictive accuracy is substantially enhanced by granting it attention to the large model's richer representations. On the PaLM2 pretraining dataset, a tandem of PaLM2-Bison and PaLM2-Gecko demonstrates a 3.3% improvement in next-token prediction accuracy over a standalone PaLM2-Gecko, offering a 1.16x speedup compared to a PaLM2-Otter model with comparable downstream performance. We further incorporate the tandem model within the speculative decoding (SPEED) framework where the large model validates tokens from the small model. This ensures that the Tandem of PaLM2-Bison and PaLM2-Gecko achieves substantial speedup (around 1.14x faster than using vanilla PaLM2-Gecko in SPEED) while maintaining identical downstream task accuracy.
Related papers
- Direct Alignment of Draft Model for Speculative Decoding with Chat-Fine-Tuned LLMs [11.245862832561176]
Training a high-quality draft model is required to enable inference acceleration via speculative decoding.
We train Llama 2 Chat Drafter 115M, a draft model for Llama 2 Chat 7B or larger, with only 1.64% of the original size.
Our results show that Llama 2 Chat Drafter 115M with speculative decoding achieves up to 2.3 block efficiency and 2.4$times$ speed-up.
arXiv Detail & Related papers (2024-02-29T19:55:06Z) - FFSplit: Split Feed-Forward Network For Optimizing Accuracy-Efficiency
Trade-off in Language Model Inference [57.119047493787185]
This paper shows how to reduce model size by 43.1% and bring $1.25sim1.56times$ wall clock time speedup on different hardware with negligible accuracy drop.
In practice, our method can reduce model size by 43.1% and bring $1.25sim1.56times$ wall clock time speedup on different hardware with negligible accuracy drop.
arXiv Detail & Related papers (2024-01-08T17:29:16Z) - PaSS: Parallel Speculative Sampling [29.23180061749074]
Scaling the size of language models to tens of billions of parameters has led to impressive performance on a wide range of tasks.
At generation, these models are used auto-regressively, requiring a forward pass for each generated token, and thus reading the full set of parameters from memory.
We show promising performance (up to $30%$ speed-up) while requiring only as few as $O(d_emb)$ additional parameters.
arXiv Detail & Related papers (2023-11-22T18:37:27Z) - PaLM 2 Technical Report [237.84195343548055]
PaLM 2 is a new state-of-the-art language model.
It has better multilingual and reasoning capabilities.
It is more compute-efficient than its predecessor PaLM.
arXiv Detail & Related papers (2023-05-17T17:46:53Z) - Speculative Decoding with Big Little Decoder [108.95187338417541]
Big Little Decoder (BiLD) is a framework that can improve inference efficiency and latency for a wide range of text generation applications.
On an NVIDIA T4 GPU, our framework achieves a speedup of up to 2.12x speedup with minimal generation quality degradation.
Our framework is fully plug-and-play and can be applied without any modifications in the training process or model architecture.
arXiv Detail & Related papers (2023-02-15T18:55:29Z) - Video Prediction by Efficient Transformers [14.685237010856953]
We present a new family of Transformer-based models for video prediction.
Experiments show that the proposed video prediction models are competitive with more complex state-of-the-art convolutional-LSTM based models.
arXiv Detail & Related papers (2022-12-12T16:46:48Z) - Fast Inference from Transformers via Speculative Decoding [3.950600027250452]
Inference from large autoregressive models like Transformers is slow - decoding K tokens takes K serial runs of the model.
In this work we introduce speculative decoding - an algorithm to sample from autoregressive models faster without any changes to the outputs, by computing several tokens in parallel.
arXiv Detail & Related papers (2022-11-30T17:33:28Z) - LightPAFF: A Two-Stage Distillation Framework for Pre-training and
Fine-tuning [146.51221523793342]
LightPAFF uses two-stage knowledge distillation to transfer knowledge from a big teacher model to a lightweight student model.
LightPAFF reduces the model size by nearly 5x and improves online inference speed by 5x-7x.
arXiv Detail & Related papers (2020-04-27T14:00:09Z) - LAVA NAT: A Non-Autoregressive Translation Model with Look-Around
Decoding and Vocabulary Attention [54.18121922040521]
Non-autoregressive translation (NAT) models generate multiple tokens in one forward pass.
These NAT models often suffer from the multimodality problem, generating duplicated tokens or missing tokens.
We propose two novel methods to address this issue, the Look-Around (LA) strategy and the Vocabulary Attention (VA) mechanism.
arXiv Detail & Related papers (2020-02-08T04:11:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.