FastDraft: How to Train Your Draft
- URL: http://arxiv.org/abs/2411.11055v1
- Date: Sun, 17 Nov 2024 12:32:44 GMT
- Title: FastDraft: How to Train Your Draft
- Authors: Ofir Zafrir, Igor Margulis, Dorin Shteyman, Guy Boudoukh,
- Abstract summary: We introduce FastDraft, a novel and efficient approach for pre-training and aligning a draft model to any large language model.
We demonstrate FastDraft by training two highly parameter efficient drafts for the popular Phi-3-mini and Llama-3.1-8B models.
Using FastDraft, we were able to produce a draft with approximately 10 billion tokens on a single server with 8 Intel$circledR$ Gaudi$circledR$ 2 accelerators in under 24 hours.
- Score: 0.7499722271664144
- License:
- Abstract: Speculative Decoding has gained popularity as an effective technique for accelerating the auto-regressive inference process of Large Language Models (LLMs). However, Speculative Decoding entirely relies on the availability of efficient draft models, which are often lacking for many existing language models due to a stringent constraint of vocabulary incompatibility. In this work we introduce FastDraft, a novel and efficient approach for pre-training and aligning a draft model to any large language model by incorporating efficient pre-training, followed by fine-tuning over synthetic datasets generated by the target model. We demonstrate FastDraft by training two highly parameter efficient drafts for the popular Phi-3-mini and Llama-3.1-8B models. Using FastDraft, we were able to produce a draft with approximately 10 billion tokens on a single server with 8 Intel$^\circledR$ Gaudi$^\circledR$ 2 accelerators in under 24 hours. Our results show that the draft model achieves impressive results in key metrics of acceptance rate, block efficiency and up to 3x memory bound speed up when evaluated on code completion and up to 2x in summarization, text completion and instruction tasks. We validate our theoretical findings through benchmarking on the latest Intel$^\circledR$ Core$^{\tiny \text{TM}}$ Ultra, achieving a wall-clock time speedup of up to 2x, indicating a significant reduction in runtime. Due to its high quality, FastDraft unlocks large language models inference on AI-PC and other edge-devices.
Related papers
- ParallelSpec: Parallel Drafter for Efficient Speculative Decoding [62.68430939686566]
We present ParallelSpec, an alternative to auto-regressive drafting strategies in state-of-the-art speculative decoding approaches.
In contrast to auto-regressive drafting in the speculative stage, we train a parallel drafter to serve as an efficient speculative model.
arXiv Detail & Related papers (2024-10-08T01:05:08Z) - SDSAT: Accelerating LLM Inference through Speculative Decoding with Semantic Adaptive Tokens [4.5888031410244885]
We propose an acceleration scheme for large language models (LLMs) through Speculative Decoding with Semantic Adaptive Tokens (SDSAT)
The primary objective of this design is to enhance the LLM model's ability to generate draft tokens more accurately without compromising its accuracy.
Experiments conducted on the CodeLlama-13B and 7B models have yielded speed increases of over 3.5X and 3.0X, respectively.
arXiv Detail & Related papers (2024-03-27T14:54:27Z) - Chimera: A Lossless Decoding Method for Accelerating Large Language Models Inference by Fusing all Tokens [15.566726645722657]
We propose a novel framework specifically designed for speculative sampling.
Within this framework, we introduce a lightweight draft model that effectively utilizes previously generated tokens to predict subsequent words.
We demonstrate impressive results, achieving an average latency speedup ratio of 2.7x compared to the vanilla auto-regressive decoding approach.
arXiv Detail & Related papers (2024-02-24T08:10:39Z) - Ouroboros: Generating Longer Drafts Phrase by Phrase for Faster Speculative Decoding [65.94521678103237]
Speculative decoding is a widely used method that accelerates the generation process of large language models.
We introduce Ouroboros, which can generate draft phrases to parallelize the drafting process.
Ouroboros can achieve speedups of up to $2.8times$ over speculative decoding and $3.9times$ over vanilla decoding.
arXiv Detail & Related papers (2024-02-21T11:31:28Z) - Multi-Candidate Speculative Decoding [82.05519287513444]
Large language models have shown impressive capabilities across a variety of NLP tasks, yet their generating text autoregressively is time-consuming.
One way to speed them up is speculative decoding, which generates candidate segments from a fast draft model that is then verified in parallel by the target model.
This paper proposes sampling multiple candidates from a draft model and then organising them in batches for verification.
We design algorithms for efficient multi-candidate verification while maintaining the distribution of the target model.
arXiv Detail & Related papers (2024-01-12T17:15:23Z) - Cascade Speculative Drafting for Even Faster LLM Inference [25.642604897018852]
Speculative decoding improves the efficiency of large language model (LLM) inference.
We introduce Cascade Speculative Drafting (CS Drafting), a speculative execution algorithm that incorporates two types of cascades.
CS Drafting achieves up to an 81 percent additional speedup over speculative decoding in our experiments.
arXiv Detail & Related papers (2023-12-18T18:59:46Z) - DistillSpec: Improving Speculative Decoding via Knowledge Distillation [70.61777015900272]
Speculative decoding (SD) accelerates large language model inference by employing a faster draft model for generating multiple tokens.
We propose DistillSpec that uses knowledge distillation to better align the draft model with the target model, before applying SD.
We show that DistillSpec yields impressive 10 - 45% speedups over standard SD on a range of standard benchmarks.
arXiv Detail & Related papers (2023-10-12T16:21:04Z) - Online Speculative Decoding [34.987825705622555]
We introduce online speculative decoding to accelerate the inference of large language models.
The main idea is to continuously update the (multiple) draft model(s) on observed user query data.
We develop a prototype of online speculative decoding based on knowledge distillation and evaluate it using both synthetic and real query data.
arXiv Detail & Related papers (2023-10-11T04:03:42Z) - Speculative Decoding with Big Little Decoder [108.95187338417541]
Big Little Decoder (BiLD) is a framework that can improve inference efficiency and latency for a wide range of text generation applications.
On an NVIDIA T4 GPU, our framework achieves a speedup of up to 2.12x speedup with minimal generation quality degradation.
Our framework is fully plug-and-play and can be applied without any modifications in the training process or model architecture.
arXiv Detail & Related papers (2023-02-15T18:55:29Z) - Speculative Decoding: Exploiting Speculative Execution for Accelerating
Seq2seq Generation [80.2267931231335]
We propose Speculative Decoding (SpecDec) to study exploiting the idea of speculative execution to accelerate autoregressive (AR) decoding.
SpecDec has two innovations: Spec-Drafter -- an independent model specially optimized for efficient drafting, and Spec-Verification -- a reliable method for verifying the drafted tokens efficiently.
arXiv Detail & Related papers (2022-03-30T17:27:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.