PipeDec: Low-Latency Pipeline-based Inference with Dynamic Speculative Decoding towards Large-scale Models
- URL: http://arxiv.org/abs/2504.04104v1
- Date: Sat, 05 Apr 2025 08:31:10 GMT
- Title: PipeDec: Low-Latency Pipeline-based Inference with Dynamic Speculative Decoding towards Large-scale Models
- Authors: Haofei Yin, Mengbai Xiao, Rouzhou Lu, Xiao Zhang, Dongxiao Yu, Guanghui Zhang,
- Abstract summary: We propose a speculative decoding framework named PipeDec to address the low global resource utilization of single tasks in pipeline deployments.<n>A dynamic prediction tree manages prediction sequences across nodes, enabling efficient updating and pruning.<n>Experiments were conducted using LLama3.2 1B as the draft model in conjunction with a 14-stage parallel pipeline to accelerate LLama3.1 70B by six different types of datasets.
- Score: 20.212041940314016
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Autoregressive large language model inference primarily consists of two stages: pre-filling and decoding. Decoding involves sequential computation for each token, which leads to significant latency. Speculative decoding is a technique that leverages the draft model combined with large model verification to enhance parallelism without sacrificing accuracy. However, existing external prediction methods face challenges in adapting to multi-node serial deployments. While they can maintain speedup under such conditions, the high latency of multi-node deployments ultimately results in low overall efficiency. We propose a speculative decoding framework named PipeDec to address the low global resource utilization of single tasks in pipeline deployments thereby reducing decoding latency. We integrate a draft model into the pipeline of the large model and immediately forward each prediction from the draft model to subsequent pipeline stages. A dynamic prediction tree manages prediction sequences across nodes, enabling efficient updating and pruning. This approach leverages the draft model's predictions to utilize all pipeline nodes for parallel decoding of a single task. Experiments were conducted using LLama3.2 1B as the draft model in conjunction with a 14-stage parallel pipeline to accelerate LLama3.1 70B by six different types of datasets. During the decoding phase of a single task, PipeDec achieved a 4.46x-7.79x speedup compared to traditional pipeline parallelism and a 2.2x-2.69x speedup compared to baseline tree-based speculative decoding methods. The code will be released after the review process.
Related papers
- BitPipe: Bidirectional Interleaved Pipeline Parallelism for Accelerating Large Models Training [5.7294516069851475]
BitPipe is a bidirectional interleaved pipeline parallelism for accelerating large models training.
We show that BitPipe improves the training throughput of GPT-style and BERT-style models by 1.05x-1.28x compared to the state-of-the-art synchronous approaches.
arXiv Detail & Related papers (2024-10-25T08:08:51Z) - COrAL: Order-Agnostic Language Modeling for Efficient Iterative Refinement [80.18490952057125]
Iterative refinement has emerged as an effective paradigm for enhancing the capabilities of large language models (LLMs) on complex tasks.
We propose Context-Wise Order-Agnostic Language Modeling (COrAL) to overcome these challenges.
Our approach models multiple token dependencies within manageable context windows, enabling the model to perform iterative refinement internally.
arXiv Detail & Related papers (2024-10-12T23:56:19Z) - ParallelSpec: Parallel Drafter for Efficient Speculative Decoding [62.68430939686566]
We present ParallelSpec, an alternative to auto-regressive drafting strategies in state-of-the-art speculative decoding approaches.
In contrast to auto-regressive drafting in the speculative stage, we train a parallel drafter to serve as an efficient speculative model.
arXiv Detail & Related papers (2024-10-08T01:05:08Z) - Dynamic-Width Speculative Beam Decoding for Efficient LLM Inference [35.730941605490194]
Large language models (LLMs) have shown outstanding performance across numerous real-world tasks.<n> Speculative decoding has emerged as a promising solution, leveraging a smaller auxiliary model to draft future tokens.<n>This paper explores the novel integration of speculative decoding with beam sampling.
arXiv Detail & Related papers (2024-09-25T02:20:42Z) - PEARL: Parallel Speculative Decoding with Adaptive Draft Length [12.166703341906242]
We propose a conceptually simple, flexible, and general framework to boost speculative decoding, namely Parallel spEculative decoding with Adaptive dRaft Length (PEARL)<n>PEARL proposes pre-verify to verify the first draft token in advance during the drafting phase, and post-verify to generate more draft tokens during the verification phase.<n> Experiments on various text generation benchmarks demonstrate the effectiveness of our PEARL, leading to a superior speed up performance up to 4.43$times$ and 1.50$times$, compared to auto-regressive decoding and vanilla speculative decoding, respectively.
arXiv Detail & Related papers (2024-08-13T08:32:06Z) - Towards Efficient Fine-tuning of Pre-trained Code Models: An
Experimental Study and Beyond [52.656743602538825]
Fine-tuning pre-trained code models incurs a large computational cost.
We conduct an experimental study to explore what happens to layer-wise pre-trained representations and their encoded code knowledge during fine-tuning.
We propose Telly to efficiently fine-tune pre-trained code models via layer freezing.
arXiv Detail & Related papers (2023-04-11T13:34:13Z) - Does compressing activations help model parallel training? [64.59298055364336]
We present the first empirical study on the effectiveness of compression methods for model parallelism.
We implement and evaluate three common classes of compression algorithms.
We evaluate these methods across more than 160 settings and 8 popular datasets.
arXiv Detail & Related papers (2023-01-06T18:58:09Z) - TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale
Language Models [60.23234205219347]
TeraPipe is a high-performance token-level pipeline parallel algorithm for synchronous model-parallel training of Transformer-based language models.
We show that TeraPipe can speed up the training by 5.0x for the largest GPT-3 model with 175 billion parameters on an AWS cluster.
arXiv Detail & Related papers (2021-02-16T07:34:32Z) - Scaling Distributed Deep Learning Workloads beyond the Memory Capacity
with KARMA [58.040931661693925]
We propose a strategy that combines redundant recomputing and out-of-core methods.
We achieve an average of 1.52x speedup in six different models over the state-of-the-art out-of-core methods.
Our data parallel out-of-core solution can outperform complex hybrid model parallelism in training large models, e.g. Megatron-LM and Turning-NLG.
arXiv Detail & Related papers (2020-08-26T07:24:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.