Multi-Token Joint Speculative Decoding for Accelerating Large Language Model Inference
- URL: http://arxiv.org/abs/2407.09722v1
- Date: Fri, 12 Jul 2024 23:29:54 GMT
- Title: Multi-Token Joint Speculative Decoding for Accelerating Large Language Model Inference
- Authors: Zongyue Qin, Ziniu Hu, Zifan He, Neha Prakriya, Jason Cong, Yizhou Sun,
- Abstract summary: Large language models (LLMs) have demonstrated their power in various tasks, but their inference incurs significant time and energy costs.
speculative decoding uses a smaller model to propose one sequence of tokens, which are subsequently validated in batch by the target large model.
Compared with autoregressive decoding, speculative decoding generates the same number of tokens with fewer runs of the large model.
An algorithm that has better output perplexity and even better efficiency than speculative decoding can be more useful in practice.
- Score: 41.93955876156331
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Transformer-based Large language models (LLMs) have demonstrated their power in various tasks, but their inference incurs significant time and energy costs. To accelerate LLM inference, speculative decoding uses a smaller model to propose one sequence of tokens, which are subsequently validated in batch by the target large model. Compared with autoregressive decoding, speculative decoding generates the same number of tokens with fewer runs of the large model, hence accelerating the overall inference by $1$-$2\times$. However, greedy decoding is not the optimal decoding algorithm in terms of output perplexity, which is a direct measurement of the effectiveness of a decoding algorithm. An algorithm that has better output perplexity and even better efficiency than speculative decoding can be more useful in practice. To achieve this seemingly contradictory goal, we first introduce multi-token joint greedy decoding (MJGD), which greedily generates multiple tokens at each step based on their joint perplexity. We show that it leads to better perplexity for the whole output. But the computation cost of MJGD is infeasible in practice. So we further propose multi-token joint speculative decoding (MJSD), which approximates and accelerates the MJGD from two aspects: it approximates the joint distribution of the large model with that of a small model, and uses a verification step to guarantee the accuracy of approximation; then it uses beam decoding to accelerate the sequence generation from the joint distribution. Compared with vanilla speculative decoding, MJSD has two advantages: (1) it is an approximation of MJGD, thus achieving better output perplexity; (2) verification with joint likelihood allows it to accept the longest prefix sub-sequence of the draft tokens with valid perplexity, leading to better efficiency...
Related papers
- Wide-In, Narrow-Out: Revokable Decoding for Efficient and Effective DLLMs [37.94110023657587]
Diffusion Large Language Models (DLLMs) have emerged as a compelling alternative to Autoregressive models.<n>ExistingDLLMs are plagued by a severe quality-speed trade-off, where faster parallel decoding leads to significant performance degradation.<n>We introduce Wide-In, Narrow-Out (WINO), a training-free decoding algorithm that enables revokable decoding inDLLMs.
arXiv Detail & Related papers (2025-07-24T16:51:33Z) - R-Stitch: Dynamic Trajectory Stitching for Efficient Reasoning [60.37610817226533]
Chain-of-thought (CoT) reasoning encourages step-by-step intermediate reasoning during inference.<n>CoT introduces substantial computational overhead due to its reliance on autoregressive decoding over long token sequences.<n>We present R-Stitch, a token-level, confidence-based hybrid decoding framework that accelerates CoT inference.
arXiv Detail & Related papers (2025-07-23T08:14:36Z) - Accelerating Diffusion LLMs via Adaptive Parallel Decoding [50.9948753314669]
We introduce adaptive parallel decoding (APD), a novel method that dynamically adjusts the number of tokens sampled in parallel.<n>APD provides markedly higher throughput with minimal quality degradations on downstream benchmarks.
arXiv Detail & Related papers (2025-05-31T06:10:10Z) - Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding [51.711605076319216]
Diffusion-based large language models (Diffusion LLMs) have shown promise for non-autoregressive text generation with parallel decoding capabilities.<n>We introduce a novel block-wise approximate KV Cache mechanism tailored for bidirectional diffusion models, enabling cache reuse with negligible performance drop.<n>We propose a confidence-aware parallel decoding strategy that selectively decodes tokens exceeding a confidence threshold, mitigating dependency violations and maintaining generation quality.
arXiv Detail & Related papers (2025-05-28T17:39:15Z) - L-MTP: Leap Multi-Token Prediction Beyond Adjacent Context for Large Language Models [69.1271366892683]
We propose leap multi-token prediction(L-MTP), an innovative token prediction method.<n>Unlike conventional MTP, L-MTP strategically skips over intermediate tokens, predicting non-sequential ones in a single forward pass.<n>We theoretically demonstrate the benefit of L-MTP in improving inference efficiency.
arXiv Detail & Related papers (2025-05-23T05:59:46Z) - Joint Transmit and Pinching Beamforming for Pinching Antenna Systems (PASS): Optimization-Based or Learning-Based? [89.05848771674773]
A novel antenna system ()-enabled downlink multi-user multiple-input single-output (MISO) framework is proposed.
It consists of multiple waveguides, which equip numerous low-cost antennas, named (PAs)
The positions of PAs can be reconfigured to both spanning large-scale path and space.
arXiv Detail & Related papers (2025-02-12T18:54:10Z) - A Theoretical Perspective for Speculative Decoding Algorithm [60.79447486066416]
One effective way to accelerate inference is emphSpeculative Decoding, which employs a small model to sample a sequence of draft tokens and a large model to validate.
This paper tackles this gap by conceptualizing the decoding problem via markov chain abstraction and studying the key properties, emphoutput quality and inference acceleration, from a theoretical perspective.
arXiv Detail & Related papers (2024-10-30T01:53:04Z) - SWIFT: On-the-Fly Self-Speculative Decoding for LLM Inference Acceleration [10.970637831760136]
Speculative decoding (SD) has emerged as a widely used paradigm to accelerate the inference of large language models (LLMs)
We introduce SWIFT, an on-the-fly self-speculative decoding algorithm that adaptively selects intermediate layers of LLMs to skip during inference.
We show that SWIFT can achieve over a 1.3x-1.6x speedup while preserving the original distribution of the generated text.
arXiv Detail & Related papers (2024-10-09T14:15:30Z) - MC-MoE: Mixture Compressor for Mixture-of-Experts LLMs Gains More [71.0473038084673]
We propose MC-MoE, a training-free Mixture-Compressor for Mixture-of-Experts large language models (MoE-LLMs)
MC-MoE leverages the significance of both experts and tokens to achieve an extreme compression.
For instance, at 2.54 bits, MC-MoE compresses 76.6% of the model, with only a 3.8% average accuracy loss.
arXiv Detail & Related papers (2024-10-08T18:09:38Z) - Expediting and Elevating Large Language Model Reasoning via Hidden Chain-of-Thought Decoding [14.175444025026508]
Large language models (LLMs) have demonstrated remarkable capabilities in tasks requiring chain-of-thought (CoT) prompting.
generating the full CoT process results in significantly longer output sequences, leading to increased computational costs and latency during inference.
We propose a novel approach to compress the CoT process through semantic alignment, enabling more efficient decoding while preserving the benefits of CoT reasoning.
arXiv Detail & Related papers (2024-09-13T06:29:20Z) - Speculative Diffusion Decoding: Accelerating Language Generation through Diffusion [59.17158389902231]
Speculative decoding has emerged as a widely adopted method to accelerate large language model inference.
This paper proposes an adaptation of speculative decoding which uses discrete diffusion models to generate draft sequences.
arXiv Detail & Related papers (2024-08-10T21:24:25Z) - Parallel Decoding via Hidden Transfer for Lossless Large Language Model Acceleration [54.897493351694195]
We propose a novel parallel decoding approach, namely textithidden transfer, which decodes multiple successive tokens simultaneously in a single forward pass.
In terms of acceleration metrics, we outperform all the single-model acceleration techniques, including Medusa and Self-Speculative decoding.
arXiv Detail & Related papers (2024-04-18T09:17:06Z) - Lossless Acceleration of Large Language Model via Adaptive N-gram Parallel Decoding [2.642212767247493]
We introduce Adaptive N-gram Parallel Decoding (ANPD), which accelerates inference by allowing the simultaneous generation of multiple tokens.
ANPD preserves the integrity of the original output while enhancing processing speed.
In experiments, models such as LLaMA and its fine-tuned variants have shown speed improvements up to 3.67x.
arXiv Detail & Related papers (2024-04-10T16:11:09Z) - Chimera: A Lossless Decoding Method for Accelerating Large Language Models Inference by Fusing all Tokens [15.566726645722657]
We propose a novel framework specifically designed for speculative sampling.
Within this framework, we introduce a lightweight draft model that effectively utilizes previously generated tokens to predict subsequent words.
We demonstrate impressive results, achieving an average latency speedup ratio of 2.7x compared to the vanilla auto-regressive decoding approach.
arXiv Detail & Related papers (2024-02-24T08:10:39Z) - DB-LLM: Accurate Dual-Binarization for Efficient LLMs [83.70686728471547]
Large language models (LLMs) have significantly advanced the field of natural language processing.
Existing ultra-low-bit quantization always causes severe accuracy drops.
We propose a novel Dual-Binarization method for LLMs, namely DB-LLM.
arXiv Detail & Related papers (2024-02-19T09:04:30Z) - BiTA: Bi-Directional Tuning for Lossless Acceleration in Large Language
Models [37.09385961422664]
Large language models (LLMs) commonly employ autoregressive generation during inference, leading to high memory bandwidth demand and extended latency.
We present Bi-directional Tuning for lossless Acceleration (BiTA), an innovative method expediting LLMs via streamlined semi-autoregressive generation and draft verification.
The proposed BiTA, LLaMA-2-70B-Chat achieves a 2.7$times$ speedup on the MT-Bench benchmark.
arXiv Detail & Related papers (2024-01-23T06:36:49Z) - MFTCoder: Boosting Code LLMs with Multitask Fine-Tuning [28.12788291168137]
We present a multi-task fine-tuning framework, MFTcoder, that enables simultaneous and parallel fine-tuning on multiple tasks.
Experiments have conclusively demonstrated that our multi-task fine-tuning approach outperforms both individual fine-tuning on single tasks and fine-tuning on a mixed ensemble of tasks.
arXiv Detail & Related papers (2023-11-04T02:22:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.