SWIFT: On-the-Fly Self-Speculative Decoding for LLM Inference Acceleration
- URL: http://arxiv.org/abs/2410.06916v1
- Date: Wed, 9 Oct 2024 14:15:30 GMT
- Title: SWIFT: On-the-Fly Self-Speculative Decoding for LLM Inference Acceleration
- Authors: Heming Xia, Yongqi Li, Jun Zhang, Cunxiao Du, Wenjie Li,
- Abstract summary: Speculative decoding (SD) has emerged as a widely used paradigm to accelerate the inference of large language models (LLMs)
We introduce SWIFT, an on-the-fly self-speculative decoding algorithm that adaptively selects intermediate layers of LLMs to skip during inference.
We show that SWIFT can achieve over a 1.3x-1.6x speedup while preserving the original distribution of the generated text.
- Score: 10.970637831760136
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Speculative decoding (SD) has emerged as a widely used paradigm to accelerate the inference of large language models (LLMs) without compromising generation quality. It works by first employing a compact model to draft multiple tokens efficiently and then using the target LLM to verify them in parallel. While this technique has achieved notable speedups, most existing approaches necessitate either additional parameters or extensive training to construct effective draft models, thereby restricting their applicability across different LLMs and tasks. To address this limitation, we explore a novel plug-and-play SD solution with layer-skipping, which skips intermediate layers of the target LLM as the compact draft model. Our analysis reveals that LLMs exhibit great potential for self-acceleration through layer sparsity and the task-specific nature of this sparsity. Building on these insights, we introduce SWIFT, an on-the-fly self-speculative decoding algorithm that adaptively selects intermediate layers of LLMs to skip during inference. SWIFT does not require auxiliary models or additional training, making it a plug-and-play solution for accelerating LLM inference across diverse input data streams. Our extensive experiments across a wide range of models and downstream tasks demonstrate that SWIFT can achieve over a 1.3x-1.6x speedup while preserving the original distribution of the generated text.
Related papers
- Reference Trustable Decoding: A Training-Free Augmentation Paradigm for Large Language Models [79.41139393080736]
Large language models (LLMs) have rapidly advanced and demonstrated impressive capabilities.
In-Context Learning (ICL) and.
Efficient Fine-Tuning (PEFT) are currently two mainstream methods for augmenting.
LLMs to downstream tasks.
We propose Reference Trustable Decoding (RTD), a paradigm that allows models to quickly adapt to new tasks without fine-tuning.
arXiv Detail & Related papers (2024-09-30T10:48:20Z) - Boosting Lossless Speculative Decoding via Feature Sampling and Partial Alignment Distillation [8.046705062670096]
Lossless speculative decoding accelerates target large language model inference.
We propose FSPAD (Feature Sampling and Partial Alignment Distillation for Lossless Speculative Decoding) to boost speculative decoding.
Our experiments include both greedy and non-greedy decoding on the largest and smallest models from the Vicuna and LLaMA3-Instruct series.
arXiv Detail & Related papers (2024-08-28T06:28:01Z) - Extend Model Merging from Fine-Tuned to Pre-Trained Large Language Models via Weight Disentanglement [72.97553348776425]
We make a pioneering effort to broaden the applicability of merging techniques from FT to PT LLMs.
We introduce an approach based on WeIght DisENtanglement (WIDEN) to effectively extend the merging scope.
We merge Qwen1.5-Chat (an FT LLM with instruction-following skills) with Sailor (a PT LLM with multilingual abilities) across 7B and 14B model scales.
arXiv Detail & Related papers (2024-08-06T10:46:46Z) - Exploring the Role of Large Language Models in Prompt Encoding for Diffusion Models [42.891427362223176]
Large language models (LLMs) based on decoder-only transformers have demonstrated superior text understanding capabilities.
We propose a novel framework to fully harness the capabilities of LLMs.
We further design an LLM-Infused Diffusion Transformer (LI-DiT) based on the framework.
arXiv Detail & Related papers (2024-06-17T17:59:43Z) - One Token Can Help! Learning Scalable and Pluggable Virtual Tokens for Retrieval-Augmented Large Language Models [67.49462724595445]
Retrieval-augmented generation (RAG) is a promising way to improve large language models (LLMs)
We propose a novel method that involves learning scalable and pluggable virtual tokens for RAG.
arXiv Detail & Related papers (2024-05-30T03:44:54Z) - ST-LLM: Large Language Models Are Effective Temporal Learners [58.79456373423189]
Large Language Models (LLMs) have showcased impressive capabilities in text comprehension and generation.
How to effectively encode and understand videos in video-based dialogue systems remains to be solved.
We propose ST-LLM, an effective video-LLM baseline with spatial-temporal sequence modeling inside LLM.
arXiv Detail & Related papers (2024-03-30T10:11:26Z) - Generation Meets Verification: Accelerating Large Language Model Inference with Smart Parallel Auto-Correct Decoding [11.832919020149891]
This research aims to accelerate the inference speed of large language models (LLMs) with billions of parameters.
We propose textbfSmart textbfParallel textbfAuto-textbfCorrect dtextbfEcoding (SPACE)
arXiv Detail & Related papers (2024-02-19T03:39:10Z) - Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models [52.98743860365194]
We propose a new fine-tuning method called Self-Play fIne-tuNing (SPIN)
At the heart of SPIN lies a self-play mechanism, where the LLM refines its capability by playing against instances of itself.
This sheds light on the promise of self-play, enabling the achievement of human-level performance in LLMs without the need for expert opponents.
arXiv Detail & Related papers (2024-01-02T18:53:13Z) - LLM-Pruner: On the Structural Pruning of Large Language Models [65.02607075556742]
Large language models (LLMs) have shown remarkable capabilities in language understanding and generation.
We tackle the compression of LLMs within the bound of two constraints: being task-agnostic and minimizing the reliance on the original training dataset.
Our method, named LLM-Pruner, adopts structural pruning that selectively removes non-critical coupled structures.
arXiv Detail & Related papers (2023-05-19T12:10:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.