Related papers: EMS-SD: Efficient Multi-sample Speculative Decoding for Accelerating Large Language Models

EMS-SD: Efficient Multi-sample Speculative Decoding for Accelerating Large Language Models

URL: http://arxiv.org/abs/2405.07542v2
Date: Mon, 14 Oct 2024 02:55:33 GMT
Title: EMS-SD: Efficient Multi-sample Speculative Decoding for Accelerating Large Language Models
Authors: Yunsheng Ni, Chuanjian Liu, Yehui Tang, Kai Han, Yunhe Wang,
Abstract summary: Vanilla method adds padding tokens in order to ensure that the number of new tokens remains consistent across samples. We propose a novel method that can resolve the issue of inconsistent tokens accepted by different samples without necessitating an increase in memory or computing overhead. Our proposed method can handle the situation where the prediction tokens of different samples are inconsistent without the need to add padding tokens.
Score: 40.651650382105636
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Speculative decoding emerges as a pivotal technique for enhancing the inference speed of Large Language Models (LLMs). Despite recent research aiming to improve prediction efficiency, multi-sample speculative decoding has been overlooked due to varying numbers of accepted tokens within a batch in the verification phase. Vanilla method adds padding tokens in order to ensure that the number of new tokens remains consistent across samples. However, this increases the computational and memory access overhead, thereby reducing the speedup ratio. We propose a novel method that can resolve the issue of inconsistent tokens accepted by different samples without necessitating an increase in memory or computing overhead. Furthermore, our proposed method can handle the situation where the prediction tokens of different samples are inconsistent without the need to add padding tokens. Sufficient experiments demonstrate the efficacy of our method. Our code is available at https://github.com/niyunsheng/EMS-SD.

Related papers

Sampling from Your Language Model One Byte at a Time [82.71473348639489]
Tokenization can introduce distortion into the model's generations, known as the Prompt Boundary Problem (PBP)<n>We present an inference-time method to convert any autore LM with a BPE tokenizer into a character-level or byte-level LM.<n>Our method efficiently solves the PBP and is also able to unify the vocabularies of language models with different tokenizers.
arXiv Detail & Related papers (2025-06-17T02:37:04Z)
Think Before You Accept: Semantic Reflective Verification for Faster Speculative Decoding [48.52389201779425]
Speculative decoding accelerates inference by generating multiple draft tokens using a lightweight model and verifying them in parallel.<n>Existing verification methods rely heavily on distributional consistency while overlooking semantic correctness.<n>We propose Reflective Verification, a training-free and semantics-aware approach that achieves a better trade-off between correctness and efficiency.
arXiv Detail & Related papers (2025-05-24T10:26:27Z)
Jakiro: Boosting Speculative Decoding with Decoupled Multi-Head via MoE [15.003006630308517]
Speculative decoding (SD) accelerates large language model inference by using a smaller draft model to predict multiple tokens. We propose Jakiro, leveraging Mixture of Experts (MoE), where independent experts generate diverse predictions. Our method significantly boosts prediction accuracy and achieves higher inference speedups.
arXiv Detail & Related papers (2025-02-10T09:24:06Z)
TCMM: Token Constraint and Multi-Scale Memory Bank of Contrastive Learning for Unsupervised Person Re-identification [2.3183978396999967]
This paper introduces the ViT Token Constraint to mitigate the damage caused by patch noises to the ViT architecture. The proposed Multi-scale Memory enhances the exploration of outlier samples and maintains feature consistency.
arXiv Detail & Related papers (2025-01-15T07:14:02Z)
Disentangling Reasoning Tokens and Boilerplate Tokens For Language Model Fine-tuning [46.43130011147807]
We argue that tokens serving different roles - specifically, reasoning tokens versus boilerplate tokens - differ significantly in importance and learning complexity. We propose a novel Shuffle-Aware Discriminator (SHAD) for adaptive token discrimination. Using SHAD, we propose the Reasoning-highlighted Fine-Tuning (RFT) method, which adaptively emphasizes reasoning tokens during fine-tuning.
arXiv Detail & Related papers (2024-12-19T12:06:24Z)
SepLLM: Accelerate Large Language Models by Compressing One Segment into One Separator [65.62084602011596]
Large Language Models (LLMs) have exhibited exceptional performance across a spectrum of natural language processing tasks. We have identified a key pattern: certain seemingly meaningless separator tokens (i.e., punctuations) contribute disproportionately to attention scores compared to semantically meaningful tokens. We introduce SepLLM, a plug-and-play framework that accelerates inference by compressing these segments and eliminating redundant tokens.
arXiv Detail & Related papers (2024-12-16T18:58:57Z)
SuffixDecoding: Extreme Speculative Decoding for Emerging AI Applications [9.143856130336783]
Speculative decoding is widely adopted to reduce latency in large language model (LLM) inference.<n>Agentic frameworks submit repetitive inference requests, such as multi-agent pipelines performing similar subtasks or self-refinement loops iteratively enhancing outputs.<n>We introduce emphSuffixDecoding, a novel method that utilizes efficient suffix trees to cache long token sequences.
arXiv Detail & Related papers (2024-11-07T18:49:33Z)
FIRP: Faster LLM inference via future intermediate representation prediction [54.897493351694195]
FIRP generates multiple tokens instead of one at each decoding step. We conduct extensive experiments, showing a speedup ratio of 1.9x-3x in several models and datasets.
arXiv Detail & Related papers (2024-10-27T15:53:49Z)
Accelerating Codec-based Speech Synthesis with Multi-Token Prediction and Speculative Decoding [24.472393096460774]
We propose an enhanced inference method that allows for flexible trade-offs between speed and quality during inference without requiring additional training. Our core idea is to predict multiple tokens per inference step of the AR module using multiple prediction heads. In experiments, we demonstrate that the time required to predict each token is reduced by a factor of 4 to 5 compared to baseline models.
arXiv Detail & Related papers (2024-10-17T17:55:26Z)
Token-level Correlation-guided Compression for Efficient Multimodal Document Understanding [54.532578213126065]
Most document understanding methods preserve all tokens within sub-images and treat them equally. This neglects their different informativeness and leads to a significant increase in the number of image tokens. We propose Token-level Correlation-guided Compression, a parameter-free and plug-and-play methodology to optimize token processing.
arXiv Detail & Related papers (2024-07-19T16:11:15Z)
Solving Token Gradient Conflict in Mixture-of-Experts for Large Vision-Language Model [20.979790612689992]
Mixture-of-Experts (MoE) has gained increasing attention in studying Large Vision-Language Models (LVLMs) Existing MoE methods in LVLMs encourage different experts to handle different tokens, and they usually employ a router to predict the routing of each token. This paper proposes a novel method based on token-level gradient analysis, i.e., Solving Token Gradient Conflict (STGC)
arXiv Detail & Related papers (2024-06-28T13:20:17Z)
Parallel Decoding via Hidden Transfer for Lossless Large Language Model Acceleration [54.897493351694195]
We propose a novel parallel decoding approach, namely textithidden transfer, which decodes multiple successive tokens simultaneously in a single forward pass. In terms of acceleration metrics, we outperform all the single-model acceleration techniques, including Medusa and Self-Speculative decoding.
arXiv Detail & Related papers (2024-04-18T09:17:06Z)
Hot or Cold? Adaptive Temperature Sampling for Code Generation with Large Language Models [54.72004797421481]
We conduct the first systematic study to explore a decoding strategy specialized in code generation. Inspired by the above findings, we propose a simple yet effective method: Adaptive Temperature (AdapT) sampling. Results show that AdapT sampling significantly outperforms state-of-the-art decoding strategy.
arXiv Detail & Related papers (2023-09-06T06:27:33Z)
Token Sparsification for Faster Medical Image Segmentation [37.25161294917211]
We reformulate segmentation as a sparse encoding -> token completion -> dense decoding (SCD) pipeline. STP predicts importance scores with a lightweight sub-network and samples the topK tokens. MTA restores a full token sequence by assembling both sparse output tokens and pruned multi-layer intermediate ones.
arXiv Detail & Related papers (2023-03-11T23:59:13Z)
Memory Augmented Lookup Dictionary based Language Modeling for Automatic Speech Recognition [20.926163659469587]
We propose a new memory augmented lookup dictionary based Transformer architecture for LM. The newly introduced lookup dictionary incorporates rich contextual information in training set, which is vital to correctly predict long-tail tokens. Our proposed method is proved to outperform the baseline Transformer LM by a great margin on both word/character error rate and tail tokens error rate.
arXiv Detail & Related papers (2022-12-30T22:26:57Z)
ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators [108.3381301768299]
Masked language modeling (MLM) pre-training methods such as BERT corrupt the input by replacing some tokens with [MASK] and then train a model to reconstruct the original tokens. We propose a more sample-efficient pre-training task called replaced token detection.
arXiv Detail & Related papers (2020-03-23T21:17:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.