Related papers: SAM Decoding: Speculative Decoding via Suffix Automaton

SAM Decoding: Speculative Decoding via Suffix Automaton

URL: http://arxiv.org/abs/2411.10666v3
Date: Mon, 16 Dec 2024 10:48:28 GMT
Title: SAM Decoding: Speculative Decoding via Suffix Automaton
Authors: Yuxuan Hu, Ke Wang, Xiaokang Zhang, Fanjin Zhang, Cuiping Li, Hong Chen, Jing Zhang,
Abstract summary: This paper presents a novel retrieval-based speculative decoding method.<n>It adapts suffix automaton for efficient and accurate draft generation by utilizing common text corpus and dynamic text sequence.<n>Experiments on Spec-Bench show that our method is $18%+$ faster than other retrieval-based SD methods.
Score: 22.289906743980445
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Speculative decoding (SD) has been demonstrated as an effective technique for lossless LLM inference acceleration. Retrieval-based SD methods, one kind of model-free method, have yielded promising speedup, but they often rely on incomplete retrieval resources, inefficient retrieval methods, and are constrained to certain domains. This paper presents a novel retrieval-based speculative decoding method that adapts suffix automaton (SAM) for efficient and accurate draft generation by utilizing common text corpus and dynamic text sequence. Unlike existing $n$-gram matching methods, SAM-Decoding finds the exact longest suffix match, achieving an average time complexity of O(1) per generation step of SAM update and suffix retrieval. It can also integrate with existing methods, adaptively selecting a draft generation strategy based on match length to generalize to broader domains. Extensive experiments on Spec-Bench show that our method is $18\%+$ faster than other retrieval-based SD methods. Additionally, when combined with advanced EAGLE-2, it provides an additional speedup of $3.28\%$ -- $11.13\%$ across various-sized LLM backbones. Our code is available at our \href{https://github.com/hyx1999/SAM-Decoding}{repository}.

Related papers

DEL: Context-Aware Dynamic Exit Layer for Efficient Self-Speculative Decoding [7.204881999658682]
We introduce DEL, a plug-and-play method that adaptively selects the exit layer and speculation length during inference. Del achieves overall speedups of $2.16times$$sim$$2.50times$ over vanilla auto-regressive decoding.
arXiv Detail & Related papers (2025-04-08T01:12:59Z)
DReSD: Dense Retrieval for Speculative Decoding [8.220217498103315]
Speculative decoding (SD) accelerates Large Language Model (LLM) generation by using an efficient draft model. We focus on retrieval-based SD where the draft model retrieves the next tokens from a non-parametric datastore. Dretrieval for Speculative Decoding (DReSD) is a novel framework that uses approximate nearest neighbour search with contextualised token embeddings.
arXiv Detail & Related papers (2025-02-21T16:32:28Z)
SuffixDecoding: A Model-Free Approach to Speeding Up Large Language Model Inference [9.143856130336783]
SuffixDecoding is a model-free approach to accelerating large language model (LLM) inference through speculative decoding. Our approach enables flexible tree-structured speculation without the overhead of maintaining and orchestrating additional models. For a proprietary multi-LLM text-to-token application, SuffixDecoding achieves up to $2.9times$ higher output throughput and $3times$ lower latency than speculative decoding.
arXiv Detail & Related papers (2024-11-07T18:49:33Z)
SAMPa: Sharpness-aware Minimization Parallelized [51.668052890249726]
Sharpness-aware (SAM) has been shown to improve the generalization of neural networks. Each SAM update requires emphsequentially computing two gradients, effectively doubling the per-iteration cost. We propose a simple modification of SAM, termed SAMPa, which allows us to fully parallelize the two gradient computations.
arXiv Detail & Related papers (2024-10-14T16:21:23Z)
Turning Trash into Treasure: Accelerating Inference of Large Language Models with Token Recycling [53.58854856174773]
Speculative decoding is an approach to accelerate inference through a guess-and-verify paradigm. Token Recycling stores candidate tokens in an adjacency matrix and employs a breadth-first search algorithm. It significantly outperforms existing train-free methods by 30% and even a training method by 25%.
arXiv Detail & Related papers (2024-08-16T12:20:56Z)
Genetic Instruct: Scaling up Synthetic Generation of Coding Instructions for Large Language Models [54.51932175059004]
We introduce a scalable method for generating synthetic instructions to enhance the code generation capability of Large Language Models. The proposed algorithm, Genetic-Instruct, mimics evolutionary processes, utilizing self-instruction to create numerous synthetic samples from a limited number of seeds.
arXiv Detail & Related papers (2024-07-29T20:42:59Z)
Nearest Neighbor Speculative Decoding for LLM Generation and Attribution [87.3259169631789]
Nearest Speculative Decoding (NEST) is capable of incorporating real-world text spans of arbitrary length into the LM generations and providing attribution to their sources. NEST significantly enhances the generation quality and attribution rate of the base LM across a variety of knowledge-intensive tasks. In addition, NEST substantially improves the generation speed, achieving a 1.8x speedup in inference time when applied to Llama-2-Chat 70B.
arXiv Detail & Related papers (2024-05-29T17:55:03Z)
PromptReps: Prompting Large Language Models to Generate Dense and Sparse Representations for Zero-Shot Document Retrieval [76.50690734636477]
We propose PromptReps, which combines the advantages of both categories: no need for training and the ability to retrieve from the whole corpus. The retrieval system harnesses both dense text embedding and sparse bag-of-words representations.
arXiv Detail & Related papers (2024-04-29T04:51:30Z)
Think Big, Generate Quick: LLM-to-SLM for Fast Autoregressive Decoding [15.723047976314751]
Large language models (LLMs) have become ubiquitous in practice and are widely used for generation tasks such as translation, summarization and instruction following. We propose a hybrid approach that combines language models of different sizes to increase the efficiency of autoregressive decoding.
arXiv Detail & Related papers (2024-02-26T18:59:28Z)
Generation Meets Verification: Accelerating Large Language Model Inference with Smart Parallel Auto-Correct Decoding [11.832919020149891]
This research aims to accelerate the inference speed of large language models (LLMs) with billions of parameters. We propose textbfSmart textbfParallel textbfAuto-textbfCorrect dtextbfEcoding (SPACE)
arXiv Detail & Related papers (2024-02-19T03:39:10Z)
SparseCoder: Identifier-Aware Sparse Transformer for File-Level Code Summarization [51.67317895094664]
This paper studies file-level code summarization, which can assist programmers in understanding and maintaining large source code projects. We propose SparseCoder, an identifier-aware sparse transformer for effectively handling long code sequences.
arXiv Detail & Related papers (2024-01-26T09:23:27Z)
SpecTr: Fast Speculative Decoding via Optimal Transport [30.18181671899423]
We develop a new autoregressive sampling algorithm called $textitSpecTr$, which provides speedup in decoding while ensuring that there is no quality degradation in the decoded output. We experimentally demonstrate that for state-of-the-art large language models, the proposed approach achieves a wall clock speedup of 2.13X, a further 1.37X speedup over speculative decoding on standard benchmarks.
arXiv Detail & Related papers (2023-10-23T17:47:34Z)
ODSum: New Benchmarks for Open Domain Multi-Document Summarization [30.875191848268347]
Open-domain Multi-Document Summarization (ODMDS) is a critical tool for condensing vast arrays of documents into coherent, concise summaries. We propose a rule-based method to process query-based document summarization datasets into ODMDS datasets.
arXiv Detail & Related papers (2023-09-16T11:27:34Z)
A Frustratingly Simple Decoding Method for Neural Text Generation [96.10656449120165]
We introduce a frustratingly simple, super efficient and surprisingly effective decoding method, which we call Frustratingly Simple Decoding (FSD) The idea behind FSD is straightforward: we build an anti-LM based on previously generated text and use this anti-LM to penalize future generation of what has been generated. Despite the simplicity, FSD is surprisingly effective; Experiments show that FSD can outperform the canonical methods to date.
arXiv Detail & Related papers (2023-05-22T03:28:47Z)
Inference with Reference: Lossless Acceleration of Large Language Models [97.04200102556551]
LLMA is an accelerator to speed up Large Language Model (LLM) inference with references. It is motivated by the observation that there are abundant identical text spans between the decoding result by an LLM and the reference that is available in many real world scenarios.
arXiv Detail & Related papers (2023-04-10T09:55:14Z)
Fast Interleaved Bidirectional Sequence Generation [90.58793284654692]
We introduce a decoder that generates target words from the left-to-right and right-to-left directions simultaneously. We show that we can easily convert a standard architecture for unidirectional decoding into a bidirectional decoder. Our interleaved bidirectional decoder (IBDecoder) retains the model simplicity and training efficiency of the standard Transformer.
arXiv Detail & Related papers (2020-10-27T17:38:51Z)
Progressively Pretrained Dense Corpus Index for Open-Domain Question Answering [87.32442219333046]
We propose a simple and resource-efficient method to pretrain the paragraph encoder. Our method outperforms an existing dense retrieval method that uses 7 times more computational resources for pretraining.
arXiv Detail & Related papers (2020-04-30T18:09:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.