Related papers: SpecTr: Fast Speculative Decoding via Optimal Transport

SpecTr: Fast Speculative Decoding via Optimal Transport

URL: http://arxiv.org/abs/2310.15141v2
Date: Thu, 18 Jan 2024 04:42:34 GMT
Title: SpecTr: Fast Speculative Decoding via Optimal Transport
Authors: Ziteng Sun and Ananda Theertha Suresh and Jae Hun Ro and Ahmad Beirami and Himanshu Jain and Felix Yu
Abstract summary: We develop a new autoregressive sampling algorithm called $textitSpecTr$, which provides speedup in decoding while ensuring that there is no quality degradation in the decoded output. We experimentally demonstrate that for state-of-the-art large language models, the proposed approach achieves a wall clock speedup of 2.13X, a further 1.37X speedup over speculative decoding on standard benchmarks.
Score: 30.18181671899423
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Autoregressive sampling from large language models has led to state-of-the-art results in several natural language tasks. However, autoregressive sampling generates tokens one at a time making it slow, and even prohibitive in certain tasks. One way to speed up sampling is $\textit{speculative decoding}$: use a small model to sample a $\textit{draft}$ (block or sequence of tokens), and then score all tokens in the draft by the large language model in parallel. A subset of the tokens in the draft are accepted (and the rest rejected) based on a statistical method to guarantee that the final output follows the distribution of the large model. In this work, we provide a principled understanding of speculative decoding through the lens of optimal transport (OT) with $\textit{membership cost}$. This framework can be viewed as an extension of the well-known $\textit{maximal-coupling}$ problem. This new formulation enables us to generalize the speculative decoding method to allow for a set of $k$ candidates at the token-level, which leads to an improved optimal membership cost. We show that the optimal draft selection algorithm (transport plan) can be computed via linear programming, whose best-known runtime is exponential in $k$. We then propose a valid draft selection algorithm whose acceptance probability is $(1-1/e)$-optimal multiplicatively. Moreover, it can be computed in time almost linear with size of domain of a single token. Using this $new draft selection$ algorithm, we develop a new autoregressive sampling algorithm called $\textit{SpecTr}$, which provides speedup in decoding while ensuring that there is no quality degradation in the decoded output. We experimentally demonstrate that for state-of-the-art large language models, the proposed approach achieves a wall clock speedup of 2.13X, a further 1.37X speedup over speculative decoding on standard benchmarks.

Related papers

DEL: Context-Aware Dynamic Exit Layer for Efficient Self-Speculative Decoding [7.204881999658682]
We introduce DEL, a plug-and-play method that adaptively selects the exit layer and speculation length during inference. Del achieves overall speedups of $2.16times$$sim$$2.50times$ over vanilla auto-regressive decoding.
arXiv Detail & Related papers (2025-04-08T01:12:59Z)
Fast Controlled Generation from Language Models with Adaptive Weighted Rejection Sampling [90.86991492288487]
evaluating constraint on every token can be prohibitively expensive. LCD can distort the global distribution over strings, sampling tokens based only on local information. We show that our approach is superior to state-of-the-art baselines.
arXiv Detail & Related papers (2025-04-07T18:30:18Z)
SuffixDecoding: A Model-Free Approach to Speeding Up Large Language Model Inference [9.143856130336783]
SuffixDecoding is a model-free approach to accelerating large language model (LLM) inference through speculative decoding. Our approach enables flexible tree-structured speculation without the overhead of maintaining and orchestrating additional models. For a proprietary multi-LLM text-to-token application, SuffixDecoding achieves up to $2.9times$ higher output throughput and $3times$ lower latency than speculative decoding.
arXiv Detail & Related papers (2024-11-07T18:49:33Z)
Graph-Structured Speculative Decoding [52.94367724136063]
Speculative decoding has emerged as a promising technique to accelerate the inference of Large Language Models. We introduce an innovative approach utilizing a directed acyclic graph (DAG) to manage the drafted hypotheses. We observe a remarkable speedup of 1.73$times$ to 1.96$times$, significantly surpassing standard speculative decoding.
arXiv Detail & Related papers (2024-07-23T06:21:24Z)
Perturb-and-Project: Differentially Private Similarities and Marginals [73.98880839337873]
We revisit the input perturbations framework for differential privacy where noise is added to the input $Ain mathcalS$. We first design novel efficient algorithms to privately release pair-wise cosine similarities. We derive a novel algorithm to compute $k$-way marginal queries over $n$ features.
arXiv Detail & Related papers (2024-06-07T12:07:16Z)
Superposed Decoding: Multiple Generations from a Single Autoregressive Inference Pass [72.07642648108849]
Superposed Decoding is a new decoding algorithm that generates $k$ drafts at the cost of one autoregressive inference pass. Superposed Decoding can be combined with other decoding strategies, resulting in universal coverage gains when scaling inference time compute.
arXiv Detail & Related papers (2024-05-28T17:40:48Z)
Some Notes on the Sample Complexity of Approximate Channel Simulation [2.4554686192257424]
Channel simulation algorithms can efficiently encode random samples from a prescribed target distribution $Q$ and find applications in machine learning-based lossy data compression. This paper considers approximate schemes with a fixed runtime instead. We exploit global-bound, depth-limited A* coding to ensure $mathrmTV[Q Vert P] leq epsilon$ and maintain optimal coding performance with a sample complexity of only $expbig((D_KL[Q Vert P] + o(1)) big/ epsilonbig
arXiv Detail & Related papers (2024-05-07T14:44:41Z)
Ouroboros: Generating Longer Drafts Phrase by Phrase for Faster Speculative Decoding [65.94521678103237]
Speculative decoding is a widely used method that accelerates the generation process of large language models. We introduce Ouroboros, which can generate draft phrases to parallelize the drafting process. Ouroboros can achieve speedups of up to $2.8times$ over speculative decoding and $3.9times$ over vanilla decoding.
arXiv Detail & Related papers (2024-02-21T11:31:28Z)
SkipDecode: Autoregressive Skip Decoding with Batching and Caching for Efficient LLM Inference [17.947904697850433]
We present SkipDecode, a token-level early exit method for batch inferencing and KeyValue caching. It overcomes prior constraints by setting up singular-level exit point for every token in a batch at each sequence position. It also guarantees a monotonic decrease in exit points, thereby eliminating the need to recompute KV Caches for preceding tokens.
arXiv Detail & Related papers (2023-07-05T19:59:09Z)
Truncation Sampling as Language Model Desmoothing [115.28983143361681]
Long samples of text from neural language models can be of poor quality. Truncation sampling algorithms set some words' probabilities to zero at each step. We introduce $eta$-sampling, which truncates words below an entropy-dependent probability threshold.
arXiv Detail & Related papers (2022-10-27T05:52:35Z)
Best Policy Identification in Linear MDPs [70.57916977441262]
We investigate the problem of best identification in discounted linear Markov+Delta Decision in the fixed confidence setting under a generative model. The lower bound as the solution of an intricate non- optimization program can be used as the starting point to devise such algorithms.
arXiv Detail & Related papers (2022-08-11T04:12:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.