SpecTr: Fast Speculative Decoding via Optimal Transport
- URL: http://arxiv.org/abs/2310.15141v2
- Date: Thu, 18 Jan 2024 04:42:34 GMT
- Title: SpecTr: Fast Speculative Decoding via Optimal Transport
- Authors: Ziteng Sun and Ananda Theertha Suresh and Jae Hun Ro and Ahmad Beirami
and Himanshu Jain and Felix Yu
- Abstract summary: We develop a new autoregressive sampling algorithm called $textitSpecTr$, which provides speedup in decoding while ensuring that there is no quality degradation in the decoded output.
We experimentally demonstrate that for state-of-the-art large language models, the proposed approach achieves a wall clock speedup of 2.13X, a further 1.37X speedup over speculative decoding on standard benchmarks.
- Score: 30.18181671899423
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Autoregressive sampling from large language models has led to
state-of-the-art results in several natural language tasks. However,
autoregressive sampling generates tokens one at a time making it slow, and even
prohibitive in certain tasks. One way to speed up sampling is
$\textit{speculative decoding}$: use a small model to sample a $\textit{draft}$
(block or sequence of tokens), and then score all tokens in the draft by the
large language model in parallel. A subset of the tokens in the draft are
accepted (and the rest rejected) based on a statistical method to guarantee
that the final output follows the distribution of the large model. In this
work, we provide a principled understanding of speculative decoding through the
lens of optimal transport (OT) with $\textit{membership cost}$. This framework
can be viewed as an extension of the well-known $\textit{maximal-coupling}$
problem. This new formulation enables us to generalize the speculative decoding
method to allow for a set of $k$ candidates at the token-level, which leads to
an improved optimal membership cost. We show that the optimal draft selection
algorithm (transport plan) can be computed via linear programming, whose
best-known runtime is exponential in $k$. We then propose a valid draft
selection algorithm whose acceptance probability is $(1-1/e)$-optimal
multiplicatively. Moreover, it can be computed in time almost linear with size
of domain of a single token. Using this $new draft selection$ algorithm, we
develop a new autoregressive sampling algorithm called $\textit{SpecTr}$, which
provides speedup in decoding while ensuring that there is no quality
degradation in the decoded output. We experimentally demonstrate that for
state-of-the-art large language models, the proposed approach achieves a wall
clock speedup of 2.13X, a further 1.37X speedup over speculative decoding on
standard benchmarks.
Related papers
- SuffixDecoding: A Model-Free Approach to Speeding Up Large Language Model Inference [9.143856130336783]
SuffixDecoding is a model-free approach to accelerating large language model (LLM) inference through speculative decoding.
Our approach enables flexible tree-structured speculation without the overhead of maintaining and orchestrating additional models.
For a proprietary multi-LLM text-to-token application, SuffixDecoding achieves up to $2.9times$ higher output throughput and $3times$ lower latency than speculative decoding.
arXiv Detail & Related papers (2024-11-07T18:49:33Z) - Graph-Structured Speculative Decoding [52.94367724136063]
Speculative decoding has emerged as a promising technique to accelerate the inference of Large Language Models.
We introduce an innovative approach utilizing a directed acyclic graph (DAG) to manage the drafted hypotheses.
We observe a remarkable speedup of 1.73$times$ to 1.96$times$, significantly surpassing standard speculative decoding.
arXiv Detail & Related papers (2024-07-23T06:21:24Z) - Perturb-and-Project: Differentially Private Similarities and Marginals [73.98880839337873]
We revisit the input perturbations framework for differential privacy where noise is added to the input $Ain mathcalS$.
We first design novel efficient algorithms to privately release pair-wise cosine similarities.
We derive a novel algorithm to compute $k$-way marginal queries over $n$ features.
arXiv Detail & Related papers (2024-06-07T12:07:16Z) - Superposed Decoding: Multiple Generations from a Single Autoregressive Inference Pass [72.07642648108849]
Superposed Decoding is a new decoding algorithm that generates $k$ drafts at the cost of one autoregressive inference pass.
Superposed Decoding can be combined with other decoding strategies, resulting in universal coverage gains when scaling inference time compute.
arXiv Detail & Related papers (2024-05-28T17:40:48Z) - Some Notes on the Sample Complexity of Approximate Channel Simulation [2.4554686192257424]
Channel simulation algorithms can efficiently encode random samples from a prescribed target distribution $Q$ and find applications in machine learning-based lossy data compression.
This paper considers approximate schemes with a fixed runtime instead.
We exploit global-bound, depth-limited A* coding to ensure $mathrmTV[Q Vert P] leq epsilon$ and maintain optimal coding performance with a sample complexity of only $expbig((D_KL[Q Vert P] + o(1)) big/ epsilonbig
arXiv Detail & Related papers (2024-05-07T14:44:41Z) - Ouroboros: Generating Longer Drafts Phrase by Phrase for Faster Speculative Decoding [65.94521678103237]
Speculative decoding is a widely used method that accelerates the generation process of large language models.
We introduce Ouroboros, which can generate draft phrases to parallelize the drafting process.
Ouroboros can achieve speedups of up to $2.8times$ over speculative decoding and $3.9times$ over vanilla decoding.
arXiv Detail & Related papers (2024-02-21T11:31:28Z) - SkipDecode: Autoregressive Skip Decoding with Batching and Caching for
Efficient LLM Inference [17.947904697850433]
We present SkipDecode, a token-level early exit method for batch inferencing and KeyValue caching.
It overcomes prior constraints by setting up singular-level exit point for every token in a batch at each sequence position.
It also guarantees a monotonic decrease in exit points, thereby eliminating the need to recompute KV Caches for preceding tokens.
arXiv Detail & Related papers (2023-07-05T19:59:09Z) - Truncation Sampling as Language Model Desmoothing [115.28983143361681]
Long samples of text from neural language models can be of poor quality.
Truncation sampling algorithms set some words' probabilities to zero at each step.
We introduce $eta$-sampling, which truncates words below an entropy-dependent probability threshold.
arXiv Detail & Related papers (2022-10-27T05:52:35Z) - Best Policy Identification in Linear MDPs [70.57916977441262]
We investigate the problem of best identification in discounted linear Markov+Delta Decision in the fixed confidence setting under a generative model.
The lower bound as the solution of an intricate non- optimization program can be used as the starting point to devise such algorithms.
arXiv Detail & Related papers (2022-08-11T04:12:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.