Coupling without Communication and Drafter-Invariant Speculative Decoding
- URL: http://arxiv.org/abs/2408.07978v2
- Date: Mon, 19 Aug 2024 05:04:38 GMT
- Title: Coupling without Communication and Drafter-Invariant Speculative Decoding
- Authors: Majid Daliri, Christopher Musco, Ananda Theertha Suresh,
- Abstract summary: Communication-free protocols yield a variant of speculative decoding that we call Drafter-Invariant Speculative Decoding.
We show that communication-free protocols yield a variant of speculative decoding that we call Drafter-Invariant Speculative Decoding.
- Score: 21.19028671377106
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Suppose Alice has a distribution $P$ and Bob has a distribution $Q$. Alice wants to generate a sample $a\sim P$ and Bob a sample $b \sim Q$ such that $a = b$ with has as high of probability as possible. It is well-known that, by sampling from an optimal coupling between the distributions, Alice and Bob can achieve $Pr[a = b] = 1 - D_{TV}(P,Q)$, where $D_{TV}(P,Q)$ is the total variation distance. What if Alice and Bob must solve this same problem without communicating at all? Perhaps surprisingly, with access to public randomness, they can still achieve $Pr[a=b] \geq \frac{1-D_{TV}(P,Q)}{1+D_{TV}(P,Q)} \geq 1-2D_{TV}(P,Q)$. In fact, this bound can be obtained using a simple protocol based on the Weighted MinHash algorithm. In this work, we explore the communication-free coupling problem in greater depth. First, we show that an equally simple protocol based on Gumbel sampling matches the worst-case guarantees of the Weighted MinHash approach, but tends to perform better in practice. Conversely, we prove that both approaches are actually sharp: no communication-free protocol can achieve $Pr[a=b]>\frac{1-D_{TV}(P,Q)}{1+D_{TV}(P,Q)}$ in the worst-case. Finally, we prove that, for distributions over $n$ items, there exists a scheme that uses just $O(\log(n/\epsilon))$ bits of communication to achieve $Pr[a = b] = 1 - D_{TV}(P,Q) - \epsilon$, i.e. to essentially match optimal coupling. Beyond our theoretical results, we demonstrate an application of communication-free coupling to speculative decoding, a recent method for accelerating autoregressive large language models [Leviathan, Kalman, Matias, ICML 2023]. We show that communication-free protocols yield a variant of speculative decoding that we call Drafter-Invariant Speculative Decoding, which has the desirable property that the output of the method is fixed given a fixed random seed, regardless of what drafter is used for speculation.
Related papers
- Some Notes on the Sample Complexity of Approximate Channel Simulation [2.4554686192257424]
Channel simulation algorithms can efficiently encode random samples from a prescribed target distribution $Q$ and find applications in machine learning-based lossy data compression.
This paper considers approximate schemes with a fixed runtime instead.
We exploit global-bound, depth-limited A* coding to ensure $mathrmTV[Q Vert P] leq epsilon$ and maintain optimal coding performance with a sample complexity of only $expbig((D_KL[Q Vert P] + o(1)) big/ epsilonbig
arXiv Detail & Related papers (2024-05-07T14:44:41Z) - Private Vector Mean Estimation in the Shuffle Model: Optimal Rates Require Many Messages [63.366380571397]
We study the problem of private vector mean estimation in the shuffle model of privacy where $n$ users each have a unit vector $v(i) inmathbbRd$.
We propose a new multi-message protocol that achieves the optimal error using $tildemathcalOleft(min(nvarepsilon2,d)right)$ messages per user.
arXiv Detail & Related papers (2024-04-16T00:56:36Z) - Sample-Optimal Locally Private Hypothesis Selection and the Provable
Benefits of Interactivity [8.100854060749212]
We study the problem of hypothesis selection under the constraint of local differential privacy.
We devise an $varepsilon$-locally-differentially-private ($varepsilon$-LDP) algorithm that uses $Thetaleft(fracklog kalpha2min varepsilon2,1 right)$ to guarantee that $d_TV(h,hatf)leq alpha + 9 min_fin mathcalF
arXiv Detail & Related papers (2023-12-09T19:22:10Z) - Replicability in Reinforcement Learning [46.89386344741442]
We focus on the fundamental setting of discounted MDPs with access to a generative model.
Inspired by Impagliazzo et al. [2022], we say that an RL algorithm is replicable if, with high probability, it outputs the exact same policy after two executions.
arXiv Detail & Related papers (2023-05-31T05:16:23Z) - Near Sample-Optimal Reduction-based Policy Learning for Average Reward
MDP [58.13930707612128]
This work considers the sample complexity of obtaining an $varepsilon$-optimal policy in an average reward Markov Decision Process (AMDP)
We prove an upper bound of $widetilde O(H varepsilon-3 ln frac1delta)$ samples per state-action pair, where $H := sp(h*)$ is the span of bias of any optimal policy, $varepsilon$ is the accuracy and $delta$ is the failure probability.
arXiv Detail & Related papers (2022-12-01T15:57:58Z) - Reward-Mixing MDPs with a Few Latent Contexts are Learnable [75.17357040707347]
We consider episodic reinforcement learning in reward-mixing Markov decision processes (RMMDPs)
Our goal is to learn a near-optimal policy that nearly maximizes the $H$ time-step cumulative rewards in such a model.
arXiv Detail & Related papers (2022-10-05T22:52:00Z) - Model-Free Reinforcement Learning: from Clipped Pseudo-Regret to Sample
Complexity [59.34067736545355]
Given an MDP with $S$ states, $A$ actions, the discount factor $gamma in (0,1)$, and an approximation threshold $epsilon > 0$, we provide a model-free algorithm to learn an $epsilon$-optimal policy.
For small enough $epsilon$, we show an improved algorithm with sample complexity.
arXiv Detail & Related papers (2020-06-06T13:34:41Z) - Locally Private Hypothesis Selection [96.06118559817057]
We output a distribution from $mathcalQ$ whose total variation distance to $p$ is comparable to the best such distribution.
We show that the constraint of local differential privacy incurs an exponential increase in cost.
Our algorithms result in exponential improvements on the round complexity of previous methods.
arXiv Detail & Related papers (2020-02-21T18:30:48Z) - Curse of Dimensionality on Randomized Smoothing for Certifiable
Robustness [151.67113334248464]
We show that extending the smoothing technique to defend against other attack models can be challenging.
We present experimental results on CIFAR to validate our theory.
arXiv Detail & Related papers (2020-02-08T22:02:14Z) - Capacity Approaching Coding for Low Noise Interactive Quantum
Communication, Part I: Large Alphabets [15.078027648304115]
We consider the problem of implementing two-party interactive quantum communication over noisy channels.
For a noiseless qudit channel over a $mathrmpoly(n)$ size alphabet, our main result is a simulation method that fails with probability less than $2-Theta(nepsilon)$
We conjecture that it is optimal up to a constant factor in the $sqrtepsilon$ term.
arXiv Detail & Related papers (2020-01-09T02:48:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.