Related papers: Multi-Draft Speculative Sampling: Canonical Architectures and Theoretical Limits

Multi-Draft Speculative Sampling: Canonical Architectures and Theoretical Limits

URL: http://arxiv.org/abs/2410.18234v1
Date: Wed, 23 Oct 2024 19:28:34 GMT
Title: Multi-Draft Speculative Sampling: Canonical Architectures and Theoretical Limits
Authors: Ashish Khisti, M. Reza Ebrahimi, Hassan Dbouk, Arash Behboodi, Roland Memisevic, Christos Louizos,
Abstract summary: We consider multi-draft speculative sampling, where the proposal sequences are sampled independently from different draft models. We show that the optimal scheme can be decomposed into a two-step solution.
Score: 26.220189807865548
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We consider multi-draft speculative sampling, where the proposal sequences are sampled independently from different draft models. At each step, a token-level draft selection scheme takes a list of valid tokens as input and produces an output token whose distribution matches that of the target model. Previous works have demonstrated that the optimal scheme (which maximizes the probability of accepting one of the input tokens) can be cast as a solution to a linear program. In this work we show that the optimal scheme can be decomposed into a two-step solution: in the first step an importance sampling (IS) type scheme is used to select one intermediate token; in the second step (single-draft) speculative sampling is applied to generate the output token. For the case of two identical draft models we further 1) establish a necessary and sufficient condition on the distributions of the target and draft models for the acceptance probability to equal one and 2) provide an explicit expression for the optimal acceptance probability. Our theoretical analysis also motives a new class of token-level selection scheme based on weighted importance sampling. Our experimental results demonstrate consistent improvements in the achievable block efficiency and token rates over baseline schemes in a number of scenarios.

Related papers

Efficient Constraint-Aware Flow Matching via Randomized Exploration [10.556484074223258]
We consider the problem of generating samples via Flow Matching (FM) with an additional requirement that the generated samples must satisfy given constraints.<n>We propose a simple adaptation of the FM objective with an additional term that penalizes the distance between the constraint set and the generated samples.<n>We numerically show that the proposed approaches achieve significant gains in terms of constraint satisfaction while matching the target distributions.
arXiv Detail & Related papers (2025-08-18T19:02:02Z)
Gumbel-max List Sampling for Distribution Coupling with Multiple Samples [19.059328123272028]
We develop a new mechanism for speculative sampling that is simple to implement and achieves performance competitive with baselines such as SpecTr and SpecInfer.<n>We consider distributed lossy compression with side information in a setting where a source sample is compressed and available to multiple decoders.
arXiv Detail & Related papers (2025-06-05T23:32:08Z)
Speculative Decoding for Multi-Sample Inference [21.64693536216534]
We propose a novel speculative decoding method tailored for multi-sample reasoning scenarios. Our method exploits the intrinsic consensus of parallel generation paths to synthesize high-quality draft tokens.
arXiv Detail & Related papers (2025-03-07T11:15:36Z)
Towards Optimal Multi-draft Speculative Decoding [102.67837141152232]
Multi-Draft Speculative Decoding (MDSD) is a recent approach where, when generating each token, a small draft model generates multiple drafts. This paper discusses the dual of the optimal transport problem, providing a way to efficiently compute the optimal acceptance rate.
arXiv Detail & Related papers (2025-02-26T03:22:44Z)
Dynamic-Width Speculative Beam Decoding for Efficient LLM Inference [35.730941605490194]
Large language models (LLMs) have shown outstanding performance across numerous real-world tasks. Speculative decoding has emerged as a promising solution, leveraging a smaller auxiliary model to draft future tokens. This paper explores the novel integration of speculative decoding with beam sampling.
arXiv Detail & Related papers (2024-09-25T02:20:42Z)
An incremental preference elicitation-based approach to learning potentially non-monotonic preferences in multi-criteria sorting [53.36437745983783]
We first construct a max-margin optimization-based model to model potentially non-monotonic preferences. We devise information amount measurement methods and question selection strategies to pinpoint the most informative alternative in each iteration. Two incremental preference elicitation-based algorithms are developed to learn potentially non-monotonic preferences.
arXiv Detail & Related papers (2024-09-04T14:36:20Z)
Graph-Structured Speculative Decoding [52.94367724136063]
Speculative decoding has emerged as a promising technique to accelerate the inference of Large Language Models. We introduce an innovative approach utilizing a directed acyclic graph (DAG) to manage the drafted hypotheses. We observe a remarkable speedup of 1.73$times$ to 1.96$times$, significantly surpassing standard speculative decoding.
arXiv Detail & Related papers (2024-07-23T06:21:24Z)
Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion [61.03681839276652]
Diffusion Forcing is a new training paradigm where a diffusion model is trained to denoise a set of tokens with independent per-token noise levels. We apply Diffusion Forcing to sequence generative modeling by training a causal next-token prediction model to generate one or several future tokens.
arXiv Detail & Related papers (2024-07-01T15:43:25Z)
Multi-Candidate Speculative Decoding [82.05519287513444]
Large language models have shown impressive capabilities across a variety of NLP tasks, yet their generating text autoregressively is time-consuming. One way to speed them up is speculative decoding, which generates candidate segments from a fast draft model that is then verified in parallel by the target model. This paper proposes sampling multiple candidates from a draft model and then organising them in batches for verification. We design algorithms for efficient multi-candidate verification while maintaining the distribution of the target model.
arXiv Detail & Related papers (2024-01-12T17:15:23Z)
Domain Generalization via Rationale Invariance [70.32415695574555]
This paper offers a new perspective to ease the challenge of domain generalization, which involves maintaining robust results even in unseen environments. We propose treating the element-wise contributions to the final results as the rationale for making a decision and representing the rationale for each sample as a matrix. Our experiments demonstrate that the proposed approach achieves competitive results across various datasets, despite its simplicity.
arXiv Detail & Related papers (2023-08-22T03:31:40Z)
Optimizing model-agnostic Random Subspace ensembles [5.680512932725364]
We present a model-agnostic ensemble approach for supervised learning. The proposed approach alternates between learning an ensemble of models using a parametric version of the Random Subspace approach. We show the good performance of the proposed approach, both in terms of prediction and feature ranking, on simulated and real-world datasets.
arXiv Detail & Related papers (2021-09-07T13:58:23Z)
Online Active Model Selection for Pre-trained Classifiers [72.84853880948894]
We design an online selective sampling approach that actively selects informative examples to label and outputs the best model with high probability at any round. Our algorithm can be used for online prediction tasks for both adversarial and streams.
arXiv Detail & Related papers (2020-10-19T19:53:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.