Related papers: Speculative Decoding for Multi-Sample Inference

Speculative Decoding for Multi-Sample Inference

URL: http://arxiv.org/abs/2503.05330v1
Date: Fri, 07 Mar 2025 11:15:36 GMT
Title: Speculative Decoding for Multi-Sample Inference
Authors: Yiwei Li, Jiayi Shi, Shaoxiong Feng, Peiwen Yuan, Xinglin Wang, Yueqi Zhang, Ji Zhang, Chuyi Tan, Boyuan Pan, Yao Hu, Kan Li,
Abstract summary: We propose a novel speculative decoding method tailored for multi-sample reasoning scenarios.<n>Our method exploits the intrinsic consensus of parallel generation paths to synthesize high-quality draft tokens.
Score: 21.64693536216534
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We propose a novel speculative decoding method tailored for multi-sample reasoning scenarios, such as self-consistency and Best-of-N sampling. Our method exploits the intrinsic consensus of parallel generation paths to synthesize high-quality draft tokens without requiring auxiliary models or external databases. By dynamically analyzing structural patterns across parallel reasoning paths through a probabilistic aggregation mechanism, it identifies consensus token sequences that align with the decoding distribution. Evaluations on mathematical reasoning benchmarks demonstrate a substantial improvement in draft acceptance rates over baselines, while reducing the latency in draft token construction. This work establishes a paradigm shift for efficient multi-sample inference, enabling seamless integration of speculative decoding with sampling-based reasoning techniques.

Related papers

Towards Optimal Multi-draft Speculative Decoding [102.67837141152232]
Multi-Draft Speculative Decoding (MDSD) is a recent approach where, when generating each token, a small draft model generates multiple drafts.<n>This paper discusses the dual of the optimal transport problem, providing a way to efficiently compute the optimal acceptance rate.
arXiv Detail & Related papers (2025-02-26T03:22:44Z)
BRiTE: Bootstrapping Reinforced Thinking Process to Enhance Language Model Reasoning [78.63421517563056]
Large Language Models (LLMs) have demonstrated remarkable capabilities in complex reasoning tasks.<n>We present a unified probabilistic framework that formalizes LLM reasoning through a novel graphical model.<n>We introduce the Bootstrapping Reinforced Thinking Process (BRiTE) algorithm, which works in two steps.
arXiv Detail & Related papers (2025-01-31T02:39:07Z)
KL-geodesics flow matching with a novel sampling scheme [4.347494885647007]
Non-autoregressive language models generate all tokens simultaneously, offering potential speed advantages over traditional autoregressive models.<n>We investigate a conditional flow matching approach for text generation.
arXiv Detail & Related papers (2024-11-25T17:15:41Z)
Unveiling the Statistical Foundations of Chain-of-Thought Prompting Methods [59.779795063072655]
Chain-of-Thought (CoT) prompting and its variants have gained popularity as effective methods for solving multi-step reasoning problems. We analyze CoT prompting from a statistical estimation perspective, providing a comprehensive characterization of its sample complexity.
arXiv Detail & Related papers (2024-08-25T04:07:18Z)
Graph-Structured Speculative Decoding [52.94367724136063]
Speculative decoding has emerged as a promising technique to accelerate the inference of Large Language Models. We introduce an innovative approach utilizing a directed acyclic graph (DAG) to manage the drafted hypotheses. We observe a remarkable speedup of 1.73$times$ to 1.96$times$, significantly surpassing standard speculative decoding.
arXiv Detail & Related papers (2024-07-23T06:21:24Z)
Distributed Markov Chain Monte Carlo Sampling based on the Alternating Direction Method of Multipliers [143.6249073384419]
In this paper, we propose a distributed sampling scheme based on the alternating direction method of multipliers. We provide both theoretical guarantees of our algorithm's convergence and experimental evidence of its superiority to the state-of-the-art. In simulation, we deploy our algorithm on linear and logistic regression tasks and illustrate its fast convergence compared to existing gradient-based methods.
arXiv Detail & Related papers (2024-01-29T02:08:40Z)
Fast and Robust Early-Exiting Framework for Autoregressive Language Models with Synchronized Parallel Decoding [43.659680579686544]
We propose a Fast and Robust Early-Exiting framework, which incorporates a shallow-deep module and a synchronized parallel decoding. Our framework enables faster inference by synchronizing the decoding process of the current token with previously stacked early-exited tokens. As parallel decoding allows us to observe predictions from both shallow and deep models, we present a novel adaptive threshold estimator.
arXiv Detail & Related papers (2023-10-09T05:53:05Z)
Regularized Vector Quantization for Tokenized Image Synthesis [126.96880843754066]
Quantizing images into discrete representations has been a fundamental problem in unified generative modeling. deterministic quantization suffers from severe codebook collapse and misalignment with inference stage while quantization suffers from low codebook utilization and reconstruction objective. This paper presents a regularized vector quantization framework that allows to mitigate perturbed above issues effectively by applying regularization from two perspectives.
arXiv Detail & Related papers (2023-03-11T15:20:54Z)
Arithmetic Sampling: Parallel Diverse Decoding for Large Language Models [65.52639709094963]
Methods such as beam search and Gumbel top-k sampling can guarantee a different output for each element of the beam, but are not easy to parallelize. We present a framework for sampling according to an arithmetic code book implicitly defined by a large language model.
arXiv Detail & Related papers (2022-10-18T22:19:41Z)
BOSS: Bidirectional One-Shot Synthesis of Adversarial Examples [8.359029046999233]
A one-shot synthesis of adversarial examples is proposed in this paper. The inputs are synthesized from scratch to induce arbitrary soft predictions at the output of pre-trained models. We demonstrate the generality and versatility of the framework and approach proposed through applications to the design of targeted adversarial attacks.
arXiv Detail & Related papers (2021-08-05T17:43:36Z)
Parallel and Multi-Objective Falsification with Scenic and VerifAI [11.152087017964584]
We present extensions to the Scenic scenario specification language and VerifAI toolkit. We first present a parallelized framework that is interfaced with both the simulation and sampling capabilities of Scenic. We then present an extension of VerifAI's falsification algorithms to support multi-objective optimization during sampling.
arXiv Detail & Related papers (2021-07-09T01:08:49Z)
Learning Proposals for Probabilistic Programs with Inference Combinators [9.227032708135617]
We develop operators for construction of proposals in probabilistic programs. Proposals in inference samplers can be parameterized using neural networks. We demonstrate the flexibility of this framework by implementing advanced variational methods.
arXiv Detail & Related papers (2021-03-01T00:17:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.