Related papers: Mamba Drafters for Speculative Decoding

Mamba Drafters for Speculative Decoding

URL: http://arxiv.org/abs/2506.01206v1
Date: Sun, 01 Jun 2025 22:52:47 GMT
Title: Mamba Drafters for Speculative Decoding
Authors: Daewon Choi, Seunghyuk Oh, Saket Dingliwal, Jihoon Tack, Kyuyoung Kim, Woomin Song, Seojin Kim, Insu Han, Jinwoo Shin, Aram Galstyan, Shubham Katiyar, Sravan Babu Bodapati,
Abstract summary: We introduce novel drafters based on Mamba, a state-of-the-art state space model (SSM)<n>By leveraging the linear structure of SSMs, our approach avoids the quadratic complexity inherent in traditional Transformer-based methods.<n>We further enhance efficiency with a novel test-time tree search algorithm for generating high-quality draft candidates.
Score: 58.080550222549064
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Speculative decoding has emerged as a promising approach to accelerating large language model (LLM) generation using a fast drafter while maintaining alignment with the target model's distribution. However, existing approaches face a trade-off: external drafters offer flexibility but can suffer from slower drafting, while self-speculation methods use drafters tailored to the target model but require re-training. In this paper, we introduce novel drafters based on Mamba, a state-of-the-art state space model (SSM), as a solution that combines the best aspects of both approaches. By leveraging the linear structure of SSMs, our approach avoids the quadratic complexity inherent in traditional Transformer-based methods, enabling faster drafting and lower memory usage while maintaining the flexibility to work across different target models. We further enhance efficiency with a novel test-time tree search algorithm for generating high-quality draft candidates. Our empirical evaluation demonstrates that Mamba-based drafters not only outperform existing external drafting methods but are also comparable to state-of-the-art self-speculation approaches while using less memory and maintaining their cross-model adaptability.

Related papers

Leveraging Importance Sampling to Detach Alignment Modules from Large Language Models [50.19188692497892]
Traditional alignment methods often require retraining large pretrained models.<n>We propose a novel textitResidual Alignment Model (textitRAM) that formalizes the alignment process as a type of importance sampling.<n>We develop a resampling algorithm with iterative token-level decoding to address the common first-token latency issue in comparable methods.
arXiv Detail & Related papers (2025-05-26T08:53:02Z)
Efficient Model Selection for Time Series Forecasting via LLMs [52.31535714387368]
We propose to leverage Large Language Models (LLMs) as a lightweight alternative for model selection.<n>Our method eliminates the need for explicit performance matrices by utilizing the inherent knowledge and reasoning capabilities of LLMs.
arXiv Detail & Related papers (2025-04-02T20:33:27Z)
Training Deep Learning Models with Norm-Constrained LMOs [56.00317694850397]
We study optimization methods that leverage the linear minimization oracle (LMO) over a norm-ball.<n>We propose a new family of algorithms that uses the LMO to adapt to the geometry of the problem and, perhaps surprisingly, show that they can be applied to unconstrained problems.
arXiv Detail & Related papers (2025-02-11T13:10:34Z)
Scalable Language Models with Posterior Inference of Latent Thought Vectors [52.63299874322121]
Latent-Thought Language Models (LTMs) incorporate explicit latent thought vectors that follow an explicit prior model in latent space.<n>LTMs possess additional scaling dimensions beyond traditional LLMs, yielding a structured design space.<n>LTMs significantly outperform conventional autoregressive models and discrete diffusion models in validation perplexity and zero-shot language modeling.
arXiv Detail & Related papers (2025-02-03T17:50:34Z)
Merging Models on the Fly Without Retraining: A Sequential Approach to Scalable Continual Model Merging [75.93960998357812]
Deep model merging represents an emerging research direction that combines multiple fine-tuned models to harness their capabilities across different tasks and domains.<n>Current model merging techniques focus on merging all available models simultaneously, with weight matrices-based methods being the predominant approaches.<n>We propose a training-free projection-based continual merging method that processes models sequentially.
arXiv Detail & Related papers (2025-01-16T13:17:24Z)
Margin Matching Preference Optimization: Enhanced Model Alignment with Granular Feedback [64.67540769692074]
Large language models (LLMs) fine-tuned with alignment techniques, such as reinforcement learning from human feedback, have been instrumental in developing some of the most capable AI systems to date. We introduce an approach called Margin Matching Preference Optimization (MMPO), which incorporates relative quality margins into optimization, leading to improved LLM policies and reward models. Experiments with both human and AI feedback data demonstrate that MMPO consistently outperforms baseline methods, often by a substantial margin, on popular benchmarks including MT-bench and RewardBench.
arXiv Detail & Related papers (2024-10-04T04:56:11Z)
Improving Multi-candidate Speculative Decoding [1.6291177798903276]
Speculative Decoding (SD) is a technique to accelerate the inference of Large Language Models (LLMs)<n>In this work, we introduce a new version of MCSD that includes a target model multi-candidate generation.<n>We also evaluate the effects of using the target model multi-candidate process with different draft models on output quality.
arXiv Detail & Related papers (2024-09-16T18:20:38Z)
S2D: Sorted Speculative Decoding For More Efficient Deployment of Nested Large Language Models [32.68002253527712]
We introduce a novel multi-target scenario for the deployment of draft models for faster inference. We present a novel, more efficient sorted speculative decoding mechanism that outperforms regular baselines in multi-target settings.
arXiv Detail & Related papers (2024-07-02T05:14:15Z)
Evolutionary Optimization of Model Merging Recipes [21.41838972039297]
Large language models (LLMs) have become increasingly capable, but their development often requires substantial computational resources.<n>Here, we propose an evolutionary approach that overcomes this limitation by automatically discovering effective combinations of diverse open-source models.<n>This work not only contributes new state-of-the-art models back to the open-source community, but also introduces a new paradigm for automated model composition.
arXiv Detail & Related papers (2024-03-19T22:56:53Z)
Recurrent Drafter for Fast Speculative Decoding in Large Language Models [18.342742904042673]
We present Recurrent Drafter, an advanced speculative decoding approach.<n>It achieves state-of-the-art speedup for large language models (LLMs) inference.
arXiv Detail & Related papers (2024-03-14T23:40:56Z)
Model Selection for Bayesian Autoencoders [25.619565817793422]
We propose to optimize the distributional sliced-Wasserstein distance between the output of the autoencoder and the empirical data distribution. We turn our BAE into a generative model by fitting a flexible Dirichlet mixture model in the latent space. We evaluate our approach qualitatively and quantitatively using a vast experimental campaign on a number of unsupervised learning tasks and show that, in small-data regimes where priors matter, our approach provides state-of-the-art results.
arXiv Detail & Related papers (2021-06-11T08:55:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.