Related papers: STree: Speculative Tree Decoding for Hybrid State-Space Models

STree: Speculative Tree Decoding for Hybrid State-Space Models

URL: http://arxiv.org/abs/2505.14969v2
Date: Mon, 27 Oct 2025 21:48:48 GMT
Title: STree: Speculative Tree Decoding for Hybrid State-Space Models
Authors: Yangchao Wu, Zongyue Qin, Alex Wong, Stefano Soatto,
Abstract summary: We propose a scalable algorithm to perform tree-based speculative decoding in state-space models (SSMs) and hybrid architectures of SSMs and Transformer layers.<n>Along with the algorithm, we describe a hardware-aware implementation that improves naive application of AR Transformer tree-based speculative decoding methods to SSMs.
Score: 41.65137016153309
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Speculative decoding is a technique to leverage hardware concurrency in order to enable multiple steps of token generation in a single forward pass, thus improving the efficiency of large-scale autoregressive (AR) Transformer models. State-space models (SSMs) are already more efficient than AR Transformers, since their state summarizes all past data with no need to cache or re-process tokens in the sliding window context. However, their state can also comprise thousands of tokens; so, speculative decoding has recently been extended to SSMs. Existing approaches, however, do not leverage the tree-based verification methods, since current SSMs lack the means to compute a token tree efficiently. We propose the first scalable algorithm to perform tree-based speculative decoding in state-space models (SSMs) and hybrid architectures of SSMs and Transformer layers. We exploit the structure of accumulated state transition matrices to facilitate tree-based speculative decoding with minimal overhead relative to current SSM implementations. Along with the algorithm, we describe a hardware-aware implementation that improves naive application of AR Transformer tree-based speculative decoding methods to SSMs. Furthermore, we outperform vanilla speculative decoding with SSMs even with a baseline drafting model and tree structure on three different benchmarks, opening up opportunities for further speed up with SSM and hybrid model inference. Code can be found at: https://github.com/wyc1997/stree.

Related papers

Towards Understanding What State Space Models Learn About Code [5.605881212882263]
State Space Models (SSMs) have emerged as an efficient alternative to the transformer architecture.<n>Recent studies show that SSMs can match or surpass Transformers on code understanding tasks, such as code retrieval, when trained under similar conditions.<n>We present the first systematic analysis of what SSM-based code models actually learn and perform the first comparative analysis of SSM and Transformer-based code models.
arXiv Detail & Related papers (2026-02-06T15:29:46Z)
Scaling LLM Speculative Decoding: Non-Autoregressive Forecasting in Large-Batch Scenarios [76.85739138203014]
We present SpecFormer, a novel architecture that accelerates unidirectional and attention mechanisms.<n>We demonstrate that SpecFormer achieves lower training demands and reduced computational costs.
arXiv Detail & Related papers (2025-11-25T14:20:08Z)
Structured Sparse Transition Matrices to Enable State Tracking in State-Space Models [68.31088463716269]
We propose a structured sparse parametrization of transition matrices in state-space models (SSMs)<n>Our method, PD-SSM, parametrizes the transition matrix as the product of a column one-hot matrix ($P$) and a complex-valued diagonal matrix ($D$)<n>The model significantly outperforms a wide collection of modern SSM variants on various FSA state tracking tasks.
arXiv Detail & Related papers (2025-09-26T12:46:30Z)
Decoder-Hybrid-Decoder Architecture for Efficient Reasoning with Long Generation [129.45368843861917]
We introduce the Gated Memory Unit (GMU), a simple yet effective mechanism for efficient memory sharing across layers.<n>We apply it to create SambaY, a decoder-hybrid-decoder architecture that incorporates GMUs to share memory readout states from a Samba-based self-decoder.
arXiv Detail & Related papers (2025-07-09T07:27:00Z)
QS4D: Quantization-aware training for efficient hardware deployment of structured state-space sequential models [0.8474310104568011]
Structured State Space models (SSM) have emerged as a new class of deep learning models.<n>QAT can significantly reduce the complexity of SSMs by up to two orders of magnitude across various performance metrics.<n>We show that QAT enhances robustness to analog noise and enables structural pruning.
arXiv Detail & Related papers (2025-07-08T15:19:14Z)
Seamlessly Integrating Tree-Based Positional Embeddings into Transformer Models for Source Code Representation [0.0]
We propose a novel tree-based positional embedding approach that explicitly encodes hierarchical relationships derived from Abstract Syntax Trees (ASTs)<n>These hierarchical embeddings are integrated into the transformer architecture, specifically enhancing the CodeBERTa model.<n> Experimental results indicate that our Tree-Enhanced CodeBERTa consistently surpasses the baseline model in terms of loss, accuracy, F1 score, precision, and recall.
arXiv Detail & Related papers (2025-07-05T11:07:47Z)
SparseSSM: Efficient Selective Structured State Space Models Can Be Pruned in One-Shot [8.080568103779893]
State-space language models such as Mamba match Transformer quality while permitting linear complexity inference.<n>Existing one-shot pruning methods are tailored to attention blocks and fail to account for the time-shared and discretized state-transition matrix.<n>We introduce SparseSSM, the first training-free pruning framework that extends the classic optimal brain surgeon (OBS) framework to state space architectures.
arXiv Detail & Related papers (2025-06-11T11:14:57Z)
Sparsified State-Space Models are Efficient Highway Networks [52.29954079160793]
State-space models (SSMs) offer an alternative to Transformers by replacing expensive self-attention with linear recurrences.<n>We propose a simple yet effective trick to enhance SSMs within given computational budgets by sparsifying them.<n>Simba is a hierarchical sparsification method for SSMs based on token pruning.
arXiv Detail & Related papers (2025-05-27T04:07:23Z)
CodeSSM: Towards State Space Models for Code Understanding [1.8838588087156363]
State Space Models (SSMs) are a potential alternative to transformers for code understanding tasks.<n>SSMs are more compute-efficient than transformers.<n>We show that SSMs are also more sample-efficient and can effectively extrapolate to longer contexts.
arXiv Detail & Related papers (2025-05-02T14:27:49Z)
Enhanced Structured State Space Models via Grouped FIR Filtering and Attention Sink Mechanisms [0.6718184400443239]
We propose an advanced architecture that mitigates challenges by decomposing A-multiplications into multiple groups. Inspired by the "attention sink" phenomenon identified in streaming language models, we incorporate a similar mechanism to enhance the stability and performance of our model.
arXiv Detail & Related papers (2024-08-01T02:49:58Z)
B'MOJO: Hybrid State Space Realizations of Foundation Models with Eidetic and Fading Memory [91.81390121042192]
We develop a class of models called B'MOJO to seamlessly combine eidetic and fading memory within an composable module. B'MOJO's ability to modulate eidetic and fading memory results in better inference on longer sequences tested up to 32K tokens.
arXiv Detail & Related papers (2024-07-08T18:41:01Z)
Repeat After Me: Transformers are Better than State Space Models at Copying [53.47717661441142]
We show that while generalized state space models are promising in terms of inference-time efficiency, they are limited compared to transformer models on tasks that require copying from the input context.
arXiv Detail & Related papers (2024-02-01T21:44:11Z)
Differentiable Tree Operations Promote Compositional Generalization [106.59434079287661]
Differentiable Tree Machine (DTM) architecture integrates interpreter with external memory and agent that learns to sequentially select tree operations. DTM achieves 100% while existing baselines such as Transformer, Tree Transformer, LSTM, and Tree2Tree LSTM achieve less than 30%.
arXiv Detail & Related papers (2023-06-01T14:46:34Z)
Single MCMC Chain Parallelisation on Decision Trees [0.9137554315375919]
We propose a method to parallelise a single MCMC decision tree chain on an average laptop or personal computer. Experiments showed that we could achieve 18 times faster running time provided that the serial and the parallel implementation are statistically identical.
arXiv Detail & Related papers (2022-07-26T07:07:51Z)
SIT3: Code Summarization with Structure-Induced Transformer [48.000063280183376]
We propose a novel model based on structure-induced self-attention, which encodes sequential inputs with highly-effective structure modeling. Our newly-proposed model achieves new state-of-the-art results on popular benchmarks.
arXiv Detail & Related papers (2020-12-29T11:37:43Z)
Tree-structured Attention with Hierarchical Accumulation [103.47584968330325]
"Hierarchical Accumulation" encodes parse tree structures into self-attention at constant time complexity. Our approach outperforms SOTA methods in four IWSLT translation tasks and the WMT'14 English-German translation task.
arXiv Detail & Related papers (2020-02-19T08:17:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.