Related papers: TALON: Confidence-Aware Speculative Decoding with Adaptive Token Trees

TALON: Confidence-Aware Speculative Decoding with Adaptive Token Trees

URL: http://arxiv.org/abs/2601.07353v1
Date: Mon, 12 Jan 2026 09:26:45 GMT
Title: TALON: Confidence-Aware Speculative Decoding with Adaptive Token Trees
Authors: Tianyu Liu, Qitan Lv, Yuhao Shen, Xiao Sun, Xiaoyan Sun,
Abstract summary: Speculative decoding (SD) has become a standard technique for accelerating LLM inference without sacrificing output quality.<n>We introduce TALON, a training-free, budget-driven adaptive tree expansion framework that can be plugged into existing tree-based methods.<n> TALON consistently outperforms state-of-the-art Eagle-3, achieving up to 5.16x end-to-end speedup over auto-regressive decoding.
Score: 18.53532655905144
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Speculative decoding (SD) has become a standard technique for accelerating LLM inference without sacrificing output quality. Recent advances in speculative decoding have shifted from sequential chain-based drafting to tree-structured generation, where the draft model constructs a tree of candidate tokens to explore multiple possible drafts in parallel. However, existing tree-based SD methods typically build a fixed-width, fixed-depth draft tree, which fails to adapt to the varying difficulty of tokens and contexts. As a result, the draft model cannot dynamically adjust the tree structure to early stop on difficult tokens and extend generation for simple ones. To address these challenges, we introduce TALON, a training-free, budget-driven adaptive tree expansion framework that can be plugged into existing tree-based methods. Unlike static methods, TALON constructs the draft tree iteratively until a fixed token budget is met, using a hybrid expansion strategy that adaptively allocates the node budget to each layer of the draft tree. This framework naturally shapes the draft tree into a "deep-and-narrow" form for deterministic contexts and a "shallow-and-wide" form for uncertain branches, effectively optimizing the trade-off between exploration width and generation depth under a given budget. Extensive experiments across 5 models and 6 datasets demonstrate that TALON consistently outperforms state-of-the-art EAGLE-3, achieving up to 5.16x end-to-end speedup over auto-regressive decoding.

Related papers

Fast Inference of Visual Autoregressive Model with Adjacency-Adaptive Dynamical Draft Trees [50.230925890958936]
We propose an adjacency-adaptive dynamic draft tree that adjusts draft tree depth and width by leveraging adjacent token states and prior acceptance rates.<n>ADT-Tree achieves speedups of 3.13xand 3.05x, respectively, and integrates seamlessly with relaxed sampling methods such as LANTERN.
arXiv Detail & Related papers (2025-12-26T04:45:49Z)
ProtInvTree: Deliberate Protein Inverse Folding with Reward-guided Tree Search [77.55575655986252]
ProtInvTree is a reward-guided tree-search framework for protein inverse folding.<n>It reformulates sequence generation as a deliberate, step-wise decision-making process.<n>It supports flexible test-time scaling by expanding the search depth and breadth without retraining.
arXiv Detail & Related papers (2025-06-01T09:34:20Z)
Efficient Autoregressive Shape Generation via Octree-Based Adaptive Tokenization [68.07464514094299]
Existing methods encode all shapes into a fixed-size token, disregarding the inherent variations in scale and complexity across 3D data.<n>We introduce Octree-based Adaptive Tokenization, a novel framework that adjusts the dimension of latent representations according to shape complexity.<n>Our approach reduces token counts by 50% compared to fixed-size methods while maintaining comparable visual quality.
arXiv Detail & Related papers (2025-04-03T17:57:52Z)
Learning Decision Trees as Amortized Structure Inference [59.65621207449269]
We propose a hybrid amortized structure inference approach to learn predictive decision tree ensembles given data.<n>We show that our approach, DT-GFN, outperforms state-of-the-art decision tree and deep learning methods on standard classification benchmarks.
arXiv Detail & Related papers (2025-03-10T07:05:07Z)
RASD: Retrieval-Augmented Speculative Decoding [5.3926068062773895]
Speculative decoding accelerates inference in large language models (LLMs)<n>This paper proposes RASD (Retrieval-Augmented Speculative Decoding), which adopts retrieval methods to enhance model-based speculative decoding.
arXiv Detail & Related papers (2025-03-05T12:10:14Z)
OPT-Tree: Speculative Decoding with Adaptive Draft Tree Structure [40.9990864658776]
Speculative decoding employs a "draft and then verify" mechanism to allow multiple tokens to be generated in one step.<n>Existing methods mainly adopt fixed draft structures, which fail to adapt to different situations.<n>We propose OPT-Tree, an algorithm to construct adaptive and scalable draft trees.
arXiv Detail & Related papers (2024-06-25T04:45:53Z)
Recursive Speculative Decoding: Accelerating LLM Inference via Sampling Without Replacement [11.91629418177851]
Speculative decoding is an inference-accel method for large language models. Recent works have advanced this method by establishing a draft-token tree. We present Recursive Speculative Decoding (RSD), a novel tree-based method that samples draft tokens without replacement.
arXiv Detail & Related papers (2024-02-21T22:57:49Z)
RLET: A Reinforcement Learning Based Approach for Explainable QA with Entailment Trees [47.745218107037786]
We propose RLET, a Reinforcement Learning based Entailment Tree generation framework. RLET iteratively performs single step reasoning with sentence selection and deduction generation modules. Experiments on three settings of the EntailmentBank dataset demonstrate the strength of using RL framework.
arXiv Detail & Related papers (2022-10-31T06:45:05Z)
Unsupervised Inference of Data-Driven Discourse Structures using a Tree Auto-Encoder [30.615883375573432]
We propose a new strategy to generate tree structures in a task-agnostic, unsupervised fashion by extending a latent tree induction framework with an auto-encoding objective. The proposed approach can be applied to any tree-structured objective, such as syntactic parsing, discourse parsing and others.
arXiv Detail & Related papers (2022-10-18T03:28:39Z)
Tree-structured Attention with Hierarchical Accumulation [103.47584968330325]
"Hierarchical Accumulation" encodes parse tree structures into self-attention at constant time complexity. Our approach outperforms SOTA methods in four IWSLT translation tasks and the WMT'14 English-German translation task.
arXiv Detail & Related papers (2020-02-19T08:17:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.