Fast Inference of Visual Autoregressive Model with Adjacency-Adaptive Dynamical Draft Trees
- URL: http://arxiv.org/abs/2512.21857v1
- Date: Fri, 26 Dec 2025 04:45:49 GMT
- Title: Fast Inference of Visual Autoregressive Model with Adjacency-Adaptive Dynamical Draft Trees
- Authors: Haodong Lei, Hongsong Wang, Xin Geng, Liang Wang, Pan Zhou,
- Abstract summary: We propose an adjacency-adaptive dynamic draft tree that adjusts draft tree depth and width by leveraging adjacent token states and prior acceptance rates.<n>ADT-Tree achieves speedups of 3.13xand 3.05x, respectively, and integrates seamlessly with relaxed sampling methods such as LANTERN.
- Score: 50.230925890958936
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Autoregressive (AR) image models achieve diffusion-level quality but suffer from sequential inference, requiring approximately 2,000 steps for a 576x576 image. Speculative decoding with draft trees accelerates LLMs yet underperforms on visual AR models due to spatially varying token prediction difficulty. We identify a key obstacle in applying speculative decoding to visual AR models: inconsistent acceptance rates across draft trees due to varying prediction difficulties in different image regions. We propose Adjacency-Adaptive Dynamical Draft Trees (ADT-Tree), an adjacency-adaptive dynamic draft tree that dynamically adjusts draft tree depth and width by leveraging adjacent token states and prior acceptance rates. ADT-Tree initializes via horizontal adjacency, then refines depth/width via bisectional adaptation, yielding deeper trees in simple regions and wider trees in complex ones. The empirical evaluations on MS-COCO 2017 and PartiPrompts demonstrate that ADT-Tree achieves speedups of 3.13xand 3.05x, respectively. Moreover, it integrates seamlessly with relaxed sampling methods such as LANTERN, enabling further acceleration. Code is available at https://github.com/Haodong-Lei-Ray/ADT-Tree.
Related papers
- SAGE: Accelerating Vision-Language Models via Entropy-Guided Adaptive Speculative Decoding [15.734450444255787]
Speculative decoding has emerged as a promising approach to accelerate inference in vision-language models.<n>Existing methods rely on static tree structures that remain fixed throughout the decoding process.<n>We propose SAGE, a novel framework that dynamically adjusts the speculation tree structure based on real-time prediction uncertainty.
arXiv Detail & Related papers (2026-01-31T05:35:40Z) - TALON: Confidence-Aware Speculative Decoding with Adaptive Token Trees [18.53532655905144]
Speculative decoding (SD) has become a standard technique for accelerating LLM inference without sacrificing output quality.<n>We introduce TALON, a training-free, budget-driven adaptive tree expansion framework that can be plugged into existing tree-based methods.<n> TALON consistently outperforms state-of-the-art Eagle-3, achieving up to 5.16x end-to-end speedup over auto-regressive decoding.
arXiv Detail & Related papers (2026-01-12T09:26:45Z) - VADTree: Explainable Training-Free Video Anomaly Detection via Hierarchical Granularity-Aware Tree [21.721087343852158]
Video anomaly detection (VAD) focuses on identifying anomalies in videos.<n>We propose VADTree that utilizes a Hierarchical Granularity Tree structure for flexible sampling in VAD.<n>VADTree achieves state-of-the-art performance in training-free settings while drastically reducing the number of sampled video segments.
arXiv Detail & Related papers (2025-10-26T14:36:15Z) - Efficient Autoregressive Shape Generation via Octree-Based Adaptive Tokenization [68.07464514094299]
Existing methods encode all shapes into a fixed-size token, disregarding the inherent variations in scale and complexity across 3D data.<n>We introduce Octree-based Adaptive Tokenization, a novel framework that adjusts the dimension of latent representations according to shape complexity.<n>Our approach reduces token counts by 50% compared to fixed-size methods while maintaining comparable visual quality.
arXiv Detail & Related papers (2025-04-03T17:57:52Z) - LANTERN++: Enhancing Relaxed Speculative Decoding with Static Tree Drafting for Visual Auto-regressive Models [31.1717739483817]
We introduce LANTERN++, a framework that integrates static tree drafting with a tailored relaxed acceptance condition.<n>Experiments on state-of-the-art visual AR models demonstrate that LANTERN++ significantly accelerates inference, achieving up to $mathbftimes 2.56$ speedup over standard AR decoding.
arXiv Detail & Related papers (2025-02-10T11:05:18Z) - Autoregressive Generation of Static and Growing Trees [49.93294993975928]
We propose a transformer architecture and training strategy for tree generation.<n>The architecture processes data at multiple resolutions and has an hourglass shape, with middle layers processing fewer tokens than outer layers.<n>We extend this approach to perform image-to-tree and point-cloud-to-tree conditional generation and to simulate the tree growth processes, generating 4D trees.
arXiv Detail & Related papers (2025-02-07T08:51:14Z) - OPT-Tree: Speculative Decoding with Adaptive Draft Tree Structure [40.9990864658776]
Speculative decoding employs a "draft and then verify" mechanism to allow multiple tokens to be generated in one step.<n>Existing methods mainly adopt fixed draft structures, which fail to adapt to different situations.<n>We propose OPT-Tree, an algorithm to construct adaptive and scalable draft trees.
arXiv Detail & Related papers (2024-06-25T04:45:53Z) - ViTree: Single-path Neural Tree for Step-wise Interpretable Fine-grained
Visual Categorization [56.37520969273242]
We introduce ViTree, a novel approach for fine-grained visual categorization.
By traversing the tree paths, ViTree effectively selects patches from transformer-processed features to highlight informative local regions.
This patch and path selectivity enhances model interpretability of ViTree, enabling better insights into the model's inner workings.
arXiv Detail & Related papers (2024-01-30T14:32:25Z) - Social Interpretable Tree for Pedestrian Trajectory Prediction [75.81745697967608]
We propose a tree-based method, termed as Social Interpretable Tree (SIT), to address this multi-modal prediction task.
A path in the tree from the root to leaf represents an individual possible future trajectory.
Despite the hand-crafted tree, the experimental results on ETH-UCY and Stanford Drone datasets demonstrate that our method is capable of matching or exceeding the performance of state-of-the-art methods.
arXiv Detail & Related papers (2022-05-26T12:18:44Z) - Growing Deep Forests Efficiently with Soft Routing and Learned
Connectivity [79.83903179393164]
This paper further extends the deep forest idea in several important aspects.
We employ a probabilistic tree whose nodes make probabilistic routing decisions, a.k.a., soft routing, rather than hard binary decisions.
Experiments on the MNIST dataset demonstrate that our empowered deep forests can achieve better or comparable performance than [1],[3].
arXiv Detail & Related papers (2020-12-29T18:05:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.