Spec-LLaVA: Accelerating Vision-Language Models with Dynamic Tree-Based Speculative Decoding
- URL: http://arxiv.org/abs/2509.11961v1
- Date: Mon, 15 Sep 2025 14:16:51 GMT
- Title: Spec-LLaVA: Accelerating Vision-Language Models with Dynamic Tree-Based Speculative Decoding
- Authors: Mingxiao Huo, Jiayi Zhang, Hewei Wang, Jinfeng Xu, Zheyu Chen, Huilin Tai, Yijun Chen,
- Abstract summary: Spec-LLaVA is a system that applies speculative decoding to accelerate Vision-Language Models without sacrificing output quality.<n>On MS out-of-domain images, Spec-LLaVA achieves up to 3.28$times$ faster decoding on LLaVA-1.5 (7B, 13B) with no loss in generation quality.
- Score: 14.571291239004225
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision-Language Models (VLMs) enable powerful multimodal reasoning but suffer from slow autoregressive inference, limiting their deployment in real-time applications. We introduce Spec-LLaVA, a system that applies speculative decoding to accelerate VLMs without sacrificing output quality. Spec-LLaVA pairs a lightweight draft VLM with a large target model: the draft speculates future tokens, which the target verifies in parallel, allowing multiple tokens to be generated per step. To maximize efficiency, we design a dynamic tree-based verification algorithm that adaptively expands and prunes speculative branches using draft model confidence. On MS COCO out-of-domain images, Spec-LLaVA achieves up to 3.28$\times$ faster decoding on LLaVA-1.5 (7B, 13B) with no loss in generation quality. This work presents a lossless acceleration framework for VLMs using dynamic tree-structured speculative decoding, opening a path toward practical real-time multimodal assistants. Importantly, the lightweight draft model design makes the framework amenable to resource-constrained or on-device deployment settings.
Related papers
- CausalEmbed: Auto-Regressive Multi-Vector Generation in Latent Space for Visual Document Embedding [71.88471147281406]
We propose an auto-regressive generation approach, CausalEmbed, for constructing multi-vector embeddings.<n>By incorporating iterative margin loss during contrastive training, CausalEmbed encourages embedding models to learn compact and well-structured representations.<n>Our method enables efficient VDR tasks using only dozens of visual tokens, achieving a 30-155x reduction in token count.
arXiv Detail & Related papers (2026-01-29T04:47:27Z) - Scaling LLM Speculative Decoding: Non-Autoregressive Forecasting in Large-Batch Scenarios [76.85739138203014]
We present SpecFormer, a novel architecture that accelerates unidirectional and attention mechanisms.<n>We demonstrate that SpecFormer achieves lower training demands and reduced computational costs.
arXiv Detail & Related papers (2025-11-25T14:20:08Z) - SpecVLM: Fast Speculative Decoding in Vision-Language Models [14.243294546325714]
Speculative decoding is a powerful way to accelerate autoregressive large language models (LLMs)<n>We study speculative decoding for vision-language models (VLMs)<n>We introduce SpecVLM, a practical system that delivers 1.5--2.3x end-to-end speedups over full autoregressive inference.
arXiv Detail & Related papers (2025-09-15T11:53:56Z) - Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding [51.711605076319216]
Diffusion-based large language models (Diffusion LLMs) have shown promise for non-autoregressive text generation with parallel decoding capabilities.<n>We introduce a novel block-wise approximate KV Cache mechanism tailored for bidirectional diffusion models, enabling cache reuse with negligible performance drop.<n>We propose a confidence-aware parallel decoding strategy that selectively decodes tokens exceeding a confidence threshold, mitigating dependency violations and maintaining generation quality.
arXiv Detail & Related papers (2025-05-28T17:39:15Z) - DREAM: Drafting with Refined Target Features and Entropy-Adaptive Cross-Attention Fusion for Multimodal Speculative Decoding [11.946177537665402]
Speculative decoding (SD) has emerged as a powerful method for accelerating autoregressive generation in large language models (LLMs)<n>We introduce DREAM, a novel speculative decoding framework tailored for vision-language models (VLMs)
arXiv Detail & Related papers (2025-05-25T15:56:50Z) - FLASH: Latent-Aware Semi-Autoregressive Speculative Decoding for Multimodal Tasks [41.04727840852988]
Large language and multimodal models (LLMs and LMMs) exhibit strong inference capabilities but are often limited by slow decoding speeds.<n>This challenge is especially acute in LMMs, where visual inputs typically comprise more tokens with lower information density than text.<n>We propose textbfFLASH (Fast Latent-Aware Semi-Autoregressive Heuristics), a speculative decoding framework designed specifically for LMMs.
arXiv Detail & Related papers (2025-05-19T05:35:30Z) - MASSV: Multimodal Adaptation and Self-Data Distillation for Speculative Decoding of Vision-Language Models [0.09895793818721334]
We introduce Multimodal Adaptation and Self-Data Distillation for Speculative Decoding of Vision-Language Models (MASSV)<n>MASSV transforms existing small language models into effective multimodal drafters through a two-phase approach.<n>Experiments across the Qwen2.5-VL and Gemma3 model families demonstrate that MASSV increases accepted length by up to 30% and delivers end-to-end inference speedups of up to 1.46x on visually-grounded tasks.
arXiv Detail & Related papers (2025-05-15T17:37:00Z) - DyMU: Dynamic Merging and Virtual Unmerging for Efficient VLMs [124.52164183968145]
We present DyMU, an efficient, training-free framework that reduces the computational burden of vision-language models (VLMs)<n>Our approach comprises two key components. First, Dynamic Token Merging (DToMe) reduces the number of visual token embeddings by merging similar tokens based on image complexity.<n>Second, Virtual Token Unmerging (VTU) simulates the expected token sequence for large language models (LLMs) by efficiently reconstructing the attention dynamics of a full sequence.
arXiv Detail & Related papers (2025-04-23T18:38:18Z) - DuoDecoding: Hardware-aware Heterogeneous Speculative Decoding with Dynamic Multi-Sequence Drafting [59.57151419673759]
Speculative decoding presents a draft-then-verify framework that reduces generation latency while maintaining output distribution fidelity.<n>We propose DuoDecoding, a novel approach that strategically deploys the draft and target models on the CPU and GPU respectively.<n>Our method incorporates a hardware-aware optimal draft budget to minimize idle times and employs dynamic multi-sequence drafting to enhance draft quality.
arXiv Detail & Related papers (2025-03-02T08:27:48Z) - EVEv2: Improved Baselines for Encoder-Free Vision-Language Models [72.07868838411474]
Existing encoder-free vision-language models (VLMs) are narrowing the performance gap with their encoder-based counterparts.<n>We develop efficient strategies for encoder-free VLMs that rival mainstream encoder-based ones.<n>We show that properly and hierarchically associating vision and language within a unified model reduces interference between modalities.
arXiv Detail & Related papers (2025-02-10T18:59:58Z) - Adaptive Draft-Verification for Efficient Large Language Model Decoding [24.347886232342862]
Large language model (LLM) decoding involves generating a sequence of tokens based on a given context.
The typical autoregressive decoding method requires a separate forward pass through the model for each token generated.
We introduce ADED, which accelerates LLM decoding without requiring fine-tuning.
arXiv Detail & Related papers (2024-06-27T22:20:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.