VVS: Accelerating Speculative Decoding for Visual Autoregressive Generation via Partial Verification Skipping
- URL: http://arxiv.org/abs/2511.13587v1
- Date: Mon, 17 Nov 2025 16:50:58 GMT
- Title: VVS: Accelerating Speculative Decoding for Visual Autoregressive Generation via Partial Verification Skipping
- Authors: Haotian Dong, Ye Li, Rongwei Lu, Chen Tang, Shu-Tao Xia, Zhi Wang,
- Abstract summary: speculative decoding (SD) has been proven effective for accelerating visual AR models.<n>We propose a novel framework VVS to accelerate visual AR generation via partial verification skipping.
- Score: 52.58270801983525
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visual autoregressive (AR) generation models have demonstrated strong potential for image generation, yet their next-token-prediction paradigm introduces considerable inference latency. Although speculative decoding (SD) has been proven effective for accelerating visual AR models, its "draft one step, then verify one step" paradigm prevents a direct reduction of the forward passes, thus restricting acceleration potential. Motivated by the visual token interchangeability, we for the first time to explore verification skipping in the SD process of visual AR model generation to explicitly cut the number of target model forward passes, thereby reducing inference latency. Based on an analysis of the drafting stage's characteristics, we observe that verification redundancy and stale feature reusability are key factors to retain generation quality and speedup for verification-free steps. Inspired by these two observations, we propose a novel SD framework VVS to accelerate visual AR generation via partial verification skipping, which integrates three complementary modules: (1) a verification-free token selector with dynamical truncation, (2) token-level feature caching and reuse, and (3) fine-grained skipped step scheduling. Consequently, VVS reduces the number of target model forward passes by a factor of $2.8\times$ relative to vanilla AR decoding while maintaining competitive generation quality, offering a superior speed-quality trade-off over conventional SD frameworks and revealing strong potential to reshape the SD paradigm.
Related papers
- Adaptive Visual Autoregressive Acceleration via Dual-Linkage Entropy Analysis [50.48301331112126]
We propose NOVA, a training-free token reduction acceleration framework for Visual AutoRegressive modeling.<n>NOVA adaptively determines the acceleration activation scale during inference by online identifying the inflection point of scale entropy growth.<n>Experiments and analyses validate NOVA as a simple yet effective training-free acceleration framework.
arXiv Detail & Related papers (2026-02-01T17:29:42Z) - TriSpec: Ternary Speculative Decoding via Lightweight Proxy Verification [63.65902785448346]
Speculative decoding offers significant speed-ups through its lightweight drafting and parallel verification mechanism.<n>We propose TriSpec, a novel ternary SD framework that introduces a lightweight proxy to significantly reduce computational cost.<n>Experiments on the Qwen3 and DeepSeek-R1-Distill-Qwen/LLaMA families show that TriSpec achieves up to 35% speedup over standard SD.
arXiv Detail & Related papers (2026-01-30T17:04:18Z) - StageVAR: Stage-Aware Acceleration for Visual Autoregressive Models [69.07782637329315]
Visual Autoregressive ( VAR) modeling departs from the next-token prediction paradigm of traditional Autoregressive (AR) models through next-scale prediction.<n>Existing acceleration methods reduce runtime for large-scale steps, but rely on manual step selection and overlook the varying importance of different stages in the generation process.<n>We present Stage VAR, a systematic study and stage-aware acceleration framework for VAR models.
arXiv Detail & Related papers (2025-12-18T12:51:19Z) - USV: Unified Sparsification for Accelerating Video Diffusion Models [11.011602744993942]
Unified Sparsification for Video diffusion models is an end-to-end trainable framework.<n>It orchestrates sparsification across both the model's internal computation and its sampling process.<n>It achieves up to 83.3% speedup in the denoising process and 22.7% end-to-end acceleration, while maintaining high visual fidelity.
arXiv Detail & Related papers (2025-12-05T14:40:06Z) - MC-SJD : Maximal Coupling Speculative Jacobi Decoding for Autoregressive Visual Generation Acceleration [14.582793306013615]
We propose MC-SJD, a training-free parallel decoding framework designed to accelerate AR visual generation.<n>We show that MC-SJD substantially accelerates standard Speculative Jacobi Decoding (SJD) by maximizing the probability of sampling identical draft tokens.<n>Remarkably, this method requires only a single-line modification to the existing algorithm, yet substantial performance gains.
arXiv Detail & Related papers (2025-10-28T09:26:27Z) - ScaleWeaver: Weaving Efficient Controllable T2I Generation with Multi-Scale Reference Attention [86.93601565563954]
ScaleWeaver is a framework designed to achieve high-fidelity, controllable generation upon advanced visual autoregressive( VAR) models.<n>The proposed Reference Attention module discards the unnecessary attention from image$rightarrow$condition, reducing computational cost.<n>Experiments show that ScaleWeaver delivers high-quality generation and precise control while attaining superior efficiency over diffusion-based methods.
arXiv Detail & Related papers (2025-10-16T17:00:59Z) - SP-VLA: A Joint Model Scheduling and Token Pruning Approach for VLA Model Acceleration [70.72227437717467]
Vision-Language-Action (VLA) models have attracted increasing attention for their strong control capabilities.<n>Their high computational cost and low execution frequency hinder their suitability for real-time tasks such as robotic manipulation and autonomous navigation.<n>We propose SP-VLA, a unified framework that accelerates VLA models by jointly scheduling models and pruning tokens.
arXiv Detail & Related papers (2025-06-15T05:04:17Z) - SkipVAR: Accelerating Visual Autoregressive Modeling via Adaptive Frequency-Aware Skipping [30.85025293160079]
High-frequency components, or later steps, in the generation process contribute disproportionately to inference latency.<n>We identify two primary sources of inefficiency: step redundancy and unconditional branch redundancy.<n>We propose an automatic step-skipping strategy that selectively omits unnecessary generation steps to improve efficiency.
arXiv Detail & Related papers (2025-06-10T15:35:29Z) - SD-ReID: View-aware Stable Diffusion for Aerial-Ground Person Re-Identification [74.36139886192495]
We propose a novel generative framework named SD-ReID for AG-ReID.<n>We first train a ViT-based model to extract person representations along with controllable conditions, including identity and view conditions.<n>We then fine-tune the Stable Diffusion (SD) model to enhance person representations guided by these controllable conditions.
arXiv Detail & Related papers (2025-04-13T12:44:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.