Scale-Wise VAR is Secretly Discrete Diffusion
- URL: http://arxiv.org/abs/2509.22636v1
- Date: Fri, 26 Sep 2025 17:58:04 GMT
- Title: Scale-Wise VAR is Secretly Discrete Diffusion
- Authors: Amandeep Kumar, Nithin Gopalakrishnan Nair, Vishal M. Patel,
- Abstract summary: Next scale prediction Visual Autoregressive Generation ( VAR) has recently demonstrated remarkable performance, even surpassing diffusion-based models.<n>In this work, we revisit VAR and uncover a theoretical insight: when equipped with a Markovian attention mask, VAR is mathematically equivalent to a discrete diffusion.<n>We show how one can directly import the advantages of diffusion such as iterative refinement and reduce architectural inefficiencies into VAR, yielding faster convergence, lower inference cost, and improved zero-shot reconstruction.
- Score: 48.994983608261286
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Autoregressive (AR) transformers have emerged as a powerful paradigm for visual generation, largely due to their scalability, computational efficiency and unified architecture with language and vision. Among them, next scale prediction Visual Autoregressive Generation (VAR) has recently demonstrated remarkable performance, even surpassing diffusion-based models. In this work, we revisit VAR and uncover a theoretical insight: when equipped with a Markovian attention mask, VAR is mathematically equivalent to a discrete diffusion. We term this reinterpretation as Scalable Visual Refinement with Discrete Diffusion (SRDD), establishing a principled bridge between AR transformers and diffusion models. Leveraging this new perspective, we show how one can directly import the advantages of diffusion such as iterative refinement and reduce architectural inefficiencies into VAR, yielding faster convergence, lower inference cost, and improved zero-shot reconstruction. Across multiple datasets, we show that the diffusion based perspective of VAR leads to consistent gains in efficiency and generation.
Related papers
- Guiding Diffusion-based Reconstruction with Contrastive Signals for Balanced Visual Representation [81.40978077888693]
Contrastive Language-Image Pre-training (CLIP) has become a key bottleneck for downstream performance.<n>Recent solutions use diffusion models to enhance representations by conditioning image reconstruction on CLIP visual tokens.<n>We integrate contrastive signals into diffusion-based reconstruction to pursue more comprehensive visual representations.
arXiv Detail & Related papers (2026-03-05T04:45:49Z) - Diversity Has Always Been There in Your Visual Autoregressive Models [78.27363151940996]
Visual Autoregressive ( VAR) models have recently garnered significant attention for their innovative next-scale prediction paradigm.<n>Despite their efficiency, VAR models often suffer from the diversity collapse, analogous to that observed in few-step distilled diffusion models.<n>We introduce Diverse VAR, a simple yet effective approach that restores the generative diversity of VAR models without requiring any additional training.
arXiv Detail & Related papers (2025-11-21T09:24:09Z) - Your VAR Model is Secretly an Efficient and Explainable Generative Classifier [19.629406299980463]
We propose a novel generative model built on recent advances in visual autoregressive modeling.<n>We show that the VAR-based method fundamentally different properties from diffusion-based methods.<n>In particular, due to its tractable likelihood, the VAR-based classifier enables visual explainability via tokenwise mutual information.
arXiv Detail & Related papers (2025-10-14T01:59:01Z) - SDAR: A Synergistic Diffusion-AutoRegression Paradigm for Scalable Sequence Generation [62.14510717860079]
We propose a Synergistic Diffusion-Autoregression paradigm that unifies the training efficiency of autoregressive models with the parallel inference capability of diffusion.<n>SDAR performs a lightweight paradigm conversion that transforms a well-trained autoregressive (AR) model into a blockwise diffusion model through brief, data-efficient adaptation.<n>Building on this insight, SDAR achieves efficient AR-to-diffusion conversion with minimal cost, preserving AR-level performance while enabling parallel generation.
arXiv Detail & Related papers (2025-10-07T17:29:28Z) - Multi-scale Autoregressive Models are Laplacian, Discrete, and Latent Diffusion Models in Disguise [0.6875312133832079]
We revisit Visual Autoregressive models through the lens of an iterative-refinement framework.<n>We formalise it as a deterministic forward process that constructs a Laplacian-style latent pyramid, paired with a learned backward process that reconstructs it in a small number of coarse-to-fine steps.
arXiv Detail & Related papers (2025-10-03T09:05:38Z) - Diffusion Beats Autoregressive in Data-Constrained Settings [46.06809870740238]
Autoregressive (AR) models have long dominated the landscape of large language models, driving progress across a wide range of tasks.<n>Recently, diffusion-based language models have emerged as a promising alternative, though their advantages over AR models remain underexplored.
arXiv Detail & Related papers (2025-07-21T17:59:57Z) - RestoreVAR: Visual Autoregressive Generation for All-in-One Image Restoration [27.307331773270676]
latent diffusion models (LDMs) have significantly improved the perceptual quality of All-in-One image Restoration (AiOR) methods.<n>These LDM-based frameworks suffer from slow inference due to their iterative denoising process, rendering them impractical for time-sensitive applications.<n>We propose a novel generative approach for AiOR that significantly outperforms LDM-based models in restoration performance while achieving over $mathbf10times$ faster inference.
arXiv Detail & Related papers (2025-05-23T15:52:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.