Visual Autoregressive Models Beat Diffusion Models on Inference Time Scaling
- URL: http://arxiv.org/abs/2510.16751v2
- Date: Fri, 24 Oct 2025 20:55:17 GMT
- Title: Visual Autoregressive Models Beat Diffusion Models on Inference Time Scaling
- Authors: Erik Riise, Mehmet Onurcan Kaya, Dim P. Papadopoulos,
- Abstract summary: We show that the discrete, sequential nature of visual autoregressive models enables effective search for image generation.<n>We show that beam search substantially improves text-to-image generation, enabling a 2B parameter autoregressive model to outperform a 12B parameter diffusion model across benchmarks.
- Score: 3.558452956820138
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While inference-time scaling through search has revolutionized Large Language Models, translating these gains to image generation has proven difficult. Recent attempts to apply search strategies to continuous diffusion models show limited benefits, with simple random sampling often performing best. We demonstrate that the discrete, sequential nature of visual autoregressive models enables effective search for image generation. We show that beam search substantially improves text-to-image generation, enabling a 2B parameter autoregressive model to outperform a 12B parameter diffusion model across benchmarks. Systematic ablations show that this advantage comes from the discrete token space, which allows early pruning and computational reuse, and our verifier analysis highlights trade-offs between speed and reasoning capability. These findings suggest that model architecture, not just scale, is critical for inference-time optimization in visual generation.
Related papers
- Hawk: Leveraging Spatial Context for Faster Autoregressive Text-to-Image Generation [87.00172597953228]
Speculative decoding has shown promise in accelerating text generation without compromising quality.<n>We introduce Hawk, a new approach that harnesses the spatial structure of images to guide the speculative model toward more accurate and efficient predictions.<n> Experimental results on multiple text-to-image benchmarks demonstrate a 1.71x speedup over standard AR models.
arXiv Detail & Related papers (2025-10-29T17:43:31Z) - Vision-Enhanced Time Series Forecasting via Latent Diffusion Models [12.54316645614762]
LDM4TS is a novel framework that leverages the powerful image reconstruction capabilities of latent diffusion models for vision-enhanced time series forecasting.<n>We are the first to use complementary transformation techniques to convert time series into multi-view visual representations.
arXiv Detail & Related papers (2025-02-16T14:15:06Z) - Accelerated Diffusion Models via Speculative Sampling [89.43940130493233]
Speculative sampling is a popular technique for accelerating inference in Large Language Models.<n>We extend speculative sampling to diffusion models, which generate samples via continuous, vector-valued Markov chains.<n>We propose various drafting strategies, including a simple and effective approach that does not require training a draft model.
arXiv Detail & Related papers (2025-01-09T16:50:16Z) - Autoregressive Video Generation without Vector Quantization [90.87907377618747]
We reformulate the video generation problem as a non-quantized autoregressive modeling of temporal frame-by-frame prediction.<n>With the proposed approach, we train a novel video autoregressive model without vector quantization, termed NOVA.<n>Our results demonstrate that NOVA surpasses prior autoregressive video models in data efficiency, inference speed, visual fidelity, and video fluency, even with a much smaller model capacity.
arXiv Detail & Related papers (2024-12-18T18:59:53Z) - Energy-Based Diffusion Language Models for Text Generation [126.23425882687195]
Energy-based Diffusion Language Model (EDLM) is an energy-based model operating at the full sequence level for each diffusion step.<n>Our framework offers a 1.3$times$ sampling speedup over existing diffusion models.
arXiv Detail & Related papers (2024-10-28T17:25:56Z) - Sequential Posterior Sampling with Diffusion Models [15.028061496012924]
We propose a novel approach that models the transition dynamics to improve the efficiency of sequential diffusion posterior sampling in conditional image synthesis.
We demonstrate the effectiveness of our approach on a real-world dataset of high frame rate cardiac ultrasound images.
Our method opens up new possibilities for real-time applications of diffusion models in imaging and other domains requiring real-time inference.
arXiv Detail & Related papers (2024-09-09T07:55:59Z) - RL for Consistency Models: Faster Reward Guided Text-to-Image Generation [15.238373471473645]
We propose a framework for fine-tuning consistency models viaReinforcement Learning (RL)
Our framework, called Reinforcement Learning for Consistency Model (RLCM), frames the iterative inference process of a consistency model as an RL procedure.
Comparing to RL finetuned diffusion models, RLCM trains significantly faster, improves the quality of the generation measured under the reward objectives, and speeds up the inference procedure by generating high quality images with as few as two inference steps.
arXiv Detail & Related papers (2024-03-25T15:40:22Z) - Self-Play Fine-Tuning of Diffusion Models for Text-to-Image Generation [59.184980778643464]
Fine-tuning Diffusion Models remains an underexplored frontier in generative artificial intelligence (GenAI)
In this paper, we introduce an innovative technique called self-play fine-tuning for diffusion models (SPIN-Diffusion)
Our approach offers an alternative to conventional supervised fine-tuning and RL strategies, significantly improving both model performance and alignment.
arXiv Detail & Related papers (2024-02-15T18:59:18Z) - Fast Sampling Through The Reuse Of Attention Maps In Diffusion Models [11.257468339231362]
Text-to-image diffusion models have demonstrated unprecedented capabilities for flexible and realistic image synthesis.<n>These models rely on a time-consuming sampling procedure, which has motivated attempts to reduce their latency.<n>Our approach seeks to reduce latency directly, without any retraining, fine-tuning, or knowledge distillation.<n>We empirically compare our reuse strategies with few-step sampling procedures of comparable latency, finding that reuse generates images that are closer to those produced by the original high-latency diffusion model.
arXiv Detail & Related papers (2023-12-13T17:05:37Z) - IRGen: Generative Modeling for Image Retrieval [82.62022344988993]
In this paper, we present a novel methodology, reframing image retrieval as a variant of generative modeling.
We develop our model, dubbed IRGen, to address the technical challenge of converting an image into a concise sequence of semantic units.
Our model achieves state-of-the-art performance on three widely-used image retrieval benchmarks and two million-scale datasets.
arXiv Detail & Related papers (2023-03-17T17:07:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.