Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction
- URL: http://arxiv.org/abs/2404.02905v2
- Date: Mon, 10 Jun 2024 17:59:07 GMT
- Title: Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction
- Authors: Keyu Tian, Yi Jiang, Zehuan Yuan, Bingyue Peng, Liwei Wang,
- Abstract summary: We present a new generation paradigm that redefines the autoregressive learning on images as coarse-to-fine "next-scale prediction"
Visual AutoRegressive modeling makes GPT-like AR models surpass diffusion transformers in image generation.
We have released all models and codes to promote the exploration of AR/token models for visual generation and unified learning.
- Score: 33.57820997288788
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: We present Visual AutoRegressive modeling (VAR), a new generation paradigm that redefines the autoregressive learning on images as coarse-to-fine "next-scale prediction" or "next-resolution prediction", diverging from the standard raster-scan "next-token prediction". This simple, intuitive methodology allows autoregressive (AR) transformers to learn visual distributions fast and generalize well: VAR, for the first time, makes GPT-like AR models surpass diffusion transformers in image generation. On ImageNet 256x256 benchmark, VAR significantly improve AR baseline by improving Frechet inception distance (FID) from 18.65 to 1.73, inception score (IS) from 80.4 to 350.2, with around 20x faster inference speed. It is also empirically verified that VAR outperforms the Diffusion Transformer (DiT) in multiple dimensions including image quality, inference speed, data efficiency, and scalability. Scaling up VAR models exhibits clear power-law scaling laws similar to those observed in LLMs, with linear correlation coefficients near -0.998 as solid evidence. VAR further showcases zero-shot generalization ability in downstream tasks including image in-painting, out-painting, and editing. These results suggest VAR has initially emulated the two important properties of LLMs: Scaling Laws and zero-shot task generalization. We have released all models and codes to promote the exploration of AR/VAR models for visual generation and unified learning.
Related papers
- Token-Shuffle: Towards High-Resolution Image Generation with Autoregressive Models [92.18057318458528]
Token-Shuffle is a novel method that reduces the number of image tokens in Transformer.
Our strategy requires no additional pretrained text-encoder and enables MLLMs to support extremely high-resolution image synthesis.
In GenAI-benchmark, our 2.7B model achieves 0.77 overall score on hard prompts, outperforming AR models LlamaGen by 0.18 and diffusion models LDM by 0.15.
arXiv Detail & Related papers (2025-04-24T17:59:56Z) - SimpleAR: Pushing the Frontier of Autoregressive Visual Generation through Pretraining, SFT, and RL [112.92522479863054]
This work presents SimpleAR, a vanilla autoregressive visual generation framework without complex architecure modifications.
We demonstrate that our model can generate 1024x1024 resolution images with high fidelity, and achieve competitive results on text-to-image benchmarks.
By sharing these findings and open-sourcing the code, we hope to reveal the potential of autoregressive visual generation.
arXiv Detail & Related papers (2025-04-15T17:59:46Z) - FastVAR: Linear Visual Autoregressive Modeling via Cached Token Pruning [66.5214586624095]
Existing Visual Autoregressive ( VAR) paradigms process the entire token map at each scale step, leading to the complexity and runtime scaling dramatically with image resolution.
We propose Fastmore, a post-training acceleration method for efficient resolution scaling with VARs.
Experiments show Fastmore can further speedup FlashAttention-accelerated VAR by 2.7$times$ with negligible performance drop of 1%.
arXiv Detail & Related papers (2025-03-30T08:51:19Z) - Beyond Next-Token: Next-X Prediction for Autoregressive Visual Generation [34.112157859384645]
Autoregressive (AR) modeling underpins state-of-the-art language and visual generative models.
Traditionally, a token'' is treated as the smallest prediction unit, often a discrete symbol in language or a quantized patch in vision.
We propose xAR, a framework that extends the notion of a token to an entity X.
arXiv Detail & Related papers (2025-02-27T18:59:08Z) - FlexVAR: Flexible Visual Autoregressive Modeling without Residual Prediction [91.09318592542509]
This work challenges the residual prediction paradigm in visual autoregressive modeling.
It presents a new Flexible Visual AutoRegressive image generation paradigm.
This simple, intuitive approach swiftly learns visual distributions and makes the generation process more flexible and adaptable.
arXiv Detail & Related papers (2025-02-27T17:39:17Z) - FlowAR: Scale-wise Autoregressive Image Generation Meets Flow Matching [34.112157859384645]
We introduce FlowAR, a next scale prediction method featuring a streamlined scale design.
This eliminates the need for VAR's intricate multi-scale residual tokenizer.
We validate the effectiveness of FlowAR on the challenging ImageNet-256 benchmark.
arXiv Detail & Related papers (2024-12-19T18:59:31Z) - RandAR: Decoder-only Autoregressive Visual Generation in Random Orders [54.49937384788739]
RandAR is a decoder-only visual autoregressive (AR) model capable of generating images in arbitrary token orders.
Our design enables random order by inserting a "position instruction token" before each image token to be predicted.
RandAR supports inpainting, outpainting and resolution inference in a zero-shot manner.
arXiv Detail & Related papers (2024-12-02T18:59:53Z) - M-VAR: Decoupled Scale-wise Autoregressive Modeling for High-Quality Image Generation [39.97174784206976]
We show that this scale-wise autoregressive framework can be effectively decoupled into textitintra-scale modeling
We apply linear-complexity mechanisms like Mamba to substantially reduce computational overhead.
Experiments demonstrate that our method outperforms existing models in both image quality and generation speed.
arXiv Detail & Related papers (2024-11-15T18:54:42Z) - Randomized Autoregressive Visual Generation [26.195148077398223]
This paper presents Randomized AutoRegressive modeling (RAR) for visual generation.
RAR sets a new state-of-the-art performance on the image generation task while maintaining full compatibility with language modeling frameworks.
On the ImageNet-256 benchmark, RAR achieves an FID score of 1.48, not only surpassing prior state-the-art autoregressive image generators but also outperforming leading diffusion-based and masked transformer-based methods.
arXiv Detail & Related papers (2024-11-01T17:59:58Z) - Open-MAGVIT2: An Open-Source Project Toward Democratizing Auto-regressive Visual Generation [74.15447383432262]
The Open-MAGVIT2 project produces an open-source replication of Google's MAGVIT-v2 tokenizer.
We provide a tokenizer pre-trained on large-scale data, significantly outperforming Cosmos on zero-shot benchmarks.
We produce a family of auto-regressive image generation models ranging from 300M to 1.5B.
arXiv Detail & Related papers (2024-09-06T17:14:53Z) - Scalable Autoregressive Image Generation with Mamba [23.027439743155192]
We introduce AiM, an autoregressive (AR) image generative model based on Mamba architecture.
Mamba is a novel state-space model characterized by its exceptional performance for long-sequence modeling with linear time.
We provide AiM models in various scales, with parameter counts ranging from 148M to 1.3B.
arXiv Detail & Related papers (2024-08-22T09:27:49Z) - Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation [52.509092010267665]
We introduce LlamaGen, a new family of image generation models that apply original next-token prediction'' paradigm of large language models to visual generation domain.
It is an affirmative answer to whether vanilla autoregressive models, e.g., Llama, without inductive biases on visual signals can achieve state-of-the-art image generation performance if scaling properly.
arXiv Detail & Related papers (2024-06-10T17:59:52Z) - Sparse then Prune: Toward Efficient Vision Transformers [2.191505742658975]
Vision Transformer is a deep learning model inspired by the success of the Transformer model in Natural Language Processing.
Applying Sparse Regularization to Vision Transformers can increase accuracy by 0.12%.
Applying pruning to models with Sparse Regularization yields even better results.
arXiv Detail & Related papers (2023-07-22T05:43:33Z) - Paraformer: Fast and Accurate Parallel Transformer for
Non-autoregressive End-to-End Speech Recognition [62.83832841523525]
We propose a fast and accurate parallel transformer, termed Paraformer.
It accurately predicts the number of output tokens and extract hidden variables.
It can attain comparable performance to the state-of-the-art AR transformer, with more than 10x speedup.
arXiv Detail & Related papers (2022-06-16T17:24:14Z) - Vector-quantized Image Modeling with Improved VQGAN [93.8443646643864]
We propose a Vector-quantized Image Modeling approach that involves pretraining a Transformer to predict image tokens autoregressively.
We first propose multiple improvements over vanilla VQGAN from architecture to codebook learning, yielding better efficiency and reconstruction fidelity.
When trained on ImageNet at 256x256 resolution, we achieve Inception Score (IS) of 175.1 and Frechet Inception Distance (FID) of 4.17, a dramatic improvement over the vanilla VQGAN.
arXiv Detail & Related papers (2021-10-09T18:36:00Z) - Fast-Slow Transformer for Visually Grounding Speech [15.68151998164009]
We present Fast-Slow Transformer for Visually Grounding Speech, or FaST-VGS.
FaST-VGS is a Transformer-based model for learning the associations between raw speech waveforms and visual images.
arXiv Detail & Related papers (2021-09-16T18:45:45Z) - Long-Short Transformer: Efficient Transformers for Language and Vision [97.2850205384295]
Long-Short Transformer (Transformer-LS) is an efficient self-attention mechanism for modeling long sequences with linear complexity for both language and vision tasks.
It aggregates a novel long-range attention with dynamic projection to model distant correlations and a short-term attention to capture fine-grained local correlations.
Our method outperforms the state-of-the-art models on multiple tasks in language and vision domains, including the Long Range Arena benchmark, autoregressive language modeling, and ImageNet classification.
arXiv Detail & Related papers (2021-07-05T18:00:14Z) - Scaling Vision Transformers [82.08465256393514]
We study how Vision Transformers scale and characterize the relationships between error rate, data, and compute.
We train a ViT model with two billion parameters, which attains a new state-of-the-art on ImageNet of 90.45% top-1 accuracy.
The model also performs well on few-shot learning, for example, attaining 84.86% top-1 accuracy on ImageNet with only 10 examples per class.
arXiv Detail & Related papers (2021-06-08T17:47:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.