Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction
- URL: http://arxiv.org/abs/2404.02905v2
- Date: Mon, 10 Jun 2024 17:59:07 GMT
- Title: Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction
- Authors: Keyu Tian, Yi Jiang, Zehuan Yuan, Bingyue Peng, Liwei Wang,
- Abstract summary: We present a new generation paradigm that redefines the autoregressive learning on images as coarse-to-fine "next-scale prediction"
Visual AutoRegressive modeling makes GPT-like AR models surpass diffusion transformers in image generation.
We have released all models and codes to promote the exploration of AR/token models for visual generation and unified learning.
- Score: 33.57820997288788
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: We present Visual AutoRegressive modeling (VAR), a new generation paradigm that redefines the autoregressive learning on images as coarse-to-fine "next-scale prediction" or "next-resolution prediction", diverging from the standard raster-scan "next-token prediction". This simple, intuitive methodology allows autoregressive (AR) transformers to learn visual distributions fast and generalize well: VAR, for the first time, makes GPT-like AR models surpass diffusion transformers in image generation. On ImageNet 256x256 benchmark, VAR significantly improve AR baseline by improving Frechet inception distance (FID) from 18.65 to 1.73, inception score (IS) from 80.4 to 350.2, with around 20x faster inference speed. It is also empirically verified that VAR outperforms the Diffusion Transformer (DiT) in multiple dimensions including image quality, inference speed, data efficiency, and scalability. Scaling up VAR models exhibits clear power-law scaling laws similar to those observed in LLMs, with linear correlation coefficients near -0.998 as solid evidence. VAR further showcases zero-shot generalization ability in downstream tasks including image in-painting, out-painting, and editing. These results suggest VAR has initially emulated the two important properties of LLMs: Scaling Laws and zero-shot task generalization. We have released all models and codes to promote the exploration of AR/VAR models for visual generation and unified learning.
Related papers
- One Diffusion Step to Real-World Super-Resolution via Flow Trajectory Distillation [60.54811860967658]
FluxSR is a novel one-step diffusion Real-ISR based on flow matching models.
First, we introduce Flow Trajectory Distillation (FTD) to distill a multi-step flow matching model into a one-step Real-ISR.
Second, to improve image realism and address high-frequency artifact issues in generated images, we propose TV-LPIPS as a perceptual loss.
arXiv Detail & Related papers (2025-02-04T04:11:29Z) - FlowAR: Scale-wise Autoregressive Image Generation Meets Flow Matching [34.112157859384645]
We introduce FlowAR, a next scale prediction method featuring a streamlined scale design.
This eliminates the need for VAR's intricate multi-scale residual tokenizer.
We validate the effectiveness of FlowAR on the challenging ImageNet-256 benchmark.
arXiv Detail & Related papers (2024-12-19T18:59:31Z) - RandAR: Decoder-only Autoregressive Visual Generation in Random Orders [54.49937384788739]
RandAR is a decoder-only visual autoregressive (AR) model capable of generating images in arbitrary token orders.
Our design enables random order by inserting a "position instruction token" before each image token to be predicted.
RandAR supports inpainting, outpainting and resolution inference in a zero-shot manner.
arXiv Detail & Related papers (2024-12-02T18:59:53Z) - Collaborative Decoding Makes Visual Auto-Regressive Modeling Efficient [52.96232442322824]
Collaborative Decoding (CoDe) is a novel efficient decoding strategy tailored for the Visual Auto-Regressive ( VAR) framework.
CoDe capitalizes on two critical observations: the substantially reduced parameter demands at larger scales and the exclusive generation patterns across different scales.
CoDe achieves a 1.7x speedup, slashes memory usage by around 50%, and preserves image quality with only a negligible FID increase from 1.95 to 1.98.
arXiv Detail & Related papers (2024-11-26T15:13:15Z) - M-VAR: Decoupled Scale-wise Autoregressive Modeling for High-Quality Image Generation [39.97174784206976]
We show that this scale-wise autoregressive framework can be effectively decoupled into textitintra-scale modeling
We apply linear-complexity mechanisms like Mamba to substantially reduce computational overhead.
Experiments demonstrate that our method outperforms existing models in both image quality and generation speed.
arXiv Detail & Related papers (2024-11-15T18:54:42Z) - Randomized Autoregressive Visual Generation [26.195148077398223]
This paper presents Randomized AutoRegressive modeling (RAR) for visual generation.
RAR sets a new state-of-the-art performance on the image generation task while maintaining full compatibility with language modeling frameworks.
On the ImageNet-256 benchmark, RAR achieves an FID score of 1.48, not only surpassing prior state-the-art autoregressive image generators but also outperforming leading diffusion-based and masked transformer-based methods.
arXiv Detail & Related papers (2024-11-01T17:59:58Z) - Open-MAGVIT2: An Open-Source Project Toward Democratizing Auto-regressive Visual Generation [74.15447383432262]
The Open-MAGVIT2 project produces an open-source replication of Google's MAGVIT-v2 tokenizer.
We provide a tokenizer pre-trained on large-scale data, significantly outperforming Cosmos on zero-shot benchmarks.
We produce a family of auto-regressive image generation models ranging from 300M to 1.5B.
arXiv Detail & Related papers (2024-09-06T17:14:53Z) - Scalable Autoregressive Image Generation with Mamba [23.027439743155192]
We introduce AiM, an autoregressive (AR) image generative model based on Mamba architecture.
Mamba is a novel state-space model characterized by its exceptional performance for long-sequence modeling with linear time.
We provide AiM models in various scales, with parameter counts ranging from 148M to 1.3B.
arXiv Detail & Related papers (2024-08-22T09:27:49Z) - Sparse then Prune: Toward Efficient Vision Transformers [2.191505742658975]
Vision Transformer is a deep learning model inspired by the success of the Transformer model in Natural Language Processing.
Applying Sparse Regularization to Vision Transformers can increase accuracy by 0.12%.
Applying pruning to models with Sparse Regularization yields even better results.
arXiv Detail & Related papers (2023-07-22T05:43:33Z) - Vector-quantized Image Modeling with Improved VQGAN [93.8443646643864]
We propose a Vector-quantized Image Modeling approach that involves pretraining a Transformer to predict image tokens autoregressively.
We first propose multiple improvements over vanilla VQGAN from architecture to codebook learning, yielding better efficiency and reconstruction fidelity.
When trained on ImageNet at 256x256 resolution, we achieve Inception Score (IS) of 175.1 and Frechet Inception Distance (FID) of 4.17, a dramatic improvement over the vanilla VQGAN.
arXiv Detail & Related papers (2021-10-09T18:36:00Z) - Scaling Vision Transformers [82.08465256393514]
We study how Vision Transformers scale and characterize the relationships between error rate, data, and compute.
We train a ViT model with two billion parameters, which attains a new state-of-the-art on ImageNet of 90.45% top-1 accuracy.
The model also performs well on few-shot learning, for example, attaining 84.86% top-1 accuracy on ImageNet with only 10 examples per class.
arXiv Detail & Related papers (2021-06-08T17:47:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.