FlowAR: Scale-wise Autoregressive Image Generation Meets Flow Matching
- URL: http://arxiv.org/abs/2412.15205v1
- Date: Thu, 19 Dec 2024 18:59:31 GMT
- Title: FlowAR: Scale-wise Autoregressive Image Generation Meets Flow Matching
- Authors: Sucheng Ren, Qihang Yu, Ju He, Xiaohui Shen, Alan Yuille, Liang-Chieh Chen,
- Abstract summary: We introduce FlowAR, a next scale prediction method featuring a streamlined scale design.<n>This eliminates the need for VAR's intricate multi-scale residual tokenizer.<n>We validate the effectiveness of FlowAR on the challenging ImageNet-256 benchmark.
- Score: 34.112157859384645
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Autoregressive (AR) modeling has achieved remarkable success in natural language processing by enabling models to generate text with coherence and contextual understanding through next token prediction. Recently, in image generation, VAR proposes scale-wise autoregressive modeling, which extends the next token prediction to the next scale prediction, preserving the 2D structure of images. However, VAR encounters two primary challenges: (1) its complex and rigid scale design limits generalization in next scale prediction, and (2) the generator's dependence on a discrete tokenizer with the same complex scale structure restricts modularity and flexibility in updating the tokenizer. To address these limitations, we introduce FlowAR, a general next scale prediction method featuring a streamlined scale design, where each subsequent scale is simply double the previous one. This eliminates the need for VAR's intricate multi-scale residual tokenizer and enables the use of any off-the-shelf Variational AutoEncoder (VAE). Our simplified design enhances generalization in next scale prediction and facilitates the integration of Flow Matching for high-quality image synthesis. We validate the effectiveness of FlowAR on the challenging ImageNet-256 benchmark, demonstrating superior generation performance compared to previous methods. Codes will be available at \url{https://github.com/OliverRensu/FlowAR}.
Related papers
- Token-Shuffle: Towards High-Resolution Image Generation with Autoregressive Models [92.18057318458528]
Token-Shuffle is a novel method that reduces the number of image tokens in Transformer.
Our strategy requires no additional pretrained text-encoder and enables MLLMs to support extremely high-resolution image synthesis.
In GenAI-benchmark, our 2.7B model achieves 0.77 overall score on hard prompts, outperforming AR models LlamaGen by 0.18 and diffusion models LDM by 0.15.
arXiv Detail & Related papers (2025-04-24T17:59:56Z) - GigaTok: Scaling Visual Tokenizers to 3 Billion Parameters for Autoregressive Image Generation [62.77721499671665]
We introduce GigaTok, the first approach to improve image reconstruction, generation, and representation learning when scaling visual tokenizers.
We identify the growing complexity of latent space as the key factor behind the reconstruction vs. generation dilemma.
By scaling to $bf3 space billion$ parameters, GigaTok achieves state-of-the-art performance in reconstruction, downstream AR generation, and downstream AR representation quality.
arXiv Detail & Related papers (2025-04-11T17:59:58Z) - Bridging Continuous and Discrete Tokens for Autoregressive Visual Generation [63.89280381800457]
We propose TokenBridge, which maintains the strong representation capacity of continuous tokens while preserving the modeling simplicity of discrete tokens.
We introduce a dimension-wise quantization strategy that independently discretizes each feature dimension, paired with a lightweight autoregressive prediction mechanism.
Our approach achieves reconstruction and generation quality on par with continuous methods while using standard categorical prediction.
arXiv Detail & Related papers (2025-03-20T17:59:59Z) - Frequency Autoregressive Image Generation with Continuous Tokens [31.833852108014312]
We introduce the frequency progressive autoregressive (textbfFAR) paradigm and instantiate FAR with the continuous tokenizer.
We demonstrate the efficacy of FAR through comprehensive experiments on the ImageNet dataset.
arXiv Detail & Related papers (2025-03-07T10:34:04Z) - Visual Autoregressive Modeling for Image Super-Resolution [14.935662351654601]
We propose a novel visual autoregressive modeling for ISR framework with the form of next-scale prediction.
We collect large-scale data and design a training process to obtain robust generative priors.
arXiv Detail & Related papers (2025-01-31T09:53:47Z) - High-Resolution Image Synthesis via Next-Token Prediction [13.131691198804127]
We introduce D-JEPA$cdot$T2I, an extension of D-JEPA incorporating flow matching loss to enable continuous resolution learning.
For the first time, we achieve state-of-the-art textbfhigh-resolution image synthesis via next-token prediction.
arXiv Detail & Related papers (2024-11-22T09:08:58Z) - M-VAR: Decoupled Scale-wise Autoregressive Modeling for High-Quality Image Generation [39.97174784206976]
We show that this scale-wise autoregressive framework can be effectively decoupled into textitintra-scale modeling
We apply linear-complexity mechanisms like Mamba to substantially reduce computational overhead.
Experiments demonstrate that our method outperforms existing models in both image quality and generation speed.
arXiv Detail & Related papers (2024-11-15T18:54:42Z) - Randomized Autoregressive Visual Generation [26.195148077398223]
This paper presents Randomized AutoRegressive modeling (RAR) for visual generation.
RAR sets a new state-of-the-art performance on the image generation task while maintaining full compatibility with language modeling frameworks.
On the ImageNet-256 benchmark, RAR achieves an FID score of 1.48, not only surpassing prior state-the-art autoregressive image generators but also outperforming leading diffusion-based and masked transformer-based methods.
arXiv Detail & Related papers (2024-11-01T17:59:58Z) - DivCon: Divide and Conquer for Progressive Text-to-Image Generation [0.0]
Diffusion-driven text-to-image (T2I) generation has achieved remarkable advancements.
layout is employed as an intermedium to bridge large language models and layout-based diffusion models.
We introduce a divide-and-conquer approach which decouples the T2I generation task into simple subtasks.
arXiv Detail & Related papers (2024-03-11T03:24:44Z) - Bayesian Prompt Learning for Image-Language Model Generalization [64.50204877434878]
We use the regularization ability of Bayesian methods to frame prompt learning as a variational inference problem.
Our approach regularizes the prompt space, reduces overfitting to the seen prompts and improves the prompt generalization on unseen prompts.
We demonstrate empirically on 15 benchmarks that Bayesian prompt learning provides an appropriate coverage of the prompt space.
arXiv Detail & Related papers (2022-10-05T17:05:56Z) - Effective Invertible Arbitrary Image Rescaling [77.46732646918936]
Invertible Neural Networks (INN) are able to increase upscaling accuracy significantly by optimizing the downscaling and upscaling cycle jointly.
A simple and effective invertible arbitrary rescaling network (IARN) is proposed to achieve arbitrary image rescaling by training only one model in this work.
It is shown to achieve a state-of-the-art (SOTA) performance in bidirectional arbitrary rescaling without compromising perceptual quality in LR outputs.
arXiv Detail & Related papers (2022-09-26T22:22:30Z) - Scale Attention for Learning Deep Face Representation: A Study Against
Visual Scale Variation [69.45176408639483]
We reform the conv layer by resorting to the scale-space theory.
We build a novel style named SCale AttentioN Conv Neural Network (textbfSCAN-CNN)
As a single-shot scheme, the inference is more efficient than multi-shot fusion.
arXiv Detail & Related papers (2022-09-19T06:35:04Z) - Scaling Autoregressive Models for Content-Rich Text-to-Image Generation [95.02406834386814]
Parti treats text-to-image generation as a sequence-to-sequence modeling problem.
Parti uses a Transformer-based image tokenizer, ViT-VQGAN, to encode images as sequences of discrete tokens.
PartiPrompts (P2) is a new holistic benchmark of over 1600 English prompts.
arXiv Detail & Related papers (2022-06-22T01:11:29Z) - NP-DRAW: A Non-Parametric Structured Latent Variable Modelfor Image
Generation [139.8037697822064]
We present a non-parametric structured latent variable model for image generation, called NP-DRAW.
It sequentially draws on a latent canvas in a part-by-part fashion and then decodes the image from the canvas.
arXiv Detail & Related papers (2021-06-25T05:17:55Z) - Normalizing Flows with Multi-Scale Autoregressive Priors [131.895570212956]
We introduce channel-wise dependencies in their latent space through multi-scale autoregressive priors (mAR)
Our mAR prior for models with split coupling flow layers (mAR-SCF) can better capture dependencies in complex multimodal data.
We show that mAR-SCF allows for improved image generation quality, with gains in FID and Inception scores compared to state-of-the-art flow-based models.
arXiv Detail & Related papers (2020-04-08T09:07:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.