SimpleAR: Pushing the Frontier of Autoregressive Visual Generation through Pretraining, SFT, and RL
- URL: http://arxiv.org/abs/2504.11455v1
- Date: Tue, 15 Apr 2025 17:59:46 GMT
- Title: SimpleAR: Pushing the Frontier of Autoregressive Visual Generation through Pretraining, SFT, and RL
- Authors: Junke Wang, Zhi Tian, Xun Wang, Xinyu Zhang, Weilin Huang, Zuxuan Wu, Yu-Gang Jiang,
- Abstract summary: This work presents SimpleAR, a vanilla autoregressive visual generation framework without complex architecure modifications.<n>We demonstrate that our model can generate 1024x1024 resolution images with high fidelity, and achieve competitive results on text-to-image benchmarks.<n>By sharing these findings and open-sourcing the code, we hope to reveal the potential of autoregressive visual generation.
- Score: 112.92522479863054
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This work presents SimpleAR, a vanilla autoregressive visual generation framework without complex architecure modifications. Through careful exploration of training and inference optimization, we demonstrate that: 1) with only 0.5B parameters, our model can generate 1024x1024 resolution images with high fidelity, and achieve competitive results on challenging text-to-image benchmarks, e.g., 0.59 on GenEval and 79.66 on DPG; 2) both supervised fine-tuning (SFT) and Group Relative Policy Optimization (GRPO) training could lead to significant improvements on generation aesthectics and prompt alignment; and 3) when optimized with inference acceleraton techniques like vLLM, the time for SimpleAR to generate an 1024x1024 image could be reduced to around 14 seconds. By sharing these findings and open-sourcing the code, we hope to reveal the potential of autoregressive visual generation and encourage more participation in this research field. Code is available at https://github.com/wdrink/SimpleAR.
Related papers
- ImageGen-CoT: Enhancing Text-to-Image In-context Learning with Chain-of-Thought Reasoning [89.19449553099747]
We study the problem of Text-to-Image In-Context Learning (T2I-ICL)<n>We propose a framework that incorporates a thought process called ImageGen-CoT prior to image generation.<n>We fine-tune MLLMs using this dataset to enhance their contextual reasoning capabilities.
arXiv Detail & Related papers (2025-03-25T03:18:46Z) - Reflect-DiT: Inference-Time Scaling for Text-to-Image Diffusion Transformers via In-Context Reflection [21.677178476653385]
We introduce an alternative to naive best-of-N sampling by equipping text-to-image Diffusion Transformers with in-context capabilities.
We show that Reflect-DiT improves performance on the GenEval benchmark (+0.19) using SANA-1.0-1.6B as a base model.
It achieves a new state-of-the-art score of 0.81 on GenEval while generating only 20 samples per prompt, surpassing the previous best score of 0.80.
arXiv Detail & Related papers (2025-03-15T21:58:12Z) - FlexVAR: Flexible Visual Autoregressive Modeling without Residual Prediction [91.09318592542509]
This work challenges the residual prediction paradigm in visual autoregressive modeling.<n>It presents a new Flexible Visual AutoRegressive image generation paradigm.<n>This simple, intuitive approach swiftly learns visual distributions and makes the generation process more flexible and adaptable.
arXiv Detail & Related papers (2025-02-27T17:39:17Z) - Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step [77.86514804787622]
Chain-of-Thought (CoT) reasoning has been extensively explored in large models to tackle complex understanding tasks.<n>We provide the first comprehensive investigation of the potential of CoT reasoning to enhance autoregressive image generation.<n>We propose the Potential Assessment Reward Model (PARM) and PARM++, specialized for autoregressive image generation.
arXiv Detail & Related papers (2025-01-23T18:59:43Z) - REDUCIO! Generating 1024$\times$1024 Video within 16 Seconds using Extremely Compressed Motion Latents [110.41795676048835]
One crucial obstacle for large-scale applications is the expensive training and inference cost.
In this paper, we argue that videos contain much more redundant information than images, thus can be encoded by very few motion latents.
We train Reducio-DiT in around 3.2K training hours in total and generate a 16-frame 1024*1024 video clip within 15.5 seconds on a single A100 GPU.
arXiv Detail & Related papers (2024-11-20T18:59:52Z) - Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction [33.57820997288788]
We present a new generation paradigm that redefines the autoregressive learning on images as coarse-to-fine "next-scale prediction"
Visual AutoRegressive modeling makes GPT-like AR models surpass diffusion transformers in image generation.
We have released all models and codes to promote the exploration of AR/token models for visual generation and unified learning.
arXiv Detail & Related papers (2024-04-03T17:59:53Z) - Ultra-Data-Efficient GAN Training: Drawing A Lottery Ticket First, Then
Training It Toughly [114.81028176850404]
Training generative adversarial networks (GANs) with limited data generally results in deteriorated performance and collapsed models.
We decompose the data-hungry GAN training into two sequential sub-problems.
Such a coordinated framework enables us to focus on lower-complexity and more data-efficient sub-problems.
arXiv Detail & Related papers (2021-02-28T05:20:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.