Related papers: Understand Before You Generate: Self-Guided Training for Autoregressive Image Generation

Understand Before You Generate: Self-Guided Training for Autoregressive Image Generation

URL: http://arxiv.org/abs/2509.15185v1
Date: Thu, 18 Sep 2025 17:47:40 GMT
Title: Understand Before You Generate: Self-Guided Training for Autoregressive Image Generation
Authors: Xiaoyu Yue, Zidong Wang, Yuqing Wang, Wenlong Zhang, Xihui Liu, Wanli Ouyang, Lei Bai, Luping Zhou,
Abstract summary: We present the first systematic investigation into the mechanisms of applying the next-token prediction paradigm to the visual domain.<n>We identify three key properties that hinder the learning of high-level visual semantics.<n>We show that these issues can be effectively addressed by introducing self-supervised objectives during training.
Score: 110.03631978640298
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent studies have demonstrated the importance of high-quality visual representations in image generation and have highlighted the limitations of generative models in image understanding. As a generative paradigm originally designed for natural language, autoregressive models face similar challenges. In this work, we present the first systematic investigation into the mechanisms of applying the next-token prediction paradigm to the visual domain. We identify three key properties that hinder the learning of high-level visual semantics: local and conditional dependence, inter-step semantic inconsistency, and spatial invariance deficiency. We show that these issues can be effectively addressed by introducing self-supervised objectives during training, leading to a novel training framework, Self-guided Training for AutoRegressive models (ST-AR). Without relying on pre-trained representation models, ST-AR significantly enhances the image understanding ability of autoregressive models and leads to improved generation quality. Specifically, ST-AR brings approximately 42% FID improvement for LlamaGen-L and 49% FID improvement for LlamaGen-XL, while maintaining the same sampling strategy.

Related papers

How far have we gone in Generative Image Restoration? A study on its capability, limitations and evaluation practices [31.845000971936923]
Generative Image Restoration (GIR) has achieved impressive perceptual realism, but how far have its practical capabilities truly advanced compared with previous methods?<n>We present a large-scale study grounded in a new multi-dimensional evaluation pipeline that assesses models on detail, sharpness, semantic correctness, and overall quality.
arXiv Detail & Related papers (2026-03-05T09:57:45Z)
DreamVAR: Taming Reinforced Visual Autoregressive Model for High-Fidelity Subject-Driven Image Generation [108.71044040025374]
We present a novel framework for subject-driven image synthesis built upon a Visual Autoregressive model that employs next-scale prediction.<n>We show that Dreamthe achieves superior appearance preservation compared to leading diffusion-based methods.
arXiv Detail & Related papers (2026-01-30T03:32:29Z)
Learning from Next-Frame Prediction: Autoregressive Video Modeling Encodes Effective Representations [53.91818843831925]
We propose NExT-Vid, a novel autoregressive visual generative pretraining framework.<n>We introduce a context-isolated autoregressive predictor to decouple semantic representation from target decoding.<n>Through context-isolated flow-matching pretraining, our approach achieves strong representations.
arXiv Detail & Related papers (2025-12-24T07:07:08Z)
GloTok: Global Perspective Tokenizer for Image Reconstruction and Generation [51.95701097588426]
We introduce a Global Perspective Tokenizer (GloTok) to model a more uniform semantic distribution of tokenized features.<n>A residual learning module is proposed to recover the fine-grained details to minimize the reconstruction error caused by quantization.<n>Experiments on the standard ImageNet-1k benchmark clearly show that our proposed method achieves state-of-the-art reconstruction performance and generation quality.
arXiv Detail & Related papers (2025-11-18T06:40:26Z)
IAR2: Improving Autoregressive Visual Generation with Semantic-Detail Associated Token Prediction [77.06211178777939]
IAR2 is an advanced autoregressive framework that enables a hierarchical semantic-detail synthesis process.<n>We show that IAR2 sets a new state-of-the-art for autoregressive image generation, achieving a FID of 1.50 on ImageNet.
arXiv Detail & Related papers (2025-10-08T12:08:21Z)
Foundations and Models in Modern Computer Vision: Key Building Blocks in Landmark Architectures [34.542592986038265]
This report analyzes the evolution of key design patterns in computer vision by examining six influential papers.<n>We review ResNet, which introduced residual connections to overcome the vanishing gradient problem.<n>We examine the Vision Transformer (ViT), which established a new paradigm by applying the Transformer architecture to sequences of image patches.
arXiv Detail & Related papers (2025-07-31T09:08:11Z)
Vision Foundation Models as Effective Visual Tokenizers for Autoregressive Image Generation [66.73899356886652]
We build an image tokenizer directly atop pre-trained vision foundation models.<n>Our proposed image tokenizer, VFMTok, achieves substantial improvements in image reconstruction and generation quality.<n>It further boosts autoregressive (AR) generation -- achieving a gFID of 2.07 on ImageNet benchmarks.
arXiv Detail & Related papers (2025-07-11T09:32:45Z)
Enhancing Variational Autoencoders with Smooth Robust Latent Encoding [54.74721202894622]
Variational Autoencoders (VAEs) have played a key role in scaling up diffusion-based generative models.<n>We introduce Smooth Robust Latent VAE, a novel adversarial training framework that boosts both generation quality and robustness.<n>Experiments show that SRL-VAE improves both generation quality, in image reconstruction and text-guided image editing, and robustness, against Nightshade attacks and image editing attacks.
arXiv Detail & Related papers (2025-04-24T03:17:57Z)
Autoregressive Image Generation with Vision Full-view Prompt [18.569610688433745]
We propose Vision Full-view prompt (VF prompt) to enhance autoregressive image generation.<n>Inspired by prompt engineering from the field of NLP, we propose Vision Full-view prompt (VF prompt) to enhance autoregressive image generation.
arXiv Detail & Related papers (2025-02-24T08:44:01Z)
Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step [86.69947123512836]
Chain-of-Thought (CoT) reasoning has been extensively explored in large models to tackle complex understanding tasks.<n>We provide the first comprehensive investigation of the potential of CoT reasoning to enhance autoregressive image generation.<n>We propose the Potential Assessment Reward Model (PARM) and PARM++, specialized for autoregressive image generation.
arXiv Detail & Related papers (2025-01-23T18:59:43Z)
High-Resolution Image Synthesis via Next-Token Prediction [19.97037318862443]
We introduce textbfD-JEPA$cdot$T2I, an autoregressive model based on continuous tokens to generate high-quality, photorealistic images at arbitrary resolutions, up to 4K.<n>For the first time, we achieve state-of-the-art high-resolution image synthesis via next-token prediction.
arXiv Detail & Related papers (2024-11-22T09:08:58Z)
CAR: Controllable Autoregressive Modeling for Visual Generation [100.33455832783416]
Controllable AutoRegressive Modeling (CAR) is a novel, plug-and-play framework that integrates conditional control into multi-scale latent variable modeling. CAR progressively refines and captures control representations, which are injected into each autoregressive step of the pre-trained model to guide the generation process. Our approach demonstrates excellent controllability across various types of conditions and delivers higher image quality compared to previous methods.
arXiv Detail & Related papers (2024-10-07T00:55:42Z)
Enhancing Large Vision Language Models with Self-Training on Image Comprehension [131.14381425260706]
We introduce Self-Training on Image (STIC), which emphasizes a self-training approach specifically for image comprehension. First, the model self-constructs a preference for image descriptions using unlabeled images. To further self-improve reasoning on the extracted visual information, we let the model reuse a small portion of existing instruction-tuning data.
arXiv Detail & Related papers (2024-05-30T05:53:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.