FuguReport

Visual Generation in the New Era: An Evolution from Atomic Mapping to Agentic World Modeling

Authors Keming Wu, Zuhao Yang, Kaichen Zhang, Shizun Wang, Haowei Zhu, Sicong Leng, Zhongyu Yang, Qijie Wang, Sudong Wang, Ziting Wang, Zili Wang, Hui Zhang, Haonan Wang, Hang Zhou, Yifan Pu, Xingxuan Li, Fangneng Zhan, Bo Li, Lidong Bing, Yuxin Song, Ziwei Liu, Wenhu Chen, Jingdong Wang, Xinchao Wang, Xiaojuan Qi, Shijian Lu, Bin Wang
Affiliations Tsinghua University / Nanyang Technological University / The Hong Kong University of Science and Technology / Baidu / National University of Singapore / Fudan University / The University of Hong Kong / MiroMind / University of Waterloo / StepFun
Categories Task / Visual Generation / Classification of five generative stages, Method / Model Improvement / Enhancements in visual representation and generation, Evaluation / Sampling Optimization / Techniques for sampling acceleration and data curation
License CC BY 4.0

Abstract Overview

This paper presents a roadmap for modern visual generation, arguing that the field must progress beyond one-shot appearance synthesis toward intelligent, interactive, and causally grounded world-modeling systems. It introduces a five-level capability taxonomy—Atomic Generation, Conditional Generation, In-Context Generation, Agentic Generation, and World-Modeling Generation—to organize research by the type of competence exhibited rather than by architecture alone. The survey synthesizes foundational generative paradigms (GANs, diffusion, flow matching, autoregressive, and hybrid models), architectural components, training and post-training methods (including SFT, DPO, GRPO, and reward modeling), data curation, evaluation practices, infrastructure, and application domains. It complements benchmark-based evaluation with in-the-wild stress tests across eight dimensions (spatial structuring, physical reasoning, visual-textual integration, multi-turn editing drift, human-centric editing, low-level vision, cross-disciplinary applications, and high-level vision tasks) to locate where current systems succeed and where they remain limited.

Novelty

The paper's primary novelty is a capability-centered five-level taxonomy that reframes visual generation as a nested progression from passive single-pass rendering to closed-loop agentic interaction and causally grounded world modeling. It further distinguishes itself by pairing this conceptual framework with structured stress-test case studies across eight evaluation dimensions that map concrete failure modes to specific taxonomy levels, revealing gaps that standard benchmarks obscure.

Results

The survey finds that recent visual generation systems have achieved substantial progress in photorealism, typography, instruction following, reference-based editing, and multimodal integration, driven by flow matching, unified understanding-generation architectures, and post-training alignment. However, its stress tests across eight dimensions reveal persistent weaknesses in spatial precision, long-horizon consistency, identity preservation, physical and causal reasoning, multi-turn editing stability, and domain-knowledge generation—demonstrating that high perceptual quality does not imply mastery of structural, temporal, or causal coherence.

Key Points

  1. The work proposes a five-stage taxonomy—Atomic, Conditional, In-Context, Agentic, and World-Modeling Generation—that defines capability growth as a nested expansion from single-pass rendering to causally grounded world simulation, with each level adding a qualitatively new competence.
  2. It synthesizes the major technical drivers behind recent progress, including the diffusion-to-flow-matching transition, unified understanding-generation architectures, improved visual representations, post-training alignment (SFT, DPO, GRPO), reward modeling, large-scale data curation with VLM relabeling, and inference acceleration via distillation.
  3. Its evaluation methodology pairs benchmark review with in-the-wild stress tests across eight dimensions (spatial structuring, physical reasoning, visual-textual logic, multi-turn drift, human-centric editing, low-level vision, cross-disciplinary applications, and high-level vision tasks), showing that high perceptual quality can mask failures in structural, temporal, and causal reasoning.

References

This page was created using generative AI such as GPT-5, Claude Opus 4, Gemini 3, Gemini 3.1 Flash Image, and their higher-end successor versions. No guarantee can be made regarding its contents.