Visual Generation in the New Era: An Evolution from Atomic Mapping to Agentic World Modeling
Abstract Overview
This paper presents a roadmap for modern visual generation, arguing that the field must progress beyond one-shot appearance synthesis toward intelligent, interactive, and causally grounded world-modeling systems. It introduces a five-level capability taxonomy—Atomic Generation, Conditional Generation, In-Context Generation, Agentic Generation, and World-Modeling Generation—to organize research by the type of competence exhibited rather than by architecture alone. The survey synthesizes foundational generative paradigms (GANs, diffusion, flow matching, autoregressive, and hybrid models), architectural components, training and post-training methods (including SFT, DPO, GRPO, and reward modeling), data curation, evaluation practices, infrastructure, and application domains. It complements benchmark-based evaluation with in-the-wild stress tests across eight dimensions (spatial structuring, physical reasoning, visual-textual integration, multi-turn editing drift, human-centric editing, low-level vision, cross-disciplinary applications, and high-level vision tasks) to locate where current systems succeed and where they remain limited.
Novelty
The paper's primary novelty is a capability-centered five-level taxonomy that reframes visual generation as a nested progression from passive single-pass rendering to closed-loop agentic interaction and causally grounded world modeling. It further distinguishes itself by pairing this conceptual framework with structured stress-test case studies across eight evaluation dimensions that map concrete failure modes to specific taxonomy levels, revealing gaps that standard benchmarks obscure.
Results
The survey finds that recent visual generation systems have achieved substantial progress in photorealism, typography, instruction following, reference-based editing, and multimodal integration, driven by flow matching, unified understanding-generation architectures, and post-training alignment. However, its stress tests across eight dimensions reveal persistent weaknesses in spatial precision, long-horizon consistency, identity preservation, physical and causal reasoning, multi-turn editing stability, and domain-knowledge generation—demonstrating that high perceptual quality does not imply mastery of structural, temporal, or causal coherence.
Key Points
- The work proposes a five-stage taxonomy—Atomic, Conditional, In-Context, Agentic, and World-Modeling Generation—that defines capability growth as a nested expansion from single-pass rendering to causally grounded world simulation, with each level adding a qualitatively new competence.
- It synthesizes the major technical drivers behind recent progress, including the diffusion-to-flow-matching transition, unified understanding-generation architectures, improved visual representations, post-training alignment (SFT, DPO, GRPO), reward modeling, large-scale data curation with VLM relabeling, and inference acceleration via distillation.
- Its evaluation methodology pairs benchmark review with in-the-wild stress tests across eight dimensions (spatial structuring, physical reasoning, visual-textual logic, multi-turn drift, human-centric editing, low-level vision, cross-disciplinary applications, and high-level vision tasks), showing that high perceptual quality can mask failures in structural, temporal, and causal reasoning.
References
- arXiv: https://arxiv.org/abs/2604.28185v1
- Fugu-MT: https://fugumt.com/fugumt/paper_check/2604.28185v1
- Hugging Face Papers: https://huggingface.co/papers/2604.28185