Related papers: Chain-of-Image Generation: Toward Monitorable and Controllable Image Generation

Chain-of-Image Generation: Toward Monitorable and Controllable Image Generation

URL: http://arxiv.org/abs/2512.08645v1
Date: Tue, 09 Dec 2025 14:35:12 GMT
Title: Chain-of-Image Generation: Toward Monitorable and Controllable Image Generation
Authors: Young Kyung Kim, Oded Schlesinger, Yuzhou Zhao, J. Matias Di Martino, Guillermo Sapiro,
Abstract summary: Chain-of-Image Generation (CoIG) framework reframes image generation as a sequential, semantic process analogous to how humans create art.<n>Our experimental results indicate that CoIG substantially enhances quantitative monitorability while achieving competitive robustness compared to established baseline models.
Score: 7.987662261007762
License: http://creativecommons.org/licenses/by/4.0/
Abstract: While state-of-the-art image generation models achieve remarkable visual quality, their internal generative processes remain a "black box." This opacity limits human observation and intervention, and poses a barrier to ensuring model reliability, safety, and control. Furthermore, their non-human-like workflows make them difficult for human observers to interpret. To address this, we introduce the Chain-of-Image Generation (CoIG) framework, which reframes image generation as a sequential, semantic process analogous to how humans create art. Similar to the advantages in monitorability and performance that Chain-of-Thought (CoT) brought to large language models (LLMs), CoIG can produce equivalent benefits in text-to-image generation. CoIG utilizes an LLM to decompose a complex prompt into a sequence of simple, step-by-step instructions. The image generation model then executes this plan by progressively generating and editing the image. Each step focuses on a single semantic entity, enabling direct monitoring. We formally assess this property using two novel metrics: CoIG Readability, which evaluates the clarity of each intermediate step via its corresponding output; and Causal Relevance, which quantifies the impact of each procedural step on the final generated image. We further show that our framework mitigates entity collapse by decomposing the complex generation task into simple subproblems, analogous to the procedural reasoning employed by CoT. Our experimental results indicate that CoIG substantially enhances quantitative monitorability while achieving competitive compositional robustness compared to established baseline models. The framework is model-agnostic and can be integrated with any image generation model.

Related papers

HiCoGen: Hierarchical Compositional Text-to-Image Generation in Diffusion Models via Reinforcement Learning [66.99487505369254]
HiCoGen is built upon a novel Chain of Synthesis paradigm.<n>It decomposes complex prompts into minimal semantic units.<n>It then synthesizes these units iteratively, where the image generated in each step provides crucial visual context for the next.<n>Experiments show our approach significantly outperforms existing methods in both concept coverage and compositional accuracy.
arXiv Detail & Related papers (2025-11-25T06:24:25Z)
Conditional Panoramic Image Generation via Masked Autoregressive Modeling [35.624070746282186]
We propose a unified framework, Panoramic AutoRegressive model (PAR), which leverages masked autoregressive modeling to address these challenges.<n>To address the inherent discontinuity in existing generative models, we introduce circular padding to enhance spatial coherence.<n>Experiments demonstrate competitive performance in text-to-image generation and panorama outpainting tasks.
arXiv Detail & Related papers (2025-05-22T16:20:12Z)
Unified Autoregressive Visual Generation and Understanding with Continuous Tokens [52.21981295470491]
We present UniFluid, a unified autoregressive framework for joint visual generation and understanding.<n>Our unified autoregressive architecture processes multimodal image and text inputs, generating discrete tokens for text and continuous tokens for image.<n>We find though there is an inherent trade-off between the image generation and understanding task, a carefully tuned training recipe enables them to improve each other.
arXiv Detail & Related papers (2025-03-17T17:58:30Z)
Towards Enhanced Image Generation Via Multi-modal Chain of Thought in Unified Generative Models [52.84391764467939]
Unified generative models have shown remarkable performance in text and image generation.<n>We introduce Chain of Thought (CoT) into unified generative models to address the challenges of complex image generation.<n>Experiments show that FoX consistently outperforms existing unified models on various T2I benchmarks.
arXiv Detail & Related papers (2025-03-03T08:36:16Z)
Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step [86.69947123512836]
Chain-of-Thought (CoT) reasoning has been extensively explored in large models to tackle complex understanding tasks.<n>We provide the first comprehensive investigation of the potential of CoT reasoning to enhance autoregressive image generation.<n>We propose the Potential Assessment Reward Model (PARM) and PARM++, specialized for autoregressive image generation.
arXiv Detail & Related papers (2025-01-23T18:59:43Z)
Stabilize the Latent Space for Image Autoregressive Modeling: A Unified Perspective [52.778766190479374]
Latent-based image generative models have achieved notable success in image generation tasks. Despite sharing the same latent space, autoregressive models significantly lag behind LDMs and MIMs in image generation. We propose a simple but effective discrete image tokenizer to stabilize the latent space for image generative modeling.
arXiv Detail & Related papers (2024-10-16T12:13:17Z)
CoC-GAN: Employing Context Cluster for Unveiling a New Pathway in Image Generation [12.211795836214112]
We propose a unique image generation process premised on the perspective of converting images into a set of point clouds. Our methodology leverages simple clustering methods named Context Clustering (CoC) to generate images from unordered point sets. We introduce this model with the novel structure as the Context Clustering Generative Adversarial Network (CoC-GAN)
arXiv Detail & Related papers (2023-08-23T01:19:58Z)
TcGAN: Semantic-Aware and Structure-Preserved GANs with Individual Vision Transformer for Fast Arbitrary One-Shot Image Generation [11.207512995742999]
One-shot image generation (OSG) with generative adversarial networks that learn from the internal patches of a given image has attracted world wide attention. We propose a novel structure-preserved method TcGAN with individual vision transformer to overcome the shortcomings of the existing one-shot image generation methods.
arXiv Detail & Related papers (2023-02-16T03:05:59Z)
DynaST: Dynamic Sparse Transformer for Exemplar-Guided Image Generation [56.514462874501675]
We propose a dynamic sparse attention based Transformer model to achieve fine-level matching with favorable efficiency. The heart of our approach is a novel dynamic-attention unit, dedicated to covering the variation on the optimal number of tokens one position should focus on. Experiments on three applications, pose-guided person image generation, edge-based face synthesis, and undistorted image style transfer, demonstrate that DynaST achieves superior performance in local details.
arXiv Detail & Related papers (2022-07-13T11:12:03Z)
Semantic Image Synthesis via Diffusion Models [174.24523061460704]
Denoising Diffusion Probabilistic Models (DDPMs) have achieved remarkable success in various image generation tasks.<n>Recent work on semantic image synthesis mainly follows the de facto GAN-based approaches.<n>We propose a novel framework based on DDPM for semantic image synthesis.
arXiv Detail & Related papers (2022-06-30T18:31:51Z)
Self-supervised Correlation Mining Network for Person Image Generation [9.505343361614928]
Person image generation aims to perform non-rigid deformation on source images. We propose a Self-supervised Correlation Mining Network (SCM-Net) to rearrange the source images in the feature space. For improving the fidelity of cross-scale pose transformation, we propose a graph based Body Structure Retaining Loss.
arXiv Detail & Related papers (2021-11-26T03:57:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.