Related papers: Iterative Refinement Improves Compositional Image Generation

Iterative Refinement Improves Compositional Image Generation

URL: http://arxiv.org/abs/2601.15286v1
Date: Wed, 21 Jan 2026 18:59:40 GMT
Title: Iterative Refinement Improves Compositional Image Generation
Authors: Shantanu Jaiswal, Mihir Prabhudesai, Nikash Bhardwaj, Zheyang Qin, Amir Zadeh, Chuan Li, Katerina Fragkiadaki, Deepak Pathak,
Abstract summary: Text-to-image (T2I) models struggle with complex prompts that require simultaneously handling multiple objects, relations, and attributes.<n>We propose an iterative test-time strategy in which a T2I model progressively refines its generations across multiple steps.<n>Our approach is simple, requires no external tools or priors, and can be flexibly applied to a wide range of image generators and vision-language models.
Score: 47.116050084875106
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Text-to-image (T2I) models have achieved remarkable progress, yet they continue to struggle with complex prompts that require simultaneously handling multiple objects, relations, and attributes. Existing inference-time strategies, such as parallel sampling with verifiers or simply increasing denoising steps, can improve prompt alignment but remain inadequate for richly compositional settings where many constraints must be satisfied. Inspired by the success of chain-of-thought reasoning in large language models, we propose an iterative test-time strategy in which a T2I model progressively refines its generations across multiple steps, guided by feedback from a vision-language model as the critic in the loop. Our approach is simple, requires no external tools or priors, and can be flexibly applied to a wide range of image generators and vision-language models. Empirically, we demonstrate consistent gains on image generation across benchmarks: a 16.9% improvement in all-correct rate on ConceptMix (k=7), a 13.8% improvement on T2I-CompBench (3D-Spatial category) and a 12.5% improvement on Visual Jenga scene decomposition compared to compute-matched parallel sampling. Beyond quantitative gains, iterative refinement produces more faithful generations by decomposing complex prompts into sequential corrections, with human evaluators preferring our method 58.7% of the time over 41.3% for the parallel baseline. Together, these findings highlight iterative self-correction as a broadly applicable principle for compositional image generation. Results and visualizations are available at https://iterative-img-gen.github.io/

Related papers

GriDiT: Factorized Grid-Based Diffusion for Efficient Long Image Sequence Generation [77.13582457917418]
We train a generative model solely on grid images comprising subsampled frames.<n>We learn to generate image sequences, using the strong self-attention mechanism of the Diffusion Transformer (DiT) to capture correlations between frames.<n>Our method consistently outperforms SoTA in quality and inference speed (at least twice-as-fast) across datasets.
arXiv Detail & Related papers (2025-12-24T16:46:04Z)
ProxT2I: Efficient Reward-Guided Text-to-Image Generation via Proximal Diffusion [18.25085327318649]
We develop a text-to-image (T2I) diffusion model based on backward discretizations, dubbed ProxT2I, relying on learned and conditional proximal operators instead of score functions.<n>We develop a new large-scale and open-source dataset comprising 15 million high-quality human images with fine-grained captions, called LAION-Face-T2I-15M, for training and evaluation.
arXiv Detail & Related papers (2025-11-24T04:10:53Z)
CompAlign: Improving Compositional Text-to-Image Generation with a Complex Benchmark and Fine-Grained Feedback [58.27353205269664]
State-of-the-art T2I models are capable of generating high-resolution images given textual prompts.<n>However, they struggle with accurately depicting compositional scenes that specify multiple objects, attributes, and spatial relations.<n>We present CompAlign, a challenging benchmark with an emphasis on assessing the depiction of 3D-spatial relationships.
arXiv Detail & Related papers (2025-05-16T12:23:58Z)
Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step [86.69947123512836]
Chain-of-Thought (CoT) reasoning has been extensively explored in large models to tackle complex understanding tasks.<n>We provide the first comprehensive investigation of the potential of CoT reasoning to enhance autoregressive image generation.<n>We propose the Potential Assessment Reward Model (PARM) and PARM++, specialized for autoregressive image generation.
arXiv Detail & Related papers (2025-01-23T18:59:43Z)
Adventurer: Optimizing Vision Mamba Architecture Designs for Efficiency [41.87857129429512]
We introduce the Adventurer series models where we treat images as sequences of patch tokens and employ uni-directional language models to learn visual representations.<n>This modeling paradigm allows us to process images in a recurrent formulation with linear complexity relative to the sequence length.<n>In detail, we introduce two simple designs that seamlessly integrate image inputs into the causal inference framework.
arXiv Detail & Related papers (2024-10-10T04:14:52Z)
Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation [52.509092010267665]
We introduce LlamaGen, a new family of image generation models that apply original next-token prediction'' paradigm of large language models to visual generation domain. It is an affirmative answer to whether vanilla autoregressive models, e.g., Llama, without inductive biases on visual signals can achieve state-of-the-art image generation performance if scaling properly.
arXiv Detail & Related papers (2024-06-10T17:59:52Z)
DivCon: Divide and Conquer for Complex Numerical and Spatial Reasoning in Text-to-Image Generation [0.0]
Diffusion-driven text-to-image (T2I) generation has achieved remarkable advancements in recent years.<n> layout is employed as an intermedium to bridge large language models and layout-based diffusion models.<n>We introduce a divide-and-conquer approach which decouples the generation task into multiple subtasks.
arXiv Detail & Related papers (2024-03-11T03:24:44Z)
AdaDiff: Adaptive Step Selection for Fast Diffusion Models [82.78899138400435]
We introduce AdaDiff, a lightweight framework designed to learn instance-specific step usage policies.<n>AdaDiff is optimized using a policy method to maximize a carefully designed reward function.<n>We conduct experiments on three image generation and two video generation benchmarks and demonstrate that our approach achieves similar visual quality compared to the baseline.
arXiv Detail & Related papers (2023-11-24T11:20:38Z)
Progressive Text-to-Image Generation [40.09326229583334]
We present a progressive model for high-fidelity text-to-image generation. The proposed method takes effect by creating new image tokens from coarse to fine based on the existing context. The resulting coarse-to-fine hierarchy makes the image generation process intuitive and interpretable.
arXiv Detail & Related papers (2022-10-05T14:27:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.