Related papers: AgentComp: From Agentic Reasoning to Compositional Mastery in Text-to-Image Models

AgentComp: From Agentic Reasoning to Compositional Mastery in Text-to-Image Models

URL: http://arxiv.org/abs/2512.09081v1
Date: Tue, 09 Dec 2025 19:54:06 GMT
Title: AgentComp: From Agentic Reasoning to Compositional Mastery in Text-to-Image Models
Authors: Arman Zarei, Jiacheng Pan, Matthew Gwilliam, Soheil Feizi, Zhenheng Yang,
Abstract summary: A key limitation is that models are not explicitly trained to differentiate between compositionally similar prompts and images.<n>We propose AgentComp, a framework that explicitly trains models to better differentiate such compositional variations.<n>We apply an agentic preference optimization method to fine-tune text-to-image models, enabling them to better distinguish between compositionally similar samples.
Score: 46.329647221942395
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Text-to-image generative models have achieved remarkable visual quality but still struggle with compositionality$-$accurately capturing object relationships, attribute bindings, and fine-grained details in prompts. A key limitation is that models are not explicitly trained to differentiate between compositionally similar prompts and images, resulting in outputs that are close to the intended description yet deviate in fine-grained details. To address this, we propose AgentComp, a framework that explicitly trains models to better differentiate such compositional variations and enhance their reasoning ability. AgentComp leverages the reasoning and tool-use capabilities of large language models equipped with image generation, editing, and VQA tools to autonomously construct compositional datasets. Using these datasets, we apply an agentic preference optimization method to fine-tune text-to-image models, enabling them to better distinguish between compositionally similar samples and resulting in overall stronger compositional generation ability. AgentComp achieves state-of-the-art results on compositionality benchmarks such as T2I-CompBench, without compromising image quality$-$a common drawback in prior approaches$-$and even generalizes to other capabilities not explicitly trained for, such as text rendering.

Related papers

CompAlign: Improving Compositional Text-to-Image Generation with a Complex Benchmark and Fine-Grained Feedback [58.27353205269664]
State-of-the-art T2I models are capable of generating high-resolution images given textual prompts.<n>However, they struggle with accurately depicting compositional scenes that specify multiple objects, attributes, and spatial relations.<n>We present CompAlign, a challenging benchmark with an emphasis on assessing the depiction of 3D-spatial relationships.
arXiv Detail & Related papers (2025-05-16T12:23:58Z)
VSC: Visual Search Compositional Text-to-Image Diffusion Model [15.682990658945682]
We introduce a novel compositional generation method that leverages pairwise image embeddings to improve attribute-object binding.<n>Our approach decomposes complex prompts into sub-prompts, generates corresponding images, and computes visual prototypes that fuse with text embeddings to enhance representation.<n>Our approaches outperform existing compositional text-to-image diffusion models on the benchmark T2I CompBench, achieving better image quality, evaluated by humans, and emerging robustness under scaling number of binding pairs in the prompt.
arXiv Detail & Related papers (2025-05-02T08:31:43Z)
IterComp: Iterative Composition-Aware Feedback Learning from Model Gallery for Text-to-Image Generation [70.8833857249951]
IterComp is a novel framework that aggregates composition-aware model preferences from multiple models.<n>We propose an iterative feedback learning method to enhance compositionality in a closed-loop manner.<n>IterComp opens new research avenues in reward feedback learning for diffusion models and compositional generation.
arXiv Detail & Related papers (2024-10-09T17:59:13Z)
ComAlign: Compositional Alignment in Vision-Language Models [2.3250871476216814]
We introduce Compositional Alignment (ComAlign) to discover more exact correspondence of text and image components. Our methodology emphasizes that the compositional structure extracted from the text modality must also be retained in the image modality. We train a lightweight network lying on top of existing visual and language encoders using a small dataset.
arXiv Detail & Related papers (2024-09-12T16:46:41Z)
Improving Compositional Attribute Binding in Text-to-Image Generative Models via Enhanced Text Embeddings [46.723653095494896]
We investigate compositional attribute binding failures in text-to-image generative models.<n>We show that imperfect text conditioning with CLIP text-encoder is one of the primary reasons behind the inability of these models to generate high-fidelity compositional scenes.<n>Our main finding shows that significant compositional improvements can be achieved without harming the model's FID score.
arXiv Detail & Related papers (2024-06-12T03:21:34Z)
Divide and Conquer: Language Models can Plan and Self-Correct for Compositional Text-to-Image Generation [72.6168579583414]
CompAgent is a training-free approach for compositional text-to-image generation with a large language model (LLM) agent as its core. Our approach achieves more than 10% improvement on T2I-CompBench, a comprehensive benchmark for open-world compositional T2I generation.
arXiv Detail & Related papers (2024-01-28T16:18:39Z)
T2I-CompBench++: An Enhanced and Comprehensive Benchmark for Compositional Text-to-image Generation [55.16845189272573]
T2I-CompBench++ is an enhanced benchmark for compositional text-to-image generation.<n>It comprises 8,000 compositional text prompts categorized into four primary groups: attribute binding, object relationships, generative numeracy, and complex compositions.
arXiv Detail & Related papers (2023-07-12T17:59:42Z)
Training-Free Structured Diffusion Guidance for Compositional Text-to-Image Synthesis [78.28620571530706]
Large-scale diffusion models have achieved state-of-the-art results on text-to-image synthesis (T2I) tasks. We improve the compositional skills of T2I models, specifically more accurate attribute binding and better image compositions.
arXiv Detail & Related papers (2022-12-09T18:30:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.