Text-to-Image Diffusion Models Cannot Count, and Prompt Refinement Cannot Help
- URL: http://arxiv.org/abs/2503.06884v1
- Date: Mon, 10 Mar 2025 03:28:18 GMT
- Title: Text-to-Image Diffusion Models Cannot Count, and Prompt Refinement Cannot Help
- Authors: Yuefan Cao, Xuyang Guo, Jiayan Huo, Yingyu Liang, Zhenmei Shi, Zhao Song, Jiahao Zhang, Zhen Zhuang,
- Abstract summary: We introduce T2ICountBench, a novel benchmark designed to rigorously evaluate the counting ability of state-of-the-art text-to-image diffusion models.<n>Our evaluations reveal that all state-of-the-art diffusion models fail to generate the correct number of objects, with accuracy dropping significantly as the number of objects increases.
- Score: 18.70937620674227
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Generative modeling is widely regarded as one of the most essential problems in today's AI community, with text-to-image generation having gained unprecedented real-world impacts. Among various approaches, diffusion models have achieved remarkable success and have become the de facto solution for text-to-image generation. However, despite their impressive performance, these models exhibit fundamental limitations in adhering to numerical constraints in user instructions, frequently generating images with an incorrect number of objects. While several prior works have mentioned this issue, a comprehensive and rigorous evaluation of this limitation remains lacking. To address this gap, we introduce T2ICountBench, a novel benchmark designed to rigorously evaluate the counting ability of state-of-the-art text-to-image diffusion models. Our benchmark encompasses a diverse set of generative models, including both open-source and private systems. It explicitly isolates counting performance from other capabilities, provides structured difficulty levels, and incorporates human evaluations to ensure high reliability. Extensive evaluations with T2ICountBench reveal that all state-of-the-art diffusion models fail to generate the correct number of objects, with accuracy dropping significantly as the number of objects increases. Additionally, an exploratory study on prompt refinement demonstrates that such simple interventions generally do not improve counting accuracy. Our findings highlight the inherent challenges in numerical understanding within diffusion models and point to promising directions for future improvements.
Related papers
- D2C: Unlocking the Potential of Continuous Autoregressive Image Generation with Discrete Tokens [80.75893450536577]
We propose D2C, a novel two-stage method to enhance model generation capacity.
In the first stage, the discrete-valued tokens representing coarse-grained image features are sampled by employing a small discrete-valued generator.
In the second stage, the continuous-valued tokens representing fine-grained image features are learned conditioned on the discrete token sequence.
arXiv Detail & Related papers (2025-03-21T13:58:49Z) - IMAGINE-E: Image Generation Intelligence Evaluation of State-of-the-art Text-to-Image Models [52.73820275861131]
Text-to-image(T2I) models have made significant progress, showcasing impressive abilities in prompt following and image generation.<n>Recent models such as FLUX.1 and Ideogram2.0 have demonstrated exceptional performance across various complex tasks.<n>This study provides valuable insights into the current state and future trajectory of T2I models as they evolve towards general-purpose usability.
arXiv Detail & Related papers (2025-01-23T18:58:33Z) - Unlocking the Potential of Text-to-Image Diffusion with PAC-Bayesian Theory [33.78620829249978]
Text-to-image (T2I) diffusion models have revolutionized generative modeling by producing high-fidelity, diverse, and visually realistic images.
Recent attention-based methods have improved object inclusion and linguistic binding, but still face challenges such as attribute misbinding.
We propose a Bayesian approach that designs custom priors over attention distributions to enforce desirable properties.
Our approach treats the attention mechanism as an interpretable component, enabling fine-grained control and improved attribute-object alignment.
arXiv Detail & Related papers (2024-11-25T10:57:48Z) - Iterative Object Count Optimization for Text-to-image Diffusion Models [59.03672816121209]
Current models, which learn from image-text pairs, inherently struggle with counting.
We propose optimizing the generated image based on a counting loss derived from a counting model that aggregates an object's potential.
We evaluate the generation of various objects and show significant improvements in accuracy.
arXiv Detail & Related papers (2024-08-21T15:51:46Z) - Six-CD: Benchmarking Concept Removals for Benign Text-to-image Diffusion Models [58.74606272936636]
Text-to-image (T2I) diffusion models have shown exceptional capabilities in generating images that closely correspond to textual prompts.
The models could be exploited for malicious purposes, such as generating images with violence or nudity, or creating unauthorized portraits of public figures in inappropriate contexts.
concept removal methods have been proposed to modify diffusion models to prevent the generation of malicious and unwanted concepts.
arXiv Detail & Related papers (2024-06-21T03:58:44Z) - A Reliable Framework for Human-in-the-Loop Anomaly Detection in Time Series [17.08674819906415]
We introduce HILAD, a novel framework designed to foster a dynamic and bidirectional collaboration between humans and AI.
Through our visual interface, HILAD empowers domain experts to detect, interpret, and correct unexpected model behaviors at scale.
arXiv Detail & Related papers (2024-05-06T07:44:07Z) - Counterfactual Image Editing [54.21104691749547]
Counterfactual image editing is an important task in generative AI, which asks how an image would look if certain features were different.
We formalize the counterfactual image editing task using formal language, modeling the causal relationships between latent generative factors and images.
We develop an efficient algorithm to generate counterfactual images by leveraging neural causal models.
arXiv Detail & Related papers (2024-02-07T20:55:39Z) - Separate-and-Enhance: Compositional Finetuning for Text2Image Diffusion
Models [58.46926334842161]
This work illuminates the fundamental reasons for such misalignment, pinpointing issues related to low attention activation scores and mask overlaps.
We propose two novel objectives, the Separate loss and the Enhance loss, that reduce object mask overlaps and maximize attention scores.
Our method diverges from conventional test-time-adaptation techniques, focusing on finetuning critical parameters, which enhances scalability and generalizability.
arXiv Detail & Related papers (2023-12-10T22:07:42Z) - Not Just Pretty Pictures: Toward Interventional Data Augmentation Using Text-to-Image Generators [12.053125079460234]
We show how modern T2I generators can be used to simulate arbitrary interventions over such environmental factors.
Our empirical findings demonstrate that modern T2I generators like Stable Diffusion can indeed be used as a powerful interventional data augmentation mechanism.
arXiv Detail & Related papers (2022-12-21T18:07:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.