Towards Evaluating Robustness of Prompt Adherence in Text to Image Models
- URL: http://arxiv.org/abs/2507.08039v1
- Date: Wed, 09 Jul 2025 18:40:17 GMT
- Title: Towards Evaluating Robustness of Prompt Adherence in Text to Image Models
- Authors: Sujith Vemishetty, Advitiya Arora, Anupama Sharma,
- Abstract summary: This paper aims to establish a comprehensive evaluation framework for Text-to-Image models.<n>We created a novel dataset that aimed to assess the robustness of these models in generating images that conform to the specified factors of variation in the input text prompts.<n>Our results reveal that these models struggle to create simple binary images with only two factors of variation: a simple geometric shape and its location.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The advancements in the domain of LLMs in recent years have surprised many, showcasing their remarkable capabilities and diverse applications. Their potential applications in various real-world scenarios have led to significant research on their reliability and effectiveness. On the other hand, multimodal LLMs and Text-to-Image models have only recently gained prominence, especially when compared to text-only LLMs. Their reliability remains constrained due to insufficient research on assessing their performance and robustness. This paper aims to establish a comprehensive evaluation framework for Text-to-Image models, concentrating particularly on their adherence to prompts. We created a novel dataset that aimed to assess the robustness of these models in generating images that conform to the specified factors of variation in the input text prompts. Our evaluation studies present findings on three variants of Stable Diffusion models: Stable Diffusion 3 Medium, Stable Diffusion 3.5 Large, and Stable Diffusion 3.5 Large Turbo, and two variants of Janus models: Janus Pro 1B and Janus Pro 7B. We introduce a pipeline that leverages text descriptions generated by the gpt-4o model for our ground-truth images, which are then used to generate artificial images by passing these descriptions to the Text-to-Image models. We then pass these generated images again through gpt-4o using the same system prompt and compare the variation between the two descriptions. Our results reveal that these models struggle to create simple binary images with only two factors of variation: a simple geometric shape and its location. We also show, using pre-trained VAEs on our dataset, that they fail to generate images that follow our input dataset distribution.
Related papers
- STRICT: Stress Test of Rendering Images Containing Text [11.236527918747925]
$textbfSTRICT$ is a benchmark designed to stress-test the ability of diffusion models to render coherent and instruction-aligned text in images.<n>We evaluate several state-of-the-art models, including proprietary and open-source variants, and reveal persistent limitations in long-range consistency and instruction-following capabilities.
arXiv Detail & Related papers (2025-05-25T05:37:08Z) - DILLEMA: Diffusion and Large Language Models for Multi-Modal Augmentation [0.13124513975412253]
We present a novel framework for testing vision neural networks that leverages Large Language Models and control-conditioned Diffusion Models.<n>Our approach begins by translating images into detailed textual descriptions using a captioning model.<n>These descriptions are then used to produce new test images through a text-to-image diffusion process.
arXiv Detail & Related papers (2025-02-05T16:35:42Z) - Efficient Pruning of Text-to-Image Models: Insights from Pruning Stable Diffusion [3.399289369740637]
This paper presents a pioneering study on post-training pruning of Stable Diffusion 2.
It addresses the critical need for model compression in text-to-image domain.
We propose an optimal pruning configuration that prunes the text encoder to 47.5% and the diffusion generator to 35%.
arXiv Detail & Related papers (2024-11-22T18:29:37Z) - Image Regeneration: Evaluating Text-to-Image Model via Generating Identical Image with Multimodal Large Language Models [54.052963634384945]
We introduce the Image Regeneration task to assess text-to-image models.
We use GPT4V to bridge the gap between the reference image and the text input for the T2I model.
We also present ImageRepainter framework to enhance the quality of generated images.
arXiv Detail & Related papers (2024-11-14T13:52:43Z) - Text-to-Image Diffusion Models are Great Sketch-Photo Matchmakers [120.49126407479717]
This paper explores text-to-image diffusion models for Zero-Shot Sketch-based Image Retrieval (ZS-SBIR)
We highlight a pivotal discovery: the capacity of text-to-image diffusion models to seamlessly bridge the gap between sketches and photos.
arXiv Detail & Related papers (2024-03-12T00:02:03Z) - IT3D: Improved Text-to-3D Generation with Explicit View Synthesis [71.68595192524843]
This study presents a novel strategy that leverages explicitly synthesized multi-view images to address these issues.
Our approach involves the utilization of image-to-image pipelines, empowered by LDMs, to generate posed high-quality images.
For the incorporated discriminator, the synthesized multi-view images are considered real data, while the renderings of the optimized 3D models function as fake data.
arXiv Detail & Related papers (2023-08-22T14:39:17Z) - LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image
Diffusion Models with Large Language Models [62.75006608940132]
This work proposes to enhance prompt understanding capabilities in text-to-image diffusion models.
Our method leverages a pretrained large language model for grounded generation in a novel two-stage process.
Our method significantly outperforms the base diffusion model and several strong baselines in accurately generating images.
arXiv Detail & Related papers (2023-05-23T03:59:06Z) - Parents and Children: Distinguishing Multimodal DeepFakes from Natural Images [60.34381768479834]
Recent advancements in diffusion models have enabled the generation of realistic deepfakes from textual prompts in natural language.
We pioneer a systematic study on deepfake detection generated by state-of-the-art diffusion models.
arXiv Detail & Related papers (2023-04-02T10:25:09Z) - Lafite2: Few-shot Text-to-Image Generation [132.14211027057766]
We propose a novel method for pre-training text-to-image generation model on image-only datasets.
It considers a retrieval-then-optimization procedure to synthesize pseudo text features.
It can be beneficial to a wide range of settings, including the few-shot, semi-supervised and fully-supervised learning.
arXiv Detail & Related papers (2022-10-25T16:22:23Z) - GLIDE: Towards Photorealistic Image Generation and Editing with
Text-Guided Diffusion Models [16.786221846896108]
We explore diffusion models for the problem of text-conditional image synthesis and compare two different guidance strategies.
We find that the latter is preferred by human evaluators for both photorealism and caption similarity, and often produces photorealistic samples.
Our models can be fine-tuned to perform image inpainting, enabling powerful text-driven image editing.
arXiv Detail & Related papers (2021-12-20T18:42:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.