Related papers: FineGRAIN: Evaluating Failure Modes of Text-to-Image Models with Vision Language Model Judges

FineGRAIN: Evaluating Failure Modes of Text-to-Image Models with Vision Language Model Judges

URL: http://arxiv.org/abs/2512.02161v1
Date: Mon, 01 Dec 2025 19:46:03 GMT
Title: FineGRAIN: Evaluating Failure Modes of Text-to-Image Models with Vision Language Model Judges
Authors: Kevin David Hayes, Micah Goldblum, Vikash Sehwag, Gowthami Somepalli, Ashwinee Panda, Tom Goldstein,
Abstract summary: We propose a structured methodology for evaluating text-to-image (T2I) models and vision language models (VLMs)<n>We test whether VLMs can identify 27 specific failure modes in the images generated by T2I models conditioned on challenging prompts.<n>Our findings suggest that current metrics are insufficient to capture these nuanced errors.
Score: 85.24983823102262
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Text-to-image (T2I) models are capable of generating visually impressive images, yet they often fail to accurately capture specific attributes in user prompts, such as the correct number of objects with the specified colors. The diversity of such errors underscores the need for a hierarchical evaluation framework that can compare prompt adherence abilities of different image generation models. Simultaneously, benchmarks of vision language models (VLMs) have not kept pace with the complexity of scenes that VLMs are used to annotate. In this work, we propose a structured methodology for jointly evaluating T2I models and VLMs by testing whether VLMs can identify 27 specific failure modes in the images generated by T2I models conditioned on challenging prompts. Our second contribution is a dataset of prompts and images generated by 5 T2I models (Flux, SD3-Medium, SD3-Large, SD3.5-Medium, SD3.5-Large) and the corresponding annotations from VLMs (Molmo, InternVL3, Pixtral) annotated by an LLM (Llama3) to test whether VLMs correctly identify the failure mode in a generated image. By analyzing failure modes on a curated set of prompts, we reveal systematic errors in attribute fidelity and object representation. Our findings suggest that current metrics are insufficient to capture these nuanced errors, highlighting the importance of targeted benchmarks for advancing generative model reliability and interpretability.

Related papers

How Well Do Models Follow Visual Instructions? VIBE: A Systematic Benchmark for Visual Instruction-Driven Image Editing [56.60465182650588]
We introduce three-level interaction hierarchy that captures deictic grounding, morphological manipulation, and causal reasoning.<n>We propose a robust LMM-as-a-judge evaluation framework with task-specific metrics to enable scalable and fine-grained assessment.<n>We find that proprietary models exhibit early-stage visual instruction-following capabilities and consistently outperform open-source models.
arXiv Detail & Related papers (2026-02-02T09:24:45Z)
AMVICC: A Novel Benchmark for Cross-Modal Failure Mode Profiling for VLMs and IGMs [2.357397994148727]
multimodal large language models (MLLMs) and image generation models (IGMs) were investigated.<n>We created a novel benchmark to compare failure modes across image-to-text and text-to-image tasks.<n>Our results show that failure modes are often shared between models and modalities, but certain failures are model-specific and modality-specific.
arXiv Detail & Related papers (2026-01-20T00:06:58Z)
MMErroR: A Benchmark for Erroneous Reasoning in Vision-Language Models [29.830224745428566]
We present MMErroR, a benchmark of 2,013 samples each embedding a single coherent reasoning error.<n>Unlike existing benchmarks that focus on answer correctness, MMErroR targets a process-level, error-centric evaluation.<n>We evaluate 20 advanced Vision-Language Models, even the best model (Gemini-3.0-Pro) classifies the error in only 66.47% of cases.
arXiv Detail & Related papers (2026-01-06T17:45:26Z)
PoSh: Using Scene Graphs To Guide LLMs-as-a-Judge For Detailed Image Descriptions [55.95282725491425]
PoSh is a metric for detailed image description that uses scene graphs as structured rubrics to guide LLMs-as-a-Judge.<n>PoSh is replicable, interpretable and a better proxy for human raters than existing metrics.<n>We show that PoSh achieves stronger correlations with the human judgments in DOCENT than the best open-weight alternatives.
arXiv Detail & Related papers (2025-10-21T20:30:20Z)
Test-time Prompt Refinement for Text-to-Image Models [14.505841027491114]
We introduce a test-time prompt refinement framework that requires no additional training of the underlying T2I model, termed TIR.<n>In our approach, each generation step is followed by a refinement step, where a pretrained multimodal large language model (MLLM) analyzes the output image and the user's prompt.<n>We demonstrate that this closed-loop strategy improves alignment and visual coherence across multiple benchmark datasets, all while maintaining plug-and-play integration with black-box T2I models.
arXiv Detail & Related papers (2025-07-22T20:30:13Z)
BYO-Eval: Build Your Own Dataset for Fine-Grained Visual Assessment of Multimodal Language Models [2.526146573337397]
We propose a new evaluation methodology, inspired by ophthalmologic diagnostics.<n>We use procedural generation of synthetic images to obtain control over visual attributes.<n>This diagnostic allows systematic stress testing and fine-grained failure analysis.
arXiv Detail & Related papers (2025-06-05T12:43:10Z)
Vision-Language In-Context Learning Driven Few-Shot Visual Inspection Model [0.5497663232622965]
We propose general visual inspection model using Vision-Language Model(VLM) with few-shot images of non-defective or defective products.<n>For new products, our method employs In-Context Learning, which allows the model to perform inspections with an example of non-defective or defective image.
arXiv Detail & Related papers (2025-02-13T08:11:10Z)
GraPE: A Generate-Plan-Edit Framework for Compositional T2I Synthesis [10.47359822447001]
We present an alternate paradigm for T2I synthesis, decomposing the task of complex multi-step generation into three steps.<n>Our approach derives its strength from the fact that it is modular in nature, is training free, and can be applied over any combination of image generation and editing models.
arXiv Detail & Related papers (2024-12-08T22:29:56Z)
Image Regeneration: Evaluating Text-to-Image Model via Generating Identical Image with Multimodal Large Language Models [54.052963634384945]
We introduce the Image Regeneration task to assess text-to-image models. We use GPT4V to bridge the gap between the reference image and the text input for the T2I model. We also present ImageRepainter framework to enhance the quality of generated images.
arXiv Detail & Related papers (2024-11-14T13:52:43Z)
Who Evaluates the Evaluations? Objectively Scoring Text-to-Image Prompt Coherence Metrics with T2IScoreScore (TS2) [62.44395685571094]
We introduce T2IScoreScore, a curated set of semantic error graphs containing a prompt and a set of increasingly erroneous images. These allow us to rigorously judge whether a given prompt faithfulness metric can correctly order images with respect to their objective error count. We find that the state-of-the-art VLM-based metrics fail to significantly outperform simple (and supposedly worse) feature-based metrics like CLIPScore.
arXiv Detail & Related papers (2024-04-05T17:57:16Z)
Image2Sentence based Asymmetrical Zero-shot Composed Image Retrieval [92.13664084464514]
The task of composed image retrieval (CIR) aims to retrieve images based on the query image and the text describing the users' intent. Existing methods have made great progress with the advanced large vision-language (VL) model in CIR task, however, they generally suffer from two main issues: lack of labeled triplets for model training and difficulty of deployment on resource-restricted environments. We propose Image2Sentence based Asymmetric zero-shot composed image retrieval (ISA), which takes advantage of the VL model and only relies on unlabeled images for composition learning.
arXiv Detail & Related papers (2024-03-03T07:58:03Z)
Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models [73.40350756742231]
Visually-conditioned language models (VLMs) have seen growing adoption in applications such as visual dialogue, scene understanding, and robotic task planning. Despite the volume of new releases, key design decisions around image preprocessing, architecture, and optimization are under-explored.
arXiv Detail & Related papers (2024-02-12T18:21:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.