Related papers: MetaLogic: Robustness Evaluation of Text-to-Image Models via Logically Equivalent Prompts

MetaLogic: Robustness Evaluation of Text-to-Image Models via Logically Equivalent Prompts

URL: http://arxiv.org/abs/2510.00796v1
Date: Wed, 01 Oct 2025 11:51:13 GMT
Title: MetaLogic: Robustness Evaluation of Text-to-Image Models via Logically Equivalent Prompts
Authors: Yifan Shen, Yangyang Shu, Hye-young Paik, Yulei Sui,
Abstract summary: Text-to-image (T2I) models struggle with maintaining semantic consistency when input prompts undergo linguistic variations.<n>We propose MetaLogic, a novel evaluation framework that detects T2I misalignment without relying on ground truth images.
Score: 13.010772460971374
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent advances in text-to-image (T2I) models, especially diffusion-based architectures, have significantly improved the visual quality of generated images. However, these models continue to struggle with a critical limitation: maintaining semantic consistency when input prompts undergo minor linguistic variations. Despite being logically equivalent, such prompt pairs often yield misaligned or semantically inconsistent images, exposing a lack of robustness in reasoning and generalisation. To address this, we propose MetaLogic, a novel evaluation framework that detects T2I misalignment without relying on ground truth images. MetaLogic leverages metamorphic testing, generating image pairs from prompts that differ grammatically but are semantically identical. By directly comparing these image pairs, the framework identifies inconsistencies that signal failures in preserving the intended meaning, effectively diagnosing robustness issues in the model's logic understanding. Unlike existing evaluation methods that compare a generated image to a single prompt, MetaLogic evaluates semantic equivalence between paired images, offering a scalable, ground-truth-free approach to identifying alignment failures. It categorises these alignment errors (e.g., entity omission, duplication, positional misalignment) and surfaces counterexamples that can be used for model debugging and refinement. We evaluate MetaLogic across multiple state-of-the-art T2I models and reveal consistent robustness failures across a range of logical constructs. We find that even the SOTA text-to-image models like Flux.dev and DALLE-3 demonstrate a 59 percent and 71 percent misalignment rate, respectively. Our results show that MetaLogic is not only efficient and scalable, but also effective in uncovering fine-grained logical inconsistencies that are overlooked by existing evaluation metrics.

Related papers

Can Unified Generation and Understanding Models Maintain Semantic Equivalence Across Different Output Modalities? [61.533560295383786]
Unified Multimodal Large Language Models (U-MLLMs) integrate understanding and generation within a single architecture.<n>We observe that U-MLLMs fail to maintain semantic equivalence when required to render the same results in the image modality.<n>We introduce VGUBench, a framework to decouple reasoning logic from generation fidelity.
arXiv Detail & Related papers (2026-02-27T06:23:56Z)
Asymmetric Idiosyncrasies in Multimodal Models [22.359102255231004]
We study idiosyncrasies in the caption models and their downstream impact on text-to-image models.<n>Our results show that text classification yields very high accuracy (99.70%)<n>Our framework provides a novel methodology for quantifying both the stylistic idiosyncrasies of caption models and the prompt-following ability of text-to-image systems.
arXiv Detail & Related papers (2026-02-26T08:16:47Z)
A Theorem-Proving-Based Evaluation of Neural Semantic Parsing [4.422349568747053]
We reassess evaluation by pairing graph-matching with automated theorem proving.<n>We evaluate outputs using graph-matching, bidirectional entailment between source and target formulas with a first-order logic theorem prover, and well-formedness.
arXiv Detail & Related papers (2025-10-13T10:09:38Z)
Mind-the-Glitch: Visual Correspondence for Detecting Inconsistencies in Subject-Driven Generation [120.23172120151821]
We propose a novel approach for disentangling visual and semantic features from the backbones of pre-trained diffusion models.<n>We introduce an automated pipeline that constructs image pairs with annotated semantic and visual correspondences.<n>We propose a new metric, Visual Semantic Matching, that quantifies visual inconsistencies in subject-driven image generation.
arXiv Detail & Related papers (2025-09-26T07:11:55Z)
Extract Free Dense Misalignment from CLIP [7.0247398611254175]
This work proposes a novel approach, dubbed CLIP4DM, for detecting dense misalignments from pre-trained CLIP.<n>We revamp the gradient-based attribution computation method, enabling negative gradient of individual text tokens to indicate misalignment.<n>Our method demonstrates state-of-the-art performance among zero-shot models and competitive performance with fine-tuned models.
arXiv Detail & Related papers (2024-12-24T12:51:05Z)
HMGIE: Hierarchical and Multi-Grained Inconsistency Evaluation for Vision-Language Data Cleansing [54.970275599061594]
We design an adaptive evaluation framework, called Hierarchical and Multi-Grained Inconsistency Evaluation (HMGIE)<n>HMGIE can provide multi-grained evaluations covering both accuracy and completeness for various image-caption pairs.<n>To verify the efficacy and flexibility of the proposed framework, we construct MVTID, an image-caption dataset with diverse types and granularities of inconsistencies.
arXiv Detail & Related papers (2024-12-07T15:47:49Z)
Image2Text2Image: A Novel Framework for Label-Free Evaluation of Image-to-Text Generation with Text-to-Image Diffusion Models [16.00576040281808]
We propose a novel framework called Image2Text2Image to evaluate image captioning models. A high similarity score suggests that the model has produced a faithful textual description, while a low score highlights discrepancies. Our framework does not rely on human-annotated captions reference, making it a valuable tool for assessing image captioning models.
arXiv Detail & Related papers (2024-11-08T17:07:01Z)
Evaluating Semantic Variation in Text-to-Image Synthesis: A Causal Perspective [50.261681681643076]
We propose a novel metric called SemVarEffect and a benchmark named SemVarBench to evaluate the causality between semantic variations in inputs and outputs in text-to-image synthesis.<n>Our work establishes an effective evaluation framework that advances the T2I synthesis community's exploration of human instruction understanding.
arXiv Detail & Related papers (2024-10-14T08:45:35Z)
Who Evaluates the Evaluations? Objectively Scoring Text-to-Image Prompt Coherence Metrics with T2IScoreScore (TS2) [62.44395685571094]
We introduce T2IScoreScore, a curated set of semantic error graphs containing a prompt and a set of increasingly erroneous images. These allow us to rigorously judge whether a given prompt faithfulness metric can correctly order images with respect to their objective error count. We find that the state-of-the-art VLM-based metrics fail to significantly outperform simple (and supposedly worse) feature-based metrics like CLIPScore.
arXiv Detail & Related papers (2024-04-05T17:57:16Z)
Semantic-Syntactic Discrepancy in Images (SSDI): Learning Meaning and Order of Features from Natural Images [7.148054923510877]
We propose the concept of "image grammar", comprising "image semantics" and "image syntax"<n>We present a semi-supervised two-stage method for learning the image grammar of visual elements and environments solely from natural images.<n>The efficacy of the proposed approach is then demonstrated by achieving detection rates ranging from 70% to 90% on corruptions generated from CelebA and SUN-RGBD datasets.
arXiv Detail & Related papers (2024-01-31T00:16:02Z)
NewsCLIPpings: Automatic Generation of Out-of-Context Multimodal Media [93.51739200834837]
We propose a dataset where both image and text are unmanipulated but mismatched. We introduce several strategies for automatic retrieval of suitable images for the given captions. Our large-scale automatically generated NewsCLIPpings dataset requires models to jointly analyze both modalities.
arXiv Detail & Related papers (2021-04-13T01:53:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.