Related papers: Prototypicality Bias Reveals Blindspots in Multimodal Evaluation Metrics

Prototypicality Bias Reveals Blindspots in Multimodal Evaluation Metrics

URL: http://arxiv.org/abs/2601.04946v2
Date: Sat, 10 Jan 2026 09:28:13 GMT
Title: Prototypicality Bias Reveals Blindspots in Multimodal Evaluation Metrics
Authors: Subhadeep Roy, Gagan Bhatia, Steffen Eger,
Abstract summary: We study prototypicality bias as a systematic failure mode in multimodal evaluation.<n>We introduce a controlled contrastive benchmark ProtoBias, spanning Animals, Objects, and Demography images.<n>Our results show that widely used metrics, including CLIPScore, PickScore, and VQA-based scores, frequently misrank these pairs.<n>We propose ProtoScore, a robust 7B- parameter metric that substantially reduces failure rates and suppresses misranking.
Score: 25.374192139098284
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Automatic metrics are now central to evaluating text-to-image models, often substituting for human judgment in benchmarking and large-scale filtering. However, it remains unclear whether these metrics truly prioritize semantic correctness or instead favor visually and socially prototypical images learned from biased data distributions. We identify and study prototypicality bias as a systematic failure mode in multimodal evaluation. We introduce a controlled contrastive benchmark ProtoBias (Prototypical Bias), spanning Animals, Objects, and Demography images, where semantically correct but non-prototypical images are paired with subtly incorrect yet prototypical adversarial counterparts. This setup enables a directional evaluation of whether metrics follow textual semantics or default to prototypes. Our results show that widely used metrics, including CLIPScore, PickScore, and VQA-based scores, frequently misrank these pairs, while even LLM-as-Judge systems exhibit uneven robustness in socially grounded cases. Human evaluations consistently favour semantic correctness with larger decision margins. Motivated by these findings, we propose ProtoScore, a robust 7B-parameter metric that substantially reduces failure rates and suppresses misranking, while running at orders of magnitude faster than the inference time of GPT-5, approaching the robustness of much larger closed-source judges.

Related papers

How Many Human Judgments Are Enough? Feasibility Limits of Human Preference Evaluation [0.38991526486631006]
We show that when preference signal is diffuse across prompts, proportional allocation is minimax-optimal.<n>Our results show that inconclusive or negative human evaluation outcomes frequently reflect underpowered evaluation rather than model equivalence.
arXiv Detail & Related papers (2026-01-14T02:34:58Z)
Defect-aware Hybrid Prompt Optimization via Progressive Tuning for Zero-Shot Multi-type Anomaly Detection and Segmentation [12.030059666003972]
We introduce DAPO, a novel approach for Defect-aware Prompt Optimization based on progressive tuning for the zero-shot multi-type and binary anomaly detection and segmentation under distribution shifts.<n>Our approach aligns anomaly-relevant image features with their corresponding text semantics by learning hybrid defect-aware prompts with both fixed textual anchors and learnable token embeddings.
arXiv Detail & Related papers (2025-12-10T09:19:17Z)
Evaluating the Evaluators: Metrics for Compositional Text-to-Image Generation [13.460909458745379]
We present a broad study of widely used metrics for compositional text-image evaluation.<n>Our analysis goes beyond simple correlation, examining their behavior across diverse compositional challenges.<n>Results show that no single metric performs consistently across tasks.
arXiv Detail & Related papers (2025-09-25T14:31:09Z)
Bias in Gender Bias Benchmarks: How Spurious Features Distort Evaluation [116.86965910589775]
We show that even minimal perturbations, such as masking just 10% of objects or weakly blurring backgrounds, can dramatically alter bias scores.<n>This suggests that current bias evaluations reflect model responses to spurious features rather than gender bias.
arXiv Detail & Related papers (2025-09-09T11:14:11Z)
A Meaningful Perturbation Metric for Evaluating Explainability Methods [55.09730499143998]
We introduce a novel approach, which harnesses image generation models to perform targeted perturbation.<n> Specifically, we focus on inpainting only the high-relevance pixels of an input image to modify the model's predictions while preserving image fidelity.<n>This is in contrast to existing approaches, which often produce out-of-distribution modifications, leading to unreliable results.
arXiv Detail & Related papers (2025-04-09T11:46:41Z)
Where is this coming from? Making groundedness count in the evaluation of Document VQA models [12.951716701565019]
We argue that common evaluation metrics do not account for the semantic and multimodal groundedness of a model's outputs.<n>We propose a new evaluation methodology that accounts for the groundedness of predictions.<n>Our proposed methodology is parameterized in such a way that users can configure the score according to their preferences.
arXiv Detail & Related papers (2025-03-24T20:14:46Z)
Who Evaluates the Evaluations? Objectively Scoring Text-to-Image Prompt Coherence Metrics with T2IScoreScore (TS2) [62.44395685571094]
We introduce T2IScoreScore, a curated set of semantic error graphs containing a prompt and a set of increasingly erroneous images. These allow us to rigorously judge whether a given prompt faithfulness metric can correctly order images with respect to their objective error count. We find that the state-of-the-art VLM-based metrics fail to significantly outperform simple (and supposedly worse) feature-based metrics like CLIPScore.
arXiv Detail & Related papers (2024-04-05T17:57:16Z)
Cobra Effect in Reference-Free Image Captioning Metrics [58.438648377314436]
A proliferation of reference-free methods, leveraging visual-language pre-trained models (VLMs), has emerged. In this paper, we study if there are any deficiencies in reference-free metrics. We employ GPT-4V as an evaluative tool to assess generated sentences and the result reveals that our approach achieves state-of-the-art (SOTA) performance.
arXiv Detail & Related papers (2024-02-18T12:36:23Z)
Rethinking FID: Towards a Better Evaluation Metric for Image Generation [43.66036053597747]
Inception Distance estimates the distance between a distribution of Inception-v3 features of real images, and those of images generated by the algorithm. We highlight important drawbacks of FID: Inception's poor representation of the rich and varied content generated by modern text-to-image models, incorrect normality assumptions, and poor sample complexity. We propose an alternative new metric, CMMD, based on richer CLIP embeddings and the maximum mean discrepancy distance with the Gaussian RBF kernel.
arXiv Detail & Related papers (2023-11-30T19:11:01Z)
Improving Adversarial Robustness of Masked Autoencoders via Test-time Frequency-domain Prompting [133.55037976429088]
We investigate the adversarial robustness of vision transformers equipped with BERT pretraining (e.g., BEiT, MAE) A surprising observation is that MAE has significantly worse adversarial robustness than other BERT pretraining methods. We propose a simple yet effective way to boost the adversarial robustness of MAE.
arXiv Detail & Related papers (2023-08-20T16:27:17Z)
Balancing the Picture: Debiasing Vision-Language Datasets with Synthetic Contrast Sets [52.77024349608834]
Vision-language models can perpetuate and amplify societal biases learned during pre-training on uncurated image-text pairs from the internet. COCO Captions is the most commonly used dataset for evaluating bias between background context and the gender of people in-situ. We propose a novel dataset debiasing pipeline to augment the COCO dataset with synthetic, gender-balanced contrast sets.
arXiv Detail & Related papers (2023-05-24T17:59:18Z)
A Closer Look at Debiased Temporal Sentence Grounding in Videos: Dataset, Metric, and Approach [53.727460222955266]
Temporal Sentence Grounding in Videos (TSGV) aims to ground a natural language sentence in an untrimmed video. Recent studies have found that current benchmark datasets may have obvious moment annotation biases. We introduce a new evaluation metric "dR@n,IoU@m" that discounts the basic recall scores to alleviate the inflating evaluation caused by biased datasets.
arXiv Detail & Related papers (2022-03-10T08:58:18Z)
Evaluating and Mitigating Bias in Image Classifiers: A Causal Perspective Using Counterfactuals [27.539001365348906]
We present a method for generating counterfactuals by incorporating a structural causal model (SCM) in an improved variant of Adversarially Learned Inference (ALI) We show how to explain a pre-trained machine learning classifier, evaluate its bias, and mitigate the bias using a counterfactual regularizer.
arXiv Detail & Related papers (2020-09-17T13:19:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.