Related papers: Bias in the Picture: Benchmarking VLMs with Social-Cue News Images and LLM-as-Judge Assessment

Bias in the Picture: Benchmarking VLMs with Social-Cue News Images and LLM-as-Judge Assessment

URL: http://arxiv.org/abs/2509.19659v1
Date: Wed, 24 Sep 2025 00:33:58 GMT
Title: Bias in the Picture: Benchmarking VLMs with Social-Cue News Images and LLM-as-Judge Assessment
Authors: Aravind Narayanan, Vahid Reza Khazaie, Shaina Raza,
Abstract summary: We introduce a news-image benchmark consisting of 1,343 image-question pairs drawn from diverse outlets.<n>We evaluate a range of state-of-the-art VLMs and employ a large language model (LLM) as judge, with human verification.<n>Our findings show that: (i) visual context systematically shifts model outputs in open-ended settings; (ii) bias prevalence varies across attributes and models, with particularly high risk for gender and occupation; and (iii) higher faithfulness does not necessarily correspond to lower bias.
Score: 8.451522319478512
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large vision-language models (VLMs) can jointly interpret images and text, but they are also prone to absorbing and reproducing harmful social stereotypes when visual cues such as age, gender, race, clothing, or occupation are present. To investigate these risks, we introduce a news-image benchmark consisting of 1,343 image-question pairs drawn from diverse outlets, which we annotated with ground-truth answers and demographic attributes (age, gender, race, occupation, and sports). We evaluate a range of state-of-the-art VLMs and employ a large language model (LLM) as judge, with human verification. Our findings show that: (i) visual context systematically shifts model outputs in open-ended settings; (ii) bias prevalence varies across attributes and models, with particularly high risk for gender and occupation; and (iii) higher faithfulness does not necessarily correspond to lower bias. We release the benchmark prompts, evaluation rubric, and code to support reproducible and fairness-aware multimodal assessment.

Related papers

Measuring Social Bias in Vision-Language Models with Face-Only Counterfactuals from Real Photos [79.03150233804458]
Real-world images entangle race and gender with correlated factors such as background and clothing, obscuring attribution.<n>We propose a textbfface-only counterfactual evaluation paradigm<n>We generate counterfactual variants by editing only facial attributes related to race and gender, keeping all other visual factors fixed.
arXiv Detail & Related papers (2026-01-11T14:35:06Z)
Ask Me Again Differently: GRAS for Measuring Bias in Vision Language Models on Gender, Race, Age, and Skin Tone [12.276292861328026]
We introduce GRAS, a benchmark for uncovering demographic biases in Vision Language Models (VLMs)<n>We benchmark five state-of-the-art VLMs and reveal concerning bias levels, with the least biased model attaining a GRAS Bias Score of only 2 out of 100.
arXiv Detail & Related papers (2025-08-26T12:41:35Z)
VIGNETTE: Socially Grounded Bias Evaluation for Vision-Language Models [23.329280888159744]
This work introduces VIGNETTE, a large-scale VQA benchmark with 30M+ images for evaluating bias in vision-language models (VLMs)<n>We assess how VLMs interpret identities in contextualized settings, revealing how models make trait and capability assumptions and exhibit patterns of discrimination.<n>Our findings uncover subtle, multifaceted, and surprising stereotypical patterns, offering insights into how VLMs construct social meaning from inputs.
arXiv Detail & Related papers (2025-05-28T22:00:30Z)
GABInsight: Exploring Gender-Activity Binding Bias in Vision-Language Models [3.018378575149671]
We show that vision-language models (VLMs) are biased towards identifying the individual with the expected gender as the performer of the activity. We refer to this bias in associating an activity with the gender of its actual performer in an image or text as the Gender-Activity Binding (GAB) bias. Our experiments indicate that VLMs experience an average performance decline of about 13.2% when confronted with gender-activity binding bias.
arXiv Detail & Related papers (2024-07-30T17:46:06Z)
Vision-Language Models under Cultural and Inclusive Considerations [53.614528867159706]
Large vision-language models (VLMs) can assist visually impaired people by describing images from their daily lives. Current evaluation datasets may not reflect diverse cultural user backgrounds or the situational context of this use case. We create a survey to determine caption preferences and propose a culture-centric evaluation benchmark by filtering VizWiz, an existing dataset with images taken by people who are blind. We then evaluate several VLMs, investigating their reliability as visual assistants in a culturally diverse setting.
arXiv Detail & Related papers (2024-07-08T17:50:00Z)
GenderBias-\emph{VL}: Benchmarking Gender Bias in Vision Language Models via Counterfactual Probing [72.0343083866144]
This paper introduces the GenderBias-emphVL benchmark to evaluate occupation-related gender bias in Large Vision-Language Models. Using our benchmark, we extensively evaluate 15 commonly used open-source LVLMs and state-of-the-art commercial APIs. Our findings reveal widespread gender biases in existing LVLMs.
arXiv Detail & Related papers (2024-06-30T05:55:15Z)
VLBiasBench: A Comprehensive Benchmark for Evaluating Bias in Large Vision-Language Model [72.13121434085116]
We introduce VLBiasBench, a benchmark to evaluate biases in Large Vision-Language Models (LVLMs)<n>VLBiasBench features a dataset that covers nine distinct categories of social biases, including age, disability status, gender, nationality, physical appearance, race, religion, profession, social economic status, as well as two intersectional bias categories: race x gender and race x social economic status.<n>We conduct extensive evaluations on 15 open-source models as well as two advanced closed-source models, yielding new insights into the biases present in these models.
arXiv Detail & Related papers (2024-06-20T10:56:59Z)
A Unified Framework and Dataset for Assessing Societal Bias in Vision-Language Models [9.025958469582363]
We propose a unified framework for evaluating gender, race, and age biases in vision-language models (VLMs) We generate high-quality synthetic datasets that intentionally conceal gender, race, and age information across different professional domains. The dataset includes action-based descriptions of each profession and serves as a benchmark for evaluating societal biases in vision-language models (VLMs)
arXiv Detail & Related papers (2024-02-21T09:17:51Z)
VisoGender: A dataset for benchmarking gender bias in image-text pronoun resolution [80.57383975987676]
VisoGender is a novel dataset for benchmarking gender bias in vision-language models. We focus on occupation-related biases within a hegemonic system of binary gender, inspired by Winograd and Winogender schemas. We benchmark several state-of-the-art vision-language models and find that they demonstrate bias in resolving binary gender in complex scenes.
arXiv Detail & Related papers (2023-06-21T17:59:51Z)
DALL-Eval: Probing the Reasoning Skills and Social Biases of Text-to-Image Generation Models [73.12069620086311]
We investigate the visual reasoning capabilities and social biases of text-to-image models. First, we measure three visual reasoning skills: object recognition, object counting, and spatial relation understanding. Second, we assess the gender and skin tone biases by measuring the gender/skin tone distribution of generated images.
arXiv Detail & Related papers (2022-02-08T18:36:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.