VIGNETTE: Socially Grounded Bias Evaluation for Vision-Language Models
- URL: http://arxiv.org/abs/2505.22897v1
- Date: Wed, 28 May 2025 22:00:30 GMT
- Title: VIGNETTE: Socially Grounded Bias Evaluation for Vision-Language Models
- Authors: Chahat Raj, Bowen Wei, Aylin Caliskan, Antonios Anastasopoulos, Ziwei Zhu,
- Abstract summary: This work introduces VIGNETTE, a large-scale VQA benchmark with 30M+ images for evaluating bias in vision-language models (VLMs)<n>We assess how VLMs interpret identities in contextualized settings, revealing how models make trait and capability assumptions and exhibit patterns of discrimination.<n>Our findings uncover subtle, multifaceted, and surprising stereotypical patterns, offering insights into how VLMs construct social meaning from inputs.
- Score: 23.329280888159744
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While bias in large language models (LLMs) is well-studied, similar concerns in vision-language models (VLMs) have received comparatively less attention. Existing VLM bias studies often focus on portrait-style images and gender-occupation associations, overlooking broader and more complex social stereotypes and their implied harm. This work introduces VIGNETTE, a large-scale VQA benchmark with 30M+ images for evaluating bias in VLMs through a question-answering framework spanning four directions: factuality, perception, stereotyping, and decision making. Beyond narrowly-centered studies, we assess how VLMs interpret identities in contextualized settings, revealing how models make trait and capability assumptions and exhibit patterns of discrimination. Drawing from social psychology, we examine how VLMs connect visual identity cues to trait and role-based inferences, encoding social hierarchies, through biased selections. Our findings uncover subtle, multifaceted, and surprising stereotypical patterns, offering insights into how VLMs construct social meaning from inputs.
Related papers
- Hidden in plain sight: VLMs overlook their visual representations [48.83628674170634]
We compare vision language models (VLMs) to their visual encoders to understand their ability to integrate across these modalities.<n>We find that VLMs perform substantially worse than their visual encoders, dropping to near-chance performance.
arXiv Detail & Related papers (2025-06-09T17:59:54Z) - A Stereotype Content Analysis on Color-related Social Bias in Large Vision Language Models [5.12659586713042]
This study introduces new evaluation metrics based on the Stereotype Content Model (SCM)<n>We also propose BASIC, a benchmark for assessing gender, race, and color stereotypes.
arXiv Detail & Related papers (2025-05-27T08:44:05Z) - Visual Cues of Gender and Race are Associated with Stereotyping in Vision-Language Models [0.2812395851874055]
Using standardized facial images that vary in prototypicality, we test four Vision Language Models for both trait associations and homogeneity bias in open-ended contexts.<n>We find that VLMs consistently generate more uniform stories for women compared to men, with people who are more gender prototypical in appearance being represented more uniformly.<n>In terms of trait associations, we find limited evidence of stereotyping-Black Americans were consistently linked with basketball across all models, while other racial associations (i.e., art, healthcare, appearance) varied by specific VLM.
arXiv Detail & Related papers (2025-03-07T02:25:16Z) - VHELM: A Holistic Evaluation of Vision Language Models [75.88987277686914]
We present the Holistic Evaluation of Vision Language Models (VHELM)
VHELM aggregates various datasets to cover one or more of the 9 aspects: visual perception, knowledge, reasoning, bias, fairness, multilinguality, robustness, toxicity, and safety.
Our framework is designed to be lightweight and automatic so that evaluation runs are cheap and fast.
arXiv Detail & Related papers (2024-10-09T17:46:34Z) - Spoken Stereoset: On Evaluating Social Bias Toward Speaker in Speech Large Language Models [50.40276881893513]
This study introduces Spoken Stereoset, a dataset specifically designed to evaluate social biases in Speech Large Language Models (SLLMs)
By examining how different models respond to speech from diverse demographic groups, we aim to identify these biases.
The findings indicate that while most models show minimal bias, some still exhibit slightly stereotypical or anti-stereotypical tendencies.
arXiv Detail & Related papers (2024-08-14T16:55:06Z) - GenderBias-\emph{VL}: Benchmarking Gender Bias in Vision Language Models via Counterfactual Probing [72.0343083866144]
This paper introduces the GenderBias-emphVL benchmark to evaluate occupation-related gender bias in Large Vision-Language Models.
Using our benchmark, we extensively evaluate 15 commonly used open-source LVLMs and state-of-the-art commercial APIs.
Our findings reveal widespread gender biases in existing LVLMs.
arXiv Detail & Related papers (2024-06-30T05:55:15Z) - See It from My Perspective: How Language Affects Cultural Bias in Image Understanding [60.70852566256668]
Vision-language models (VLMs) can respond to queries about images in many languages.<n>We characterize the Western bias of VLMs in image understanding and investigate the role that language plays in this disparity.
arXiv Detail & Related papers (2024-06-17T15:49:51Z) - Uncovering Bias in Large Vision-Language Models at Scale with Counterfactuals [8.41410889524315]
Large Vision-Language Models (LVLMs) condition generated text on both an input image and a text prompt.<n>We conduct a large-scale study of text generated by different LVLMs under counterfactual changes to input images.<n>Our multi-dimensional bias evaluation framework reveals that social attributes such as perceived race, gender, and physical characteristics depicted in images can significantly influence the generation of toxic content.
arXiv Detail & Related papers (2024-05-30T15:27:56Z) - An Introduction to Vision-Language Modeling [128.6223984157515]
The vision-language model (VLM) applications will significantly impact our relationship with technology.
We introduce what VLMs are, how they work, and how to train them.
Although this work primarily focuses on mapping images to language, we also discuss extending VLMs to videos.
arXiv Detail & Related papers (2024-05-27T15:01:23Z) - More Distinctively Black and Feminine Faces Lead to Increased Stereotyping in Vision-Language Models [0.30723404270319693]
This study explores how Vision Language Models (VLMs) perpetuate homogeneity bias and trait associations with regards to race and gender.
VLMs may associate subtle visual cues related to racial and gender groups with stereotypes in ways that could be challenging to mitigate.
arXiv Detail & Related papers (2024-05-22T00:45:29Z) - A Unified Framework and Dataset for Assessing Societal Bias in Vision-Language Models [9.025958469582363]
We propose a unified framework for evaluating gender, race, and age biases in vision-language models (VLMs)
We generate high-quality synthetic datasets that intentionally conceal gender, race, and age information across different professional domains.
The dataset includes action-based descriptions of each profession and serves as a benchmark for evaluating societal biases in vision-language models (VLMs)
arXiv Detail & Related papers (2024-02-21T09:17:51Z) - Vision-Language Models for Vision Tasks: A Survey [62.543250338410836]
Vision-Language Models (VLMs) learn rich vision-language correlation from web-scale image-text pairs.
This paper provides a systematic review of visual language models for various visual recognition tasks.
arXiv Detail & Related papers (2023-04-03T02:17:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.