VLBiasBench: A Comprehensive Benchmark for Evaluating Bias in Large Vision-Language Model
- URL: http://arxiv.org/abs/2406.14194v2
- Date: Wed, 25 Dec 2024 07:31:14 GMT
- Title: VLBiasBench: A Comprehensive Benchmark for Evaluating Bias in Large Vision-Language Model
- Authors: Sibo Wang, Xiangkui Cao, Jie Zhang, Zheng Yuan, Shiguang Shan, Xilin Chen, Wen Gao,
- Abstract summary: We introduce VLBiasBench, a benchmark to evaluate biases in Large Vision-Language Models (LVLMs)
VLBiasBench features a dataset that covers nine distinct categories of social biases, including age, disability status, gender, nationality, physical appearance, race, religion, profession, social economic status, as well as two intersectional bias categories: race x gender and race x social economic status.
We conduct extensive evaluations on 15 open-source models as well as two advanced closed-source models, yielding new insights into the biases present in these models.
- Score: 72.13121434085116
- License:
- Abstract: The emergence of Large Vision-Language Models (LVLMs) marks significant strides towards achieving general artificial intelligence. However, these advancements are accompanied by concerns about biased outputs, a challenge that has yet to be thoroughly explored. Existing benchmarks are not sufficiently comprehensive in evaluating biases due to their limited data scale, single questioning format and narrow sources of bias. To address this problem, we introduce VLBiasBench, a comprehensive benchmark designed to evaluate biases in LVLMs. VLBiasBench, features a dataset that covers nine distinct categories of social biases, including age, disability status, gender, nationality, physical appearance, race, religion, profession, social economic status, as well as two intersectional bias categories: race x gender and race x social economic status. To build a large-scale dataset, we use Stable Diffusion XL model to generate 46,848 high-quality images, which are combined with various questions to creat 128,342 samples. These questions are divided into open-ended and close-ended types, ensuring thorough consideration of bias sources and a comprehensive evaluation of LVLM biases from multiple perspectives. We conduct extensive evaluations on 15 open-source models as well as two advanced closed-source models, yielding new insights into the biases present in these models. Our benchmark is available at https://github.com/Xiangkui-Cao/VLBiasBench.
Related papers
- SB-Bench: Stereotype Bias Benchmark for Large Multimodal Models [42.54907891780377]
Stereotype biases in Large Multimodal Models (LMMs) perpetuate harmful societal prejudices.
Existing datasets evaluating stereotype biases in LMMs often lack diversity and rely on synthetic images.
We introduce the Stereotype Bias Benchmark (SB-bench), the most comprehensive framework to date for assessing stereotype biases.
arXiv Detail & Related papers (2025-02-12T20:41:53Z) - Evaluating and Mitigating Social Bias for Large Language Models in Open-ended Settings [13.686732204665738]
We extend an existing BBQ dataset by incorporating fill-in-the-blank and short-answer question types.
Our finding reveals that LLMs produce responses that are more biased against certain protected attributes, like age and socio-economic status.
Our debiasing approach combined zero-shot, few-shot, and chain-of-thought could significantly reduce the level of bias to almost 0.
arXiv Detail & Related papers (2024-12-09T01:29:47Z) - Bias Similarity Across Large Language Models [32.0365189539138]
We analyze bias through output distribution across multiple dimensions using two datasets (4K and 1M questions)
Our results show that fine-tuning has minimal impact on output distributions, and proprietary models tend to overly response as unknowns to minimize bias, compromising accuracy and utility.
Open-source models like Llama3-Chat and Gemma2-it demonstrate fairness comparable to proprietary models like GPT-4, challenging the assumption that larger, closed-source models are inherently less biased.
arXiv Detail & Related papers (2024-10-15T19:21:14Z) - Editable Fairness: Fine-Grained Bias Mitigation in Language Models [52.66450426729818]
We propose a novel debiasing approach, Fairness Stamp (FAST), which enables fine-grained calibration of individual social biases.
FAST surpasses state-of-the-art baselines with superior debiasing performance.
This highlights the potential of fine-grained debiasing strategies to achieve fairness in large language models.
arXiv Detail & Related papers (2024-08-07T17:14:58Z) - BIGbench: A Unified Benchmark for Social Bias in Text-to-Image Generative Models Based on Multi-modal LLM [8.24274551090375]
We introduce BIGbench, a unified benchmark for Biases of Image Generation.
Unlike existing benchmarks, BIGbench classifies and evaluates biases across four dimensions.
Our study also reveal new research directions about biases, such as the effect of distillation and irrelevant protected attributes.
arXiv Detail & Related papers (2024-07-21T18:09:40Z) - GenderBias-\emph{VL}: Benchmarking Gender Bias in Vision Language Models via Counterfactual Probing [72.0343083866144]
This paper introduces the GenderBias-emphVL benchmark to evaluate occupation-related gender bias in Large Vision-Language Models.
Using our benchmark, we extensively evaluate 15 commonly used open-source LVLMs and state-of-the-art commercial APIs.
Our findings reveal widespread gender biases in existing LVLMs.
arXiv Detail & Related papers (2024-06-30T05:55:15Z) - GPTBIAS: A Comprehensive Framework for Evaluating Bias in Large Language
Models [83.30078426829627]
Large language models (LLMs) have gained popularity and are being widely adopted by a large user community.
The existing evaluation methods have many constraints, and their results exhibit a limited degree of interpretability.
We propose a bias evaluation framework named GPTBIAS that leverages the high performance of LLMs to assess bias in models.
arXiv Detail & Related papers (2023-12-11T12:02:14Z) - CBBQ: A Chinese Bias Benchmark Dataset Curated with Human-AI
Collaboration for Large Language Models [52.25049362267279]
We present a Chinese Bias Benchmark dataset that consists of over 100K questions jointly constructed by human experts and generative language models.
The testing instances in the dataset are automatically derived from 3K+ high-quality templates manually authored with stringent quality control.
Extensive experiments demonstrate the effectiveness of the dataset in detecting model bias, with all 10 publicly available Chinese large language models exhibiting strong bias in certain categories.
arXiv Detail & Related papers (2023-06-28T14:14:44Z) - Debiasing Vision-Language Models via Biased Prompts [79.04467131711775]
We propose a general approach for debiasing vision-language foundation models by projecting out biased directions in the text embedding.
We show that debiasing only the text embedding with a calibrated projection matrix suffices to yield robust classifiers and fair generative models.
arXiv Detail & Related papers (2023-01-31T20:09:33Z) - "I'm sorry to hear that": Finding New Biases in Language Models with a
Holistic Descriptor Dataset [12.000335510088648]
We present a new, more inclusive bias measurement dataset, HolisticBias, which includes nearly 600 descriptor terms across 13 different demographic axes.
HolisticBias was assembled in a participatory process including experts and community members with lived experience of these terms.
We demonstrate that HolisticBias is effective at measuring previously undetectable biases in token likelihoods from language models.
arXiv Detail & Related papers (2022-05-18T20:37:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.