Related papers: Evaluating Fairness in Large Vision-Language Models Across Diverse Demographic Attributes and Prompts

Evaluating Fairness in Large Vision-Language Models Across Diverse Demographic Attributes and Prompts

URL: http://arxiv.org/abs/2406.17974v1
Date: Tue, 25 Jun 2024 23:11:39 GMT
Title: Evaluating Fairness in Large Vision-Language Models Across Diverse Demographic Attributes and Prompts
Authors: Xuyang Wu, Yuan Wang, Hsin-Tai Wu, Zhiqiang Tao, Yi Fang,
Abstract summary: We empirically investigate emphvisual fairness in several mainstream vision-language models (LVLMs) We audit their performance disparities across sensitive demographic attributes based on public fairness benchmark datasets (e.g., FACET) Despite enhancements in visual understanding, both open-source and closed-source LVLMs exhibit prevalent fairness issues across different instruct prompts and demographic attributes.
Score: 27.66626125248612
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Large vision-language models (LVLMs) have recently achieved significant progress, demonstrating strong capabilities in open-world visual understanding. However, it is not yet clear how LVLMs address demographic biases in real life, especially the disparities across attributes such as gender, skin tone, and age. In this paper, we empirically investigate \emph{visual fairness} in several mainstream LVLMs and audit their performance disparities across sensitive demographic attributes, based on public fairness benchmark datasets (e.g., FACET). To disclose the visual bias in LVLMs, we design a fairness evaluation framework with direct questions and single-choice question-instructed prompts on visual question-answering/classification tasks. The zero-shot prompting results indicate that, despite enhancements in visual understanding, both open-source and closed-source LVLMs exhibit prevalent fairness issues across different instruct prompts and demographic attributes.

Related papers

Exploring Fairness across Fine-Grained Attributes in Large Vision-Language Models [26.186038156155522]
We construct an open-set knowledge base of bias attributes leveraging Large Language Models (LLMs) and evaluate the fairness of LVLMs across finer-grained attributes.<n>Our experimental results reveal that LVLMs exhibit biased outputs across a diverse set of attributes and further demonstrate that cultural, environmental, and behavioral factors have a more pronounced impact on LVLM decision-making than traditional demographic attributes.
arXiv Detail & Related papers (2025-08-05T04:52:32Z)
Mitigating Group-Level Fairness Disparities in Federated Visual Language Models [115.16940773660104]
This paper introduces FVL-FP, a novel framework that combines FL with fair prompt tuning techniques.<n>We focus on mitigating demographic biases while preserving model performance.<n>Our approach reduces demographic disparity by an average of 45% compared to standard FL approaches.
arXiv Detail & Related papers (2025-05-03T16:09:52Z)
Benchmarking Large Vision-Language Models on Fine-Grained Image Tasks: A Comprehensive Evaluation [53.84282335629258]
We introduce a comprehensive fine-grained evaluation benchmark, i.e., FG-BMK, comprising 3.49 million questions and 3.32 million images. Our evaluation systematically examines LVLMs from both human-oriented and machine-oriented perspectives. We uncover key findings regarding the influence of training paradigms, modality alignment, perturbation susceptibility, and fine-grained category reasoning on task performance.
arXiv Detail & Related papers (2025-04-21T09:30:41Z)
FairMT-Bench: Benchmarking Fairness for Multi-turn Dialogue in Conversational LLMs [8.37667737406383]
We propose a fairness benchmark for large language model (LLM)-based chatbots in multi-turn dialogue scenarios, textbfFairMT-Bench. To ensure coverage of diverse bias types and attributes, we employ our template to construct a multi-turn dialogue dataset, textttFairMT-10K. Experiments and analyses on textttFairMT-10K reveal that in multi-turn dialogue scenarios, current LLMs are more likely to generate biased responses, and there is significant variation in performance across different tasks and models.
arXiv Detail & Related papers (2024-10-25T06:06:31Z)
Fairness in Large Language Models in Three Hours [2.443957114877221]
This tutorial provides a systematic overview of recent advances in the literature concerning large language models. The concept of fairness in LLMs is then explored, summarizing the strategies for evaluating bias and the algorithms designed to promote fairness.
arXiv Detail & Related papers (2024-08-02T03:44:14Z)
GenderBias-\emph{VL}: Benchmarking Gender Bias in Vision Language Models via Counterfactual Probing [72.0343083866144]
This paper introduces the GenderBias-emphVL benchmark to evaluate occupation-related gender bias in Large Vision-Language Models. Using our benchmark, we extensively evaluate 15 commonly used open-source LVLMs and state-of-the-art commercial APIs. Our findings reveal widespread gender biases in existing LVLMs.
arXiv Detail & Related papers (2024-06-30T05:55:15Z)
VLBiasBench: A Comprehensive Benchmark for Evaluating Bias in Large Vision-Language Model [72.13121434085116]
VLBiasBench is a benchmark aimed at evaluating biases in Large Vision-Language Models (LVLMs) We construct a dataset encompassing nine distinct categories of social biases, including age, disability status, gender, nationality, physical appearance, race, religion, profession, social economic status and two intersectional bias categories (race x gender, and race x social economic status) We conduct extensive evaluations on 15 open-source models as well as one advanced closed-source model, providing some new insights into the biases revealing from these models.
arXiv Detail & Related papers (2024-06-20T10:56:59Z)
Unveiling the Tapestry of Consistency in Large Vision-Language Models [25.106467574467448]
We provide a benchmark ConBench to intuitively analyze how LVLMs perform when the solution space of a prompt revolves around a knowledge point. Based on the ConBench tool, we are the first to reveal the tapestry and get the following findings. We hope this paper will accelerate the research community in better evaluating their models and encourage future advancements in the consistency domain.
arXiv Detail & Related papers (2024-05-23T04:08:23Z)
VALOR-EVAL: Holistic Coverage and Faithfulness Evaluation of Large Vision-Language Models [57.43276586087863]
Large Vision-Language Models (LVLMs) suffer from hallucination issues, wherein the models generate plausible-sounding but factually incorrect outputs. Existing benchmarks are often limited in scope, focusing mainly on object hallucinations. We introduce a multi-dimensional benchmark covering objects, attributes, and relations, with challenging images selected based on associative biases.
arXiv Detail & Related papers (2024-04-22T04:49:22Z)
Leveraging vision-language models for fair facial attribute classification [19.93324644519412]
General-purpose vision-language model (VLM) is a rich knowledge source for common sensitive attributes. We analyze the correspondence between VLM predicted and human defined sensitive attribute distribution. Experiments on multiple benchmark facial attribute classification datasets show fairness gains of the model over existing unsupervised baselines.
arXiv Detail & Related papers (2024-03-15T18:37:15Z)
Debiasing Multimodal Large Language Models [61.6896704217147]
Large Vision-Language Models (LVLMs) have become indispensable tools in computer vision and natural language processing. Our investigation reveals a noteworthy bias in the generated content, where the output is primarily influenced by the underlying Large Language Models (LLMs) prior to the input image. To rectify these biases and redirect the model's focus toward vision information, we introduce two simple, training-free strategies.
arXiv Detail & Related papers (2024-03-08T12:35:07Z)
Finer: Investigating and Enhancing Fine-Grained Visual Concept Recognition in Large Vision Language Models [57.95366341738857]
In-depth analyses show that instruction-tuned LVLMs exhibit modality gap, showing discrepancy when given textual and visual inputs that correspond to the same concept. We propose a multiple attribute-centric evaluation benchmark, Finer, to evaluate LVLMs' fine-grained visual comprehension ability and provide significantly improved explainability.
arXiv Detail & Related papers (2024-02-26T05:43:51Z)
Exploring Value Biases: How LLMs Deviate Towards the Ideal [57.99044181599786]
Large-Language-Models (LLMs) are deployed in a wide range of applications, and their response has an increasing social impact. We show that value bias is strong in LLMs across different categories, similar to the results found in human studies.
arXiv Detail & Related papers (2024-02-16T18:28:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.