IndicFairFace: Balanced Indian Face Dataset for Auditing and Mitigating Geographical Bias in Vision-Language Models
- URL: http://arxiv.org/abs/2602.12659v1
- Date: Fri, 13 Feb 2026 06:41:03 GMT
- Title: IndicFairFace: Balanced Indian Face Dataset for Auditing and Mitigating Geographical Bias in Vision-Language Models
- Authors: Aarish Shah Mohsin, Mohammed Tayyab Ilyas Khan, Mohammad Nadeem, Shahab Saquib Sohail, Erik Cambria, Jiechao Gao,
- Abstract summary: Vision-Language Models (VLMs) are known to inherit and amplify societal biases from their web-scale training data.<n>We present IndicFairFace, a novel and balanced face dataset comprising 14,400 images representing geographical diversity of India.
- Score: 33.41922953936466
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision-Language Models (VLMs) are known to inherit and amplify societal biases from their web-scale training data with Indian being particularly misrepresented. Existing fairness-aware datasets have significantly improved demographic balance across global race and gender groups, yet they continue to treat Indian as a single monolithic category. The oversimplification ignores the vast intra-national diversity across 28 states and 8 Union Territories of India and leads to representational and geographical bias. To address the limitation, we present IndicFairFace, a novel and balanced face dataset comprising 14,400 images representing geographical diversity of India. Images were sourced ethically from Wikimedia Commons and open-license web repositories and uniformly balanced across states and gender. Using IndicFairFace, we quantify intra-national geographical bias in prominent CLIP-based VLMs and reduce it using post-hoc Iterative Nullspace Projection debiasing approach. We also show that the adopted debiasing approach does not adversely impact the existing embedding space as the average drop in retrieval accuracy on benchmark datasets is less than 1.5 percent. Our work establishes IndicFairFace as the first benchmark to study geographical bias in VLMs for the Indian context.
Related papers
- Person-Centric Annotations of LAION-400M: Auditing Bias and Its Transfer to Models [81.45743826739054]
A major barrier has been the lack of demographic annotations in web-scale datasets such as LAION-400M.<n>We create person-centric annotations for the full dataset, including over 276 million bounding boxes, perceived gender and race/ethnicity labels, and automatically generated captions.<n>Using them, we uncover demographic imbalances and harmful associations, such as the disproportionate linking of men and individuals perceived as Black or Middle Eastern with crime-related and negative content.
arXiv Detail & Related papers (2025-10-04T07:51:59Z) - How Deep Is Representational Bias in LLMs? The Cases of Caste and Religion [25.340454708475754]
We quantify the presence and "stickiness" of representational bias in large language models for religion and caste.<n>We find that GPT-4 responses consistently overrepresent culturally dominant groups far beyond their statistical representation.<n>Our findings suggest that representational bias in LLMs has a winner-take-all quality that is more biased than the likely distribution bias in their training data.
arXiv Detail & Related papers (2025-07-22T17:28:37Z) - FairI Tales: Evaluation of Fairness in Indian Contexts with a Focus on Bias and Stereotypes [23.71105683137539]
Existing studies on fairness are largely Western-focused, making them inadequate for culturally diverse countries such as India.<n>We introduce INDIC-BIAS, a comprehensive India-centric benchmark designed to evaluate fairness of LLMs across 85 socio identity groups.
arXiv Detail & Related papers (2025-06-29T06:31:06Z) - FaceSaliencyAug: Mitigating Geographic, Gender and Stereotypical Biases via Saliency-Based Data Augmentation [46.74201905814679]
We present an approach named FaceSaliencyAug aimed at addressing the gender bias in computer vision models.
We quantify dataset diversity using Image Similarity Score (ISS) across five datasets, including Flickr Faces HQ (FFHQ), WIKI, IMDB, Labelled Faces in the Wild (LFW), UTK Faces, and Diverse dataset.
Our experiments reveal a reduction in gender bias for both CNNs and ViTs, indicating the efficacy of our method in promoting fairness and inclusivity in computer vision models.
arXiv Detail & Related papers (2024-10-17T22:36:52Z) - GenderBias-\emph{VL}: Benchmarking Gender Bias in Vision Language Models via Counterfactual Probing [72.0343083866144]
This paper introduces the GenderBias-emphVL benchmark to evaluate occupation-related gender bias in Large Vision-Language Models.
Using our benchmark, we extensively evaluate 15 commonly used open-source LVLMs and state-of-the-art commercial APIs.
Our findings reveal widespread gender biases in existing LVLMs.
arXiv Detail & Related papers (2024-06-30T05:55:15Z) - VLBiasBench: A Comprehensive Benchmark for Evaluating Bias in Large Vision-Language Model [72.13121434085116]
We introduce VLBiasBench, a benchmark to evaluate biases in Large Vision-Language Models (LVLMs)<n>VLBiasBench features a dataset that covers nine distinct categories of social biases, including age, disability status, gender, nationality, physical appearance, race, religion, profession, social economic status, as well as two intersectional bias categories: race x gender and race x social economic status.<n>We conduct extensive evaluations on 15 open-source models as well as two advanced closed-source models, yielding new insights into the biases present in these models.
arXiv Detail & Related papers (2024-06-20T10:56:59Z) - IndiBias: A Benchmark Dataset to Measure Social Biases in Language Models for Indian Context [32.48196952339581]
We introduce IndiBias, a benchmark dataset for evaluating social biases in the Indian context.
The included bias dimensions encompass gender, religion, caste, age, region, physical appearance, and occupation.
Our dataset contains 800 sentence pairs and 300s for bias measurement across different demographics.
arXiv Detail & Related papers (2024-03-29T12:32:06Z) - Decoding Demographic un-fairness from Indian Names [4.402336973466853]
Demographic classification is essential in fairness assessment in recommender systems or in measuring unintended bias in online networks and voting systems.
We collect three publicly available datasets to train state-of-the-art classifiers in the domain of gender and caste classification.
We perform cross-testing (training and testing on different datasets) to understand the efficacy of the above models.
arXiv Detail & Related papers (2022-09-07T11:54:49Z) - Balancing Biases and Preserving Privacy on Balanced Faces in the Wild [50.915684171879036]
There are demographic biases present in current facial recognition (FR) models.
We introduce our Balanced Faces in the Wild dataset to measure these biases across different ethnic and gender subgroups.
We find that relying on a single score threshold to differentiate between genuine and imposters sample pairs leads to suboptimal results.
We propose a novel domain adaptation learning scheme that uses facial features extracted from state-of-the-art neural networks.
arXiv Detail & Related papers (2021-03-16T15:05:49Z) - Re-imagining Algorithmic Fairness in India and Beyond [9.667710168953239]
We de-center algorithmic fairness and analyse AI power in India.
We find that data is not always reliable due to socio-economic factors.
We provide a roadmap to re-contextualise data and models, empower oppressed communities, and enable Fair-ML ecosystems.
arXiv Detail & Related papers (2021-01-25T10:20:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.