Automated Evaluation of Gender Bias Across 13 Large Multimodal Models
- URL: http://arxiv.org/abs/2509.07050v1
- Date: Mon, 08 Sep 2025 15:54:25 GMT
- Title: Automated Evaluation of Gender Bias Across 13 Large Multimodal Models
- Authors: Juan Manuel Contreras,
- Abstract summary: Large multimodal models (LMMs) have revolutionized text-to-image generation, but they risk perpetuating the harmful social biases in their training data.<n>We introduce the Aymara Image Fairness Evaluation, a benchmark for assessing social bias in AI-generated images.<n>We test 13 commercially available LMMs using 75 procedurally-generated, gender-neutral prompts to generate people in stereotypically-male, stereotypically-female, and non-stereotypical professions.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large multimodal models (LMMs) have revolutionized text-to-image generation, but they risk perpetuating the harmful social biases in their training data. Prior work has identified gender bias in these models, but methodological limitations prevented large-scale, comparable, cross-model analysis. To address this gap, we introduce the Aymara Image Fairness Evaluation, a benchmark for assessing social bias in AI-generated images. We test 13 commercially available LMMs using 75 procedurally-generated, gender-neutral prompts to generate people in stereotypically-male, stereotypically-female, and non-stereotypical professions. We then use a validated LLM-as-a-judge system to score the 965 resulting images for gender representation. Our results reveal (p < .001 for all): 1) LMMs systematically not only reproduce but actually amplify occupational gender stereotypes relative to real-world labor data, generating men in 93.0% of images for male-stereotyped professions but only 22.5% for female-stereotyped professions; 2) Models exhibit a strong default-male bias, generating men in 68.3% of the time for non-stereotyped professions; and 3) The extent of bias varies dramatically across models, with overall male representation ranging from 46.7% to 73.3%. Notably, the top-performing model de-amplified gender stereotypes and approached gender parity, achieving the highest fairness scores. This variation suggests high bias is not an inevitable outcome but a consequence of design choices. Our work provides the most comprehensive cross-model benchmark of gender bias to date and underscores the necessity of standardized, automated evaluation tools for promoting accountability and fairness in AI development.
Related papers
- Who Gets the Callback? Generative AI and Gender Bias [0.030693357740321777]
We find that large language models (LLMs) tend to favor men, especially for higher-wage roles.<n>A comprehensive analysis of linguistic features in job ads reveals strong alignment of model recommendations with traditional gender stereotypes.<n>Our findings highlight how AI-driven hiring may perpetuate biases in the labor market and have implications for fairness and diversity within firms.
arXiv Detail & Related papers (2025-04-30T07:55:52Z) - Effect of Gender Fair Job Description on Generative AI Images [2.5234156040689237]
This study analyzed gender representation in STEM occupation images generated by OpenAI DALL-E 3 & Black Forest FLUX.1 using 150 prompts in three linguistic forms.<n>Results revealed significant male bias across all forms, with the German pair form showing reduced bias but still overrepresenting men for the STEM-Group and mixed results for the Group of Social Occupations.
arXiv Detail & Related papers (2025-02-25T10:21:29Z) - Revealing and Reducing Gender Biases in Vision and Language Assistants (VLAs) [82.57490175399693]
We study gender bias in 22 popular image-to-text vision-language assistants (VLAs)<n>Our results show that VLAs replicate human biases likely present in the data, such as real-world occupational imbalances.<n>To eliminate the gender bias in these models, we find that fine-tuning-based debiasing methods achieve the best trade-off between debiasing and retaining performance.
arXiv Detail & Related papers (2024-10-25T05:59:44Z) - GenderCARE: A Comprehensive Framework for Assessing and Reducing Gender Bias in Large Language Models [73.23743278545321]
Large language models (LLMs) have exhibited remarkable capabilities in natural language generation, but have also been observed to magnify societal biases.<n>GenderCARE is a comprehensive framework that encompasses innovative Criteria, bias Assessment, Reduction techniques, and Evaluation metrics.
arXiv Detail & Related papers (2024-08-22T15:35:46Z) - GenderBias-\emph{VL}: Benchmarking Gender Bias in Vision Language Models via Counterfactual Probing [72.0343083866144]
This paper introduces the GenderBias-emphVL benchmark to evaluate occupation-related gender bias in Large Vision-Language Models.
Using our benchmark, we extensively evaluate 15 commonly used open-source LVLMs and state-of-the-art commercial APIs.
Our findings reveal widespread gender biases in existing LVLMs.
arXiv Detail & Related papers (2024-06-30T05:55:15Z) - The Male CEO and the Female Assistant: Evaluation and Mitigation of Gender Biases in Text-To-Image Generation of Dual Subjects [58.27353205269664]
We propose the Paired Stereotype Test (PST) framework, which queries T2I models to depict two individuals assigned with male-stereotyped and female-stereotyped social identities.<n>PST queries T2I models to depict two individuals assigned with male-stereotyped and female-stereotyped social identities.<n>Using PST, we evaluate two aspects of gender biases -- the well-known bias in gendered occupation and a novel aspect: bias in organizational power.
arXiv Detail & Related papers (2024-02-16T21:32:27Z) - AI Gender Bias, Disparities, and Fairness: Does Training Data Matter? [3.509963616428399]
This study delves into the pervasive issue of gender issues in artificial intelligence (AI)<n>It analyzes more than 1000 human-graded student responses from male and female participants across six assessment items.<n>Results indicate that scoring accuracy for mixed-trained models shows an insignificant difference from either male- or female-trained models.
arXiv Detail & Related papers (2023-12-17T22:37:06Z) - VisoGender: A dataset for benchmarking gender bias in image-text pronoun
resolution [80.57383975987676]
VisoGender is a novel dataset for benchmarking gender bias in vision-language models.
We focus on occupation-related biases within a hegemonic system of binary gender, inspired by Winograd and Winogender schemas.
We benchmark several state-of-the-art vision-language models and find that they demonstrate bias in resolving binary gender in complex scenes.
arXiv Detail & Related papers (2023-06-21T17:59:51Z) - Are Commercial Face Detection Models as Biased as Academic Models? [64.71318433419636]
We compare academic and commercial face detection systems, specifically examining robustness to noise.
We find that state-of-the-art academic face detection models exhibit demographic disparities in their noise robustness.
We conclude that commercial models are always as biased or more biased than an academic model.
arXiv Detail & Related papers (2022-01-25T02:21:42Z) - Are Gender-Neutral Queries Really Gender-Neutral? Mitigating Gender Bias
in Image Search [8.730027941735804]
We study a unique gender bias in image search.
The search images are often gender-imbalanced for gender-neutral natural language queries.
We introduce two novel debiasing approaches.
arXiv Detail & Related papers (2021-09-12T04:47:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.