AesBiasBench: Evaluating Bias and Alignment in Multimodal Language Models for Personalized Image Aesthetic Assessment
- URL: http://arxiv.org/abs/2509.11620v1
- Date: Mon, 15 Sep 2025 06:25:39 GMT
- Title: AesBiasBench: Evaluating Bias and Alignment in Multimodal Language Models for Personalized Image Aesthetic Assessment
- Authors: Kun Li, Lai-Man Po, Hongzheng Yang, Xuyuan Xu, Kangcheng Liu, Yuzhi Zhao,
- Abstract summary: AesBiasBench is a benchmark designed to evaluate Multimodal Large Language Models (MLLMs)<n>Results indicate that smaller models exhibit stronger stereotype biases, whereas larger models align more closely with human preferences.<n>These findings underscore the importance of identity-aware evaluation frameworks in subjective vision-language tasks.
- Score: 29.2617518199559
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multimodal Large Language Models (MLLMs) are increasingly applied in Personalized Image Aesthetic Assessment (PIAA) as a scalable alternative to expert evaluations. However, their predictions may reflect subtle biases influenced by demographic factors such as gender, age, and education. In this work, we propose AesBiasBench, a benchmark designed to evaluate MLLMs along two complementary dimensions: (1) stereotype bias, quantified by measuring variations in aesthetic evaluations across demographic groups; and (2) alignment between model outputs and genuine human aesthetic preferences. Our benchmark covers three subtasks (Aesthetic Perception, Assessment, Empathy) and introduces structured metrics (IFD, NRD, AAS) to assess both bias and alignment. We evaluate 19 MLLMs, including proprietary models (e.g., GPT-4o, Claude-3.5-Sonnet) and open-source models (e.g., InternVL-2.5, Qwen2.5-VL). Results indicate that smaller models exhibit stronger stereotype biases, whereas larger models align more closely with human preferences. Incorporating identity information often exacerbates bias, particularly in emotional judgments. These findings underscore the importance of identity-aware evaluation frameworks in subjective vision-language tasks.
Related papers
- Breaking the Benchmark: Revealing LLM Bias via Minimal Contextual Augmentation [12.56588481992456]
Large Language Models have been shown to demonstrate stereotypical biases in their representations and behavior.<n>We introduce a novel and general augmentation framework that involves three plug-and-play steps.<n>We find that Large Language Models are susceptible to perturbations to their inputs, showcasing a higher likelihood to behave stereotypically.
arXiv Detail & Related papers (2025-10-27T23:05:12Z) - Bias in the Picture: Benchmarking VLMs with Social-Cue News Images and LLM-as-Judge Assessment [8.451522319478512]
We introduce a news-image benchmark consisting of 1,343 image-question pairs drawn from diverse outlets.<n>We evaluate a range of state-of-the-art VLMs and employ a large language model (LLM) as judge, with human verification.<n>Our findings show that: (i) visual context systematically shifts model outputs in open-ended settings; (ii) bias prevalence varies across attributes and models, with particularly high risk for gender and occupation; and (iii) higher faithfulness does not necessarily correspond to lower bias.
arXiv Detail & Related papers (2025-09-24T00:33:58Z) - No LLM is Free From Bias: A Comprehensive Study of Bias Evaluation in Large Language Models [0.9620910657090186]
Large Language Models (LLMs) have increased the performance of different natural language understanding as well as generation tasks.<n>We provide a unified evaluation of benchmarks using a set of representative small and medium-sized LLMs.<n>We propose five prompting approaches to carry out the bias detection task across different aspects of bias.<n>The results indicate that each of the selected LLMs suffer from one or the other form of bias with the Phi-3.5B model being the least biased.
arXiv Detail & Related papers (2025-03-15T03:58:14Z) - VHELM: A Holistic Evaluation of Vision Language Models [75.88987277686914]
We present the Holistic Evaluation of Vision Language Models (VHELM)
VHELM aggregates various datasets to cover one or more of the 9 aspects: visual perception, knowledge, reasoning, bias, fairness, multilinguality, robustness, toxicity, and safety.
Our framework is designed to be lightweight and automatic so that evaluation runs are cheap and fast.
arXiv Detail & Related papers (2024-10-09T17:46:34Z) - CEB: Compositional Evaluation Benchmark for Fairness in Large Language Models [58.57987316300529]
Large Language Models (LLMs) are increasingly deployed to handle various natural language processing (NLP) tasks.<n>To evaluate the biases exhibited by LLMs, researchers have recently proposed a variety of datasets.<n>We propose CEB, a Compositional Evaluation Benchmark that covers different types of bias across different social groups and tasks.
arXiv Detail & Related papers (2024-07-02T16:31:37Z) - Aligning Vision Models with Human Aesthetics in Retrieval: Benchmarks and Algorithms [91.19304518033144]
We aim to align vision models with human aesthetic standards in a retrieval system.
We propose a preference-based reinforcement learning method that fine-tunes the vision models to better align the vision models with human aesthetics.
arXiv Detail & Related papers (2024-06-13T17:59:20Z) - Subtle Biases Need Subtler Measures: Dual Metrics for Evaluating Representative and Affinity Bias in Large Language Models [10.73340009530019]
This study addresses two such biases within Large Language Models (LLMs): representative bias and affinity bias.
We introduce two novel metrics to measure these biases: the Representative Bias Score (RBS) and the Affinity Bias Score (ABS)
Our analysis uncovers marked representative biases in prominent LLMs, with a preference for identities associated with being white, straight, and men.
Our investigation of affinity bias reveals distinctive evaluative patterns within each model, akin to bias fingerprints'
arXiv Detail & Related papers (2024-05-23T13:35:34Z) - VALOR-EVAL: Holistic Coverage and Faithfulness Evaluation of Large Vision-Language Models [57.43276586087863]
Large Vision-Language Models (LVLMs) suffer from hallucination issues, wherein the models generate plausible-sounding but factually incorrect outputs.
Existing benchmarks are often limited in scope, focusing mainly on object hallucinations.
We introduce a multi-dimensional benchmark covering objects, attributes, and relations, with challenging images selected based on associative biases.
arXiv Detail & Related papers (2024-04-22T04:49:22Z) - Evaluating the Fairness of Discriminative Foundation Models in Computer
Vision [51.176061115977774]
We propose a novel taxonomy for bias evaluation of discriminative foundation models, such as Contrastive Language-Pretraining (CLIP)
We then systematically evaluate existing methods for mitigating bias in these models with respect to our taxonomy.
Specifically, we evaluate OpenAI's CLIP and OpenCLIP models for key applications, such as zero-shot classification, image retrieval and image captioning.
arXiv Detail & Related papers (2023-10-18T10:32:39Z) - Gender Biases in Automatic Evaluation Metrics for Image Captioning [87.15170977240643]
We conduct a systematic study of gender biases in model-based evaluation metrics for image captioning tasks.
We demonstrate the negative consequences of using these biased metrics, including the inability to differentiate between biased and unbiased generations.
We present a simple and effective way to mitigate the metric bias without hurting the correlations with human judgments.
arXiv Detail & Related papers (2023-05-24T04:27:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.