Reading Smiles: Proxy Bias in Foundation Models for Facial Emotion Recognition
- URL: http://arxiv.org/abs/2506.19079v1
- Date: Mon, 23 Jun 2025 19:56:30 GMT
- Title: Reading Smiles: Proxy Bias in Foundation Models for Facial Emotion Recognition
- Authors: Iosif Tsangko, Andreas Triantafyllopoulos, Adem Abdelmoula, Adria Mallol-Ragolta, Bjoern W. Schuller,
- Abstract summary: Foundation Models (FMs) are rapidly transforming Affective Computing (AC), with Vision Language Models (VLMs) now capable of recognising emotions in zero shot settings.<n>This paper probes a critical but underexplored question: what visual cues do these models rely on to infer affect, and are these cues psychologically grounded or superficially learnt?<n>We benchmark varying scale VLMs on a teeth annotated subset of AffectNet dataset and find consistent performance shifts depending on the presence of visible teeth.
- Score: 10.842056584680071
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Foundation Models (FMs) are rapidly transforming Affective Computing (AC), with Vision Language Models (VLMs) now capable of recognising emotions in zero shot settings. This paper probes a critical but underexplored question: what visual cues do these models rely on to infer affect, and are these cues psychologically grounded or superficially learnt? We benchmark varying scale VLMs on a teeth annotated subset of AffectNet dataset and find consistent performance shifts depending on the presence of visible teeth. Through structured introspection of, the best-performing model, i.e., GPT-4o, we show that facial attributes like eyebrow position drive much of its affective reasoning, revealing a high degree of internal consistency in its valence-arousal predictions. These patterns highlight the emergent nature of FMs behaviour, but also reveal risks: shortcut learning, bias, and fairness issues especially in sensitive domains like mental health and education.
Related papers
- Visual Affect Analysis: Predicting Emotions of Image Viewers with Vision-Language Models [2.2023261946811563]
Vision-language models (VLMs) show promise as tools for inferring affect from visual stimuli at scale.<n>We benchmarked nine VLMs, ranging from state-of-the-art proprietary models to open-source models, on three psycho-metrically validated affective image datasets.
arXiv Detail & Related papers (2026-01-27T17:56:21Z) - Understanding the Implicit Biases of Design Choices for Time Series Foundation Models [90.894232610821]
Time series foundation models (TSFMs) are a class of potentially powerful, general-purpose tools for time series forecasting and related temporal tasks.<n>Their behavior is strongly shaped by subtle inductive biases in their design.<n>We show how these biases can be intuitive or very counterintuitive, depending on properties of the model and data.
arXiv Detail & Related papers (2025-10-22T04:42:35Z) - The Personality Illusion: Revealing Dissociation Between Self-Reports & Behavior in LLMs [60.15472325639723]
Personality traits have long been studied as predictors of human behavior.<n>Recent advances in Large Language Models (LLMs) suggest similar patterns may emerge in artificial systems.
arXiv Detail & Related papers (2025-09-03T21:27:10Z) - Envisioning Beyond the Pixels: Benchmarking Reasoning-Informed Visual Editing [84.16442052968615]
We introduce RISEBench, the first benchmark for evaluating Reasoning-Informed viSual Editing (RISE)<n>RISEBench focuses on four key reasoning categories: Temporal, Causal, Spatial, and Logical Reasoning.<n>We conduct experiments evaluating nine prominent visual editing models, comprising both open-source and proprietary models.
arXiv Detail & Related papers (2025-04-03T17:59:56Z) - Towards Context-Aware Emotion Recognition Debiasing from a Causal Demystification Perspective via De-confounded Training [14.450673163785094]
Context-Aware Emotion Recognition (CAER) provides valuable semantic cues for recognizing the emotions of target persons.
Current approaches invariably focus on designing sophisticated structures to extract perceptually critical representations from contexts.
We present a Contextual Causal Intervention Module (CCIM) to de-confound the confounder.
arXiv Detail & Related papers (2024-07-06T05:29:02Z) - CAGE: Circumplex Affect Guided Expression Inference [9.108319009019912]
We present a comparative in-depth analysis of two common datasets (AffectNet and EMOTIC) equipped with the components of the circumplex model of affect.
We propose a model for the prediction of facial expressions tailored for lightweight applications.
arXiv Detail & Related papers (2024-04-23T12:30:17Z) - The Power of Properties: Uncovering the Influential Factors in Emotion Classification [7.562215603730798]
State-of-the-art classification performance is only reached by end-to-end trained neural networks.
We introduce a workflow to evaluate explicit properties and their impact.
arXiv Detail & Related papers (2024-04-11T16:01:00Z) - Emotic Masked Autoencoder with Attention Fusion for Facial Expression Recognition [1.4374467687356276]
This paper presents an innovative approach integrating the MAE-Face self-supervised learning (SSL) method and multi-view Fusion Attention mechanism for expression classification.
We suggest easy-to-implement and no-training frameworks aimed at highlighting key facial features to determine if such features can serve as guides for the model.
The efficacy of this method is validated by improvements in model performance on the Aff-wild2 dataset.
arXiv Detail & Related papers (2024-03-19T16:21:47Z) - LLMvsSmall Model? Large Language Model Based Text Augmentation Enhanced
Personality Detection Model [58.887561071010985]
Personality detection aims to detect one's personality traits underlying in social media posts.
Most existing methods learn post features directly by fine-tuning the pre-trained language models.
We propose a large language model (LLM) based text augmentation enhanced personality detection model.
arXiv Detail & Related papers (2024-03-12T12:10:18Z) - What Makes Pre-Trained Visual Representations Successful for Robust
Manipulation? [57.92924256181857]
We find that visual representations designed for manipulation and control tasks do not necessarily generalize under subtle changes in lighting and scene texture.
We find that emergent segmentation ability is a strong predictor of out-of-distribution generalization among ViT models.
arXiv Detail & Related papers (2023-11-03T18:09:08Z) - I am Only Happy When There is Light: The Impact of Environmental Changes
on Affective Facial Expressions Recognition [65.69256728493015]
We study the impact of different image conditions on the recognition of arousal from human facial expressions.
Our results show how the interpretation of human affective states can differ greatly in either the positive or negative direction.
arXiv Detail & Related papers (2022-10-28T16:28:26Z) - CIAO! A Contrastive Adaptation Mechanism for Non-Universal Facial
Expression Recognition [80.07590100872548]
We propose Contrastive Inhibitory Adaptati On (CIAO), a mechanism that adapts the last layer of facial encoders to depict specific affective characteristics on different datasets.
CIAO presents an improvement in facial expression recognition performance over six different datasets with very unique affective representations.
arXiv Detail & Related papers (2022-08-10T15:46:05Z) - SparseBERT: Rethinking the Importance Analysis in Self-attention [107.68072039537311]
Transformer-based models are popular for natural language processing (NLP) tasks due to its powerful capacity.
Attention map visualization of a pre-trained model is one direct method for understanding self-attention mechanism.
We propose a Differentiable Attention Mask (DAM) algorithm, which can be also applied in guidance of SparseBERT design.
arXiv Detail & Related papers (2021-02-25T14:13:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.