Spatial Action Unit Cues for Interpretable Deep Facial Expression Recognition
- URL: http://arxiv.org/abs/2410.01848v1
- Date: Tue, 1 Oct 2024 10:42:55 GMT
- Title: Spatial Action Unit Cues for Interpretable Deep Facial Expression Recognition
- Authors: Soufiane Belharbi, Marco Pedersoli, Alessandro Lameiras Koerich, Simon Bacon, Eric Granger,
- Abstract summary: State-of-the-art classifiers for facial expression recognition (FER) lack interpretability, an important feature for end-users.
A new learning strategy is proposed to explicitly incorporate AU cues into classifier training, allowing to train deep interpretable models.
Our new strategy is generic, and can be applied to any deep CNN- or transformer-based classifier without requiring any architectural change or significant additional training time.
- Score: 55.97779732051921
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Although state-of-the-art classifiers for facial expression recognition (FER) can achieve a high level of accuracy, they lack interpretability, an important feature for end-users. Experts typically associate spatial action units (AUs) from a codebook to facial regions for the visual interpretation of expressions. In this paper, the same expert steps are followed. A new learning strategy is proposed to explicitly incorporate AU cues into classifier training, allowing to train deep interpretable models. During training, this AU codebook is used, along with the input image expression label, and facial landmarks, to construct a AU heatmap that indicates the most discriminative image regions of interest w.r.t the facial expression. This valuable spatial cue is leveraged to train a deep interpretable classifier for FER. This is achieved by constraining the spatial layer features of a classifier to be correlated with AU heatmaps. Using a composite loss, the classifier is trained to correctly classify an image while yielding interpretable visual layer-wise attention correlated with AU maps, simulating the expert decision process. Our strategy only relies on image class expression for supervision, without additional manual annotations. Our new strategy is generic, and can be applied to any deep CNN- or transformer-based classifier without requiring any architectural change or significant additional training time. Our extensive evaluation on two public benchmarks RAF-DB, and AffectNet datasets shows that our proposed strategy can improve layer-wise interpretability without degrading classification performance. In addition, we explore a common type of interpretable classifiers that rely on class activation mapping (CAM) methods, and show that our approach can also improve CAM interpretability.
Related papers
- Auxiliary Tasks Enhanced Dual-affinity Learning for Weakly Supervised
Semantic Segmentation [79.05949524349005]
We propose AuxSegNet+, a weakly supervised auxiliary learning framework to explore the rich information from saliency maps.
We also propose a cross-task affinity learning mechanism to learn pixel-level affinities from the saliency and segmentation feature maps.
arXiv Detail & Related papers (2024-03-02T10:03:21Z) - Guided Interpretable Facial Expression Recognition via Spatial Action Unit Cues [55.97779732051921]
A new learning strategy is proposed to explicitly incorporate au cues into classifier training.
We show that our strategy can improve layer-wise interpretability without degrading classification performance.
arXiv Detail & Related papers (2024-02-01T02:13:49Z) - LICO: Explainable Models with Language-Image Consistency [39.869639626266554]
This paper develops a Language-Image COnsistency model for explainable image classification, termed LICO.
We first establish a coarse global manifold structure alignment by minimizing the distance between the distributions of image and language features.
We then achieve fine-grained saliency maps by applying optimal transport (OT) theory to assign local feature maps with class-specific prompts.
arXiv Detail & Related papers (2023-10-15T12:44:33Z) - Location-Aware Self-Supervised Transformers [74.76585889813207]
We propose to pretrain networks for semantic segmentation by predicting the relative location of image parts.
We control the difficulty of the task by masking a subset of the reference patch features visible to those of the query.
Our experiments show that this location-aware pretraining leads to representations that transfer competitively to several challenging semantic segmentation benchmarks.
arXiv Detail & Related papers (2022-12-05T16:24:29Z) - Fine-Grained Visual Classification using Self Assessment Classifier [12.596520707449027]
Extracting discriminative features plays a crucial role in the fine-grained visual classification task.
In this paper, we introduce a Self Assessment, which simultaneously leverages the representation of the image and top-k prediction classes.
We show that our method achieves new state-of-the-art results on CUB200-2011, Stanford Dog, and FGVC Aircraft datasets.
arXiv Detail & Related papers (2022-05-21T07:41:27Z) - Uncertain Label Correction via Auxiliary Action Unit Graphs for Facial
Expression Recognition [46.99756911719854]
We achieve uncertain label correction of facial expressions using auxiliary action unit (AU) graphs, called ULC-AG.
Experiments show that our ULC-AG achieves 89.31% and 61.57% accuracy on RAF-DB and AffectNet datasets, respectively.
arXiv Detail & Related papers (2022-04-23T11:09:43Z) - LEAD: Self-Supervised Landmark Estimation by Aligning Distributions of
Feature Similarity [49.84167231111667]
Existing works in self-supervised landmark detection are based on learning dense (pixel-level) feature representations from an image.
We introduce an approach to enhance the learning of dense equivariant representations in a self-supervised fashion.
We show that having such a prior in the feature extractor helps in landmark detection, even under drastically limited number of annotations.
arXiv Detail & Related papers (2022-04-06T17:48:18Z) - AU-Expression Knowledge Constrained Representation Learning for Facial
Expression Recognition [79.8779790682205]
We propose an AU-Expression Knowledge Constrained Representation Learning (AUE-CRL) framework to learn the AU representations without AU annotations and adaptively use representations to facilitate facial expression recognition.
We conduct experiments on the challenging uncontrolled datasets to demonstrate the superiority of the proposed framework over current state-of-the-art methods.
arXiv Detail & Related papers (2020-12-29T03:42:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.