POSTER: A Pyramid Cross-Fusion Transformer Network for Facial Expression
Recognition
- URL: http://arxiv.org/abs/2204.04083v2
- Date: Sun, 13 Aug 2023 20:49:39 GMT
- Title: POSTER: A Pyramid Cross-Fusion Transformer Network for Facial Expression
Recognition
- Authors: Ce Zheng, Matias Mendieta, and Chen Chen
- Abstract summary: Facial expression recognition (FER) is an important task in computer vision, having practical applications in areas such as human-computer interaction, education, healthcare, and online monitoring.
There are three key issues especially prevalent: inter-class similarity, intra-class discrepancy, and scale sensitivity.
We propose a two-stream Pyramid crOss-fuSion TransformER network (POSTER) that aims to holistically solve all three issues.
- Score: 11.525573321175925
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Facial expression recognition (FER) is an important task in computer vision,
having practical applications in areas such as human-computer interaction,
education, healthcare, and online monitoring. In this challenging FER task,
there are three key issues especially prevalent: inter-class similarity,
intra-class discrepancy, and scale sensitivity. While existing works typically
address some of these issues, none have fully addressed all three challenges in
a unified framework. In this paper, we propose a two-stream Pyramid
crOss-fuSion TransformER network (POSTER), that aims to holistically solve all
three issues. Specifically, we design a transformer-based cross-fusion method
that enables effective collaboration of facial landmark features and image
features to maximize proper attention to salient facial regions. Furthermore,
POSTER employs a pyramid structure to promote scale invariance. Extensive
experimental results demonstrate that our POSTER achieves new state-of-the-art
results on RAF-DB (92.05%), FERPlus (91.62%), as well as AffectNet 7 class
(67.31%) and 8 class (63.34%). The code is available at
https://github.com/zczcwh/POSTER.
Related papers
- A Lightweight Attention-based Deep Network via Multi-Scale Feature Fusion for Multi-View Facial Expression Recognition [2.9581436761331017]
We introduce a lightweight attentional network incorporating multi-scale feature fusion (LANMSFF) to tackle these issues.
We present two novel components, namely mass attention (MassAtt) and point wise feature selection (PWFS) blocks.
Our proposed approach achieved results comparable to state-of-the-art methods in terms of parameter counts and robustness to pose variation.
arXiv Detail & Related papers (2024-03-21T11:40:51Z) - POSTER V2: A simpler and stronger facial expression recognition network [8.836565857279052]
Facial expression recognition (FER) plays an important role in a variety of real-world applications such as human-computer interaction.
POSTER V1 achieves the state-of-the-art (SOTA) performance in FER by effectively combining facial landmark and image features.
In this paper, we propose POSTER V2, which improves POSTER V1 in three directions: cross-fusion, two-stream, and multi-scale feature extraction.
arXiv Detail & Related papers (2023-01-28T10:23:44Z) - Distract Your Attention: Multi-head Cross Attention Network for Facial
Expression Recognition [4.500212131331687]
We present a novel facial expression recognition network, called Distract your Attention Network (DAN)
Our method is based on two key observations. Multiple classes share inherently similar underlying facial appearance, and their differences could be subtle.
We propose our DAN with three key components: Feature Clustering Network (FCN), Multi-head cross Attention Network (MAN), and Attention Fusion Network (AFN)
arXiv Detail & Related papers (2021-09-15T13:15:54Z) - P2T: Pyramid Pooling Transformer for Scene Understanding [62.41912463252468]
We build a downstream-task-oriented transformer network, dubbed Pyramid Pooling Transformer (P2T)
Plugged with our pooling-based MHSA, we build a downstream-task-oriented transformer network, dubbed Pyramid Pooling Transformer (P2T)
arXiv Detail & Related papers (2021-06-22T18:28:52Z) - MViT: Mask Vision Transformer for Facial Expression Recognition in the
wild [77.44854719772702]
Facial Expression Recognition (FER) in the wild is an extremely challenging task in computer vision.
In this work, we first propose a novel pure transformer-based mask vision transformer (MViT) for FER in the wild.
Our MViT outperforms state-of-the-art methods on RAF-DB with 88.62%, FERPlus with 89.22%, and AffectNet-7 with 64.57%, respectively, and achieves a comparable result on AffectNet-8 with 61.40%.
arXiv Detail & Related papers (2021-06-08T16:58:10Z) - Robust Facial Expression Recognition with Convolutional Visual
Transformers [23.05378099875569]
We propose Convolutional Visual Transformers to tackle Facial Expression Recognition in the wild by two main steps.
First, we propose an attentional selective fusion (ASF) for leveraging the feature maps generated by two-branch CNNs.
Second, inspired by the success of Transformers in natural language processing, we propose to model relationships between these visual words with global self-attention.
arXiv Detail & Related papers (2021-03-31T07:07:56Z) - Hierarchical Deep CNN Feature Set-Based Representation Learning for
Robust Cross-Resolution Face Recognition [59.29808528182607]
Cross-resolution face recognition (CRFR) is important in intelligent surveillance and biometric forensics.
Existing shallow learning-based and deep learning-based methods focus on mapping the HR-LR face pairs into a joint feature space.
In this study, we desire to fully exploit the multi-level deep convolutional neural network (CNN) feature set for robust CRFR.
arXiv Detail & Related papers (2021-03-25T14:03:42Z) - Transformers Solve the Limited Receptive Field for Monocular Depth
Prediction [82.90445525977904]
We propose TransDepth, an architecture which benefits from both convolutional neural networks and transformers.
This is the first paper which applies transformers into pixel-wise prediction problems involving continuous labels.
arXiv Detail & Related papers (2021-03-22T18:00:13Z) - TransFG: A Transformer Architecture for Fine-grained Recognition [27.76159820385425]
Recently, vision transformer (ViT) shows its strong performance in the traditional classification task.
We propose a novel transformer-based framework TransFG where we integrate all raw attention weights of the transformer into an attention map.
A contrastive loss is applied to further enlarge the distance between feature representations of similar sub-classes.
arXiv Detail & Related papers (2021-03-14T17:03:53Z) - Feature Pyramid Transformer [121.50066435635118]
We propose a fully active feature interaction across both space and scales, called Feature Pyramid Transformer (FPT)
FPT transforms any feature pyramid into another feature pyramid of the same size but with richer contexts.
We conduct extensive experiments in both instance-level (i.e., object detection and instance segmentation) and pixel-level segmentation tasks.
arXiv Detail & Related papers (2020-07-18T15:16:32Z) - DotFAN: A Domain-transferred Face Augmentation Network for Pose and
Illumination Invariant Face Recognition [94.96686189033869]
We propose a 3D model-assisted domain-transferred face augmentation network (DotFAN)
DotFAN can generate a series of variants of an input face based on the knowledge distilled from existing rich face datasets collected from other domains.
Experiments show that DotFAN is beneficial for augmenting small face datasets to improve their within-class diversity.
arXiv Detail & Related papers (2020-02-23T08:16:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.