MViT: Mask Vision Transformer for Facial Expression Recognition in the
wild
- URL: http://arxiv.org/abs/2106.04520v1
- Date: Tue, 8 Jun 2021 16:58:10 GMT
- Title: MViT: Mask Vision Transformer for Facial Expression Recognition in the
wild
- Authors: Hanting Li, Mingzhe Sui, Feng Zhao, Zhengjun Zha, and Feng Wu
- Abstract summary: Facial Expression Recognition (FER) in the wild is an extremely challenging task in computer vision.
In this work, we first propose a novel pure transformer-based mask vision transformer (MViT) for FER in the wild.
Our MViT outperforms state-of-the-art methods on RAF-DB with 88.62%, FERPlus with 89.22%, and AffectNet-7 with 64.57%, respectively, and achieves a comparable result on AffectNet-8 with 61.40%.
- Score: 77.44854719772702
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Facial Expression Recognition (FER) in the wild is an extremely challenging
task in computer vision due to variant backgrounds, low-quality facial images,
and the subjectiveness of annotators. These uncertainties make it difficult for
neural networks to learn robust features on limited-scale datasets. Moreover,
the networks can be easily distributed by the above factors and perform
incorrect decisions. Recently, vision transformer (ViT) and data-efficient
image transformers (DeiT) present their significant performance in traditional
classification tasks. The self-attention mechanism makes transformers obtain a
global receptive field in the first layer which dramatically enhances the
feature extraction capability. In this work, we first propose a novel pure
transformer-based mask vision transformer (MViT) for FER in the wild, which
consists of two modules: a transformer-based mask generation network (MGN) to
generate a mask that can filter out complex backgrounds and occlusion of face
images, and a dynamic relabeling module to rectify incorrect labels in FER
datasets in the wild. Extensive experimental results demonstrate that our MViT
outperforms state-of-the-art methods on RAF-DB with 88.62%, FERPlus with
89.22%, and AffectNet-7 with 64.57%, respectively, and achieves a comparable
result on AffectNet-8 with 61.40%.
Related papers
- MoE-FFD: Mixture of Experts for Generalized and Parameter-Efficient Face Forgery Detection [54.545054873239295]
Deepfakes have recently raised significant trust issues and security concerns among the public.
ViT-based methods take advantage of the expressivity of transformers, achieving superior detection performance.
This work introduces Mixture-of-Experts modules for Face Forgery Detection (MoE-FFD), a generalized yet parameter-efficient ViT-based approach.
arXiv Detail & Related papers (2024-04-12T13:02:08Z) - HiViT: Hierarchical Vision Transformer Meets Masked Image Modeling [126.89573619301953]
We propose a new design of hierarchical vision transformers named HiViT (short for Hierarchical ViT)
HiViT enjoys both high efficiency and good performance in MIM.
In running MAE on ImageNet-1K, HiViT-B reports a +0.6% accuracy gain over ViT-B and a 1.9$times$ speed-up over Swin-B.
arXiv Detail & Related papers (2022-05-30T09:34:44Z) - Towards Lightweight Transformer via Group-wise Transformation for
Vision-and-Language Tasks [126.33843752332139]
We introduce Group-wise Transformation towards a universal yet lightweight Transformer for vision-and-language tasks, termed as LW-Transformer.
We apply LW-Transformer to a set of Transformer-based networks, and quantitatively measure them on three vision-and-language tasks and six benchmark datasets.
Experimental results show that while saving a large number of parameters and computations, LW-Transformer achieves very competitive performance against the original Transformer networks for vision-and-language tasks.
arXiv Detail & Related papers (2022-04-16T11:30:26Z) - Robust Facial Expression Recognition with Convolutional Visual
Transformers [23.05378099875569]
We propose Convolutional Visual Transformers to tackle Facial Expression Recognition in the wild by two main steps.
First, we propose an attentional selective fusion (ASF) for leveraging the feature maps generated by two-branch CNNs.
Second, inspired by the success of Transformers in natural language processing, we propose to model relationships between these visual words with global self-attention.
arXiv Detail & Related papers (2021-03-31T07:07:56Z) - Mask Attention Networks: Rethinking and Strengthen Transformer [70.95528238937861]
Transformer is an attention-based neural network, which consists of two sublayers, Self-Attention Network (SAN) and Feed-Forward Network (FFN)
arXiv Detail & Related papers (2021-03-25T04:07:44Z) - Transformers Solve the Limited Receptive Field for Monocular Depth
Prediction [82.90445525977904]
We propose TransDepth, an architecture which benefits from both convolutional neural networks and transformers.
This is the first paper which applies transformers into pixel-wise prediction problems involving continuous labels.
arXiv Detail & Related papers (2021-03-22T18:00:13Z) - Toward Transformer-Based Object Detection [12.704056181392415]
Vision Transformers can be used as a backbone by a common detection task head to produce competitive COCO results.
ViT-FRCNN demonstrates several known properties associated with transformers, including large pretraining capacity and fast fine-tuning performance.
We view ViT-FRCNN as an important stepping stone toward a pure-transformer solution of complex vision tasks such as object detection.
arXiv Detail & Related papers (2020-12-17T22:33:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.