MAE-DFER: Efficient Masked Autoencoder for Self-supervised Dynamic
Facial Expression Recognition
- URL: http://arxiv.org/abs/2307.02227v2
- Date: Tue, 8 Aug 2023 02:19:48 GMT
- Title: MAE-DFER: Efficient Masked Autoencoder for Self-supervised Dynamic
Facial Expression Recognition
- Authors: Licai Sun, Zheng Lian, Bin Liu, Jianhua Tao
- Abstract summary: MAE-DFER is a novel self-supervised method for learning dynamic facial expressions.
It uses large-scale self-supervised pre-training on abundant unlabeled data.
It consistently outperforms state-of-the-art supervised methods.
- Score: 47.29528724322795
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Dynamic facial expression recognition (DFER) is essential to the development
of intelligent and empathetic machines. Prior efforts in this field mainly fall
into supervised learning paradigm, which is severely restricted by the limited
labeled data in existing datasets. Inspired by recent unprecedented success of
masked autoencoders (e.g., VideoMAE), this paper proposes MAE-DFER, a novel
self-supervised method which leverages large-scale self-supervised pre-training
on abundant unlabeled data to largely advance the development of DFER. Since
the vanilla Vision Transformer (ViT) employed in VideoMAE requires substantial
computation during fine-tuning, MAE-DFER develops an efficient local-global
interaction Transformer (LGI-Former) as the encoder. Moreover, in addition to
the standalone appearance content reconstruction in VideoMAE, MAE-DFER also
introduces explicit temporal facial motion modeling to encourage LGI-Former to
excavate both static appearance and dynamic motion information. Extensive
experiments on six datasets show that MAE-DFER consistently outperforms
state-of-the-art supervised methods by significant margins (e.g., +6.30\% UAR
on DFEW and +8.34\% UAR on MAFW), verifying that it can learn powerful dynamic
facial representations via large-scale self-supervised pre-training. Besides,
it has comparable or even better performance than VideoMAE, while largely
reducing the computational cost (about 38\% FLOPs). We believe MAE-DFER has
paved a new way for the advancement of DFER and can inspire more relevant
research in this field and even other related tasks. Codes and models are
publicly available at https://github.com/sunlicai/MAE-DFER.
Related papers
- VidMan: Exploiting Implicit Dynamics from Video Diffusion Model for Effective Robot Manipulation [79.00294932026266]
VidMan is a novel framework that employs a two-stage training mechanism to enhance stability and improve data utilization efficiency.
Our framework outperforms state-of-the-art baseline model GR-1 on the CALVIN benchmark, achieving a 11.7% relative improvement, and demonstrates over 9% precision gains on the OXE small-scale dataset.
arXiv Detail & Related papers (2024-11-14T03:13:26Z) - MMA-DFER: MultiModal Adaptation of unimodal models for Dynamic Facial Expression Recognition in-the-wild [81.32127423981426]
Multimodal emotion recognition based on audio and video data is important for real-world applications.
Recent methods have focused on exploiting advances of self-supervised learning (SSL) for pre-training of strong multimodal encoders.
We propose a different perspective on the problem and investigate the advancement of multimodal DFER performance by adapting SSL-pre-trained disjoint unimodal encoders.
arXiv Detail & Related papers (2024-04-13T13:39:26Z) - SVFAP: Self-supervised Video Facial Affect Perceiver [42.16505961654868]
Motivated by the recent success of self-supervised learning in computer vision, this paper introduces a self-supervised approach, termed Self-supervised Video Facial Affect Perceiver (SVFAP)
To address the dilemma faced by supervised methods, SVFAP leverages masked video autoencoding to perform self-supervised pre-training on massive unlabeled facial videos.
To verify the effectiveness of our method, we conduct experiments on nine datasets spanning three downstream tasks, including dynamic facial expression recognition, dimensional emotion recognition, and personality recognition.
arXiv Detail & Related papers (2023-12-31T07:44:05Z) - From Static to Dynamic: Adapting Landmark-Aware Image Models for Facial Expression Recognition in Videos [88.08209394979178]
Dynamic facial expression recognition (DFER) in the wild is still hindered by data limitations.
We introduce a novel Static-to-Dynamic model (S2D) that leverages existing SFER knowledge and dynamic information implicitly encoded in extracted facial landmark-aware features.
arXiv Detail & Related papers (2023-12-09T03:16:09Z) - SurgMAE: Masked Autoencoders for Long Surgical Video Analysis [4.866110274299399]
Masked autoencoders (MAE) got the attention in self-supervised paradigm for Vision Transformers (ViTs)
In this paper, we first investigate whether MAE can learn transferrable representations in surgical video domain.
We propose SurgMAE, which is a novel architecture with a masking strategy on sampling high-temporal tokens for MAE.
arXiv Detail & Related papers (2023-05-19T06:12:50Z) - GD-MAE: Generative Decoder for MAE Pre-training on LiDAR Point Clouds [72.60362979456035]
Masked Autoencoders (MAE) are challenging to explore in large-scale 3D point clouds.
We propose a textbfGenerative textbfDecoder for MAE (GD-MAE) to automatically merges the surrounding context.
We demonstrate the efficacy of the proposed method on several large-scale benchmarks: KITTI, and ONCE.
arXiv Detail & Related papers (2022-12-06T14:32:55Z) - Exploring The Role of Mean Teachers in Self-supervised Masked
Auto-Encoders [64.03000385267339]
Masked image modeling (MIM) has become a popular strategy for self-supervised learning(SSL) of visual representations with Vision Transformers.
We present a simple SSL method, the Reconstruction-Consistent Masked Auto-Encoder (RC-MAE) by adding an EMA teacher to MAE.
RC-MAE converges faster and requires less memory usage than state-of-the-art self-distillation methods during pre-training.
arXiv Detail & Related papers (2022-10-05T08:08:55Z) - Representation Learning with Video Deep InfoMax [26.692717942430185]
We extend DeepInfoMax to the video domain by leveraging similar structure intemporal networks.
We find that drawing views from both natural-rate sequences and temporally-downsampled sequences yields results on Kinetics-pretrained action recognition tasks.
arXiv Detail & Related papers (2020-07-27T02:28:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.