UniLearn: Enhancing Dynamic Facial Expression Recognition through Unified Pre-Training and Fine-Tuning on Images and Videos
- URL: http://arxiv.org/abs/2409.06154v1
- Date: Tue, 10 Sep 2024 01:57:57 GMT
- Title: UniLearn: Enhancing Dynamic Facial Expression Recognition through Unified Pre-Training and Fine-Tuning on Images and Videos
- Authors: Yin Chen, Jia Li, Yu Zhang, Zhenzhen Hu, Shiguang Shan, Meng Wang, Richang Hong,
- Abstract summary: UniLearn is a unified learning paradigm that integrates static facial expression recognition data to enhance DFER task.
UniLearn consistently state-of-the-art performance on FERV39K, MAFW, and DFEW benchmarks, with weighted average recall (WAR) of 53.65%, 58.44%, and 76.68%, respectively.
- Score: 83.48170683672427
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Dynamic facial expression recognition (DFER) is essential for understanding human emotions and behavior. However, conventional DFER methods, which primarily use dynamic facial data, often underutilize static expression images and their labels, limiting their performance and robustness. To overcome this, we introduce UniLearn, a novel unified learning paradigm that integrates static facial expression recognition (SFER) data to enhance DFER task. UniLearn employs a dual-modal self-supervised pre-training method, leveraging both facial expression images and videos to enhance a ViT model's spatiotemporal representation capability. Then, the pre-trained model is fine-tuned on both static and dynamic expression datasets using a joint fine-tuning strategy. To prevent negative transfer during joint fine-tuning, we introduce an innovative Mixture of Adapter Experts (MoAE) module that enables task-specific knowledge acquisition and effectively integrates information from both static and dynamic expression data. Extensive experiments demonstrate UniLearn's effectiveness in leveraging complementary information from static and dynamic facial data, leading to more accurate and robust DFER. UniLearn consistently achieves state-of-the-art performance on FERV39K, MAFW, and DFEW benchmarks, with weighted average recall (WAR) of 53.65\%, 58.44\%, and 76.68\%, respectively. The source code and model weights will be publicly available at \url{https://github.com/MSA-LMC/UniLearn}.
Related papers
- Boosting Unconstrained Face Recognition with Targeted Style Adversary [10.428185253933004]
We present a simple yet effective method to expand the training data by interpolating between instance-level feature statistics across labeled and unlabeled sets.
Our method, dubbed Targeted Style Adversary (TSA), is motivated by two observations: (i) the input domain is reflected in feature statistics, and (ii) face recognition model performance is influenced by style information.
arXiv Detail & Related papers (2024-08-14T16:13:03Z) - Enhancing Large Vision Language Models with Self-Training on Image Comprehension [99.9389737339175]
We introduce Self-Training on Image (STIC), which emphasizes a self-training approach specifically for image comprehension.
First, the model self-constructs a preference for image descriptions using unlabeled images.
To further self-improve reasoning on the extracted visual information, we let the model reuse a small portion of existing instruction-tuning data.
arXiv Detail & Related papers (2024-05-30T05:53:49Z) - Dynamic in Static: Hybrid Visual Correspondence for Self-Supervised Video Object Segmentation [126.12940972028012]
We present HVC, a framework for self-supervised video object segmentation.
HVC extracts pseudo-dynamic signals from static images, enabling an efficient and scalable VOS model.
We propose a hybrid visual correspondence loss to learn joint static and dynamic consistency representations.
arXiv Detail & Related papers (2024-04-21T02:21:30Z) - Alleviating Catastrophic Forgetting in Facial Expression Recognition with Emotion-Centered Models [49.3179290313959]
The proposed method, emotion-centered generative replay (ECgr), tackles this challenge by integrating synthetic images from generative adversarial networks.
ECgr incorporates a quality assurance algorithm to ensure the fidelity of generated images.
The experimental results on four diverse facial expression datasets demonstrate that incorporating images generated by our pseudo-rehearsal method enhances training on the targeted dataset and the source dataset.
arXiv Detail & Related papers (2024-04-18T15:28:34Z) - From Static to Dynamic: Adapting Landmark-Aware Image Models for Facial Expression Recognition in Videos [88.08209394979178]
Dynamic facial expression recognition (DFER) in the wild is still hindered by data limitations.
We introduce a novel Static-to-Dynamic model (S2D) that leverages existing SFER knowledge and dynamic information implicitly encoded in extracted facial landmark-aware features.
arXiv Detail & Related papers (2023-12-09T03:16:09Z) - Exploring Large-scale Unlabeled Faces to Enhance Facial Expression
Recognition [12.677143408225167]
We propose a semi-supervised learning framework that utilizes unlabeled face data to train expression recognition models effectively.
Our method uses a dynamic threshold module that can adaptively adjust the confidence threshold to fully utilize the face recognition data.
In the ABAW5 EXPR task, our method achieved excellent results on the official validation set.
arXiv Detail & Related papers (2023-03-15T13:43:06Z) - Cluster-level pseudo-labelling for source-free cross-domain facial
expression recognition [94.56304526014875]
We propose the first Source-Free Unsupervised Domain Adaptation (SFUDA) method for Facial Expression Recognition (FER)
Our method exploits self-supervised pretraining to learn good feature representations from the target data.
We validate the effectiveness of our method in four adaptation setups, proving that it consistently outperforms existing SFUDA methods when applied to FER.
arXiv Detail & Related papers (2022-10-11T08:24:50Z) - Noisy Student Training using Body Language Dataset Improves Facial
Expression Recognition [10.529781894367877]
In this paper, we use a self-training method that utilizes a combination of a labelled dataset and an unlabelled dataset.
Experimental analysis shows that training a noisy student network iteratively helps in achieving significantly better results.
Our results show that the proposed method achieves state-of-the-art performance on benchmark datasets.
arXiv Detail & Related papers (2020-08-06T13:45:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.