From Static to Dynamic: Adapting Landmark-Aware Image Models for Facial Expression Recognition in Videos
- URL: http://arxiv.org/abs/2312.05447v2
- Date: Sat, 7 Sep 2024 01:43:29 GMT
- Title: From Static to Dynamic: Adapting Landmark-Aware Image Models for Facial Expression Recognition in Videos
- Authors: Yin Chen, Jia Li, Shiguang Shan, Meng Wang, Richang Hong,
- Abstract summary: Dynamic facial expression recognition (DFER) in the wild is still hindered by data limitations.
We introduce a novel Static-to-Dynamic model (S2D) that leverages existing SFER knowledge and dynamic information implicitly encoded in extracted facial landmark-aware features.
- Score: 88.08209394979178
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Dynamic facial expression recognition (DFER) in the wild is still hindered by data limitations, e.g., insufficient quantity and diversity of pose, occlusion and illumination, as well as the inherent ambiguity of facial expressions. In contrast, static facial expression recognition (SFER) currently shows much higher performance and can benefit from more abundant high-quality training data. Moreover, the appearance features and dynamic dependencies of DFER remain largely unexplored. To tackle these challenges, we introduce a novel Static-to-Dynamic model (S2D) that leverages existing SFER knowledge and dynamic information implicitly encoded in extracted facial landmark-aware features, thereby significantly improving DFER performance. Firstly, we build and train an image model for SFER, which incorporates a standard Vision Transformer (ViT) and Multi-View Complementary Prompters (MCPs) only. Then, we obtain our video model (i.e., S2D), for DFER, by inserting Temporal-Modeling Adapters (TMAs) into the image model. MCPs enhance facial expression features with landmark-aware features inferred by an off-the-shelf facial landmark detector. And the TMAs capture and model the relationships of dynamic changes in facial expressions, effectively extending the pre-trained image model for videos. Notably, MCPs and TMAs only increase a fraction of trainable parameters (less than +10\%) to the original image model. Moreover, we present a novel Emotion-Anchors (i.e., reference samples for each emotion category) based Self-Distillation Loss to reduce the detrimental influence of ambiguous emotion labels, further enhancing our S2D. Experiments conducted on popular SFER and DFER datasets show that we achieve the state of the art.
Related papers
- UniLearn: Enhancing Dynamic Facial Expression Recognition through Unified Pre-Training and Fine-Tuning on Images and Videos [83.48170683672427]
UniLearn is a unified learning paradigm that integrates static facial expression recognition data to enhance DFER task.
UniLearn consistently state-of-the-art performance on FERV39K, MAFW, and DFEW benchmarks, with weighted average recall (WAR) of 53.65%, 58.44%, and 76.68%, respectively.
arXiv Detail & Related papers (2024-09-10T01:57:57Z) - FineCLIPER: Multi-modal Fine-grained CLIP for Dynamic Facial Expression Recognition with AdaptERs [5.35588281968644]
We propose a novel framework, named Multi-modal Fine-grained CLIP for Dynamic Facial Expression Recognition with AdaptERs (Fine CLIPER)
Our Fine CLIPER achieves tunable SOTA performance on the DFEW, FERV39k, and MAFW datasets with few parameters.
arXiv Detail & Related papers (2024-07-02T10:55:43Z) - VividPose: Advancing Stable Video Diffusion for Realistic Human Image Animation [79.99551055245071]
We propose VividPose, an end-to-end pipeline that ensures superior temporal stability.
An identity-aware appearance controller integrates additional facial information without compromising other appearance details.
A geometry-aware pose controller utilizes both dense rendering maps from SMPL-X and sparse skeleton maps.
VividPose exhibits superior generalization capabilities on our proposed in-the-wild dataset.
arXiv Detail & Related papers (2024-05-28T13:18:32Z) - EMOPortraits: Emotion-enhanced Multimodal One-shot Head Avatars [36.96390906514729]
MegaPortraits model has demonstrated state-of-the-art results in this domain.
We introduce our EMOPortraits model, where we: Enhance the model's capability to faithfully support intense, asymmetric face expressions.
We propose a novel multi-view video dataset featuring a wide range of intense and asymmetric facial expressions.
arXiv Detail & Related papers (2024-04-29T21:23:29Z) - EmerNeRF: Emergent Spatial-Temporal Scene Decomposition via
Self-Supervision [85.17951804790515]
EmerNeRF is a simple yet powerful approach for learning spatial-temporal representations of dynamic driving scenes.
It simultaneously captures scene geometry, appearance, motion, and semantics via self-bootstrapping.
Our method achieves state-of-the-art performance in sensor simulation.
arXiv Detail & Related papers (2023-11-03T17:59:55Z) - Intensity-Aware Loss for Dynamic Facial Expression Recognition in the
Wild [1.8604727699812171]
Video sequences often contain frames with different expression intensities, especially for the facial expressions in the real-world scenarios.
We propose the global convolution-attention block (GCA) to rescale the channels of the feature maps.
In addition, we introduce the intensity-aware loss (IAL) in the training process to help the network distinguish the samples with relatively low expression intensities.
arXiv Detail & Related papers (2022-08-19T12:48:07Z) - Emotion Separation and Recognition from a Facial Expression by Generating the Poker Face with Vision Transformers [57.1091606948826]
We propose a novel FER model, named Poker Face Vision Transformer or PF-ViT, to address these challenges.
PF-ViT aims to separate and recognize the disturbance-agnostic emotion from a static facial image via generating its corresponding poker face.
PF-ViT utilizes vanilla Vision Transformers, and its components are pre-trained as Masked Autoencoders on a large facial expression dataset.
arXiv Detail & Related papers (2022-07-22T13:39:06Z) - Intriguing Properties of Vision Transformers [114.28522466830374]
Vision transformers (ViT) have demonstrated impressive performance across various machine vision problems.
We systematically study this question via an extensive set of experiments and comparisons with a high-performing convolutional neural network (CNN)
We show effective features of ViTs are due to flexible receptive and dynamic fields possible via the self-attention mechanism.
arXiv Detail & Related papers (2021-05-21T17:59:18Z) - Unsupervised Facial Action Unit Intensity Estimation via Differentiable
Optimization [45.07851622835555]
We propose an unsupervised framework GE-Net for facial AU intensity estimation from a single image.
Our framework performs differentiable optimization, which iteratively updates the facial parameters to match the input image.
Experimental results demonstrate that our method can achieve state-of-the-art results compared with existing methods.
arXiv Detail & Related papers (2020-04-13T12:56:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.