From Static to Dynamic: Adapting Landmark-Aware Image Models for Facial
Expression Recognition in Videos
- URL: http://arxiv.org/abs/2312.05447v1
- Date: Sat, 9 Dec 2023 03:16:09 GMT
- Title: From Static to Dynamic: Adapting Landmark-Aware Image Models for Facial
Expression Recognition in Videos
- Authors: Yin Chen, Jia Li, Shiguang Shan, Meng Wang and Richang Hong
- Abstract summary: Dynamic facial expression recognition (DFER) in the wild is still hindered by data limitations.
We introduce a novel Static-to-Dynamic model (S2D) that leverages existing SFER knowledge and dynamic information implicitly encoded in extracted facial landmark-aware features.
- Score: 94.49851812388061
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Dynamic facial expression recognition (DFER) in the wild is still hindered by
data limitations, e.g., insufficient quantity and diversity of pose, occlusion
and illumination, as well as the inherent ambiguity of facial expressions. In
contrast, static facial expression recognition (SFER) currently shows much
higher performance and can benefit from more abundant high-quality training
data. Moreover, the appearance features and dynamic dependencies of DFER remain
largely unexplored. To tackle these challenges, we introduce a novel
Static-to-Dynamic model (S2D) that leverages existing SFER knowledge and
dynamic information implicitly encoded in extracted facial landmark-aware
features, thereby significantly improving DFER performance. Firstly, we build
and train an image model for SFER, which incorporates a standard Vision
Transformer (ViT) and Multi-View Complementary Prompters (MCPs) only. Then, we
obtain our video model (i.e., S2D), for DFER, by inserting Temporal-Modeling
Adapters (TMAs) into the image model. MCPs enhance facial expression features
with landmark-aware features inferred by an off-the-shelf facial landmark
detector. And the TMAs capture and model the relationships of dynamic changes
in facial expressions, effectively extending the pre-trained image model for
videos. Notably, MCPs and TMAs only increase a fraction of trainable parameters
(less than +10\%) to the original image model. Moreover, we present a novel
Emotion-Anchors (i.e., reference samples for each emotion category) based
Self-Distillation Loss to reduce the detrimental influence of ambiguous emotion
labels, further enhancing our S2D. Experiments conducted on popular SFER and
DFER datasets show that we achieve the state of the art.
Related papers
- FantasyID: Face Knowledge Enhanced ID-Preserving Video Generation [12.894864326299544]
We present a novel tuning-free IPT2V framework by enhancing face knowledge of the pre-trained video model built on diffusion transformers (DiT)
In this work, we present a novel tuning-free IPT2V framework by enhancing face knowledge of the pre-trained video model built on diffusion transformers (DiT)
arXiv Detail & Related papers (2025-02-19T06:50:27Z) - SkyReels-A1: Expressive Portrait Animation in Video Diffusion Transformers [30.06494915665044]
We present SkyReels-A1, a framework built upon video diffusion Transformer to facilitate portrait image animation.
SkyReels-A1 capitalizes on the strong generative capabilities of video DiT, enhancing facial motion transfer precision, identity retention, and temporal coherence.
It is highly applicable to domains such as virtual avatars, remote communication, and digital media generation.
arXiv Detail & Related papers (2025-02-15T16:08:40Z) - OSDFace: One-Step Diffusion Model for Face Restoration [72.5045389847792]
Diffusion models have demonstrated impressive performance in face restoration.
We propose OSDFace, a novel one-step diffusion model for face restoration.
Results demonstrate that OSDFace surpasses current state-of-the-art (SOTA) methods in both visual quality and quantitative metrics.
arXiv Detail & Related papers (2024-11-26T07:07:48Z) - UniLearn: Enhancing Dynamic Facial Expression Recognition through Unified Pre-Training and Fine-Tuning on Images and Videos [83.48170683672427]
UniLearn is a unified learning paradigm that integrates static facial expression recognition data to enhance DFER task.
UniLearn consistently state-of-the-art performance on FERV39K, MAFW, and DFEW benchmarks, with weighted average recall (WAR) of 53.65%, 58.44%, and 76.68%, respectively.
arXiv Detail & Related papers (2024-09-10T01:57:57Z) - FineCLIPER: Multi-modal Fine-grained CLIP for Dynamic Facial Expression Recognition with AdaptERs [5.35588281968644]
We propose a novel framework, named Multi-modal Fine-grained CLIP for Dynamic Facial Expression Recognition with AdaptERs (Fine CLIPER)
Our Fine CLIPER achieves tunable SOTA performance on the DFEW, FERV39k, and MAFW datasets with few parameters.
arXiv Detail & Related papers (2024-07-02T10:55:43Z) - VividPose: Advancing Stable Video Diffusion for Realistic Human Image Animation [79.99551055245071]
We propose VividPose, an end-to-end pipeline that ensures superior temporal stability.
An identity-aware appearance controller integrates additional facial information without compromising other appearance details.
A geometry-aware pose controller utilizes both dense rendering maps from SMPL-X and sparse skeleton maps.
VividPose exhibits superior generalization capabilities on our proposed in-the-wild dataset.
arXiv Detail & Related papers (2024-05-28T13:18:32Z) - EMOPortraits: Emotion-enhanced Multimodal One-shot Head Avatars [36.96390906514729]
MegaPortraits model has demonstrated state-of-the-art results in this domain.
We introduce our EMOPortraits model, where we: Enhance the model's capability to faithfully support intense, asymmetric face expressions.
We propose a novel multi-view video dataset featuring a wide range of intense and asymmetric facial expressions.
arXiv Detail & Related papers (2024-04-29T21:23:29Z) - EmerNeRF: Emergent Spatial-Temporal Scene Decomposition via
Self-Supervision [85.17951804790515]
EmerNeRF is a simple yet powerful approach for learning spatial-temporal representations of dynamic driving scenes.
It simultaneously captures scene geometry, appearance, motion, and semantics via self-bootstrapping.
Our method achieves state-of-the-art performance in sensor simulation.
arXiv Detail & Related papers (2023-11-03T17:59:55Z) - Intensity-Aware Loss for Dynamic Facial Expression Recognition in the
Wild [1.8604727699812171]
Video sequences often contain frames with different expression intensities, especially for the facial expressions in the real-world scenarios.
We propose the global convolution-attention block (GCA) to rescale the channels of the feature maps.
In addition, we introduce the intensity-aware loss (IAL) in the training process to help the network distinguish the samples with relatively low expression intensities.
arXiv Detail & Related papers (2022-08-19T12:48:07Z) - Emotion Separation and Recognition from a Facial Expression by Generating the Poker Face with Vision Transformers [57.1091606948826]
We propose a novel FER model, named Poker Face Vision Transformer or PF-ViT, to address these challenges.
PF-ViT aims to separate and recognize the disturbance-agnostic emotion from a static facial image via generating its corresponding poker face.
PF-ViT utilizes vanilla Vision Transformers, and its components are pre-trained as Masked Autoencoders on a large facial expression dataset.
arXiv Detail & Related papers (2022-07-22T13:39:06Z) - Unsupervised Facial Action Unit Intensity Estimation via Differentiable
Optimization [45.07851622835555]
We propose an unsupervised framework GE-Net for facial AU intensity estimation from a single image.
Our framework performs differentiable optimization, which iteratively updates the facial parameters to match the input image.
Experimental results demonstrate that our method can achieve state-of-the-art results compared with existing methods.
arXiv Detail & Related papers (2020-04-13T12:56:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.