From Static to Dynamic: Adapting Landmark-Aware Image Models for Facial
Expression Recognition in Videos
- URL: http://arxiv.org/abs/2312.05447v1
- Date: Sat, 9 Dec 2023 03:16:09 GMT
- Title: From Static to Dynamic: Adapting Landmark-Aware Image Models for Facial
Expression Recognition in Videos
- Authors: Yin Chen, Jia Li, Shiguang Shan, Meng Wang and Richang Hong
- Abstract summary: Dynamic facial expression recognition (DFER) in the wild is still hindered by data limitations.
We introduce a novel Static-to-Dynamic model (S2D) that leverages existing SFER knowledge and dynamic information implicitly encoded in extracted facial landmark-aware features.
- Score: 94.49851812388061
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Dynamic facial expression recognition (DFER) in the wild is still hindered by
data limitations, e.g., insufficient quantity and diversity of pose, occlusion
and illumination, as well as the inherent ambiguity of facial expressions. In
contrast, static facial expression recognition (SFER) currently shows much
higher performance and can benefit from more abundant high-quality training
data. Moreover, the appearance features and dynamic dependencies of DFER remain
largely unexplored. To tackle these challenges, we introduce a novel
Static-to-Dynamic model (S2D) that leverages existing SFER knowledge and
dynamic information implicitly encoded in extracted facial landmark-aware
features, thereby significantly improving DFER performance. Firstly, we build
and train an image model for SFER, which incorporates a standard Vision
Transformer (ViT) and Multi-View Complementary Prompters (MCPs) only. Then, we
obtain our video model (i.e., S2D), for DFER, by inserting Temporal-Modeling
Adapters (TMAs) into the image model. MCPs enhance facial expression features
with landmark-aware features inferred by an off-the-shelf facial landmark
detector. And the TMAs capture and model the relationships of dynamic changes
in facial expressions, effectively extending the pre-trained image model for
videos. Notably, MCPs and TMAs only increase a fraction of trainable parameters
(less than +10\%) to the original image model. Moreover, we present a novel
Emotion-Anchors (i.e., reference samples for each emotion category) based
Self-Distillation Loss to reduce the detrimental influence of ambiguous emotion
labels, further enhancing our S2D. Experiments conducted on popular SFER and
DFER datasets show that we achieve the state of the art.
Related papers
- FineCLIPER: Multi-modal Fine-grained CLIP for Dynamic Facial Expression Recognition with AdaptERs [5.35588281968644]
We propose a novel framework, named Multi-modal Fine-grained CLIP for Dynamic Facial Expression Recognition with AdaptERs (Fine CLIPER)
Our Fine CLIPER achieves tunable SOTA performance on the DFEW, FERV39k, and MAFW datasets with few parameters.
arXiv Detail & Related papers (2024-07-02T10:55:43Z) - VividPose: Advancing Stable Video Diffusion for Realistic Human Image Animation [79.99551055245071]
We propose VividPose, an end-to-end pipeline that ensures superior temporal stability.
An identity-aware appearance controller integrates additional facial information without compromising other appearance details.
A geometry-aware pose controller utilizes both dense rendering maps from SMPL-X and sparse skeleton maps.
VividPose exhibits superior generalization capabilities on our proposed in-the-wild dataset.
arXiv Detail & Related papers (2024-05-28T13:18:32Z) - EMOPortraits: Emotion-enhanced Multimodal One-shot Head Avatars [36.96390906514729]
MegaPortraits model has demonstrated state-of-the-art results in this domain.
We introduce our EMOPortraits model, where we: Enhance the model's capability to faithfully support intense, asymmetric face expressions.
We propose a novel multi-view video dataset featuring a wide range of intense and asymmetric facial expressions.
arXiv Detail & Related papers (2024-04-29T21:23:29Z) - EmerNeRF: Emergent Spatial-Temporal Scene Decomposition via
Self-Supervision [85.17951804790515]
EmerNeRF is a simple yet powerful approach for learning spatial-temporal representations of dynamic driving scenes.
It simultaneously captures scene geometry, appearance, motion, and semantics via self-bootstrapping.
Our method achieves state-of-the-art performance in sensor simulation.
arXiv Detail & Related papers (2023-11-03T17:59:55Z) - GaFET: Learning Geometry-aware Facial Expression Translation from
In-The-Wild Images [55.431697263581626]
We introduce a novel Geometry-aware Facial Expression Translation framework, which is based on parametric 3D facial representations and can stably decoupled expression.
We achieve higher-quality and more accurate facial expression transfer results compared to state-of-the-art methods, and demonstrate applicability of various poses and complex textures.
arXiv Detail & Related papers (2023-08-07T09:03:35Z) - Revisiting Temporal Modeling for CLIP-based Image-to-Video Knowledge
Transferring [82.84513669453744]
Image-text pretrained models, e.g., CLIP, have shown impressive general multi-modal knowledge learned from large-scale image-text data pairs.
We revisit temporal modeling in the context of image-to-video knowledge transferring.
We present a simple and effective temporal modeling mechanism extending CLIP model to diverse video tasks.
arXiv Detail & Related papers (2023-01-26T14:12:02Z) - Intensity-Aware Loss for Dynamic Facial Expression Recognition in the
Wild [1.8604727699812171]
Video sequences often contain frames with different expression intensities, especially for the facial expressions in the real-world scenarios.
We propose the global convolution-attention block (GCA) to rescale the channels of the feature maps.
In addition, we introduce the intensity-aware loss (IAL) in the training process to help the network distinguish the samples with relatively low expression intensities.
arXiv Detail & Related papers (2022-08-19T12:48:07Z) - FDNeRF: Few-shot Dynamic Neural Radiance Fields for Face Reconstruction
and Expression Editing [27.014582934266492]
We propose a Few-shot Dynamic Neural Radiance Field (FDNeRF), the first NeRF-based method capable of reconstruction and expression editing of 3D faces.
Unlike existing dynamic NeRFs that require dense images as input and can only be modeled for a single identity, our method enables face reconstruction across different persons with few-shot inputs.
arXiv Detail & Related papers (2022-08-11T11:05:59Z) - Intriguing Properties of Vision Transformers [114.28522466830374]
Vision transformers (ViT) have demonstrated impressive performance across various machine vision problems.
We systematically study this question via an extensive set of experiments and comparisons with a high-performing convolutional neural network (CNN)
We show effective features of ViTs are due to flexible receptive and dynamic fields possible via the self-attention mechanism.
arXiv Detail & Related papers (2021-05-21T17:59:18Z) - Unsupervised Facial Action Unit Intensity Estimation via Differentiable
Optimization [45.07851622835555]
We propose an unsupervised framework GE-Net for facial AU intensity estimation from a single image.
Our framework performs differentiable optimization, which iteratively updates the facial parameters to match the input image.
Experimental results demonstrate that our method can achieve state-of-the-art results compared with existing methods.
arXiv Detail & Related papers (2020-04-13T12:56:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.