Related papers: Static for Dynamic: Towards a Deeper Understanding of Dynamic Facial Expressions Using Static Expression Data

Static for Dynamic: Towards a Deeper Understanding of Dynamic Facial Expressions Using Static Expression Data

URL: http://arxiv.org/abs/2409.06154v2
Date: Fri, 10 Jan 2025 09:44:43 GMT
Title: Static for Dynamic: Towards a Deeper Understanding of Dynamic Facial Expressions Using Static Expression Data
Authors: Yin Chen, Jia Li, Yu Zhang, Zhenzhen Hu, Shiguang Shan, Meng Wang, Richang Hong,
Abstract summary: We propose a unified dual-modal learning framework that integrates SFER data as a complementary resource for DFER.<n>S4D employs dual-modal self-supervised pre-training on facial images and videos using a shared Transformer (ViT) encoder-decoder architecture.<n>Experiments demonstrate that S4D achieves a deeper understanding of DFER, setting new state-of-the-art performance.
Score: 83.48170683672427
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Dynamic facial expression recognition (DFER) infers emotions from the temporal evolution of expressions, unlike static facial expression recognition (SFER), which relies solely on a single snapshot. This temporal analysis provides richer information and promises greater recognition capability. However, current DFER methods often exhibit unsatisfied performance largely due to fewer training samples compared to SFER. Given the inherent correlation between static and dynamic expressions, we hypothesize that leveraging the abundant SFER data can enhance DFER. To this end, we propose Static-for-Dynamic (S4D), a unified dual-modal learning framework that integrates SFER data as a complementary resource for DFER. Specifically, S4D employs dual-modal self-supervised pre-training on facial images and videos using a shared Vision Transformer (ViT) encoder-decoder architecture, yielding improved spatiotemporal representations. The pre-trained encoder is then fine-tuned on static and dynamic expression datasets in a multi-task learning setup to facilitate emotional information interaction. Unfortunately, vanilla multi-task learning in our study results in negative transfer. To address this, we propose an innovative Mixture of Adapter Experts (MoAE) module that facilitates task-specific knowledge acquisition while effectively extracting shared knowledge from both static and dynamic expression data. Extensive experiments demonstrate that S4D achieves a deeper understanding of DFER, setting new state-of-the-art performance on FERV39K, MAFW, and DFEW benchmarks, with weighted average recall (WAR) of 53.65\%, 58.44\%, and 76.68\%, respectively. Additionally, a systematic correlation analysis between SFER and DFER tasks is presented, which further elucidates the potential benefits of leveraging SFER.

Related papers

Underlying Semantic Diffusion for Effective and Efficient In-Context Learning [113.4003355229632]
Underlying Semantic Diffusion (US-Diffusion) is an enhanced diffusion model that boosts underlying semantics learning, computational efficiency, and in-context learning capabilities. We present a Feedback-Aided Learning (FAL) framework, which leverages feedback signals to guide the model in capturing semantic details. We also propose a plug-and-play Efficient Sampling Strategy (ESS) for dense sampling at time steps with high-noise levels.
arXiv Detail & Related papers (2025-03-06T03:06:22Z)
MIDAS: Mixing Ambiguous Data with Soft Labels for Dynamic Facial Expression Recognition [11.89503569570198]
We propose MIDAS, a data augmentation method for dynamic facial expression recognition (DFER) In MIDAS, the training data are augmented by convexly combining pairs of video frames and their corresponding emotion class labels. The results demonstrate that the model trained on the data augmented by MIDAS outperforms the existing state-of-the-art method trained on the original dataset.
arXiv Detail & Related papers (2025-02-28T21:39:19Z)
Robust Dynamic Facial Expression Recognition [6.626374248579249]
This paper proposes a robust method of distinguishing between hard and noisy samples. To identify the principal expression in a video, a key expression re-sampling framework and a dual-stream hierarchical network is proposed. The proposed method has been shown to outperform current State-Of-The-Art approaches in DFER.
arXiv Detail & Related papers (2025-02-22T07:48:12Z)
Lifting Scheme-Based Implicit Disentanglement of Emotion-Related Facial Dynamics in the Wild [3.3905929183808796]
In-the-wild dynamic facial expression recognition (DFER) encounters a significant challenge in recognizing emotion-related expressions. We propose a novel Implicit Facial Dynamics Disentanglement framework (IFDD) IFDD disentangles emotion-related dynamic information from emotion-irrelevant global context in an implicit manner.
arXiv Detail & Related papers (2024-12-17T18:45:53Z)
Boosting Unconstrained Face Recognition with Targeted Style Adversary [10.428185253933004]
We present a simple yet effective method to expand the training data by interpolating between instance-level feature statistics across labeled and unlabeled sets. Our method, dubbed Targeted Style Adversary (TSA), is motivated by two observations: (i) the input domain is reflected in feature statistics, and (ii) face recognition model performance is influenced by style information.
arXiv Detail & Related papers (2024-08-14T16:13:03Z)
Enhancing Large Vision Language Models with Self-Training on Image Comprehension [131.14381425260706]
We introduce Self-Training on Image (STIC), which emphasizes a self-training approach specifically for image comprehension. First, the model self-constructs a preference for image descriptions using unlabeled images. To further self-improve reasoning on the extracted visual information, we let the model reuse a small portion of existing instruction-tuning data.
arXiv Detail & Related papers (2024-05-30T05:53:49Z)
Dynamic in Static: Hybrid Visual Correspondence for Self-Supervised Video Object Segmentation [126.12940972028012]
We present HVC, a framework for self-supervised video object segmentation. HVC extracts pseudo-dynamic signals from static images, enabling an efficient and scalable VOS model. We propose a hybrid visual correspondence loss to learn joint static and dynamic consistency representations.
arXiv Detail & Related papers (2024-04-21T02:21:30Z)
Alleviating Catastrophic Forgetting in Facial Expression Recognition with Emotion-Centered Models [49.3179290313959]
The proposed method, emotion-centered generative replay (ECgr), tackles this challenge by integrating synthetic images from generative adversarial networks. ECgr incorporates a quality assurance algorithm to ensure the fidelity of generated images. The experimental results on four diverse facial expression datasets demonstrate that incorporating images generated by our pseudo-rehearsal method enhances training on the targeted dataset and the source dataset.
arXiv Detail & Related papers (2024-04-18T15:28:34Z)
MMA-DFER: MultiModal Adaptation of unimodal models for Dynamic Facial Expression Recognition in-the-wild [81.32127423981426]
Multimodal emotion recognition based on audio and video data is important for real-world applications. Recent methods have focused on exploiting advances of self-supervised learning (SSL) for pre-training of strong multimodal encoders. We propose a different perspective on the problem and investigate the advancement of multimodal DFER performance by adapting SSL-pre-trained disjoint unimodal encoders.
arXiv Detail & Related papers (2024-04-13T13:39:26Z)
Joint-Embedding Masked Autoencoder for Self-supervised Learning of Dynamic Functional Connectivity from the Human Brain [18.165807360855435]
Graph Neural Networks (GNNs) have shown promise in learning dynamic functional connectivity for distinguishing phenotypes from human brain networks. We introduce the Spatio-Temporal Joint Embedding Masked Autoencoder (ST-JEMA), drawing inspiration from the Joint Embedding Predictive Architecture (JEPA) in computer vision.
arXiv Detail & Related papers (2024-03-11T04:49:41Z)
From Static to Dynamic: Adapting Landmark-Aware Image Models for Facial Expression Recognition in Videos [88.08209394979178]
Dynamic facial expression recognition (DFER) in the wild is still hindered by data limitations. We introduce a novel Static-to-Dynamic model (S2D) that leverages existing SFER knowledge and dynamic information implicitly encoded in extracted facial landmark-aware features.
arXiv Detail & Related papers (2023-12-09T03:16:09Z)
EmerNeRF: Emergent Spatial-Temporal Scene Decomposition via Self-Supervision [85.17951804790515]
EmerNeRF is a simple yet powerful approach for learning spatial-temporal representations of dynamic driving scenes. It simultaneously captures scene geometry, appearance, motion, and semantics via self-bootstrapping. Our method achieves state-of-the-art performance in sensor simulation.
arXiv Detail & Related papers (2023-11-03T17:59:55Z)
Exploring Large-scale Unlabeled Faces to Enhance Facial Expression Recognition [12.677143408225167]
We propose a semi-supervised learning framework that utilizes unlabeled face data to train expression recognition models effectively. Our method uses a dynamic threshold module that can adaptively adjust the confidence threshold to fully utilize the face recognition data. In the ABAW5 EXPR task, our method achieved excellent results on the official validation set.
arXiv Detail & Related papers (2023-03-15T13:43:06Z)
Cluster-level pseudo-labelling for source-free cross-domain facial expression recognition [94.56304526014875]
We propose the first Source-Free Unsupervised Domain Adaptation (SFUDA) method for Facial Expression Recognition (FER) Our method exploits self-supervised pretraining to learn good feature representations from the target data. We validate the effectiveness of our method in four adaptation setups, proving that it consistently outperforms existing SFUDA methods when applied to FER.
arXiv Detail & Related papers (2022-10-11T08:24:50Z)
Improved Speech Emotion Recognition using Transfer Learning and Spectrogram Augmentation [56.264157127549446]
Speech emotion recognition (SER) is a challenging task that plays a crucial role in natural human-computer interaction. One of the main challenges in SER is data scarcity. We propose a transfer learning strategy combined with spectrogram augmentation.
arXiv Detail & Related papers (2021-08-05T10:39:39Z)
Noisy Student Training using Body Language Dataset Improves Facial Expression Recognition [10.529781894367877]
In this paper, we use a self-training method that utilizes a combination of a labelled dataset and an unlabelled dataset. Experimental analysis shows that training a noisy student network iteratively helps in achieving significantly better results. Our results show that the proposed method achieves state-of-the-art performance on benchmark datasets.
arXiv Detail & Related papers (2020-08-06T13:45:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.