Lifting Scheme-Based Implicit Disentanglement of Emotion-Related Facial Dynamics in the Wild
- URL: http://arxiv.org/abs/2412.13168v2
- Date: Wed, 18 Dec 2024 09:47:15 GMT
- Title: Lifting Scheme-Based Implicit Disentanglement of Emotion-Related Facial Dynamics in the Wild
- Authors: Xingjian Wang, Li Chai,
- Abstract summary: In-the-wild dynamic facial expression recognition (DFER) encounters a significant challenge in recognizing emotion-related expressions.
We propose a novel Implicit Facial Dynamics Disentanglement framework (IFDD)
IFDD disentangles emotion-related dynamic information from emotion-irrelevant global context in an implicit manner.
- Score: 3.3905929183808796
- License:
- Abstract: In-the-wild dynamic facial expression recognition (DFER) encounters a significant challenge in recognizing emotion-related expressions, which are often temporally and spatially diluted by emotion-irrelevant expressions and global context. Most prior DFER methods directly utilize coupled spatiotemporal representations that may incorporate weakly relevant features with emotion-irrelevant context bias. Several DFER methods highlight dynamic information for DFER, but following explicit guidance that may be vulnerable to irrelevant motion. In this paper, we propose a novel Implicit Facial Dynamics Disentanglement framework (IFDD). Through expanding wavelet lifting scheme to fully learnable framework, IFDD disentangles emotion-related dynamic information from emotion-irrelevant global context in an implicit manner, i.e., without exploit operations and external guidance. The disentanglement process contains two stages. The first is Inter-frame Static-dynamic Splitting Module (ISSM) for rough disentanglement estimation, which explores inter-frame correlation to generate content-aware splitting indexes on-the-fly. We utilize these indexes to split frame features into two groups, one with greater global similarity, and the other with more unique dynamic features. The second stage is Lifting-based Aggregation-Disentanglement Module (LADM) for further refinement. LADM first aggregates two groups of features from ISSM to obtain fine-grained global context features by an updater, and then disentangles emotion-related facial dynamic features from the global context by a predictor. Extensive experiments on in-the-wild datasets have demonstrated that IFDD outperforms prior supervised DFER methods with higher recognition accuracy and comparable efficiency. Code is available at https://github.com/CyberPegasus/IFDD.
Related papers
- Zero-Shot Interactive Text-to-Image Retrieval via Diffusion-Augmented Representations [7.439049772394586]
Diffusion Augmented Retrieval (DAR) is a paradigm-shifting framework that bypasses MLLM finetuning entirely.
DAR synergizes Large Language Model (LLM)-guided query refinement with Diffusion Model (DM)-based visual synthesis to create contextually enriched intermediate representations.
arXiv Detail & Related papers (2025-01-26T03:29:18Z) - Spatio-Temporal Fuzzy-oriented Multi-Modal Meta-Learning for Fine-grained Emotion Recognition [26.882865792905754]
Fine-grained emotion recognition (FER) plays a vital role in various fields, such as disease diagnosis, personalized recommendations, and multimedia mining.
Existing FER methods face three key challenges in real-world applications: (i) they rely on large amounts of continuously annotated data to ensure accuracy since emotions are complex and ambiguous in reality, which is costly and time-consuming; (ii) they cannot capture the temporal heterogeneity caused by changing emotion patterns, because they usually assume that the temporal correlation within sampling periods is the same; (iii) they do not consider the spatial heterogeneity of different FER scenarios, that is, the distribution of emotion
arXiv Detail & Related papers (2024-12-18T06:40:53Z) - Static for Dynamic: Towards a Deeper Understanding of Dynamic Facial Expressions Using Static Expression Data [83.48170683672427]
We propose a unified dual-modal learning framework that integrates SFER data as a complementary resource for DFER.
S4D employs dual-modal self-supervised pre-training on facial images and videos using a shared Transformer (ViT) encoder-decoder architecture.
Experiments demonstrate that S4D achieves a deeper understanding of DFER, setting new state-of-the-art performance.
arXiv Detail & Related papers (2024-09-10T01:57:57Z) - MASA: Motion-aware Masked Autoencoder with Semantic Alignment for Sign Language Recognition [94.56755080185732]
We propose a Motion-Aware masked autoencoder with Semantic Alignment (MASA) that integrates rich motion cues and global semantic information.
Our framework can simultaneously learn local motion cues and global semantic features for comprehensive sign language representation.
arXiv Detail & Related papers (2024-05-31T08:06:05Z) - Spatial Semantic Recurrent Mining for Referring Image Segmentation [63.34997546393106]
We propose Stextsuperscript2RM to achieve high-quality cross-modality fusion.
It follows a working strategy of trilogy: distributing language feature, spatial semantic recurrent coparsing, and parsed-semantic balancing.
Our proposed method performs favorably against other state-of-the-art algorithms.
arXiv Detail & Related papers (2024-05-15T00:17:48Z) - Two in One Go: Single-stage Emotion Recognition with Decoupled Subject-context Transformer [78.35816158511523]
We present a single-stage emotion recognition approach, employing a Decoupled Subject-Context Transformer (DSCT) for simultaneous subject localization and emotion classification.
We evaluate our single-stage framework on two widely used context-aware emotion recognition datasets, CAER-S and EMOTIC.
arXiv Detail & Related papers (2024-04-26T07:30:32Z) - From Static to Dynamic: Adapting Landmark-Aware Image Models for Facial Expression Recognition in Videos [88.08209394979178]
Dynamic facial expression recognition (DFER) in the wild is still hindered by data limitations.
We introduce a novel Static-to-Dynamic model (S2D) that leverages existing SFER knowledge and dynamic information implicitly encoded in extracted facial landmark-aware features.
arXiv Detail & Related papers (2023-12-09T03:16:09Z) - DytanVO: Joint Refinement of Visual Odometry and Motion Segmentation in
Dynamic Environments [6.5121327691369615]
We present DytanVO, the first supervised learning-based VO method that deals with dynamic environments.
Our method achieves an average improvement of 27.7% in ATE over state-of-the-art VO solutions in real-world dynamic environments.
arXiv Detail & Related papers (2022-09-17T23:56:03Z) - Fusion with Hierarchical Graphs for Mulitmodal Emotion Recognition [7.147235324895931]
This paper proposes a novel hierarchical graph network (HFGCN) model that learns more informative multimodal representations.
Specifically, the proposed model fuses multimodality inputs using a two-stage graph construction approach and encodes the modality dependencies into the conversation representation.
Experiments showed the effectiveness of our proposed model for more accurate AER, which yielded state-of-the-art results on two public datasets.
arXiv Detail & Related papers (2021-09-15T08:21:01Z) - Referring Image Segmentation via Cross-Modal Progressive Comprehension [94.70482302324704]
Referring image segmentation aims at segmenting the foreground masks of the entities that can well match the description given in the natural language expression.
Previous approaches tackle this problem using implicit feature interaction and fusion between visual and linguistic modalities.
We propose a Cross-Modal Progressive (CMPC) module and a Text-Guided Feature Exchange (TGFE) module to effectively address the challenging task.
arXiv Detail & Related papers (2020-10-01T16:02:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.