Multi Modal Facial Expression Recognition with Transformer-Based Fusion
Networks and Dynamic Sampling
- URL: http://arxiv.org/abs/2303.08419v2
- Date: Sun, 19 Mar 2023 04:47:43 GMT
- Title: Multi Modal Facial Expression Recognition with Transformer-Based Fusion
Networks and Dynamic Sampling
- Authors: Jun-Hwa Kim, Namho Kim, Chee Sun Won
- Abstract summary: We introduce a Modal Fusion Module (MFM) to fuse audio-visual information, where image and audio features are extracted from Swin Transformer.
Our model has been evaluated in the Affective Behavior in-the-wild (ABAW) challenge of CVPR 2023.
- Score: 1.983814021949464
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Facial expression recognition is an essential task for various applications,
including emotion detection, mental health analysis, and human-machine
interactions. In this paper, we propose a multi-modal facial expression
recognition method that exploits audio information along with facial images to
provide a crucial clue to differentiate some ambiguous facial expressions.
Specifically, we introduce a Modal Fusion Module (MFM) to fuse audio-visual
information, where image and audio features are extracted from Swin
Transformer. Additionally, we tackle the imbalance problem in the dataset by
employing dynamic data resampling. Our model has been evaluated in the
Affective Behavior in-the-wild (ABAW) challenge of CVPR 2023.
Related papers
- Speech2UnifiedExpressions: Synchronous Synthesis of Co-Speech Affective Face and Body Expressions from Affordable Inputs [67.27840327499625]
We present a multimodal learning-based method to simultaneously synthesize co-speech facial expressions and upper-body gestures for digital characters.
Our approach learns from sparse face landmarks and upper-body joints, estimated directly from video data, to generate plausible emotive character motions.
arXiv Detail & Related papers (2024-06-26T04:53:11Z) - From Static to Dynamic: Adapting Landmark-Aware Image Models for Facial Expression Recognition in Videos [88.08209394979178]
Dynamic facial expression recognition (DFER) in the wild is still hindered by data limitations.
We introduce a novel Static-to-Dynamic model (S2D) that leverages existing SFER knowledge and dynamic information implicitly encoded in extracted facial landmark-aware features.
arXiv Detail & Related papers (2023-12-09T03:16:09Z) - Cooperative Dual Attention for Audio-Visual Speech Enhancement with
Facial Cues [80.53407593586411]
We focus on leveraging facial cues beyond the lip region for robust Audio-Visual Speech Enhancement (AVSE)
We propose a Dual Attention Cooperative Framework, DualAVSE, to ignore speech-unrelated information, capture speech-related information with facial cues, and dynamically integrate it with the audio signal for AVSE.
arXiv Detail & Related papers (2023-11-24T04:30:31Z) - Interpretable Multimodal Emotion Recognition using Facial Features and
Physiological Signals [16.549488750320336]
It introduces a multimodal framework for emotion understanding by fusing information from visual facial features and r signals extracted from the input videos.
An interpretability technique based on permutation importance analysis has also been implemented.
arXiv Detail & Related papers (2023-06-05T12:57:07Z) - Emotion Separation and Recognition from a Facial Expression by Generating the Poker Face with Vision Transformers [57.1091606948826]
We propose a novel FER model, named Poker Face Vision Transformer or PF-ViT, to address these challenges.
PF-ViT aims to separate and recognize the disturbance-agnostic emotion from a static facial image via generating its corresponding poker face.
PF-ViT utilizes vanilla Vision Transformers, and its components are pre-trained as Masked Autoencoders on a large facial expression dataset.
arXiv Detail & Related papers (2022-07-22T13:39:06Z) - Facial Expression Recognition with Swin Transformer [1.983814021949464]
We introduce Swin transformer-based facial expression approach for an in-the-wild audio-visual dataset of the Aff-Wild2 Expression dataset.
Specifically, we employ a three-stream network for the audio-visual videos to fuse the multi-modal information into facial expression recognition.
arXiv Detail & Related papers (2022-03-25T06:42:31Z) - Transformer-based Multimodal Information Fusion for Facial Expression
Analysis [10.548915939047305]
We introduce our submission to CVPR2022 Competition on Affective Behavior Analysis in-the-wild (ABAW) that defines four competition tasks.
The available multimodal information consist of spoken words, speech prosody, and visual expression in videos.
Our work proposes four unified transformer-based network frameworks to create the fusion of the above multimodal information.
arXiv Detail & Related papers (2022-03-23T12:38:50Z) - Multimodal Emotion Recognition using Transfer Learning from Speaker
Recognition and BERT-based models [53.31917090073727]
We propose a neural network-based emotion recognition framework that uses a late fusion of transfer-learned and fine-tuned models from speech and text modalities.
We evaluate the effectiveness of our proposed multimodal approach on the interactive emotional dyadic motion capture dataset.
arXiv Detail & Related papers (2022-02-16T00:23:42Z) - SynFace: Face Recognition with Synthetic Data [83.15838126703719]
We devise the SynFace with identity mixup (IM) and domain mixup (DM) to mitigate the performance gap.
We also perform a systematically empirical analysis on synthetic face images to provide some insights on how to effectively utilize synthetic data for face recognition.
arXiv Detail & Related papers (2021-08-18T03:41:54Z) - MAFER: a Multi-resolution Approach to Facial Expression Recognition [9.878384185493623]
We propose a two-step learning procedure, named MAFER, to train Deep Learning models tasked with recognizing facial expressions.
A relevant feature of MAFER is that it is task-agnostic, i.e., it can be used complementarily to other objective-related techniques.
arXiv Detail & Related papers (2021-05-06T07:26:58Z) - A Multi-resolution Approach to Expression Recognition in the Wild [9.118706387430883]
We propose a multi-resolution approach to solve the Facial Expression Recognition task.
We ground our intuition on the observation that often faces images are acquired at different resolutions.
To our aim, we use a ResNet-like architecture, equipped with Squeeze-and-Excitation blocks, trained on the Affect-in-the-Wild 2 dataset.
arXiv Detail & Related papers (2021-03-09T21:21:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.