Facial Expression Recognition with Swin Transformer
- URL: http://arxiv.org/abs/2203.13472v1
- Date: Fri, 25 Mar 2022 06:42:31 GMT
- Title: Facial Expression Recognition with Swin Transformer
- Authors: Jun-Hwa Kim, Namho Kim, Chee Sun Won
- Abstract summary: We introduce Swin transformer-based facial expression approach for an in-the-wild audio-visual dataset of the Aff-Wild2 Expression dataset.
Specifically, we employ a three-stream network for the audio-visual videos to fuse the multi-modal information into facial expression recognition.
- Score: 1.983814021949464
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The task of recognizing human facial expressions plays a vital role in
various human-related systems, including health care and medical fields. With
the recent success of deep learning and the accessibility of a large amount of
annotated data, facial expression recognition research has been mature enough
to be utilized in real-world scenarios with audio-visual datasets. In this
paper, we introduce Swin transformer-based facial expression approach for an
in-the-wild audio-visual dataset of the Aff-Wild2 Expression dataset.
Specifically, we employ a three-stream network (i.e., Visual stream, Temporal
stream, and Audio stream) for the audio-visual videos to fuse the multi-modal
information into facial expression recognition. Experimental results on the
Aff-Wild2 dataset show the effectiveness of our proposed multi-modal
approaches.
Related papers
- Robust Audiovisual Speech Recognition Models with Mixture-of-Experts [67.75334989582709]
We introduce EVA, leveraging the mixture-of-Experts for audioVisual ASR to perform robust speech recognition for in-the-wild'' videos.
We first encode visual information into visual tokens sequence and map them into speech space by a lightweight projection.
Experiments show our model achieves state-of-the-art results on three benchmarks.
arXiv Detail & Related papers (2024-09-19T00:08:28Z) - Cooperative Dual Attention for Audio-Visual Speech Enhancement with
Facial Cues [80.53407593586411]
We focus on leveraging facial cues beyond the lip region for robust Audio-Visual Speech Enhancement (AVSE)
We propose a Dual Attention Cooperative Framework, DualAVSE, to ignore speech-unrelated information, capture speech-related information with facial cues, and dynamically integrate it with the audio signal for AVSE.
arXiv Detail & Related papers (2023-11-24T04:30:31Z) - Multi Modal Facial Expression Recognition with Transformer-Based Fusion
Networks and Dynamic Sampling [1.983814021949464]
We introduce a Modal Fusion Module (MFM) to fuse audio-visual information, where image and audio features are extracted from Swin Transformer.
Our model has been evaluated in the Affective Behavior in-the-wild (ABAW) challenge of CVPR 2023.
arXiv Detail & Related papers (2023-03-15T07:40:28Z) - Multi-Modal Masked Autoencoders for Medical Vision-and-Language
Pre-Training [62.215025958347105]
We propose a self-supervised learning paradigm with multi-modal masked autoencoders.
We learn cross-modal domain knowledge by reconstructing missing pixels and tokens from randomly masked images and texts.
arXiv Detail & Related papers (2022-09-15T07:26:43Z) - CIAO! A Contrastive Adaptation Mechanism for Non-Universal Facial
Expression Recognition [80.07590100872548]
We propose Contrastive Inhibitory Adaptati On (CIAO), a mechanism that adapts the last layer of facial encoders to depict specific affective characteristics on different datasets.
CIAO presents an improvement in facial expression recognition performance over six different datasets with very unique affective representations.
arXiv Detail & Related papers (2022-08-10T15:46:05Z) - Transformer-based Multimodal Information Fusion for Facial Expression
Analysis [10.548915939047305]
We introduce our submission to CVPR2022 Competition on Affective Behavior Analysis in-the-wild (ABAW) that defines four competition tasks.
The available multimodal information consist of spoken words, speech prosody, and visual expression in videos.
Our work proposes four unified transformer-based network frameworks to create the fusion of the above multimodal information.
arXiv Detail & Related papers (2022-03-23T12:38:50Z) - Towards a General Deep Feature Extractor for Facial Expression
Recognition [5.012963825796511]
We propose a new deep learning-based approach that learns a visual feature extractor general enough to be applied to any other facial emotion recognition task or dataset.
DeepFEVER outperforms state-of-the-art results on the AffectNet and Google Facial Expression Comparison datasets.
arXiv Detail & Related papers (2022-01-19T18:42:23Z) - A Multi-resolution Approach to Expression Recognition in the Wild [9.118706387430883]
We propose a multi-resolution approach to solve the Facial Expression Recognition task.
We ground our intuition on the observation that often faces images are acquired at different resolutions.
To our aim, we use a ResNet-like architecture, equipped with Squeeze-and-Excitation blocks, trained on the Affect-in-the-Wild 2 dataset.
arXiv Detail & Related papers (2021-03-09T21:21:02Z) - CapsField: Light Field-based Face and Expression Recognition in the Wild
using Capsule Routing [81.21490913108835]
This paper proposes a new deep face and expression recognition solution, called CapsField, based on a convolutional neural network.
The proposed solution achieves superior performance for both face and expression recognition tasks when compared to the state-of-the-art.
arXiv Detail & Related papers (2021-01-10T09:06:02Z) - Learning to Augment Expressions for Few-shot Fine-grained Facial
Expression Recognition [98.83578105374535]
We present a novel Fine-grained Facial Expression Database - F2ED.
It includes more than 200k images with 54 facial expressions from 119 persons.
Considering the phenomenon of uneven data distribution and lack of samples is common in real-world scenarios, we evaluate several tasks of few-shot expression learning.
We propose a unified task-driven framework - Compositional Generative Adversarial Network (Comp-GAN) learning to synthesize facial images.
arXiv Detail & Related papers (2020-01-17T03:26:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.