Related papers: Facial Expression Recognition with Swin Transformer

Facial Expression Recognition with Swin Transformer

URL: http://arxiv.org/abs/2203.13472v1
Date: Fri, 25 Mar 2022 06:42:31 GMT
Title: Facial Expression Recognition with Swin Transformer
Authors: Jun-Hwa Kim, Namho Kim, Chee Sun Won
Abstract summary: We introduce Swin transformer-based facial expression approach for an in-the-wild audio-visual dataset of the Aff-Wild2 Expression dataset. Specifically, we employ a three-stream network for the audio-visual videos to fuse the multi-modal information into facial expression recognition.
Score: 1.983814021949464
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The task of recognizing human facial expressions plays a vital role in various human-related systems, including health care and medical fields. With the recent success of deep learning and the accessibility of a large amount of annotated data, facial expression recognition research has been mature enough to be utilized in real-world scenarios with audio-visual datasets. In this paper, we introduce Swin transformer-based facial expression approach for an in-the-wild audio-visual dataset of the Aff-Wild2 Expression dataset. Specifically, we employ a three-stream network (i.e., Visual stream, Temporal stream, and Audio stream) for the audio-visual videos to fuse the multi-modal information into facial expression recognition. Experimental results on the Aff-Wild2 dataset show the effectiveness of our proposed multi-modal approaches.

Related papers

Semantic Data Augmentation for Long-tailed Facial Expression Recognition [4.912577183275402]
We propose a novel semantic augmentation method for Facial Expression Recognition. Our method can be used in not only FER tasks, but also more diverse data-hungry scenarios.
arXiv Detail & Related papers (2024-11-26T09:31:12Z)
Robust Audiovisual Speech Recognition Models with Mixture-of-Experts [67.75334989582709]
We introduce EVA, leveraging the mixture-of-Experts for audioVisual ASR to perform robust speech recognition for in-the-wild'' videos. We first encode visual information into visual tokens sequence and map them into speech space by a lightweight projection. Experiments show our model achieves state-of-the-art results on three benchmarks.
arXiv Detail & Related papers (2024-09-19T00:08:28Z)
Cooperative Dual Attention for Audio-Visual Speech Enhancement with Facial Cues [80.53407593586411]
We focus on leveraging facial cues beyond the lip region for robust Audio-Visual Speech Enhancement (AVSE) We propose a Dual Attention Cooperative Framework, DualAVSE, to ignore speech-unrelated information, capture speech-related information with facial cues, and dynamically integrate it with the audio signal for AVSE.
arXiv Detail & Related papers (2023-11-24T04:30:31Z)
AVTENet: A Human-Cognition-Inspired Audio-Visual Transformer-Based Ensemble Network for Video Deepfake Detection [49.81915942821647]
This study introduces the audio-visual transformer-based ensemble network (AVTENet) to detect deepfake videos.<n>For evaluation, we use the recently released benchmark multimodal audio-video FakeAVCeleb dataset.<n>For a detailed analysis, we evaluate AVTENet, its variants, and several existing methods on multiple test sets of the FakeAVCeleb dataset.
arXiv Detail & Related papers (2023-10-19T19:01:26Z)
Multi Modal Facial Expression Recognition with Transformer-Based Fusion Networks and Dynamic Sampling [1.983814021949464]
We introduce a Modal Fusion Module (MFM) to fuse audio-visual information, where image and audio features are extracted from Swin Transformer. Our model has been evaluated in the Affective Behavior in-the-wild (ABAW) challenge of CVPR 2023.
arXiv Detail & Related papers (2023-03-15T07:40:28Z)
Multi-Modal Masked Autoencoders for Medical Vision-and-Language Pre-Training [62.215025958347105]
We propose a self-supervised learning paradigm with multi-modal masked autoencoders. We learn cross-modal domain knowledge by reconstructing missing pixels and tokens from randomly masked images and texts.
arXiv Detail & Related papers (2022-09-15T07:26:43Z)
CIAO! A Contrastive Adaptation Mechanism for Non-Universal Facial Expression Recognition [80.07590100872548]
We propose Contrastive Inhibitory Adaptati On (CIAO), a mechanism that adapts the last layer of facial encoders to depict specific affective characteristics on different datasets. CIAO presents an improvement in facial expression recognition performance over six different datasets with very unique affective representations.
arXiv Detail & Related papers (2022-08-10T15:46:05Z)
Transformer-based Multimodal Information Fusion for Facial Expression Analysis [10.548915939047305]
We introduce our submission to CVPR2022 Competition on Affective Behavior Analysis in-the-wild (ABAW) that defines four competition tasks. The available multimodal information consist of spoken words, speech prosody, and visual expression in videos. Our work proposes four unified transformer-based network frameworks to create the fusion of the above multimodal information.
arXiv Detail & Related papers (2022-03-23T12:38:50Z)
A Multi-resolution Approach to Expression Recognition in the Wild [9.118706387430883]
We propose a multi-resolution approach to solve the Facial Expression Recognition task. We ground our intuition on the observation that often faces images are acquired at different resolutions. To our aim, we use a ResNet-like architecture, equipped with Squeeze-and-Excitation blocks, trained on the Affect-in-the-Wild 2 dataset.
arXiv Detail & Related papers (2021-03-09T21:21:02Z)
CapsField: Light Field-based Face and Expression Recognition in the Wild using Capsule Routing [81.21490913108835]
This paper proposes a new deep face and expression recognition solution, called CapsField, based on a convolutional neural network. The proposed solution achieves superior performance for both face and expression recognition tasks when compared to the state-of-the-art.
arXiv Detail & Related papers (2021-01-10T09:06:02Z)
Learning to Augment Expressions for Few-shot Fine-grained Facial Expression Recognition [98.83578105374535]
We present a novel Fine-grained Facial Expression Database - F2ED. It includes more than 200k images with 54 facial expressions from 119 persons. Considering the phenomenon of uneven data distribution and lack of samples is common in real-world scenarios, we evaluate several tasks of few-shot expression learning. We propose a unified task-driven framework - Compositional Generative Adversarial Network (Comp-GAN) learning to synthesize facial images.
arXiv Detail & Related papers (2020-01-17T03:26:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.