Multi-modal Facial Affective Analysis based on Masked Autoencoder
- URL: http://arxiv.org/abs/2303.10849v2
- Date: Tue, 11 Apr 2023 06:41:45 GMT
- Title: Multi-modal Facial Affective Analysis based on Masked Autoencoder
- Authors: Wei Zhang, Bowen Ma, Feng Qiu, Yu Ding
- Abstract summary: We introduce our submission to the CVPR 2023: ABAW5 competition: Affective Behavior Analysis in-the-wild.
Our approach involves several key components. First, we utilize the visual information from a Masked Autoencoder(MAE) model that has been pre-trained on a large-scale face image dataset in a self-supervised manner.
Our approach achieves impressive results in the ABAW5 competition, with an average F1 score of 55.49% and 41.21% in the AU and EXPR tracks, respectively.
- Score: 7.17338843593134
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Human affective behavior analysis focuses on analyzing human expressions or
other behaviors to enhance the understanding of human psychology. The CVPR 2023
Competition on Affective Behavior Analysis in-the-wild (ABAW) is dedicated to
providing high-quality and large-scale Aff-wild2 for the recognition of
commonly used emotion representations, such as Action Units (AU), basic
expression categories(EXPR), and Valence-Arousal (VA). The competition is
committed to making significant strides in improving the accuracy and
practicality of affective analysis research in real-world scenarios. In this
paper, we introduce our submission to the CVPR 2023: ABAW5. Our approach
involves several key components. First, we utilize the visual information from
a Masked Autoencoder(MAE) model that has been pre-trained on a large-scale face
image dataset in a self-supervised manner. Next, we finetune the MAE encoder on
the image frames from the Aff-wild2 for AU, EXPR and VA tasks, which can be
regarded as a static and uni-modal training. Additionally, we leverage the
multi-modal and temporal information from the videos and implement a
transformer-based framework to fuse the multi-modal features. Our approach
achieves impressive results in the ABAW5 competition, with an average F1 score
of 55.49\% and 41.21\% in the AU and EXPR tracks, respectively, and an average
CCC of 0.6372 in the VA track. Our approach ranks first in the EXPR and AU
tracks, and second in the VA track. Extensive quantitative experiments and
ablation studies demonstrate the effectiveness of our proposed method.
Related papers
- Self-Training with Pseudo-Label Scorer for Aspect Sentiment Quad Prediction [54.23208041792073]
Aspect Sentiment Quad Prediction (ASQP) aims to predict all quads (aspect term, aspect category, opinion term, sentiment polarity) for a given review.
A key challenge in the ASQP task is the scarcity of labeled data, which limits the performance of existing methods.
We propose a self-training framework with a pseudo-label scorer, wherein a scorer assesses the match between reviews and their pseudo-labels.
arXiv Detail & Related papers (2024-06-26T05:30:21Z) - Opinion-Unaware Blind Image Quality Assessment using Multi-Scale Deep Feature Statistics [54.08757792080732]
We propose integrating deep features from pre-trained visual models with a statistical analysis model to achieve opinion-unaware BIQA (OU-BIQA)
Our proposed model exhibits superior consistency with human visual perception compared to state-of-the-art BIQA models.
arXiv Detail & Related papers (2024-05-29T06:09:34Z) - Multi-Modal Prompt Learning on Blind Image Quality Assessment [65.0676908930946]
Image Quality Assessment (IQA) models benefit significantly from semantic information, which allows them to treat different types of objects distinctly.
Traditional methods, hindered by a lack of sufficiently annotated data, have employed the CLIP image-text pretraining model as their backbone to gain semantic awareness.
Recent approaches have attempted to address this mismatch using prompt technology, but these solutions have shortcomings.
This paper introduces an innovative multi-modal prompt-based methodology for IQA.
arXiv Detail & Related papers (2024-04-23T11:45:32Z) - ViTGaze: Gaze Following with Interaction Features in Vision Transformers [42.08842391756614]
We introduce a novel single-modality gaze following framework, ViTGaze.
In contrast to previous methods, ViTGaze creates a brand new gaze following framework based mainly on powerful encoders.
Our method achieves state-of-the-art (SOTA) performance among all single-modality methods.
arXiv Detail & Related papers (2024-03-19T14:45:17Z) - Disentangled Interaction Representation for One-Stage Human-Object
Interaction Detection [70.96299509159981]
Human-Object Interaction (HOI) detection is a core task for human-centric image understanding.
Recent one-stage methods adopt a transformer decoder to collect image-wide cues that are useful for interaction prediction.
Traditional two-stage methods benefit significantly from their ability to compose interaction features in a disentangled and explainable manner.
arXiv Detail & Related papers (2023-12-04T08:02:59Z) - HAP: Structure-Aware Masked Image Modeling for Human-Centric Perception [97.55089867970874]
We introduce masked image modeling (MIM) as a pre-training approach for this task.
Motivated by this insight, we incorporate an intuitive human structure prior - human parts - into pre-training.
This encourages the model to concentrate more on body structure information during pre-training, yielding substantial benefits across a range of human-centric perception tasks.
arXiv Detail & Related papers (2023-10-31T17:56:11Z) - Multi-modal Facial Action Unit Detection with Large Pre-trained Models
for the 5th Competition on Affective Behavior Analysis in-the-wild [7.905280782507726]
This paper presents our submission to the Affective Behavior Analysis in-the-wild (ABAW) 2023 Competition for AU detection.
We propose a multi-modal method for facial action unit detection with visual, acoustic, and lexical features extracted from the large pre-trained models.
Our approach achieves the F1 score of 52.3% on the official validation set of the 5th ABAW Challenge.
arXiv Detail & Related papers (2023-03-19T07:18:14Z) - Affective Behaviour Analysis Using Pretrained Model with Facial Priori [22.885249372875727]
We propose to utilize prior facial information via Masked Auto-Encoder (MAE) pretrained on unlabeled face images.
We also combine MAE pretrained Vision Transformer (ViT) and AffectNet pretrained CNN to perform multi-task emotion recognition.
arXiv Detail & Related papers (2022-07-24T07:28:08Z) - mc-BEiT: Multi-choice Discretization for Image BERT Pre-training [52.04866462439979]
Image BERT pre-training with masked image modeling (MIM) is a popular practice to cope with self-supervised representation learning.
We introduce an improved BERT-style image pre-training method, namely mc-BEiT, which performs MIM proxy tasks towards eased and refined multi-choice training objectives.
arXiv Detail & Related papers (2022-03-29T09:08:18Z) - A Multi-modal and Multi-task Learning Method for Action Unit and
Expression Recognition [18.478011167414223]
We introduce a multi-modal and multi-task learning method by using both visual and audio information.
We achieve an AU score of 0.712 and an expression score of 0.477 on the validation set.
arXiv Detail & Related papers (2021-07-09T03:28:17Z) - A Multi-term and Multi-task Analyzing Framework for Affective Analysis
in-the-wild [0.2216657815393579]
We introduce the affective recognition method that was submitted to the Affective Behavior Analysis in-the-wild (ABAW) 2020 Contest.
Since affective behaviors have many observable features that have their own time frames, we introduced multiple optimized time windows.
We generated affective recognition models for each time window and ensembled these models together.
arXiv Detail & Related papers (2020-09-29T09:24:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.