Facial Affect Recognition based on Transformer Encoder and Audiovisual
Fusion for the ABAW5 Challenge
- URL: http://arxiv.org/abs/2303.09158v2
- Date: Mon, 20 Mar 2023 12:17:53 GMT
- Title: Facial Affect Recognition based on Transformer Encoder and Audiovisual
Fusion for the ABAW5 Challenge
- Authors: Ziyang Zhang, Liuwei An, Zishun Cui, Ao xu, Tengteng Dong, Yueqi
Jiang, Jingyi Shi, Xin Liu, Xiao Sun, Meng Wang
- Abstract summary: We present our solutions for four sub-challenges of Valence-Arousal (VA) Estimation, Expression (Expr) Classification, Action Unit (AU) Detection and Emotional Reaction Intensity (ERI) Estimation.
The 5th ABAW competition focuses on facial affect recognition utilizing different modalities and datasets.
- Score: 10.88275919652131
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we present our solutions for the 5th Workshop and Competition
on Affective Behavior Analysis in-the-wild (ABAW), which includes four
sub-challenges of Valence-Arousal (VA) Estimation, Expression (Expr)
Classification, Action Unit (AU) Detection and Emotional Reaction Intensity
(ERI) Estimation. The 5th ABAW competition focuses on facial affect recognition
utilizing different modalities and datasets. In our work, we extract powerful
audio and visual features using a large number of sota models. These features
are fused by Transformer Encoder and TEMMA. Besides, to avoid the possible
impact of large dimensional differences between various features, we design an
Affine Module to align different features to the same dimension. Extensive
experiments demonstrate that the superiority of the proposed method. For the VA
Estimation sub-challenge, our method obtains the mean Concordance Correlation
Coefficient (CCC) of 0.6066. For the Expression Classification sub-challenge,
the average F1 Score is 0.4055. For the AU Detection sub-challenge, the average
F1 Score is 0.5296. For the Emotional Reaction Intensity Estimation
sub-challenge, the average pearson's correlations coefficient on the validation
set is 0.3968. All of the results of four sub-challenges outperform the
baseline with a large margin.
Related papers
- $C^2$AV-TSE: Context and Confidence-aware Audio Visual Target Speaker Extraction [80.57232374640911]
We propose a model-agnostic strategy called the Mask-And-Recover (MAR)
MAR integrates both inter- and intra-modality contextual correlations to enable global inference within extraction modules.
To better target challenging parts within each sample, we introduce a Fine-grained Confidence Score (FCS) model.
arXiv Detail & Related papers (2025-04-01T13:01:30Z) - Multi-threshold Deep Metric Learning for Facial Expression Recognition [60.26967776920412]
We present the multi-threshold deep metric learning technique, which avoids the difficult threshold validation.
We find that each threshold of the triplet loss intrinsically determines a distinctive distribution of inter-class variations.
It makes the embedding layer, which is composed of a set of slices, a more informative and discriminative feature.
arXiv Detail & Related papers (2024-06-24T08:27:31Z) - The 6th Affective Behavior Analysis in-the-wild (ABAW) Competition [53.718777420180395]
This paper describes the 6th Affective Behavior Analysis in-the-wild (ABAW) Competition.
The 6th ABAW Competition addresses contemporary challenges in understanding human emotions and behaviors.
arXiv Detail & Related papers (2024-02-29T16:49:38Z) - EmoCLIP: A Vision-Language Method for Zero-Shot Video Facial Expression Recognition [10.411186945517148]
We propose a novel vision-language model that uses sample-level text descriptions as natural language supervision.
Our findings show that this approach yields significant improvements when compared to baseline methods.
We evaluate the representations obtained from the network trained using sample-level descriptions on the downstream task of mental health symptom estimation.
arXiv Detail & Related papers (2023-10-25T13:43:36Z) - DASA: Difficulty-Aware Semantic Augmentation for Speaker Verification [55.306583814017046]
We present a novel difficulty-aware semantic augmentation (DASA) approach for speaker verification.
DASA generates diversified training samples in speaker embedding space with negligible extra computing cost.
The best result achieves a 14.6% relative reduction in EER metric on CN-Celeb evaluation set.
arXiv Detail & Related papers (2023-10-18T17:07:05Z) - Cross-Attention is Not Enough: Incongruity-Aware Dynamic Hierarchical
Fusion for Multimodal Affect Recognition [69.32305810128994]
Incongruity between modalities poses a challenge for multimodal fusion, especially in affect recognition.
We propose the Hierarchical Crossmodal Transformer with Dynamic Modality Gating (HCT-DMG), a lightweight incongruity-aware model.
HCT-DMG: 1) outperforms previous multimodal models with a reduced size of approximately 0.8M parameters; 2) recognizes hard samples where incongruity makes affect recognition difficult; 3) mitigates the incongruity at the latent level in crossmodal attention.
arXiv Detail & Related papers (2023-05-23T01:24:15Z) - Multi-modal Facial Affective Analysis based on Masked Autoencoder [7.17338843593134]
We introduce our submission to the CVPR 2023: ABAW5 competition: Affective Behavior Analysis in-the-wild.
Our approach involves several key components. First, we utilize the visual information from a Masked Autoencoder(MAE) model that has been pre-trained on a large-scale face image dataset in a self-supervised manner.
Our approach achieves impressive results in the ABAW5 competition, with an average F1 score of 55.49% and 41.21% in the AU and EXPR tracks, respectively.
arXiv Detail & Related papers (2023-03-20T03:58:03Z) - A Dual Branch Network for Emotional Reaction Intensity Estimation [12.677143408225167]
We propose a solution to the ERI challenge of the fifth Affective Behavior Analysis in-the-wild(ABAW), a dual-branch based multi-output regression model.
The spatial attention is used to better extract visual features, and the Mel-Frequency Cepstral Coefficients technology extracts acoustic features.
Our method achieves excellent results on the official validation set.
arXiv Detail & Related papers (2023-03-16T10:31:40Z) - Multimodal Feature Extraction and Fusion for Emotional Reaction
Intensity Estimation and Expression Classification in Videos with
Transformers [47.16005553291036]
We present our solutions to the two sub-challenges of Affective Behavior Analysis in the wild (ABAW) 2023.
For the Expression Classification Challenge, we propose a streamlined approach that handles the challenges of classification effectively.
By studying, analyzing, and combining these features, we significantly enhance the model's accuracy for sentiment prediction in a multimodal context.
arXiv Detail & Related papers (2023-03-16T09:03:17Z) - EmotiEffNet Facial Features in Uni-task Emotion Recognition in Video at
ABAW-5 competition [7.056222499095849]
We present the results of our team for the fifth Affective Behavior Analysis in-the-wild (ABAW) competition.
The usage of the pre-trained convolutional networks from the EmotiEffNet family for frame-level feature extraction is studied.
arXiv Detail & Related papers (2023-03-16T08:57:33Z) - Leveraging TCN and Transformer for effective visual-audio fusion in
continuous emotion recognition [0.5370906227996627]
We present our approach to the Valence-Arousal (VA) Estimation Challenge, Expression (Expr) Classification Challenge, and Action Unit (AU) Detection Challenge.
We propose a novel multi-modal fusion model that leverages Temporal Convolutional Networks (TCN) and Transformer to enhance the performance of continuous emotion recognition.
arXiv Detail & Related papers (2023-03-15T04:15:57Z) - ABAW: Valence-Arousal Estimation, Expression Recognition, Action Unit
Detection & Emotional Reaction Intensity Estimation Challenges [62.413819189049946]
5th Affective Behavior Analysis in-the-wild (ABAW) Competition is part of the respective ABAW Workshop which will be held in conjunction with IEEE Computer Vision and Pattern Recognition Conference (CVPR), 2023.
For this year's Competition, we feature two corpora: i) an extended version of the Aff-Wild2 database and ii) the Hume-Reaction dataset.
The latter dataset is an audiovisual one in which reactions of individuals to emotional stimuli have been annotated with respect to seven emotional expression intensities.
arXiv Detail & Related papers (2023-03-02T18:58:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.