7ABAW-Compound Expression Recognition via Curriculum Learning
- URL: http://arxiv.org/abs/2503.07969v1
- Date: Tue, 11 Mar 2025 01:53:34 GMT
- Title: 7ABAW-Compound Expression Recognition via Curriculum Learning
- Authors: Chen Liu, Feng Qiu, Wei Zhang, Lincheng Li, Dadong Wang, Xin Yu,
- Abstract summary: We present a curriculum learning-based framework that initially trains the model on single-expression tasks.<n>Our method achieves the textbfbest performance in this competition track with an F-score of 0.6063.
- Score: 25.64304473149263
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: With the advent of deep learning, expression recognition has made significant advancements. However, due to the limited availability of annotated compound expression datasets and the subtle variations of compound expressions, Compound Emotion Recognition (CE) still holds considerable potential for exploration. To advance this task, the 7th Affective Behavior Analysis in-the-wild (ABAW) competition introduces the Compound Expression Challenge based on C-EXPR-DB, a limited dataset without labels. In this paper, we present a curriculum learning-based framework that initially trains the model on single-expression tasks and subsequently incorporates multi-expression data. This design ensures that our model first masters the fundamental features of basic expressions before being exposed to the complexities of compound emotions. Specifically, our designs can be summarized as follows: 1) Single-Expression Pre-training: The model is first trained on datasets containing single expressions to learn the foundational facial features associated with basic emotions. 2) Dynamic Compound Expression Generation: Given the scarcity of annotated compound expression datasets, we employ CutMix and Mixup techniques on the original single-expression images to create hybrid images exhibiting characteristics of multiple basic emotions. 3) Incremental Multi-Expression Integration: After performing well on single-expression tasks, the model is progressively exposed to multi-expression data, allowing the model to adapt to the complexity and variability of compound expressions. The official results indicate that our method achieves the \textbf{best} performance in this competition track with an F-score of 0.6063. Our code is released at https://github.com/YenanLiu/ABAW7th.
Related papers
- Contrastive Decoupled Representation Learning and Regularization for Speech-Preserving Facial Expression Manipulation [58.189703277322224]
Speech-preserving facial expression manipulation (SPFEM) aims to modify a talking head to display a specific reference emotion.
Emotion and content information existing in reference and source inputs can provide direct and accurate supervision signals for SPFEM models.
We propose to learn content and emotion priors as guidance augmented with contrastive learning to learn decoupled content and emotion representation.
arXiv Detail & Related papers (2025-04-08T04:34:38Z) - Compound Expression Recognition via Large Vision-Language Models [9.401699207785015]
Compound Expression Recognition (CER) is crucial for understanding human emotions and improving human-computer interaction.
To address these issues, we propose a novel approach leveraging Large Vision-Language Models (LVLMs)
Our method employs a two-stage fine-tuning process: first, pre-trained LVLMs are fine-tuned on basic facial expressions to establish foundational patterns; second, the model is further optimized on a compound-expression dataset to refine visual-language feature interactions.
arXiv Detail & Related papers (2025-03-14T09:46:05Z) - Static for Dynamic: Towards a Deeper Understanding of Dynamic Facial Expressions Using Static Expression Data [83.48170683672427]
We propose a unified dual-modal learning framework that integrates SFER data as a complementary resource for DFER.<n>S4D employs dual-modal self-supervised pre-training on facial images and videos using a shared Transformer (ViT) encoder-decoder architecture.<n>Experiments demonstrate that S4D achieves a deeper understanding of DFER, setting new state-of-the-art performance.
arXiv Detail & Related papers (2024-09-10T01:57:57Z) - Compound Expression Recognition via Multi Model Ensemble for the ABAW7 Challenge [6.26485278174662]
Compound Expression Recognition (CER) is vital for effective interpersonal interactions.
In this paper, we propose an ensemble learning-based solution to address this complexity.
Our method demonstrates high accuracy on the RAF-DB datasets and is capable of recognizing expressions in certain portions of the C-EXPR-DB through zero-shot learning.
arXiv Detail & Related papers (2024-07-17T01:59:34Z) - VLLMs Provide Better Context for Emotion Understanding Through Common Sense Reasoning [66.23296689828152]
We leverage the capabilities of Vision-and-Large-Language Models to enhance in-context emotion classification.
In the first stage, we propose prompting VLLMs to generate descriptions in natural language of the subject's apparent emotion.
In the second stage, the descriptions are used as contextual information and, along with the image input, are used to train a transformer-based architecture.
arXiv Detail & Related papers (2024-04-10T15:09:15Z) - Compound Expression Recognition via Multi Model Ensemble [8.529105068848828]
Compound Expression Recognition plays a crucial role in interpersonal interactions.
We propose a solution based on ensemble learning methods for Compound Expression Recognition.
Our method achieves high accuracy on RAF-DB and is able to recognize expressions through zero-shot on certain portions of C-EXPR-DB.
arXiv Detail & Related papers (2024-03-19T09:30:56Z) - Zero-shot Compound Expression Recognition with Visual Language Model at the 6th ABAW Challenge [11.49671335206114]
We propose a zero-shot approach for recognizing compound expressions by leveraging a pretrained visual language model integrated with some traditional CNN networks.
In this study, we propose a zero-shot approach for recognizing compound expressions by leveraging a pretrained visual language model integrated with some traditional CNN networks.
arXiv Detail & Related papers (2024-03-18T03:59:24Z) - COSA: Concatenated Sample Pretrained Vision-Language Foundation Model [78.32081709802873]
Most vision-language foundation models employ image-text datasets for pretraining.
We propose COSA, a COncatenated SAmple pretrained vision-language foundation model.
We achieve this by sequentially concatenating multiple image-text pairs as inputs for pretraining.
This transformation effectively converts existing image-text corpora into a pseudo long-form video-paragraph corpus.
arXiv Detail & Related papers (2023-06-15T12:29:42Z) - Learn-to-Decompose: Cascaded Decomposition Network for Cross-Domain
Few-Shot Facial Expression Recognition [60.51225419301642]
We propose a novel cascaded decomposition network (CDNet) for compound facial expression recognition.
By training across similar tasks on basic expression datasets, CDNet learns the ability of learn-to-decompose that can be easily adapted to identify unseen compound expressions.
arXiv Detail & Related papers (2022-07-16T16:10:28Z) - When Facial Expression Recognition Meets Few-Shot Learning: A Joint and
Alternate Learning Framework [60.51225419301642]
We propose an Emotion Guided Similarity Network (EGS-Net) to address the diversity of human emotions in practical scenarios.
EGS-Net consists of an emotion branch and a similarity branch, based on a two-stage learning framework.
Experimental results on both in-the-lab and in-the-wild compound expression datasets demonstrate the superiority of our proposed method against several state-of-the-art methods.
arXiv Detail & Related papers (2022-01-18T07:24:12Z) - Learning to Augment Expressions for Few-shot Fine-grained Facial
Expression Recognition [98.83578105374535]
We present a novel Fine-grained Facial Expression Database - F2ED.
It includes more than 200k images with 54 facial expressions from 119 persons.
Considering the phenomenon of uneven data distribution and lack of samples is common in real-world scenarios, we evaluate several tasks of few-shot expression learning.
We propose a unified task-driven framework - Compositional Generative Adversarial Network (Comp-GAN) learning to synthesize facial images.
arXiv Detail & Related papers (2020-01-17T03:26:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.