Related papers: Team RAS in 9th ABAW Competition: Multimodal Compound Expression Recognition Approach

Team RAS in 9th ABAW Competition: Multimodal Compound Expression Recognition Approach

URL: http://arxiv.org/abs/2507.02205v2
Date: Fri, 04 Jul 2025 14:42:15 GMT
Title: Team RAS in 9th ABAW Competition: Multimodal Compound Expression Recognition Approach
Authors: Elena Ryumina, Maxim Markitantov, Alexandr Axyonov, Dmitry Ryumin, Mikhail Dolgushin, Alexey Karpov,
Abstract summary: Compound Expression Recognition (CER) aims to detect complex emotional states formed by combinations of basic emotions.<n>We present a novel zero-shot multimodal approach for CER that combines six heterogeneous modalities into a single pipeline.<n>The proposed approach shows F1 scores of 46.95% on AffWild2, 49.02% on Acted Facial Expressions in The Wild (AFEW), and 34.85% on C-EXPR-DB via zero-shot testing.
Score: 44.40745123728199
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Compound Expression Recognition (CER), a subfield of affective computing, aims to detect complex emotional states formed by combinations of basic emotions. In this work, we present a novel zero-shot multimodal approach for CER that combines six heterogeneous modalities into a single pipeline: static and dynamic facial expressions, scene and label matching, scene context, audio, and text. Unlike previous approaches relying on task-specific training data, our approach uses zero-shot components, including Contrastive Language-Image Pretraining (CLIP)-based label matching and Qwen-VL for semantic scene understanding. We further introduce a Multi-Head Probability Fusion (MHPF) module that dynamically weights modality-specific predictions, followed by a Compound Expressions (CE) transformation module that uses Pair-Wise Probability Aggregation (PPA) and Pair-Wise Feature Similarity Aggregation (PFSA) methods to produce interpretable compound emotion outputs. Evaluated under multi-corpus training, the proposed approach shows F1 scores of 46.95% on AffWild2, 49.02% on Acted Facial Expressions in The Wild (AFEW), and 34.85% on C-EXPR-DB via zero-shot testing, which is comparable to the results of supervised approaches trained on target data. This demonstrates the effectiveness of the proposed approach for capturing CE without domain adaptation. The source code is publicly available.

Related papers

CRIA: A Cross-View Interaction and Instance-Adapted Pre-training Framework for Generalizable EEG Representations [52.251569042852815]
CRIA is an adaptive framework that utilizes variable-length and variable-channel coding to achieve a unified representation of EEG data across different datasets.<n>The model employs a cross-attention mechanism to fuse temporal, spectral, and spatial features effectively.<n> Experimental results on the Temple University EEG corpus and the CHB-MIT dataset show that CRIA outperforms existing methods with the same pre-training conditions.
arXiv Detail & Related papers (2025-06-19T06:31:08Z)
7ABAW-Compound Expression Recognition via Curriculum Learning [25.64304473149263]
We present a curriculum learning-based framework that initially trains the model on single-expression tasks.<n>Our method achieves the textbfbest performance in this competition track with an F-score of 0.6063.
arXiv Detail & Related papers (2025-03-11T01:53:34Z)
Compound Expression Recognition via Multi Model Ensemble for the ABAW7 Challenge [6.26485278174662]
Compound Expression Recognition (CER) is vital for effective interpersonal interactions. In this paper, we propose an ensemble learning-based solution to address this complexity. Our method demonstrates high accuracy on the RAF-DB datasets and is capable of recognizing expressions in certain portions of the C-EXPR-DB through zero-shot learning.
arXiv Detail & Related papers (2024-07-17T01:59:34Z)
LEAF: Unveiling Two Sides of the Same Coin in Semi-supervised Facial Expression Recognition [56.22672276092373]
Semi-supervised learning has emerged as a promising approach to tackle the challenge of label scarcity in facial expression recognition.<n>We propose a unified framework termed hierarchicaL dEcoupling And Fusing (LEAF) to coordinate expression-relevant representations and pseudo-labels for semi-supervised FER.
arXiv Detail & Related papers (2024-04-23T13:43:33Z)
Audio-Visual Compound Expression Recognition Method based on Late Modality Fusion and Rule-based Decision [9.436107335675473]
This paper presents the results of the SUN team for the Compound Expressions Recognition Challenge of the 6th ABAW Competition. We propose a novel audio-visual method for compound expression recognition.
arXiv Detail & Related papers (2024-03-19T12:45:52Z)
Compound Expression Recognition via Multi Model Ensemble [8.529105068848828]
Compound Expression Recognition plays a crucial role in interpersonal interactions. We propose a solution based on ensemble learning methods for Compound Expression Recognition. Our method achieves high accuracy on RAF-DB and is able to recognize expressions through zero-shot on certain portions of C-EXPR-DB.
arXiv Detail & Related papers (2024-03-19T09:30:56Z)
FLIP: Fine-grained Alignment between ID-based Models and Pretrained Language Models for CTR Prediction [49.510163437116645]
Click-through rate (CTR) prediction plays as a core function module in personalized online services. Traditional ID-based models for CTR prediction take as inputs the one-hot encoded ID features of tabular modality. Pretrained Language Models(PLMs) has given rise to another paradigm, which takes as inputs the sentences of textual modality. We propose to conduct Fine-grained feature-level ALignment between ID-based Models and Pretrained Language Models(FLIP) for CTR prediction.
arXiv Detail & Related papers (2023-10-30T11:25:03Z)
Multi-modal Expression Recognition with Ensemble Method [9.880739481276835]
multimodal feature combinations extracted by several different pre-trained models are applied to capture more effective emotional information. For these combinations of visual and audio modal features, we utilize two temporal encoders to explore the temporal contextual information in the data. Our system achieves the average F1 Score of 0.45774 on the validation set.
arXiv Detail & Related papers (2023-03-17T15:03:58Z)
UATVR: Uncertainty-Adaptive Text-Video Retrieval [90.8952122146241]
A common practice is to transfer text-video pairs to the same embedding space and craft cross-modal interactions with certain entities. We propose an Uncertainty-language Text-Video Retrieval approach, termed UATVR, which models each look-up as a distribution matching procedure.
arXiv Detail & Related papers (2023-01-16T08:43:17Z)
Disentangled Representation Learning for Text-Video Retrieval [51.861423831566626]
Cross-modality interaction is a critical component in Text-Video Retrieval (TVR) We study the interaction paradigm in depth, where we find that its computation can be split into two terms. We propose a disentangled framework to capture a sequential and hierarchical representation.
arXiv Detail & Related papers (2022-03-14T13:55:33Z)
Referring Image Segmentation via Cross-Modal Progressive Comprehension [94.70482302324704]
Referring image segmentation aims at segmenting the foreground masks of the entities that can well match the description given in the natural language expression. Previous approaches tackle this problem using implicit feature interaction and fusion between visual and linguistic modalities. We propose a Cross-Modal Progressive (CMPC) module and a Text-Guided Feature Exchange (TGFE) module to effectively address the challenging task.
arXiv Detail & Related papers (2020-10-01T16:02:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.