Technical Approach for the EMI Challenge in the 8th Affective Behavior Analysis in-the-Wild Competition
- URL: http://arxiv.org/abs/2503.10603v3
- Date: Tue, 25 Mar 2025 08:46:00 GMT
- Title: Technical Approach for the EMI Challenge in the 8th Affective Behavior Analysis in-the-Wild Competition
- Authors: Jun Yu, Lingsi Zhu, Yanjun Chi, Yunxiang Zhang, Yang Zheng, Yongqi Wang, Xilong Lu,
- Abstract summary: Emotional Mimicry Intensity (EMI) estimation plays a pivotal role in understanding human social behavior and advancing human-computer interaction.<n>This paper proposes a dual-stage cross-modal alignment framework to address the limitations of existing methods.<n> Experiments on the Hume-Vidmimic2 dataset demonstrate superior performance with an average Pearson coefficient correlation of 0.51 across six emotion dimensions.
- Score: 10.741278852581646
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Emotional Mimicry Intensity (EMI) estimation plays a pivotal role in understanding human social behavior and advancing human-computer interaction. The core challenges lie in dynamic correlation modeling and robust fusion of multimodal temporal signals. To address the limitations of existing methods--insufficient exploitation of cross-modal synergies, sensitivity to noise, and constrained fine-grained alignment capabilities--this paper proposes a dual-stage cross-modal alignment framework. Stage 1 develops vision-text and audio-text contrastive learning networks based on a CLIP architecture, achieving preliminary feature-space alignment through modality-decoupled pre-training. Stage 2 introduces a temporal-aware dynamic fusion module integrating Temporal Convolutional Networks (TCN) and gated bidirectional LSTM to capture macro-evolution patterns of facial expressions and local dynamics of acoustic features, respectively. A novel quality-guided fusion strategy further enables differentiable weight allocation for modality compensation under occlusion and noise. Experiments on the Hume-Vidmimic2 dataset demonstrate superior performance with an average Pearson correlation coefficient of 0.51 across six emotion dimensions on the validate set. Remarkably, our method achieved 0.68 on the test set, securing runner-up in the EMI Challenge Track of the 8th ABAW (Affective Behavior Analysis in the Wild) Competition, offering a novel pathway for fine-grained emotion analysis in open environments.
Related papers
- Towards Explainable Fusion and Balanced Learning in Multimodal Sentiment Analysis [14.029574339845476]
KAN-MCP is a novel framework that integrates the interpretability of Kolmogorov-Arnold Networks (KAN) with the robustness of the Multimodal Clean Pareto (MCPareto) framework.
We introduce the Dimensionality Reduction and Denoising Modal Information Bottleneck (DRD-MIB) method, which jointly denoises and reduces feature dimensionality.
This synergy of interpretability and robustness achieves superior performance on benchmark datasets such as CMU-MOSI, CMU-MOSEI, and CH-SIMS v2.
arXiv Detail & Related papers (2025-04-16T15:00:06Z) - ReCoM: Realistic Co-Speech Motion Generation with Recurrent Embedded Transformer [58.49950218437718]
We present ReCoM, an efficient framework for generating high-fidelity and generalizable human body motions synchronized with speech.
The core innovation lies in the Recurrent Embedded Transformer (RET), which integrates Dynamic Embedding Regularization (DER) into a Vision Transformer (ViT) core architecture.
To enhance model robustness, we incorporate the proposed DER strategy, which equips the model with dual capabilities of noise resistance and cross-domain generalization.
arXiv Detail & Related papers (2025-03-27T16:39:40Z) - MHSA: A Multi-scale Hypergraph Network for Mild Cognitive Impairment Detection via Synchronous and Attentive Fusion [4.526574526136158]
A Multi-scale Hypergraph Network for MCI Detection via Synchronous and Attentive Fusion is presented.<n>Our approach employs the Phase-Locking Value (PLV) to calculate the phase synchronization relationship in the spectrum domain of regions of interest.<n>We structure the PLV coefficients dynamically adjust strategy, and the dynamic hypergraph is modelled based on a comprehensive temporal-spectrum fusion matrix.
arXiv Detail & Related papers (2024-12-11T02:59:57Z) - Synesthesia of Machines (SoM)-Enhanced ISAC Precoding for Vehicular Networks with Double Dynamics [15.847713094328286]
Integrated sensing and communication (ISAC) technology is vital for vehicular networks.<n>Time-varying communication channels and rapid movement of targets present significant challenges for real-time precoding design.<n>We propose a synesthesia of machine (SoM)-enhanced precoding paradigm that leverages modalities such as positioning and channel information to adapt to these dynamics.
arXiv Detail & Related papers (2024-08-24T10:35:10Z) - Interpretable Spatio-Temporal Embedding for Brain Structural-Effective Network with Ordinary Differential Equation [56.34634121544929]
In this study, we first construct the brain-effective network via the dynamic causal model.
We then introduce an interpretable graph learning framework termed Spatio-Temporal Embedding ODE (STE-ODE)
This framework incorporates specifically designed directed node embedding layers, aiming at capturing the dynamic interplay between structural and effective networks.
arXiv Detail & Related papers (2024-05-21T20:37:07Z) - MMA-DFER: MultiModal Adaptation of unimodal models for Dynamic Facial Expression Recognition in-the-wild [81.32127423981426]
Multimodal emotion recognition based on audio and video data is important for real-world applications.
Recent methods have focused on exploiting advances of self-supervised learning (SSL) for pre-training of strong multimodal encoders.
We propose a different perspective on the problem and investigate the advancement of multimodal DFER performance by adapting SSL-pre-trained disjoint unimodal encoders.
arXiv Detail & Related papers (2024-04-13T13:39:26Z) - Recursive Joint Cross-Modal Attention for Multimodal Fusion in Dimensional Emotion Recognition [3.5803801804085347]
We introduce Recursive Joint Cross-Modal Attention (RJCMA) to capture both intra- and inter-modal relationships across audio, visual, and text modalities for dimensional emotion recognition.
In particular, we compute the attention weights based on cross-correlation between the joint audio-visual-text feature representations and the feature representations of individual modalities.
Extensive experiments are conducted to evaluate the performance of the proposed fusion model on the challenging Affwild2 dataset.
arXiv Detail & Related papers (2024-03-20T15:08:43Z) - Joint Multimodal Transformer for Emotion Recognition in the Wild [49.735299182004404]
Multimodal emotion recognition (MMER) systems typically outperform unimodal systems.
This paper proposes an MMER method that relies on a joint multimodal transformer (JMT) for fusion with key-based cross-attention.
arXiv Detail & Related papers (2024-03-15T17:23:38Z) - Unleashing Network Potentials for Semantic Scene Completion [50.95486458217653]
This paper proposes a novel SSC framework - Adrial Modality Modulation Network (AMMNet)
AMMNet introduces two core modules: a cross-modal modulation enabling the interdependence of gradient flows between modalities, and a customized adversarial training scheme leveraging dynamic gradient competition.
Extensive experimental results demonstrate that AMMNet outperforms state-of-the-art SSC methods by a large margin.
arXiv Detail & Related papers (2024-03-12T11:48:49Z) - Cross-Attention is Not Enough: Incongruity-Aware Dynamic Hierarchical
Fusion for Multimodal Affect Recognition [69.32305810128994]
Incongruity between modalities poses a challenge for multimodal fusion, especially in affect recognition.
We propose the Hierarchical Crossmodal Transformer with Dynamic Modality Gating (HCT-DMG), a lightweight incongruity-aware model.
HCT-DMG: 1) outperforms previous multimodal models with a reduced size of approximately 0.8M parameters; 2) recognizes hard samples where incongruity makes affect recognition difficult; 3) mitigates the incongruity at the latent level in crossmodal attention.
arXiv Detail & Related papers (2023-05-23T01:24:15Z) - Group Gated Fusion on Attention-based Bidirectional Alignment for
Multimodal Emotion Recognition [63.07844685982738]
This paper presents a new model named as Gated Bidirectional Alignment Network (GBAN), which consists of an attention-based bidirectional alignment network over LSTM hidden states.
We empirically show that the attention-aligned representations outperform the last-hidden-states of LSTM significantly.
The proposed GBAN model outperforms existing state-of-the-art multimodal approaches on the IEMOCAP dataset.
arXiv Detail & Related papers (2022-01-17T09:46:59Z) - Bi-Bimodal Modality Fusion for Correlation-Controlled Multimodal
Sentiment Analysis [96.46952672172021]
Bi-Bimodal Fusion Network (BBFN) is a novel end-to-end network that performs fusion on pairwise modality representations.
Model takes two bimodal pairs as input due to known information imbalance among modalities.
arXiv Detail & Related papers (2021-07-28T23:33:42Z) - Dynamic Dual-Attentive Aggregation Learning for Visible-Infrared Person
Re-Identification [208.1227090864602]
Visible-infrared person re-identification (VI-ReID) is a challenging cross-modality pedestrian retrieval problem.
Existing VI-ReID methods tend to learn global representations, which have limited discriminability and weak robustness to noisy images.
We propose a novel dynamic dual-attentive aggregation (DDAG) learning method by mining both intra-modality part-level and cross-modality graph-level contextual cues for VI-ReID.
arXiv Detail & Related papers (2020-07-18T03:08:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.