Interactive Multimodal Fusion with Temporal Modeling
- URL: http://arxiv.org/abs/2503.10523v1
- Date: Thu, 13 Mar 2025 16:31:56 GMT
- Title: Interactive Multimodal Fusion with Temporal Modeling
- Authors: Jun Yu, Yongqi Wang, Lei Wang, Yang Zheng, Shengfan Xu,
- Abstract summary: Our approach integrates visual and audio information through a multimodal framework.<n>The visual branch uses a pre-trained ResNet model to extract features from facial images.<n>The audio branches employ pre-trained VGG models to extract VGGish and LogMel features from speech signals.<n>Our method achieves competitive performance on the Aff-Wild2 dataset, demonstrating effective multimodal fusion for VA estimation in-the-wild.
- Score: 11.506800500772734
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper presents our method for the estimation of valence-arousal (VA) in the 8th Affective Behavior Analysis in-the-Wild (ABAW) competition. Our approach integrates visual and audio information through a multimodal framework. The visual branch uses a pre-trained ResNet model to extract spatial features from facial images. The audio branches employ pre-trained VGG models to extract VGGish and LogMel features from speech signals. These features undergo temporal modeling using Temporal Convolutional Networks (TCNs). We then apply cross-modal attention mechanisms, where visual features interact with audio features through query-key-value attention structures. Finally, the features are concatenated and passed through a regression layer to predict valence and arousal. Our method achieves competitive performance on the Aff-Wild2 dataset, demonstrating effective multimodal fusion for VA estimation in-the-wild.
Related papers
- Graph-Driven Multimodal Feature Learning Framework for Apparent Personality Assessment [0.39945675027960637]
Predicting personality traits automatically has become a challenging problem in computer vision.
This paper introduces an innovative multimodal feature learning framework for personality analysis in short video clips.
arXiv Detail & Related papers (2025-04-15T14:26:12Z) - AVadCLIP: Audio-Visual Collaboration for Robust Video Anomaly Detection [57.649223695021114]
We present a novel weakly supervised framework that leverages audio-visual collaboration for robust video anomaly detection.
Our framework demonstrates superior performance across multiple benchmarks, with audio integration significantly boosting anomaly detection accuracy.
arXiv Detail & Related papers (2025-04-06T13:59:16Z) - AUD-TGN: Advancing Action Unit Detection with Temporal Convolution and GPT-2 in Wild Audiovisual Contexts [8.809586885539002]
We propose a novel approach utilizing audio-visual multimodal data.
This method enhances audio feature extraction by leveraging Mel Frequency Cepstral Coefficients (MFCC) and Log-Mel spectrogram features alongside a pre-trained VGGish network.
Our method notably improves the accuracy of AU detection by understanding the temporal and contextual nuances of the data, showcasing significant advancements in the comprehension of intricate scenarios.
arXiv Detail & Related papers (2024-03-20T15:37:19Z) - Multimodal Fusion Method with Spatiotemporal Sequences and Relationship Learning for Valence-Arousal Estimation [9.93719767430551]
This paper presents our approach for the VA (Valence-Arousal) estimation task in the ABA6 competition.
We devised a comprehensive model by preprocessing video frames and audio segments to extract visual and audio features.
We employed a Transformer encoder structure to learn long-range dependencies, thereby enhancing the model's performance and generalization ability.
arXiv Detail & Related papers (2024-03-19T04:25:54Z) - Joint Multimodal Transformer for Emotion Recognition in the Wild [49.735299182004404]
Multimodal emotion recognition (MMER) systems typically outperform unimodal systems.
This paper proposes an MMER method that relies on a joint multimodal transformer (JMT) for fusion with key-based cross-attention.
arXiv Detail & Related papers (2024-03-15T17:23:38Z) - Harnessing Diffusion Models for Visual Perception with Meta Prompts [68.78938846041767]
We propose a simple yet effective scheme to harness a diffusion model for visual perception tasks.
We introduce learnable embeddings (meta prompts) to the pre-trained diffusion models to extract proper features for perception.
Our approach achieves new performance records in depth estimation tasks on NYU depth V2 and KITTI, and in semantic segmentation task on CityScapes.
arXiv Detail & Related papers (2023-12-22T14:40:55Z) - Cross-modal Audio-visual Co-learning for Text-independent Speaker
Verification [55.624946113550195]
This paper proposes a cross-modal speech co-learning paradigm.
Two cross-modal boosters are introduced based on an audio-visual pseudo-siamese structure to learn the modality-transformed correlation.
Experimental results on the LRSLip3, GridLip, LomGridLip, and VoxLip datasets demonstrate that our proposed method achieves 60% and 20% average relative performance improvement.
arXiv Detail & Related papers (2023-02-22T10:06:37Z) - A Joint Cross-Attention Model for Audio-Visual Fusion in Dimensional Emotion Recognition [46.443866373546726]
We focus on dimensional emotion recognition based on the fusion of facial and vocal modalities extracted from videos.
We propose a joint cross-attention model that relies on the complementary relationships to extract the salient features.
Our proposed A-V fusion model provides a cost-effective solution that can outperform state-of-the-art approaches.
arXiv Detail & Related papers (2022-03-28T14:09:43Z) - Cross Attentional Audio-Visual Fusion for Dimensional Emotion Recognition [13.994609732846344]
Most effective techniques for emotion recognition efficiently leverage diverse and complimentary sources of information.
We introduce a cross-attentional fusion approach to extract the salient features across audio-visual (A-V) modalities.
Results indicate that our cross-attentional A-V fusion model is a cost-effective approach that outperforms state-of-the-art fusion approaches.
arXiv Detail & Related papers (2021-11-09T16:01:56Z) - With a Little Help from my Temporal Context: Multimodal Egocentric
Action Recognition [95.99542238790038]
We propose a method that learns to attend to surrounding actions in order to improve recognition performance.
To incorporate the temporal context, we propose a transformer-based multimodal model that ingests video and audio as input modalities.
We test our approach on EPIC-KITCHENS and EGTEA datasets reporting state-of-the-art performance.
arXiv Detail & Related papers (2021-11-01T15:27:35Z) - Attention Bottlenecks for Multimodal Fusion [90.75885715478054]
Machine perception models are typically modality-specific and optimised for unimodal benchmarks.
We introduce a novel transformer based architecture that uses fusion' for modality fusion at multiple layers.
We conduct thorough ablation studies, and achieve state-of-the-art results on multiple audio-visual classification benchmarks.
arXiv Detail & Related papers (2021-06-30T22:44:12Z) - $M^3$T: Multi-Modal Continuous Valence-Arousal Estimation in the Wild [86.40973759048957]
This report describes a multi-modal multi-task ($M3$T) approach underlying our submission to the valence-arousal estimation track of the Affective Behavior Analysis in-the-wild (ABAW) Challenge.
In the proposed $M3$T framework, we fuse both visual features from videos and acoustic features from the audio tracks to estimate the valence and arousal.
We evaluated the $M3$T framework on the validation set provided by ABAW and it significantly outperforms the baseline method.
arXiv Detail & Related papers (2020-02-07T18:53:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.