Related papers: Unimodal Multi-Task Fusion for Emotional Mimicry Intensity Prediction

Unimodal Multi-Task Fusion for Emotional Mimicry Intensity Prediction

URL: http://arxiv.org/abs/2403.11879v4
Date: Sun, 16 Jun 2024 12:21:39 GMT
Title: Unimodal Multi-Task Fusion for Emotional Mimicry Intensity Prediction
Authors: Tobias Hallmen, Fabian Deuser, Norbert Oswald, Elisabeth André,
Abstract summary: We introduce a novel methodology for assessing Emotional Mimicry Intensity (EMI) as part of the 6th Workshop and Competition on Affective Behavior Analysis in-the-wild. Our methodology utilise the Wav2Vec 2.0 architecture, which has been pre-trained on an extensive podcast dataset. We refine our feature extraction process by employing a fusion technique that combines individual features with a global mean vector.
Score: 6.1058750788332325
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In this research, we introduce a novel methodology for assessing Emotional Mimicry Intensity (EMI) as part of the 6th Workshop and Competition on Affective Behavior Analysis in-the-wild. Our methodology utilises the Wav2Vec 2.0 architecture, which has been pre-trained on an extensive podcast dataset, to capture a wide array of audio features that include both linguistic and paralinguistic components. We refine our feature extraction process by employing a fusion technique that combines individual features with a global mean vector, thereby embedding a broader contextual understanding into our analysis. A key aspect of our approach is the multi-task fusion strategy that not only leverages these features but also incorporates a pre-trained Valence-Arousal-Dominance (VAD) model. This integration is designed to refine emotion intensity prediction by concurrently processing multiple emotional dimensions, thereby embedding a richer contextual understanding into our framework. For the temporal analysis of audio data, our feature fusion process utilises a Long Short-Term Memory (LSTM) network. This approach, which relies solely on the provided audio data, shows marked advancements over the existing baseline, offering a more comprehensive understanding of emotional mimicry in naturalistic settings, achieving the second place in the EMI challenge.

Related papers

Rethinking Multimodal Sentiment Analysis: A High-Accuracy, Simplified Fusion Architecture [2.3272964989267626]
We propose a lightweight, yet effective fusion-based deep learning model tailored for utterance-level emotion classification.<n>Our approach demonstrates that with careful feature engineering and modular design, simpler fusion strategies can outperform or match more complex models.
arXiv Detail & Related papers (2025-05-05T02:31:11Z)
DeepMLF: Multimodal language model with learnable tokens for deep fusion in sentiment analysis [62.31018417955254]
DeepMLF is a novel multimodal language model with learnable tokens tailored toward deep fusion. Our results confirm that deeper fusion leads to better performance, with optimal fusion depths (5-7) exceeding those of existing approaches.
arXiv Detail & Related papers (2025-04-15T11:28:02Z)
Semantic Matters: Multimodal Features for Affective Analysis [5.691287789660795]
We present our methodology for two tasks: the Emotional Mimicry Intensity (EMI) Estimation Challenge and the Behavioural Ambivalence/Hesitancy (BAH) Recognition Challenge. We utilize a Wav2Vec 2.0 model pre-trained on a large podcast dataset to extract various audio features. We integrate the textual and visual modality into our analysis, recognizing that semantic content provides valuable contextual cues.
arXiv Detail & Related papers (2025-03-16T11:30:44Z)
Enriching Multimodal Sentiment Analysis through Textual Emotional Descriptions of Visual-Audio Content [56.62027582702816]
Multimodal Sentiment Analysis seeks to unravel human emotions by amalgamating text, audio, and visual data. Yet, discerning subtle emotional nuances within audio and video expressions poses a formidable challenge. We introduce DEVA, a progressive fusion framework founded on textual sentiment descriptions.
arXiv Detail & Related papers (2024-12-12T11:30:41Z)
Enhancing Modal Fusion by Alignment and Label Matching for Multimodal Emotion Recognition [16.97833694961584]
Foal-Net is designed to enhance the effectiveness of modality fusion. It includes two auxiliary tasks: audio-video emotion alignment and cross-modal emotion label matching. Experiments show that Foal-Net outperforms the state-of-the-art methods.
arXiv Detail & Related papers (2024-08-18T11:05:21Z)
AIMDiT: Modality Augmentation and Interaction via Multimodal Dimension Transformation for Emotion Recognition in Conversations [57.99479708224221]
We propose a novel framework called AIMDiT to solve the problem of multimodal fusion of deep features. Experiments conducted using our AIMDiT framework on the public benchmark dataset MELD reveal 2.34% and 2.87% improvements in terms of the Acc-7 and w-F1 metrics.
arXiv Detail & Related papers (2024-04-12T11:31:18Z)
AUD-TGN: Advancing Action Unit Detection with Temporal Convolution and GPT-2 in Wild Audiovisual Contexts [8.809586885539002]
We propose a novel approach utilizing audio-visual multimodal data. This method enhances audio feature extraction by leveraging Mel Frequency Cepstral Coefficients (MFCC) and Log-Mel spectrogram features alongside a pre-trained VGGish network. Our method notably improves the accuracy of AU detection by understanding the temporal and contextual nuances of the data, showcasing significant advancements in the comprehension of intricate scenarios.
arXiv Detail & Related papers (2024-03-20T15:37:19Z)
MM-Narrator: Narrating Long-form Videos with Multimodal In-Context Learning [120.95150400119705]
We present MM-Narrator, a novel system leveraging GPT-4 with multimodal in-context learning for the generation of audio descriptions (AD) MM-Narrator excels in generating precise audio descriptions for videos of extensive lengths, even beyond hours, in an autoregressive manner. We introduce the first segment-based evaluator for recurrent text generation.
arXiv Detail & Related papers (2023-11-29T08:27:00Z)
Exploiting Modality-Specific Features For Multi-Modal Manipulation Detection And Grounding [54.49214267905562]
We construct a transformer-based framework for multi-modal manipulation detection and grounding tasks. Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment. We propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality.
arXiv Detail & Related papers (2023-09-22T06:55:41Z)
Revisiting Disentanglement and Fusion on Modality and Context in Conversational Multimodal Emotion Recognition [81.2011058113579]
We argue that both the feature multimodality and conversational contextualization should be properly modeled simultaneously during the feature disentanglement and fusion steps. We propose a Contribution-aware Fusion Mechanism (CFM) and a Context Refusion Mechanism ( CRM) for multimodal and context integration. Our system achieves new state-of-the-art performance consistently.
arXiv Detail & Related papers (2023-08-08T18:11:27Z)
A Low-rank Matching Attention based Cross-modal Feature Fusion Method for Conversational Emotion Recognition [54.44337276044968]
We introduce a novel and lightweight cross-modal feature fusion method called Low-Rank Matching Attention Method (LMAM) LMAM effectively captures contextual emotional semantic information in conversations while mitigating the quadratic complexity issue caused by the self-attention mechanism. Experimental results verify the superiority of LMAM compared with other popular cross-modal fusion methods on the premise of being more lightweight.
arXiv Detail & Related papers (2023-06-16T16:02:44Z)
DeepSafety:Multi-level Audio-Text Feature Extraction and Fusion Approach for Violence Detection in Conversations [2.8038382295783943]
The choice of words and vocal cues in conversations presents an underexplored rich source of natural language data for personal safety and crime prevention. We introduce a new information fusion approach that extracts and fuses multi-level features including verbal, vocal, and text as heterogeneous sources of information to detect the extent of violent behaviours in conversations.
arXiv Detail & Related papers (2022-06-23T16:45:50Z)
M2FNet: Multi-modal Fusion Network for Emotion Recognition in Conversation [1.3864478040954673]
We propose a Multi-modal Fusion Network (M2FNet) that extracts emotion-relevant features from visual, audio, and text modality. It employs a multi-head attention-based fusion mechanism to combine emotion-rich latent representations of the input data. The proposed feature extractor is trained with a novel adaptive margin-based triplet loss function to learn emotion-relevant features from the audio and visual data.
arXiv Detail & Related papers (2022-06-05T14:18:58Z)
End-to-End Active Speaker Detection [58.7097258722291]
We propose an end-to-end training network where feature learning and contextual predictions are jointly learned. We also introduce intertemporal graph neural network (iGNN) blocks, which split the message passing according to the main sources of context in the ASD problem. Experiments show that the aggregated features from the iGNN blocks are more suitable for ASD, resulting in state-of-the art performance.
arXiv Detail & Related papers (2022-03-27T08:55:28Z)
Multistage linguistic conditioning of convolutional layers for speech emotion recognition [7.482371204083917]
We investigate the effectiveness of deep fusion of text and audio features for categorical and dimensional speech emotion recognition (SER) We propose a novel, multistage fusion method where the two information streams are integrated in several layers of a deep neural network (DNN) Experiments on the widely used IEMOCAP and MSP-Podcast databases demonstrate that the two fusion methods clearly outperform a shallow (late) fusion baseline.
arXiv Detail & Related papers (2021-10-13T11:28:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.