Multimodal Feature Extraction and Fusion for Emotional Reaction
Intensity Estimation and Expression Classification in Videos with
Transformers
- URL: http://arxiv.org/abs/2303.09164v2
- Date: Fri, 14 Apr 2023 12:56:58 GMT
- Title: Multimodal Feature Extraction and Fusion for Emotional Reaction
Intensity Estimation and Expression Classification in Videos with
Transformers
- Authors: Jia Li, Yin Chen, Xuesong Zhang, Jiantao Nie, Ziqiang Li, Yangchen Yu,
Yan Zhang, Richang Hong, Meng Wang
- Abstract summary: We present our solutions to the two sub-challenges of Affective Behavior Analysis in the wild (ABAW) 2023.
For the Expression Classification Challenge, we propose a streamlined approach that handles the challenges of classification effectively.
By studying, analyzing, and combining these features, we significantly enhance the model's accuracy for sentiment prediction in a multimodal context.
- Score: 47.16005553291036
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we present our advanced solutions to the two sub-challenges of
Affective Behavior Analysis in the wild (ABAW) 2023: the Emotional Reaction
Intensity (ERI) Estimation Challenge and Expression (Expr) Classification
Challenge. ABAW 2023 aims to tackle the challenge of affective behavior
analysis in natural contexts, with the ultimate goal of creating intelligent
machines and robots that possess the ability to comprehend human emotions,
feelings, and behaviors. For the Expression Classification Challenge, we
propose a streamlined approach that handles the challenges of classification
effectively. However, our main contribution lies in our use of diverse models
and tools to extract multimodal features such as audio and video cues from the
Hume-Reaction dataset. By studying, analyzing, and combining these features, we
significantly enhance the model's accuracy for sentiment prediction in a
multimodal context. Furthermore, our method achieves outstanding results on the
Emotional Reaction Intensity (ERI) Estimation Challenge, surpassing the
baseline method by an impressive 84\% increase, as measured by the Pearson
Coefficient, on the validation dataset.
Related papers
- MEMO-Bench: A Multiple Benchmark for Text-to-Image and Multimodal Large Language Models on Human Emotion Analysis [53.012111671763776]
This study introduces MEMO-Bench, a comprehensive benchmark consisting of 7,145 portraits, each depicting one of six different emotions.
Results demonstrate that existing T2I models are more effective at generating positive emotions than negative ones.
Although MLLMs show a certain degree of effectiveness in distinguishing and recognizing human emotions, they fall short of human-level accuracy.
arXiv Detail & Related papers (2024-11-18T02:09:48Z) - Emotion-LLaMA: Multimodal Emotion Recognition and Reasoning with Instruction Tuning [55.127202990679976]
We introduce the MERR dataset, containing 28,618 coarse-grained and 4,487 fine-grained annotated samples across diverse emotional categories.
This dataset enables models to learn from varied scenarios and generalize to real-world applications.
We propose Emotion-LLaMA, a model that seamlessly integrates audio, visual, and textual inputs through emotion-specific encoders.
arXiv Detail & Related papers (2024-06-17T03:01:22Z) - Self-supervised Gait-based Emotion Representation Learning from Selective Strongly Augmented Skeleton Sequences [4.740624855896404]
We propose a contrastive learning framework utilizing selective strong augmentation for self-supervised gait-based emotion representation.
Our approach is validated on the Emotion-Gait (E-Gait) and Emilya datasets and outperforms the state-of-the-art methods under different evaluation protocols.
arXiv Detail & Related papers (2024-05-08T09:13:10Z) - Deep Imbalanced Learning for Multimodal Emotion Recognition in
Conversations [15.705757672984662]
Multimodal Emotion Recognition in Conversations (MERC) is a significant development direction for machine intelligence.
Many data in MERC naturally exhibit an imbalanced distribution of emotion categories, and researchers ignore the negative impact of imbalanced data on emotion recognition.
We propose the Class Boundary Enhanced Representation Learning (CBERL) model to address the imbalanced distribution of emotion categories in raw data.
We have conducted extensive experiments on the IEMOCAP and MELD benchmark datasets, and the results show that CBERL has achieved a certain performance improvement in the effectiveness of emotion recognition.
arXiv Detail & Related papers (2023-12-11T12:35:17Z) - A Dual Branch Network for Emotional Reaction Intensity Estimation [12.677143408225167]
We propose a solution to the ERI challenge of the fifth Affective Behavior Analysis in-the-wild(ABAW), a dual-branch based multi-output regression model.
The spatial attention is used to better extract visual features, and the Mel-Frequency Cepstral Coefficients technology extracts acoustic features.
Our method achieves excellent results on the official validation set.
arXiv Detail & Related papers (2023-03-16T10:31:40Z) - Leveraging TCN and Transformer for effective visual-audio fusion in
continuous emotion recognition [0.5370906227996627]
We present our approach to the Valence-Arousal (VA) Estimation Challenge, Expression (Expr) Classification Challenge, and Action Unit (AU) Detection Challenge.
We propose a novel multi-modal fusion model that leverages Temporal Convolutional Networks (TCN) and Transformer to enhance the performance of continuous emotion recognition.
arXiv Detail & Related papers (2023-03-15T04:15:57Z) - A Hierarchical Regression Chain Framework for Affective Vocal Burst
Recognition [72.36055502078193]
We propose a hierarchical framework, based on chain regression models, for affective recognition from vocal bursts.
To address the challenge of data sparsity, we also use self-supervised learning (SSL) representations with layer-wise and temporal aggregation modules.
The proposed systems participated in the ACII Affective Vocal Burst (A-VB) Challenge 2022 and ranked first in the "TWO'' and "CULTURE" tasks.
arXiv Detail & Related papers (2023-03-14T16:08:45Z) - Multimodal Emotion Recognition using Transfer Learning from Speaker
Recognition and BERT-based models [53.31917090073727]
We propose a neural network-based emotion recognition framework that uses a late fusion of transfer-learned and fine-tuned models from speech and text modalities.
We evaluate the effectiveness of our proposed multimodal approach on the interactive emotional dyadic motion capture dataset.
arXiv Detail & Related papers (2022-02-16T00:23:42Z) - MEmoBERT: Pre-training Model with Prompt-based Learning for Multimodal
Emotion Recognition [118.73025093045652]
We propose a pre-training model textbfMEmoBERT for multimodal emotion recognition.
Unlike the conventional "pre-train, finetune" paradigm, we propose a prompt-based method that reformulates the downstream emotion classification task as a masked text prediction.
Our proposed MEmoBERT significantly enhances emotion recognition performance.
arXiv Detail & Related papers (2021-10-27T09:57:00Z) - Affective Expression Analysis in-the-wild using Multi-Task Temporal
Statistical Deep Learning Model [6.024865915538501]
We present an affective expression analysis model that deals with the above challenges.
We experimented on Aff-Wild2 dataset, a large-scale dataset for ABAW Challenge.
arXiv Detail & Related papers (2020-02-21T04:06:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.