Emotional Reaction Intensity Estimation Based on Multimodal Data
- URL: http://arxiv.org/abs/2303.09167v1
- Date: Thu, 16 Mar 2023 09:14:47 GMT
- Title: Emotional Reaction Intensity Estimation Based on Multimodal Data
- Authors: Shangfei Wang, Jiaqiang Wu, Feiyi Zheng, Xin Li, Xuewei Li, Suwen
Wang, Yi Wu, Yanan Chang, Xiangyu Miao
- Abstract summary: This paper introduces our method for the Emotional Reaction Intensity (ERI) Estimation Challenge.
Based on the multimodal data provided by the originazers, we extract acoustic and visual features with different pretrained models.
- Score: 24.353102762289545
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper introduces our method for the Emotional Reaction Intensity (ERI)
Estimation Challenge, in CVPR 2023: 5th Workshop and Competition on Affective
Behavior Analysis in-the-wild (ABAW). Based on the multimodal data provided by
the originazers, we extract acoustic and visual features with different
pretrained models. The multimodal features are mixed together by Transformer
Encoders with cross-modal attention mechnism. In this paper, 1. better features
are extracted with the SOTA pretrained models. 2. Compared with the baseline,
we improve the Pearson's Correlations Coefficient a lot. 3. We process the data
with some special skills to enhance performance ability of our model.
Related papers
- Zero-Shot Embeddings Inform Learning and Forgetting with Vision-Language Encoders [6.7181844004432385]
The Inter-Intra Modal Measure (IIMM) functions as a strong predictor of performance changes with fine-tuning.
Fine-tuning on tasks with higher IIMM scores produces greater in-domain performance gains but also induces more severe out-of-domain performance degradation.
With only a single forward pass of the target data, practitioners can leverage this key insight to evaluate the degree to which a model can be expected to improve following fine-tuning.
arXiv Detail & Related papers (2024-07-22T15:35:09Z) - Self-Supervised Modality-Agnostic Pre-Training of Swin Transformers [0.7496510641958004]
We augment the Swin Transformer to learn from different medical imaging modalities, enhancing downstream performance.
Our model, dubbed SwinFUSE, offers three key advantages: (i) it learns from both Computed Tomography (CT) and Magnetic Resonance Images (MRI) during pre-training, resulting in complementary feature representations; (ii) a domain-invariance module (DIM) that effectively highlights salient input regions, enhancing adaptability; (iii) exhibits remarkable generalizability, surpassing the confines of tasks it was initially pre-trained on.
arXiv Detail & Related papers (2024-05-21T13:28:32Z) - MMA-DFER: MultiModal Adaptation of unimodal models for Dynamic Facial Expression Recognition in-the-wild [81.32127423981426]
Multimodal emotion recognition based on audio and video data is important for real-world applications.
Recent methods have focused on exploiting advances of self-supervised learning (SSL) for pre-training of strong multimodal encoders.
We propose a different perspective on the problem and investigate the advancement of multimodal DFER performance by adapting SSL-pre-trained disjoint unimodal encoders.
arXiv Detail & Related papers (2024-04-13T13:39:26Z) - MPRE: Multi-perspective Patient Representation Extractor for Disease
Prediction [3.914545513460964]
We propose the Multi-perspective Patient Representation Extractor (MPRE) for disease prediction.
Specifically, we propose Frequency Transformation Module (FTM) to extract the trend and variation information of dynamic features.
In the 2D Multi-Extraction Network (2D MEN), we form the 2D temporal tensor based on trend and variation.
We also propose the First-Order Difference Attention Mechanism (FODAM) to calculate the contributions of differences in adjacent variations to the disease diagnosis.
arXiv Detail & Related papers (2024-01-01T13:52:05Z) - When Parameter-efficient Tuning Meets General-purpose Vision-language
Models [65.19127815275307]
PETAL revolutionizes the training process by requiring only 0.5% of the total parameters, achieved through a unique mode approximation technique.
Our experiments reveal that PETAL not only outperforms current state-of-the-art methods in most scenarios but also surpasses full fine-tuning models in effectiveness.
arXiv Detail & Related papers (2023-12-16T17:13:08Z) - Improving Discriminative Multi-Modal Learning with Large-Scale
Pre-Trained Models [51.5543321122664]
This paper investigates how to better leverage large-scale pre-trained uni-modal models to enhance discriminative multi-modal learning.
We introduce Multi-Modal Low-Rank Adaptation learning (MMLoRA)
arXiv Detail & Related papers (2023-10-08T15:01:54Z) - An Empirical Study of Multimodal Model Merging [148.48412442848795]
Model merging is a technique that fuses multiple models trained on different tasks to generate a multi-task solution.
We conduct our study for a novel goal where we can merge vision, language, and cross-modal transformers of a modality-specific architecture.
We propose two metrics that assess the distance between weights to be merged and can serve as an indicator of the merging outcomes.
arXiv Detail & Related papers (2023-04-28T15:43:21Z) - A Dual Branch Network for Emotional Reaction Intensity Estimation [12.677143408225167]
We propose a solution to the ERI challenge of the fifth Affective Behavior Analysis in-the-wild(ABAW), a dual-branch based multi-output regression model.
The spatial attention is used to better extract visual features, and the Mel-Frequency Cepstral Coefficients technology extracts acoustic features.
Our method achieves excellent results on the official validation set.
arXiv Detail & Related papers (2023-03-16T10:31:40Z) - MEmoBERT: Pre-training Model with Prompt-based Learning for Multimodal
Emotion Recognition [118.73025093045652]
We propose a pre-training model textbfMEmoBERT for multimodal emotion recognition.
Unlike the conventional "pre-train, finetune" paradigm, we propose a prompt-based method that reformulates the downstream emotion classification task as a masked text prediction.
Our proposed MEmoBERT significantly enhances emotion recognition performance.
arXiv Detail & Related papers (2021-10-27T09:57:00Z) - Bi-Bimodal Modality Fusion for Correlation-Controlled Multimodal
Sentiment Analysis [96.46952672172021]
Bi-Bimodal Fusion Network (BBFN) is a novel end-to-end network that performs fusion on pairwise modality representations.
Model takes two bimodal pairs as input due to known information imbalance among modalities.
arXiv Detail & Related papers (2021-07-28T23:33:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.