Audio-Guided Fusion Techniques for Multimodal Emotion Analysis
- URL: http://arxiv.org/abs/2409.05007v1
- Date: Sun, 8 Sep 2024 07:28:27 GMT
- Title: Audio-Guided Fusion Techniques for Multimodal Emotion Analysis
- Authors: Pujin Shi, Fei Gao,
- Abstract summary: We propose a solution for the semi-supervised learning track (MER-SEMI) in MER2024.
We fine-tuned video and text feature extractors, specifically CLIP-vit-large and Baichuan-13B, using labeled data.
We also propose an Audio-Guided Transformer (AGT) fusion mechanism, showing superior effectiveness in fusing both inter-channel and intra-channel information.
- Score: 2.7013910991626213
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we propose a solution for the semi-supervised learning track (MER-SEMI) in MER2024. First, in order to enhance the performance of the feature extractor on sentiment classification tasks,we fine-tuned video and text feature extractors, specifically CLIP-vit-large and Baichuan-13B, using labeled data. This approach effectively preserves the original emotional information conveyed in the videos. Second, we propose an Audio-Guided Transformer (AGT) fusion mechanism, which leverages the robustness of Hubert-large, showing superior effectiveness in fusing both inter-channel and intra-channel information. Third, To enhance the accuracy of the model, we iteratively apply self-supervised learning by using high-confidence unlabeled data as pseudo-labels. Finally, through black-box probing, we discovered an imbalanced data distribution between the training and test sets. Therefore, We adopt a prior-knowledge-based voting mechanism. The results demonstrate the effectiveness of our strategy, ultimately earning us third place in the MER-SEMI track.
Related papers
- Leveraging Contrastive Learning and Self-Training for Multimodal Emotion Recognition with Limited Labeled Samples [18.29910296652917]
We present our submission solutions for the Semi-Supervised Learning Sub-Challenge (MER2024-SEMI)
This challenge tackles the issue of limited annotated data in emotion recognition.
Our proposed method is validated to be effective on the MER2024-SEMI Challenge, achieving a weighted average F-score of 88.25% and ranking 6th on the leaderboard.
arXiv Detail & Related papers (2024-08-23T11:33:54Z) - SZTU-CMU at MER2024: Improving Emotion-LLaMA with Conv-Attention for Multimodal Emotion Recognition [65.19303535139453]
We present our winning approach for the MER-NOISE and MER-OV tracks of the MER2024 Challenge on multimodal emotion recognition.
Our system leverages the advanced emotional understanding capabilities of Emotion-LLaMA to generate high-quality annotations for unlabeled samples.
For the MER-OV track, our utilization of Emotion-LLaMA for open-vocabulary annotation yields an 8.52% improvement in average accuracy and recall compared to GPT-4V.
arXiv Detail & Related papers (2024-08-20T02:46:03Z) - MERGE -- A Bimodal Dataset for Static Music Emotion Recognition [0.5339846068056558]
This article proposes three new audio, lyrics, and bimodal Music Emotion Recognition research datasets, collectively called MERGE, created using a semi-automatic approach.
The obtained results confirm the viability of the proposed datasets, achieving the best overall result of 79.21% F1-score for bimodal classification using a deep neural network.
arXiv Detail & Related papers (2024-07-08T16:01:04Z) - The Solution for Temporal Sound Localisation Task of ICCV 1st Perception Test Challenge 2023 [11.64675515432159]
We employ a multimodal fusion approach to combine visual and audio features.
High-quality visual features are extracted using a state-of-the-art self-supervised pre-training network.
At the same time, audio features serve as complementary information to help the model better localize the start and end of sounds.
arXiv Detail & Related papers (2024-07-01T12:52:05Z) - Leveraging Large Language Models for Enhanced NLP Task Performance through Knowledge Distillation and Optimized Training Strategies [0.8704964543257245]
This study explores a three-phase training strategy that harnesses GPT-4's capabilities to enhance the BERT model's performance on NER.
We train BERT using a mix of original and LLM-annotated data, analyzing the efficacy of LLM annotations against traditional methods.
Our results indicate that a strategic mix of distilled and original data markedly elevates the NER capabilities of BERT.
arXiv Detail & Related papers (2024-02-14T16:10:45Z) - MLCA-AVSR: Multi-Layer Cross Attention Fusion based Audio-Visual Speech Recognition [62.89464258519723]
We propose a multi-layer cross-attention fusion based AVSR approach that promotes representation of each modality by fusing them at different levels of audio/visual encoders.
Our proposed approach surpasses the first-place system, establishing a new SOTA cpCER of 29.13% on this dataset.
arXiv Detail & Related papers (2024-01-07T08:59:32Z) - Meta-Wrapper: Differentiable Wrapping Operator for User Interest
Selection in CTR Prediction [97.99938802797377]
Click-through rate (CTR) prediction, whose goal is to predict the probability of the user to click on an item, has become increasingly significant in recommender systems.
Recent deep learning models with the ability to automatically extract the user interest from his/her behaviors have achieved great success.
We propose a novel approach under the framework of the wrapper method, which is named Meta-Wrapper.
arXiv Detail & Related papers (2022-06-28T03:28:15Z) - MDMMT-2: Multidomain Multimodal Transformer for Video Retrieval, One
More Step Towards Generalization [65.09758931804478]
Three different data sources are combined: weakly-supervised videos, crowd-labeled text-image pairs and text-video pairs.
A careful analysis of available pre-trained networks helps to choose the best prior-knowledge ones.
arXiv Detail & Related papers (2022-03-14T13:15:09Z) - Bi-Bimodal Modality Fusion for Correlation-Controlled Multimodal
Sentiment Analysis [96.46952672172021]
Bi-Bimodal Fusion Network (BBFN) is a novel end-to-end network that performs fusion on pairwise modality representations.
Model takes two bimodal pairs as input due to known information imbalance among modalities.
arXiv Detail & Related papers (2021-07-28T23:33:42Z) - Optimizing Speech Emotion Recognition using Manta-Ray Based Feature
Selection [1.4502611532302039]
We show that concatenation of features, extracted by using different existing feature extraction methods can boost the classification accuracy.
We also perform a novel application of Manta Ray optimization in speech emotion recognition tasks that resulted in a state-of-the-art result.
arXiv Detail & Related papers (2020-09-18T16:09:34Z) - Emotions Don't Lie: An Audio-Visual Deepfake Detection Method Using
Affective Cues [75.1731999380562]
We present a learning-based method for detecting real and fake deepfake multimedia content.
We extract and analyze the similarity between the two audio and visual modalities from within the same video.
We compare our approach with several SOTA deepfake detection methods and report per-video AUC of 84.4% on the DFDC and 96.6% on the DF-TIMIT datasets.
arXiv Detail & Related papers (2020-03-14T22:07:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.