SZTU-CMU at MER2024: Improving Emotion-LLaMA with Conv-Attention for Multimodal Emotion Recognition
- URL: http://arxiv.org/abs/2408.10500v2
- Date: Wed, 21 Aug 2024 18:58:26 GMT
- Title: SZTU-CMU at MER2024: Improving Emotion-LLaMA with Conv-Attention for Multimodal Emotion Recognition
- Authors: Zebang Cheng, Shuyuan Tu, Dawei Huang, Minghan Li, Xiaojiang Peng, Zhi-Qi Cheng, Alexander G. Hauptmann,
- Abstract summary: We present our winning approach for the MER-NOISE and MER-OV tracks of the MER2024 Challenge on multimodal emotion recognition.
Our system leverages the advanced emotional understanding capabilities of Emotion-LLaMA to generate high-quality annotations for unlabeled samples.
For the MER-OV track, our utilization of Emotion-LLaMA for open-vocabulary annotation yields an 8.52% improvement in average accuracy and recall compared to GPT-4V.
- Score: 65.19303535139453
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper presents our winning approach for the MER-NOISE and MER-OV tracks of the MER2024 Challenge on multimodal emotion recognition. Our system leverages the advanced emotional understanding capabilities of Emotion-LLaMA to generate high-quality annotations for unlabeled samples, addressing the challenge of limited labeled data. To enhance multimodal fusion while mitigating modality-specific noise, we introduce Conv-Attention, a lightweight and efficient hybrid framework. Extensive experimentation vali-dates the effectiveness of our approach. In the MER-NOISE track, our system achieves a state-of-the-art weighted average F-score of 85.30%, surpassing the second and third-place teams by 1.47% and 1.65%, respectively. For the MER-OV track, our utilization of Emotion-LLaMA for open-vocabulary annotation yields an 8.52% improvement in average accuracy and recall compared to GPT-4V, securing the highest score among all participating large multimodal models. The code and model for Emotion-LLaMA are available at https://github.com/ZebangCheng/Emotion-LLaMA.
Related papers
- Audio-Guided Fusion Techniques for Multimodal Emotion Analysis [2.7013910991626213]
We propose a solution for the semi-supervised learning track (MER-SEMI) in MER2024.
We fine-tuned video and text feature extractors, specifically CLIP-vit-large and Baichuan-13B, using labeled data.
We also propose an Audio-Guided Transformer (AGT) fusion mechanism, showing superior effectiveness in fusing both inter-channel and intra-channel information.
arXiv Detail & Related papers (2024-09-08T07:28:27Z) - Leveraging Contrastive Learning and Self-Training for Multimodal Emotion Recognition with Limited Labeled Samples [18.29910296652917]
We present our submission solutions for the Semi-Supervised Learning Sub-Challenge (MER2024-SEMI)
This challenge tackles the issue of limited annotated data in emotion recognition.
Our proposed method is validated to be effective on the MER2024-SEMI Challenge, achieving a weighted average F-score of 88.25% and ranking 6th on the leaderboard.
arXiv Detail & Related papers (2024-08-23T11:33:54Z) - Emotion-LLaMA: Multimodal Emotion Recognition and Reasoning with Instruction Tuning [55.127202990679976]
We introduce the MERR dataset, containing 28,618 coarse-grained and 4,487 fine-grained annotated samples across diverse emotional categories.
This dataset enables models to learn from varied scenarios and generalize to real-world applications.
We propose Emotion-LLaMA, a model that seamlessly integrates audio, visual, and textual inputs through emotion-specific encoders.
arXiv Detail & Related papers (2024-06-17T03:01:22Z) - MER 2024: Semi-Supervised Learning, Noise Robustness, and Open-Vocabulary Multimodal Emotion Recognition [102.76954967225231]
We organize the MER series of competitions to promote the development of this field.
Last year, we launched MER2023, focusing on three interesting topics: multi-label learning, noise robustness, and semi-supervised learning.
This year, besides expanding the dataset size, we introduce a new track around open-vocabulary emotion recognition.
arXiv Detail & Related papers (2024-04-26T02:05:20Z) - Joint Multimodal Transformer for Emotion Recognition in the Wild [49.735299182004404]
Multimodal emotion recognition (MMER) systems typically outperform unimodal systems.
This paper proposes an MMER method that relies on a joint multimodal transformer (JMT) for fusion with key-based cross-attention.
arXiv Detail & Related papers (2024-03-15T17:23:38Z) - Hierarchical Audio-Visual Information Fusion with Multi-label Joint
Decoding for MER 2023 [51.95161901441527]
In this paper, we propose a novel framework for recognizing both discrete and dimensional emotions.
Deep features extracted from foundation models are used as robust acoustic and visual representations of raw video.
Our final system achieves state-of-the-art performance and ranks third on the leaderboard on MER-MULTI sub-challenge.
arXiv Detail & Related papers (2023-09-11T03:19:10Z) - MER 2023: Multi-label Learning, Modality Robustness, and Semi-Supervised
Learning [90.17500229142755]
The first Multimodal Emotion Recognition Challenge (MER 2023) was successfully held at ACM Multimedia.
This paper introduces the motivation behind this challenge, describe the benchmark dataset, and provide some statistics about participants.
We believe this high-quality dataset can become a new benchmark in multimodal emotion recognition, especially for the Chinese research community.
arXiv Detail & Related papers (2023-04-18T13:23:42Z) - Using Auxiliary Tasks In Multimodal Fusion Of Wav2vec 2.0 And BERT For
Multimodal Emotion Recognition [24.115771176570824]
We propose to use pretrained models as upstream network, wav2vec 2.0 for audio modality and BERT for text modality.
For the difficulty of multimodal fusion, we use a K-layer multi-head attention mechanism as a downstream fusion module.
We achieve a better performance by 78.42% Weighted Accuracy (WA) and 79.71% Unweighted Accuracy (UA) on the IEMOCAP dataset.
arXiv Detail & Related papers (2023-02-27T10:59:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.