Curriculum Learning Meets Directed Acyclic Graph for Multimodal Emotion
Recognition
- URL: http://arxiv.org/abs/2402.17269v2
- Date: Fri, 8 Mar 2024 06:00:12 GMT
- Title: Curriculum Learning Meets Directed Acyclic Graph for Multimodal Emotion
Recognition
- Authors: Cam-Van Thi Nguyen, Cao-Bach Nguyen, Quang-Thuy Ha, Duc-Trong Le
- Abstract summary: MultiDAG+CL is a novel approach for Multimodal Emotion Recognition in Conversation (ERC)
The model is enhanced by Curriculum Learning (CL) to address challenges related to emotional shifts and data imbalance.
Experimental results on the IEMOCAP and MELD datasets demonstrate that the MultiDAG+CL models outperform baseline models.
- Score: 2.4660652494309936
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Emotion recognition in conversation (ERC) is a crucial task in natural
language processing and affective computing. This paper proposes MultiDAG+CL, a
novel approach for Multimodal Emotion Recognition in Conversation (ERC) that
employs Directed Acyclic Graph (DAG) to integrate textual, acoustic, and visual
features within a unified framework. The model is enhanced by Curriculum
Learning (CL) to address challenges related to emotional shifts and data
imbalance. Curriculum learning facilitates the learning process by gradually
presenting training samples in a meaningful order, thereby improving the
model's performance in handling emotional variations and data imbalance.
Experimental results on the IEMOCAP and MELD datasets demonstrate that the
MultiDAG+CL models outperform baseline models. We release the code for
MultiDAG+CL and experiments: https://github.com/vanntc711/MultiDAG-CL
Related papers
- EEG-based Multimodal Representation Learning for Emotion Recognition [26.257531037300325]
We introduce a novel multimodal framework that accommodates not only conventional modalities such as video, images, and audio, but also incorporates EEG data.
Our framework is designed to flexibly handle varying input sizes, while dynamically adjusting attention to account for feature importance across modalities.
arXiv Detail & Related papers (2024-10-29T01:35:17Z) - Textualized and Feature-based Models for Compound Multimodal Emotion Recognition in the Wild [45.29814349246784]
multimodal large language models (LLMs) rely on explicit non-verbal cues that may be translated from different non-textual modalities into text.
This paper compares the potential of text- and feature-based approaches for compound multimodal ER in videos.
arXiv Detail & Related papers (2024-07-17T18:01:25Z) - SignVTCL: Multi-Modal Continuous Sign Language Recognition Enhanced by
Visual-Textual Contrastive Learning [51.800031281177105]
SignVTCL is a continuous sign language recognition framework enhanced by visual-textual contrastive learning.
It integrates multi-modal data (video, keypoints, and optical flow) simultaneously to train a unified visual backbone.
It achieves state-of-the-art results compared with previous methods.
arXiv Detail & Related papers (2024-01-22T11:04:55Z) - A Two-Stage Multimodal Emotion Recognition Model Based on Graph
Contrastive Learning [13.197551708300345]
We propose a two-stage emotion recognition model based on graph contrastive learning (TS-GCL)
We show that TS-GCL has superior performance on IEMOCAP and MELD datasets compared with previous methods.
arXiv Detail & Related papers (2024-01-03T01:58:31Z) - Deep Imbalanced Learning for Multimodal Emotion Recognition in
Conversations [15.705757672984662]
Multimodal Emotion Recognition in Conversations (MERC) is a significant development direction for machine intelligence.
Many data in MERC naturally exhibit an imbalanced distribution of emotion categories, and researchers ignore the negative impact of imbalanced data on emotion recognition.
We propose the Class Boundary Enhanced Representation Learning (CBERL) model to address the imbalanced distribution of emotion categories in raw data.
We have conducted extensive experiments on the IEMOCAP and MELD benchmark datasets, and the results show that CBERL has achieved a certain performance improvement in the effectiveness of emotion recognition.
arXiv Detail & Related papers (2023-12-11T12:35:17Z) - Towards More Unified In-context Visual Understanding [74.55332581979292]
We present a new ICL framework for visual understanding with multi-modal output enabled.
First, we quantize and embed both text and visual prompt into a unified representational space.
Then a decoder-only sparse transformer architecture is employed to perform generative modeling on them.
arXiv Detail & Related papers (2023-12-05T06:02:21Z) - UniDiff: Advancing Vision-Language Models with Generative and
Discriminative Learning [86.91893533388628]
This paper presents UniDiff, a unified multi-modal model that integrates image-text contrastive learning (ITC), text-conditioned image synthesis learning (IS), and reciprocal semantic consistency modeling (RSC)
UniDiff demonstrates versatility in both multi-modal understanding and generative tasks.
arXiv Detail & Related papers (2023-06-01T15:39:38Z) - MEmoBERT: Pre-training Model with Prompt-based Learning for Multimodal
Emotion Recognition [118.73025093045652]
We propose a pre-training model textbfMEmoBERT for multimodal emotion recognition.
Unlike the conventional "pre-train, finetune" paradigm, we propose a prompt-based method that reformulates the downstream emotion classification task as a masked text prediction.
Our proposed MEmoBERT significantly enhances emotion recognition performance.
arXiv Detail & Related papers (2021-10-27T09:57:00Z) - Improved Speech Emotion Recognition using Transfer Learning and
Spectrogram Augmentation [56.264157127549446]
Speech emotion recognition (SER) is a challenging task that plays a crucial role in natural human-computer interaction.
One of the main challenges in SER is data scarcity.
We propose a transfer learning strategy combined with spectrogram augmentation.
arXiv Detail & Related papers (2021-08-05T10:39:39Z) - Self-Supervised learning with cross-modal transformers for emotion
recognition [20.973999078271483]
Self-supervised learning has shown improvements on tasks with limited labeled datasets in domains like speech and natural language.
In this work, we extend self-supervised training to multi-modal applications.
arXiv Detail & Related papers (2020-11-20T21:38:34Z) - Object Relational Graph with Teacher-Recommended Learning for Video
Captioning [92.48299156867664]
We propose a complete video captioning system including both a novel model and an effective training strategy.
Specifically, we propose an object relational graph (ORG) based encoder, which captures more detailed interaction features to enrich visual representation.
Meanwhile, we design a teacher-recommended learning (TRL) method to make full use of the successful external language model (ELM) to integrate the abundant linguistic knowledge into the caption model.
arXiv Detail & Related papers (2020-02-26T15:34:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.