Multimodal Prompt Transformer with Hybrid Contrastive Learning for
Emotion Recognition in Conversation
- URL: http://arxiv.org/abs/2310.04456v1
- Date: Wed, 4 Oct 2023 13:54:46 GMT
- Title: Multimodal Prompt Transformer with Hybrid Contrastive Learning for
Emotion Recognition in Conversation
- Authors: Shihao Zou, Xianying Huang, Xudong Shen
- Abstract summary: multimodal Emotion Recognition in Conversation (ERC) faces two problems.
Deep emotion cues extraction was performed on modalities with strong representation ability.
Feature filters were designed as multimodal prompt information for modalities with weak representation ability.
MPT embeds multimodal fusion information into each attention layer of the Transformer.
- Score: 9.817888267356716
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Emotion Recognition in Conversation (ERC) plays an important role in driving
the development of human-machine interaction. Emotions can exist in multiple
modalities, and multimodal ERC mainly faces two problems: (1) the noise problem
in the cross-modal information fusion process, and (2) the prediction problem
of less sample emotion labels that are semantically similar but different
categories. To address these issues and fully utilize the features of each
modality, we adopted the following strategies: first, deep emotion cues
extraction was performed on modalities with strong representation ability, and
feature filters were designed as multimodal prompt information for modalities
with weak representation ability. Then, we designed a Multimodal Prompt
Transformer (MPT) to perform cross-modal information fusion. MPT embeds
multimodal fusion information into each attention layer of the Transformer,
allowing prompt information to participate in encoding textual features and
being fused with multi-level textual information to obtain better multimodal
fusion features. Finally, we used the Hybrid Contrastive Learning (HCL)
strategy to optimize the model's ability to handle labels with few samples.
This strategy uses unsupervised contrastive learning to improve the
representation ability of multimodal fusion and supervised contrastive learning
to mine the information of labels with few samples. Experimental results show
that our proposed model outperforms state-of-the-art models in ERC on two
benchmark datasets.
Related papers
- CMATH: Cross-Modality Augmented Transformer with Hierarchical Variational Distillation for Multimodal Emotion Recognition in Conversation [8.874033487493913]
Multimodal emotion recognition in conversation aims to accurately identify emotions in conversational utterances.
We propose a novel Cross-Modality Augmented Transformer with Hierarchical Variational Distillation, called CMATH, which consists of two major components.
Experiments on the IEMOCAP and MELD datasets demonstrate that our proposed model outperforms previous state-of-the-art baselines.
arXiv Detail & Related papers (2024-11-15T09:23:02Z) - DeepInteraction++: Multi-Modality Interaction for Autonomous Driving [80.8837864849534]
We introduce a novel modality interaction strategy that allows individual per-modality representations to be learned and maintained throughout.
DeepInteraction++ is a multi-modal interaction framework characterized by a multi-modal representational interaction encoder and a multi-modal predictive interaction decoder.
Experiments demonstrate the superior performance of the proposed framework on both 3D object detection and end-to-end autonomous driving tasks.
arXiv Detail & Related papers (2024-08-09T14:04:21Z) - Multimodal Information Interaction for Medical Image Segmentation [24.024848382458767]
We introduce an innovative Multimodal Information Cross Transformer (MicFormer)
It queries features from one modality and retrieves corresponding responses from another, facilitating effective communication between bimodal features.
Compared to other multimodal segmentation techniques, our method outperforms by margins of 2.83 and 4.23, respectively.
arXiv Detail & Related papers (2024-04-25T07:21:14Z) - AIMDiT: Modality Augmentation and Interaction via Multimodal Dimension Transformation for Emotion Recognition in Conversations [57.99479708224221]
We propose a novel framework called AIMDiT to solve the problem of multimodal fusion of deep features.
Experiments conducted using our AIMDiT framework on the public benchmark dataset MELD reveal 2.34% and 2.87% improvements in terms of the Acc-7 and w-F1 metrics.
arXiv Detail & Related papers (2024-04-12T11:31:18Z) - Exploiting Modality-Specific Features For Multi-Modal Manipulation
Detection And Grounding [54.49214267905562]
We construct a transformer-based framework for multi-modal manipulation detection and grounding tasks.
Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment.
We propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality.
arXiv Detail & Related papers (2023-09-22T06:55:41Z) - Multilevel Transformer For Multimodal Emotion Recognition [6.0149102420697025]
We introduce a novel multi-granularity framework, which combines fine-grained representation with pre-trained utterance-level representation.
Inspired by Transformer TTS, we propose a multilevel transformer model to perform fine-grained multimodal emotion recognition.
arXiv Detail & Related papers (2022-10-26T10:31:24Z) - Multi-scale Cooperative Multimodal Transformers for Multimodal Sentiment
Analysis in Videos [58.93586436289648]
We propose a multi-scale cooperative multimodal transformer (MCMulT) architecture for multimodal sentiment analysis.
Our model outperforms existing approaches on unaligned multimodal sequences and has strong performance on aligned multimodal sequences.
arXiv Detail & Related papers (2022-06-16T07:47:57Z) - High-Modality Multimodal Transformer: Quantifying Modality & Interaction
Heterogeneity for High-Modality Representation Learning [112.51498431119616]
This paper studies efficient representation learning for high-modality scenarios involving a large set of diverse modalities.
A single model, HighMMT, scales up to 10 modalities (text, image, audio, video, sensors, proprioception, speech, time-series, sets, and tables) and 15 tasks from 5 research areas.
arXiv Detail & Related papers (2022-03-02T18:56:20Z) - Multistage linguistic conditioning of convolutional layers for speech
emotion recognition [7.482371204083917]
We investigate the effectiveness of deep fusion of text and audio features for categorical and dimensional speech emotion recognition (SER)
We propose a novel, multistage fusion method where the two information streams are integrated in several layers of a deep neural network (DNN)
Experiments on the widely used IEMOCAP and MSP-Podcast databases demonstrate that the two fusion methods clearly outperform a shallow (late) fusion baseline.
arXiv Detail & Related papers (2021-10-13T11:28:04Z) - Fusion with Hierarchical Graphs for Mulitmodal Emotion Recognition [7.147235324895931]
This paper proposes a novel hierarchical graph network (HFGCN) model that learns more informative multimodal representations.
Specifically, the proposed model fuses multimodality inputs using a two-stage graph construction approach and encodes the modality dependencies into the conversation representation.
Experiments showed the effectiveness of our proposed model for more accurate AER, which yielded state-of-the-art results on two public datasets.
arXiv Detail & Related papers (2021-09-15T08:21:01Z) - Bi-Bimodal Modality Fusion for Correlation-Controlled Multimodal
Sentiment Analysis [96.46952672172021]
Bi-Bimodal Fusion Network (BBFN) is a novel end-to-end network that performs fusion on pairwise modality representations.
Model takes two bimodal pairs as input due to known information imbalance among modalities.
arXiv Detail & Related papers (2021-07-28T23:33:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.