A Transformer-Based Model With Self-Distillation for Multimodal Emotion
Recognition in Conversations
- URL: http://arxiv.org/abs/2310.20494v1
- Date: Tue, 31 Oct 2023 14:33:30 GMT
- Title: A Transformer-Based Model With Self-Distillation for Multimodal Emotion
Recognition in Conversations
- Authors: Hui Ma, Jian Wang, Hongfei Lin, Bo Zhang, Yijia Zhang, Bo Xu
- Abstract summary: We propose a transformer-based model with self-distillation (SDT) for the task.
The proposed model captures intra- and inter-modal interactions by utilizing intra- and inter-modal transformers.
We introduce self-distillation to transfer knowledge of hard and soft labels from the proposed model to each modality.
- Score: 15.77747948751497
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Emotion recognition in conversations (ERC), the task of recognizing the
emotion of each utterance in a conversation, is crucial for building empathetic
machines. Existing studies focus mainly on capturing context- and
speaker-sensitive dependencies on the textual modality but ignore the
significance of multimodal information. Different from emotion recognition in
textual conversations, capturing intra- and inter-modal interactions between
utterances, learning weights between different modalities, and enhancing modal
representations play important roles in multimodal ERC. In this paper, we
propose a transformer-based model with self-distillation (SDT) for the task.
The transformer-based model captures intra- and inter-modal interactions by
utilizing intra- and inter-modal transformers, and learns weights between
modalities dynamically by designing a hierarchical gated fusion strategy.
Furthermore, to learn more expressive modal representations, we treat soft
labels of the proposed model as extra training supervision. Specifically, we
introduce self-distillation to transfer knowledge of hard and soft labels from
the proposed model to each modality. Experiments on IEMOCAP and MELD datasets
demonstrate that SDT outperforms previous state-of-the-art baselines.
Related papers
- DeepInteraction++: Multi-Modality Interaction for Autonomous Driving [80.8837864849534]
We introduce a novel modality interaction strategy that allows individual per-modality representations to be learned and maintained throughout.
DeepInteraction++ is a multi-modal interaction framework characterized by a multi-modal representational interaction encoder and a multi-modal predictive interaction decoder.
Experiments demonstrate the superior performance of the proposed framework on both 3D object detection and end-to-end autonomous driving tasks.
arXiv Detail & Related papers (2024-08-09T14:04:21Z) - MMA-DFER: MultiModal Adaptation of unimodal models for Dynamic Facial Expression Recognition in-the-wild [81.32127423981426]
Multimodal emotion recognition based on audio and video data is important for real-world applications.
Recent methods have focused on exploiting advances of self-supervised learning (SSL) for pre-training of strong multimodal encoders.
We propose a different perspective on the problem and investigate the advancement of multimodal DFER performance by adapting SSL-pre-trained disjoint unimodal encoders.
arXiv Detail & Related papers (2024-04-13T13:39:26Z) - AIMDiT: Modality Augmentation and Interaction via Multimodal Dimension Transformation for Emotion Recognition in Conversations [57.99479708224221]
We propose a novel framework called AIMDiT to solve the problem of multimodal fusion of deep features.
Experiments conducted using our AIMDiT framework on the public benchmark dataset MELD reveal 2.34% and 2.87% improvements in terms of the Acc-7 and w-F1 metrics.
arXiv Detail & Related papers (2024-04-12T11:31:18Z) - Joint Multimodal Transformer for Emotion Recognition in the Wild [49.735299182004404]
Multimodal emotion recognition (MMER) systems typically outperform unimodal systems.
This paper proposes an MMER method that relies on a joint multimodal transformer (JMT) for fusion with key-based cross-attention.
arXiv Detail & Related papers (2024-03-15T17:23:38Z) - AMuSE: Adaptive Multimodal Analysis for Speaker Emotion Recognition in
Group Conversations [39.79734528362605]
Multimodal Attention Network captures cross-modal interactions at various levels of spatial abstraction.
AMuSE model condenses both spatial and temporal features into two dense descriptors: speaker-level and utterance-level.
arXiv Detail & Related papers (2024-01-26T19:17:05Z) - Multimodal Prompt Transformer with Hybrid Contrastive Learning for
Emotion Recognition in Conversation [9.817888267356716]
multimodal Emotion Recognition in Conversation (ERC) faces two problems.
Deep emotion cues extraction was performed on modalities with strong representation ability.
Feature filters were designed as multimodal prompt information for modalities with weak representation ability.
MPT embeds multimodal fusion information into each attention layer of the Transformer.
arXiv Detail & Related papers (2023-10-04T13:54:46Z) - Multimodal Emotion Recognition using Transfer Learning from Speaker
Recognition and BERT-based models [53.31917090073727]
We propose a neural network-based emotion recognition framework that uses a late fusion of transfer-learned and fine-tuned models from speech and text modalities.
We evaluate the effectiveness of our proposed multimodal approach on the interactive emotional dyadic motion capture dataset.
arXiv Detail & Related papers (2022-02-16T00:23:42Z) - MEmoBERT: Pre-training Model with Prompt-based Learning for Multimodal
Emotion Recognition [118.73025093045652]
We propose a pre-training model textbfMEmoBERT for multimodal emotion recognition.
Unlike the conventional "pre-train, finetune" paradigm, we propose a prompt-based method that reformulates the downstream emotion classification task as a masked text prediction.
Our proposed MEmoBERT significantly enhances emotion recognition performance.
arXiv Detail & Related papers (2021-10-27T09:57:00Z) - Fusion with Hierarchical Graphs for Mulitmodal Emotion Recognition [7.147235324895931]
This paper proposes a novel hierarchical graph network (HFGCN) model that learns more informative multimodal representations.
Specifically, the proposed model fuses multimodality inputs using a two-stage graph construction approach and encodes the modality dependencies into the conversation representation.
Experiments showed the effectiveness of our proposed model for more accurate AER, which yielded state-of-the-art results on two public datasets.
arXiv Detail & Related papers (2021-09-15T08:21:01Z) - DialogueTRM: Exploring the Intra- and Inter-Modal Emotional Behaviors in
the Conversation [20.691806885663848]
We propose the DialogueTransformer to explore the differentiated emotional behaviors from the intra- and inter-modal perspectives.
For intra-modal, we construct a novel Hierarchical Transformer that can easily switch between sequential and feed-forward structures.
For inter-modal, we constitute a novel Multi-Grained Interactive Fusion that applies both neuron- and vector-grained feature interactions.
arXiv Detail & Related papers (2020-10-15T10:10:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.