SI-LSTM: Speaker Hybrid Long-short Term Memory and Cross Modal Attention
for Emotion Recognition in Conversation
- URL: http://arxiv.org/abs/2305.03506v3
- Date: Tue, 6 Jun 2023 12:19:35 GMT
- Title: SI-LSTM: Speaker Hybrid Long-short Term Memory and Cross Modal Attention
for Emotion Recognition in Conversation
- Authors: Xingwei Liang, You Zou, Ruifeng Xu
- Abstract summary: Emotion Recognition in Conversation(ERC) is of vital importance for a variety of applications, including intelligent healthcare, artificial intelligence for conversation, and opinion mining over chat history.
The crux of ERC is to model both cross-modality and cross-time interactions throughout the conversation.
Previous methods have made progress in learning the time series information of conversation while lacking the ability to trace down the different emotional states of each speaker in a conversation.
- Score: 16.505046191280634
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Emotion Recognition in Conversation~(ERC) across modalities is of vital
importance for a variety of applications, including intelligent healthcare,
artificial intelligence for conversation, and opinion mining over chat history.
The crux of ERC is to model both cross-modality and cross-time interactions
throughout the conversation. Previous methods have made progress in learning
the time series information of conversation while lacking the ability to trace
down the different emotional states of each speaker in a conversation. In this
paper, we propose a recurrent structure called Speaker Information Enhanced
Long-Short Term Memory (SI-LSTM) for the ERC task, where the emotional states
of the distinct speaker can be tracked in a sequential way to enhance the
learning of the emotion in conversation. Further, to improve the learning of
multimodal features in ERC, we utilize a cross-modal attention component to
fuse the features between different modalities and model the interaction of the
important information from different modalities. Experimental results on two
benchmark datasets demonstrate the superiority of the proposed SI-LSTM against
the state-of-the-art baseline methods in the ERC task on multimodal data.
Related papers
- Mamba-Enhanced Text-Audio-Video Alignment Network for Emotion Recognition in Conversations [15.748798247815298]
We propose a novel Mamba-enhanced Text-Audio-Video alignment network (MaTAV) for the Emotion Recognition in Conversations (ERC) task.
MaTAV is with the advantages of aligning unimodal features to ensure consistency across different modalities and handling long input sequences to better capture contextual multimodal information.
arXiv Detail & Related papers (2024-09-08T23:09:22Z) - MM-TTS: A Unified Framework for Multimodal, Prompt-Induced Emotional Text-to-Speech Synthesis [70.06396781553191]
Multimodal Emotional Text-to-Speech System (MM-TTS) is a unified framework that leverages emotional cues from multiple modalities to generate highly expressive and emotionally resonant speech.
MM-TTS consists of two key components: the Emotion Prompt Alignment Module (EP-Align), which employs contrastive learning to align emotional features across text, audio, and visual modalities, and the Emotion Embedding-Induced TTS (EMI-TTS), which integrates the aligned emotional embeddings with state-of-the-art TTS models to synthesize speech that accurately reflects the intended emotions.
arXiv Detail & Related papers (2024-04-29T03:19:39Z) - AIMDiT: Modality Augmentation and Interaction via Multimodal Dimension Transformation for Emotion Recognition in Conversations [57.99479708224221]
We propose a novel framework called AIMDiT to solve the problem of multimodal fusion of deep features.
Experiments conducted using our AIMDiT framework on the public benchmark dataset MELD reveal 2.34% and 2.87% improvements in terms of the Acc-7 and w-F1 metrics.
arXiv Detail & Related papers (2024-04-12T11:31:18Z) - TelME: Teacher-leading Multimodal Fusion Network for Emotion Recognition in Conversation [0.78452977096722]
TelME incorporates cross-modal knowledge distillation to transfer information from a language model acting as the teacher to the non-verbal students.
We then combine multimodal features using a shifting fusion approach in which student networks support the teacher.
arXiv Detail & Related papers (2024-01-16T07:18:41Z) - Conversation Understanding using Relational Temporal Graph Neural
Networks with Auxiliary Cross-Modality Interaction [2.1261712640167856]
Emotion recognition is a crucial task for human conversation understanding.
We propose an input Temporal Graph Neural Network with Cross-Modality Interaction (CORECT)
CORECT effectively captures conversation-level cross-modality interactions and utterance-level temporal dependencies.
arXiv Detail & Related papers (2023-11-08T07:46:25Z) - Revisiting Disentanglement and Fusion on Modality and Context in
Conversational Multimodal Emotion Recognition [81.2011058113579]
We argue that both the feature multimodality and conversational contextualization should be properly modeled simultaneously during the feature disentanglement and fusion steps.
We propose a Contribution-aware Fusion Mechanism (CFM) and a Context Refusion Mechanism ( CRM) for multimodal and context integration.
Our system achieves new state-of-the-art performance consistently.
arXiv Detail & Related papers (2023-08-08T18:11:27Z) - A Low-rank Matching Attention based Cross-modal Feature Fusion Method for Conversational Emotion Recognition [54.44337276044968]
We introduce a novel and lightweight cross-modal feature fusion method called Low-Rank Matching Attention Method (LMAM)
LMAM effectively captures contextual emotional semantic information in conversations while mitigating the quadratic complexity issue caused by the self-attention mechanism.
Experimental results verify the superiority of LMAM compared with other popular cross-modal fusion methods on the premise of being more lightweight.
arXiv Detail & Related papers (2023-06-16T16:02:44Z) - M2FNet: Multi-modal Fusion Network for Emotion Recognition in
Conversation [1.3864478040954673]
We propose a Multi-modal Fusion Network (M2FNet) that extracts emotion-relevant features from visual, audio, and text modality.
It employs a multi-head attention-based fusion mechanism to combine emotion-rich latent representations of the input data.
The proposed feature extractor is trained with a novel adaptive margin-based triplet loss function to learn emotion-relevant features from the audio and visual data.
arXiv Detail & Related papers (2022-06-05T14:18:58Z) - End-to-end Spoken Conversational Question Answering: Task, Dataset and
Model [92.18621726802726]
In spoken question answering, the systems are designed to answer questions from contiguous text spans within the related speech transcripts.
We propose a new Spoken Conversational Question Answering task (SCQA), aiming at enabling the systems to model complex dialogue flows.
Our main objective is to build the system to deal with conversational questions based on the audio recordings, and to explore the plausibility of providing more cues from different modalities with systems in information gathering.
arXiv Detail & Related papers (2022-04-29T17:56:59Z) - Multimodal Emotion Recognition using Transfer Learning from Speaker
Recognition and BERT-based models [53.31917090073727]
We propose a neural network-based emotion recognition framework that uses a late fusion of transfer-learned and fine-tuned models from speech and text modalities.
We evaluate the effectiveness of our proposed multimodal approach on the interactive emotional dyadic motion capture dataset.
arXiv Detail & Related papers (2022-02-16T00:23:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.