Related papers: SI-LSTM: Speaker Hybrid Long-short Term Memory and Cross Modal Attention for Emotion Recognition in Conversation

SI-LSTM: Speaker Hybrid Long-short Term Memory and Cross Modal Attention for Emotion Recognition in Conversation

URL: http://arxiv.org/abs/2305.03506v3
Date: Tue, 6 Jun 2023 12:19:35 GMT
Title: SI-LSTM: Speaker Hybrid Long-short Term Memory and Cross Modal Attention for Emotion Recognition in Conversation
Authors: Xingwei Liang, You Zou, Ruifeng Xu
Abstract summary: Emotion Recognition in Conversation(ERC) is of vital importance for a variety of applications, including intelligent healthcare, artificial intelligence for conversation, and opinion mining over chat history. The crux of ERC is to model both cross-modality and cross-time interactions throughout the conversation. Previous methods have made progress in learning the time series information of conversation while lacking the ability to trace down the different emotional states of each speaker in a conversation.
Score: 16.505046191280634
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Emotion Recognition in Conversation~(ERC) across modalities is of vital importance for a variety of applications, including intelligent healthcare, artificial intelligence for conversation, and opinion mining over chat history. The crux of ERC is to model both cross-modality and cross-time interactions throughout the conversation. Previous methods have made progress in learning the time series information of conversation while lacking the ability to trace down the different emotional states of each speaker in a conversation. In this paper, we propose a recurrent structure called Speaker Information Enhanced Long-Short Term Memory (SI-LSTM) for the ERC task, where the emotional states of the distinct speaker can be tracked in a sequential way to enhance the learning of the emotion in conversation. Further, to improve the learning of multimodal features in ERC, we utilize a cross-modal attention component to fuse the features between different modalities and model the interaction of the important information from different modalities. Experimental results on two benchmark datasets demonstrate the superiority of the proposed SI-LSTM against the state-of-the-art baseline methods in the ERC task on multimodal data.

Related papers

REALTALK: A 21-Day Real-World Dataset for Long-Term Conversation [51.97224538045096]
We introduce REALTALK, a 21-day corpus of authentic messaging app dialogues. We compare EI attributes and persona consistency to understand the challenges posed by real-world dialogues. Our findings reveal that models struggle to simulate a user solely from dialogue history, while fine-tuning on specific user chats improves persona emulation.
arXiv Detail & Related papers (2025-02-18T20:29:01Z)
VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction [105.88658935310605]
We propose a multi-stage training methodology that progressively trains LLM to understand both visual and speech information. Our approach not only preserves strong vision-language capacity, but also enables efficient speech-to-speech dialogue capabilities. By comparing our method against state-of-the-art counterparts across benchmarks for image, video, and speech tasks, we demonstrate that our model is equipped with both strong visual and speech capabilities.
arXiv Detail & Related papers (2025-01-03T18:59:52Z)
Mamba-Enhanced Text-Audio-Video Alignment Network for Emotion Recognition in Conversations [15.748798247815298]
We propose a novel Mamba-enhanced Text-Audio-Video alignment network (MaTAV) for the Emotion Recognition in Conversations (ERC) task. MaTAV is with the advantages of aligning unimodal features to ensure consistency across different modalities and handling long input sequences to better capture contextual multimodal information.
arXiv Detail & Related papers (2024-09-08T23:09:22Z)
MM-TTS: A Unified Framework for Multimodal, Prompt-Induced Emotional Text-to-Speech Synthesis [70.06396781553191]
Multimodal Emotional Text-to-Speech System (MM-TTS) is a unified framework that leverages emotional cues from multiple modalities to generate highly expressive and emotionally resonant speech. MM-TTS consists of two key components: the Emotion Prompt Alignment Module (EP-Align), which employs contrastive learning to align emotional features across text, audio, and visual modalities, and the Emotion Embedding-Induced TTS (EMI-TTS), which integrates the aligned emotional embeddings with state-of-the-art TTS models to synthesize speech that accurately reflects the intended emotions.
arXiv Detail & Related papers (2024-04-29T03:19:39Z)
AIMDiT: Modality Augmentation and Interaction via Multimodal Dimension Transformation for Emotion Recognition in Conversations [57.99479708224221]
We propose a novel framework called AIMDiT to solve the problem of multimodal fusion of deep features. Experiments conducted using our AIMDiT framework on the public benchmark dataset MELD reveal 2.34% and 2.87% improvements in terms of the Acc-7 and w-F1 metrics.
arXiv Detail & Related papers (2024-04-12T11:31:18Z)
TelME: Teacher-leading Multimodal Fusion Network for Emotion Recognition in Conversation [0.78452977096722]
TelME incorporates cross-modal knowledge distillation to transfer information from a language model acting as the teacher to the non-verbal students. We then combine multimodal features using a shifting fusion approach in which student networks support the teacher.
arXiv Detail & Related papers (2024-01-16T07:18:41Z)
Conversation Understanding using Relational Temporal Graph Neural Networks with Auxiliary Cross-Modality Interaction [2.1261712640167856]
Emotion recognition is a crucial task for human conversation understanding. We propose an input Temporal Graph Neural Network with Cross-Modality Interaction (CORECT) CORECT effectively captures conversation-level cross-modality interactions and utterance-level temporal dependencies.
arXiv Detail & Related papers (2023-11-08T07:46:25Z)
Revisiting Disentanglement and Fusion on Modality and Context in Conversational Multimodal Emotion Recognition [81.2011058113579]
We argue that both the feature multimodality and conversational contextualization should be properly modeled simultaneously during the feature disentanglement and fusion steps. We propose a Contribution-aware Fusion Mechanism (CFM) and a Context Refusion Mechanism ( CRM) for multimodal and context integration. Our system achieves new state-of-the-art performance consistently.
arXiv Detail & Related papers (2023-08-08T18:11:27Z)
A Low-rank Matching Attention based Cross-modal Feature Fusion Method for Conversational Emotion Recognition [54.44337276044968]
We introduce a novel and lightweight cross-modal feature fusion method called Low-Rank Matching Attention Method (LMAM) LMAM effectively captures contextual emotional semantic information in conversations while mitigating the quadratic complexity issue caused by the self-attention mechanism. Experimental results verify the superiority of LMAM compared with other popular cross-modal fusion methods on the premise of being more lightweight.
arXiv Detail & Related papers (2023-06-16T16:02:44Z)
M2FNet: Multi-modal Fusion Network for Emotion Recognition in Conversation [1.3864478040954673]
We propose a Multi-modal Fusion Network (M2FNet) that extracts emotion-relevant features from visual, audio, and text modality. It employs a multi-head attention-based fusion mechanism to combine emotion-rich latent representations of the input data. The proposed feature extractor is trained with a novel adaptive margin-based triplet loss function to learn emotion-relevant features from the audio and visual data.
arXiv Detail & Related papers (2022-06-05T14:18:58Z)
End-to-end Spoken Conversational Question Answering: Task, Dataset and Model [92.18621726802726]
In spoken question answering, the systems are designed to answer questions from contiguous text spans within the related speech transcripts. We propose a new Spoken Conversational Question Answering task (SCQA), aiming at enabling the systems to model complex dialogue flows. Our main objective is to build the system to deal with conversational questions based on the audio recordings, and to explore the plausibility of providing more cues from different modalities with systems in information gathering.
arXiv Detail & Related papers (2022-04-29T17:56:59Z)
Multimodal Emotion Recognition using Transfer Learning from Speaker Recognition and BERT-based models [53.31917090073727]
We propose a neural network-based emotion recognition framework that uses a late fusion of transfer-learned and fine-tuned models from speech and text modalities. We evaluate the effectiveness of our proposed multimodal approach on the interactive emotional dyadic motion capture dataset.
arXiv Detail & Related papers (2022-02-16T00:23:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.