Related papers: TelME: Teacher-leading Multimodal Fusion Network for Emotion Recognition in Conversation

TelME: Teacher-leading Multimodal Fusion Network for Emotion Recognition in Conversation

URL: http://arxiv.org/abs/2401.12987v2
Date: Sun, 31 Mar 2024 09:55:51 GMT
Title: TelME: Teacher-leading Multimodal Fusion Network for Emotion Recognition in Conversation
Authors: Taeyang Yun, Hyunkuk Lim, Jeonghwan Lee, Min Song,
Abstract summary: TelME incorporates cross-modal knowledge distillation to transfer information from a language model acting as the teacher to the non-verbal students. We then combine multimodal features using a shifting fusion approach in which student networks support the teacher.
Score: 0.78452977096722
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Emotion Recognition in Conversation (ERC) plays a crucial role in enabling dialogue systems to effectively respond to user requests. The emotions in a conversation can be identified by the representations from various modalities, such as audio, visual, and text. However, due to the weak contribution of non-verbal modalities to recognize emotions, multimodal ERC has always been considered a challenging task. In this paper, we propose Teacher-leading Multimodal fusion network for ERC (TelME). TelME incorporates cross-modal knowledge distillation to transfer information from a language model acting as the teacher to the non-verbal students, thereby optimizing the efficacy of the weak modalities. We then combine multimodal features using a shifting fusion approach in which student networks support the teacher. TelME achieves state-of-the-art performance in MELD, a multi-speaker conversation dataset for ERC. Finally, we demonstrate the effectiveness of our components through additional experiments.

Related papers

Multi-modal Anchor Gated Transformer with Knowledge Distillation for Emotion Recognition in Conversation [16.83256715431888]
Emotion Recognition in Conversation (ERC) aims to detect the emotions of individual utterances within a conversation.<n> generating efficient and modality-specific representations for each utterance remains a significant challenge.<n>We propose the Multi-modal Anchor Gated Transformer with Knowledge Distillation (MAGTKD) for the ERC task.
arXiv Detail & Related papers (2025-06-23T14:53:22Z)
VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction [105.88658935310605]
We propose a multi-stage training methodology that progressively trains LLM to understand both visual and speech information. Our approach not only preserves strong vision-language capacity, but also enables efficient speech-to-speech dialogue capabilities. By comparing our method against state-of-the-art counterparts across benchmarks for image, video, and speech tasks, we demonstrate that our model is equipped with both strong visual and speech capabilities.
arXiv Detail & Related papers (2025-01-03T18:59:52Z)
Mamba-Enhanced Text-Audio-Video Alignment Network for Emotion Recognition in Conversations [15.748798247815298]
We propose a novel Mamba-enhanced Text-Audio-Video alignment network (MaTAV) for the Emotion Recognition in Conversations (ERC) task. MaTAV is with the advantages of aligning unimodal features to ensure consistency across different modalities and handling long input sequences to better capture contextual multimodal information.
arXiv Detail & Related papers (2024-09-08T23:09:22Z)
Enhancing Emotion Recognition in Conversation through Emotional Cross-Modal Fusion and Inter-class Contrastive Learning [40.101313334772016]
The purpose of emotion recognition in conversation (ERC) is to identify the emotion category of an utterance based on contextual information. Previous ERC methods relied on simple connections for cross-modal fusion. We propose a cross-modal fusion emotion prediction network based on vector connections.
arXiv Detail & Related papers (2024-05-28T07:22:30Z)
MM-TTS: A Unified Framework for Multimodal, Prompt-Induced Emotional Text-to-Speech Synthesis [70.06396781553191]
Multimodal Emotional Text-to-Speech System (MM-TTS) is a unified framework that leverages emotional cues from multiple modalities to generate highly expressive and emotionally resonant speech. MM-TTS consists of two key components: the Emotion Prompt Alignment Module (EP-Align), which employs contrastive learning to align emotional features across text, audio, and visual modalities, and the Emotion Embedding-Induced TTS (EMI-TTS), which integrates the aligned emotional embeddings with state-of-the-art TTS models to synthesize speech that accurately reflects the intended emotions.
arXiv Detail & Related papers (2024-04-29T03:19:39Z)
AIMDiT: Modality Augmentation and Interaction via Multimodal Dimension Transformation for Emotion Recognition in Conversations [57.99479708224221]
We propose a novel framework called AIMDiT to solve the problem of multimodal fusion of deep features. Experiments conducted using our AIMDiT framework on the public benchmark dataset MELD reveal 2.34% and 2.87% improvements in terms of the Acc-7 and w-F1 metrics.
arXiv Detail & Related papers (2024-04-12T11:31:18Z)
CFN-ESA: A Cross-Modal Fusion Network with Emotion-Shift Awareness for Dialogue Emotion Recognition [34.24557248359872]
We propose a Cross-modal Fusion Network with Emotion-Shift Awareness (CFN-ESA) for emotion recognition in conversation. CFN-ESA consists of unimodal encoder (RUME), cross-modal encoder (ACME), and emotion-shift module (LESM)
arXiv Detail & Related papers (2023-07-28T09:29:42Z)
SI-LSTM: Speaker Hybrid Long-short Term Memory and Cross Modal Attention for Emotion Recognition in Conversation [16.505046191280634]
Emotion Recognition in Conversation(ERC) is of vital importance for a variety of applications, including intelligent healthcare, artificial intelligence for conversation, and opinion mining over chat history. The crux of ERC is to model both cross-modality and cross-time interactions throughout the conversation. Previous methods have made progress in learning the time series information of conversation while lacking the ability to trace down the different emotional states of each speaker in a conversation.
arXiv Detail & Related papers (2023-05-04T10:13:15Z)
M2FNet: Multi-modal Fusion Network for Emotion Recognition in Conversation [1.3864478040954673]
We propose a Multi-modal Fusion Network (M2FNet) that extracts emotion-relevant features from visual, audio, and text modality. It employs a multi-head attention-based fusion mechanism to combine emotion-rich latent representations of the input data. The proposed feature extractor is trained with a novel adaptive margin-based triplet loss function to learn emotion-relevant features from the audio and visual data.
arXiv Detail & Related papers (2022-06-05T14:18:58Z)
M3ED: Multi-modal Multi-scene Multi-label Emotional Dialogue Database [139.08528216461502]
We propose a Multi-modal Multi-scene Multi-label Emotional Dialogue dataset, M3ED. M3ED contains 990 dyadic emotional dialogues from 56 different TV series, a total of 9,082 turns and 24,449 utterances. To the best of our knowledge, M3ED is the first multimodal emotional dialogue dataset in Chinese.
arXiv Detail & Related papers (2022-05-09T06:52:51Z)
MMER: Multimodal Multi-task Learning for Speech Emotion Recognition [48.32879363033598]
MMER is a novel Multimodal Multi-task learning approach for Speech Emotion Recognition. In practice, MMER achieves all our baselines and state-of-the-art performance on the IEMOCAP benchmark.
arXiv Detail & Related papers (2022-03-31T04:51:32Z)
Multimodal Emotion Recognition using Transfer Learning from Speaker Recognition and BERT-based models [53.31917090073727]
We propose a neural network-based emotion recognition framework that uses a late fusion of transfer-learned and fine-tuned models from speech and text modalities. We evaluate the effectiveness of our proposed multimodal approach on the interactive emotional dyadic motion capture dataset.
arXiv Detail & Related papers (2022-02-16T00:23:42Z)
An Attribute-Aligned Strategy for Learning Speech Representation [57.891727280493015]
We propose an attribute-aligned learning strategy to derive speech representation that can flexibly address these issues by attribute-selection mechanism. Specifically, we propose a layered-representation variational autoencoder (LR-VAE), which factorizes speech representation into attribute-sensitive nodes. Our proposed method achieves competitive performances on identity-free SER and a better performance on emotionless SV.
arXiv Detail & Related papers (2021-06-05T06:19:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.