M2FNet: Multi-modal Fusion Network for Emotion Recognition in
  Conversation
        - URL: http://arxiv.org/abs/2206.02187v1
 - Date: Sun, 5 Jun 2022 14:18:58 GMT
 - Title: M2FNet: Multi-modal Fusion Network for Emotion Recognition in
  Conversation
 - Authors: Vishal Chudasama, Purbayan Kar, Ashish Gudmalwar, Nirmesh Shah, Pankaj
  Wasnik, Naoyuki Onoe
 - Abstract summary: We propose a Multi-modal Fusion Network (M2FNet) that extracts emotion-relevant features from visual, audio, and text modality.
It employs a multi-head attention-based fusion mechanism to combine emotion-rich latent representations of the input data.
The proposed feature extractor is trained with a novel adaptive margin-based triplet loss function to learn emotion-relevant features from the audio and visual data.
 - Score: 1.3864478040954673
 - License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
 - Abstract:   Emotion Recognition in Conversations (ERC) is crucial in developing
sympathetic human-machine interaction. In conversational videos, emotion can be
present in multiple modalities, i.e., audio, video, and transcript. However,
due to the inherent characteristics of these modalities, multi-modal ERC has
always been considered a challenging undertaking. Existing ERC research focuses
mainly on using text information in a discussion, ignoring the other two
modalities. We anticipate that emotion recognition accuracy can be improved by
employing a multi-modal approach. Thus, in this study, we propose a Multi-modal
Fusion Network (M2FNet) that extracts emotion-relevant features from visual,
audio, and text modality. It employs a multi-head attention-based fusion
mechanism to combine emotion-rich latent representations of the input data. We
introduce a new feature extractor to extract latent features from the audio and
visual modality. The proposed feature extractor is trained with a novel
adaptive margin-based triplet loss function to learn emotion-relevant features
from the audio and visual data. In the domain of ERC, the existing methods
perform well on one benchmark dataset but not on others. Our results show that
the proposed M2FNet architecture outperforms all other methods in terms of
weighted average F1 score on well-known MELD and IEMOCAP datasets and sets a
new state-of-the-art performance in ERC.
 
       
      
        Related papers
        - Feature-Based Dual Visual Feature Extraction Model for Compound   Multimodal Emotion Recognition [15.077653455298707]
This article presents our results for the eighth Affective Behavior Analysis in-the-wild (ABAW) competition.
We propose a multimodal emotion recognition method that fuses the features of Vision Transformer (ViT) and Residual Network (ResNet)
The results show that in scenarios with complex visual and audio cues, the model that fuses the features of ViT and ResNet exhibits superior performance.
arXiv  Detail & Related papers  (2025-03-21T18:03:44Z) - Enriching Multimodal Sentiment Analysis through Textual Emotional   Descriptions of Visual-Audio Content [56.62027582702816]
Multimodal Sentiment Analysis seeks to unravel human emotions by amalgamating text, audio, and visual data.
Yet, discerning subtle emotional nuances within audio and video expressions poses a formidable challenge.
We introduce DEVA, a progressive fusion framework founded on textual sentiment descriptions.
arXiv  Detail & Related papers  (2024-12-12T11:30:41Z) - Mamba-Enhanced Text-Audio-Video Alignment Network for Emotion   Recognition in Conversations [15.748798247815298]
We propose a novel Mamba-enhanced Text-Audio-Video alignment network (MaTAV) for the Emotion Recognition in Conversations (ERC) task.
MaTAV is with the advantages of aligning unimodal features to ensure consistency across different modalities and handling long input sequences to better capture contextual multimodal information.
arXiv  Detail & Related papers  (2024-09-08T23:09:22Z) - Masked Graph Learning with Recurrent Alignment for Multimodal Emotion   Recognition in Conversation [12.455034591553506]
Multimodal Emotion Recognition in Conversation (MERC) can be applied to public opinion monitoring, intelligent dialogue robots, and other fields.
Previous work ignored the inter-modal alignment process and the intra-modal noise information before multimodal fusion.
We have developed a novel approach called Masked Graph Learning with Recursive Alignment (MGLRA) to tackle this problem.
arXiv  Detail & Related papers  (2024-07-23T02:23:51Z) - Emotion and Intent Joint Understanding in Multimodal Conversation: A   Benchmarking Dataset [74.74686464187474]
Emotion and Intent Joint Understanding in Multimodal Conversation (MC-EIU) aims to decode the semantic information manifested in a multimodal conversational history.
MC-EIU is enabling technology for many human-computer interfaces.
We propose an MC-EIU dataset, which features 7 emotion categories, 9 intent categories, 3 modalities, i.e., textual, acoustic, and visual content, and two languages, English and Mandarin.
arXiv  Detail & Related papers  (2024-07-03T01:56:00Z) - MM-TTS: A Unified Framework for Multimodal, Prompt-Induced Emotional   Text-to-Speech Synthesis [70.06396781553191]
Multimodal Emotional Text-to-Speech System (MM-TTS) is a unified framework that leverages emotional cues from multiple modalities to generate highly expressive and emotionally resonant speech.
MM-TTS consists of two key components: the Emotion Prompt Alignment Module (EP-Align), which employs contrastive learning to align emotional features across text, audio, and visual modalities, and the Emotion Embedding-Induced TTS (EMI-TTS), which integrates the aligned emotional embeddings with state-of-the-art TTS models to synthesize speech that accurately reflects the intended emotions.
arXiv  Detail & Related papers  (2024-04-29T03:19:39Z) - AIMDiT: Modality Augmentation and Interaction via Multimodal Dimension   Transformation for Emotion Recognition in Conversations [57.99479708224221]
We propose a novel framework called AIMDiT to solve the problem of multimodal fusion of deep features.
Experiments conducted using our AIMDiT framework on the public benchmark dataset MELD reveal 2.34% and 2.87% improvements in terms of the Acc-7 and w-F1 metrics.
arXiv  Detail & Related papers  (2024-04-12T11:31:18Z) - Exploiting Modality-Specific Features For Multi-Modal Manipulation
  Detection And Grounding [54.49214267905562]
We construct a transformer-based framework for multi-modal manipulation detection and grounding tasks.
Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment.
We propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality.
arXiv  Detail & Related papers  (2023-09-22T06:55:41Z) - Multimodal Emotion Recognition using Transfer Learning from Speaker
  Recognition and BERT-based models [53.31917090073727]
We propose a neural network-based emotion recognition framework that uses a late fusion of transfer-learned and fine-tuned models from speech and text modalities.
We evaluate the effectiveness of our proposed multimodal approach on the interactive emotional dyadic motion capture dataset.
arXiv  Detail & Related papers  (2022-02-16T00:23:42Z) - Shapes of Emotions: Multimodal Emotion Recognition in Conversations via
  Emotion Shifts [2.443125107575822]
Emotion Recognition in Conversations (ERC) is an important and active research problem.
Recent work has shown the benefits of using multiple modalities for the ERC task.
We propose a multimodal ERC model and augment it with an emotion-shift component.
arXiv  Detail & Related papers  (2021-12-03T14:39:04Z) - Fusion with Hierarchical Graphs for Mulitmodal Emotion Recognition [7.147235324895931]
This paper proposes a novel hierarchical graph network (HFGCN) model that learns more informative multimodal representations.
Specifically, the proposed model fuses multimodality inputs using a two-stage graph construction approach and encodes the modality dependencies into the conversation representation.
Experiments showed the effectiveness of our proposed model for more accurate AER, which yielded state-of-the-art results on two public datasets.
arXiv  Detail & Related papers  (2021-09-15T08:21:01Z) - Exploring Emotion Features and Fusion Strategies for Audio-Video Emotion
  Recognition [62.48806555665122]
We describe our approaches in EmotiW 2019, which mainly explores emotion features and feature fusion strategies for audio and visual modality.
With careful evaluation, we obtain 65.5% on the AFEW validation set and 62.48% on the test set and rank third in the challenge.
arXiv  Detail & Related papers  (2020-12-27T10:50:24Z) 
        This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.