MMLatch: Bottom-up Top-down Fusion for Multimodal Sentiment Analysis
- URL: http://arxiv.org/abs/2201.09828v1
- Date: Mon, 24 Jan 2022 17:48:04 GMT
- Title: MMLatch: Bottom-up Top-down Fusion for Multimodal Sentiment Analysis
- Authors: Georgios Paraskevopoulos, Efthymios Georgiou, Alexandros Potamianos
- Abstract summary: Current deep learning approaches for multimodal fusion rely on bottom-up fusion of high and mid-level latent modality representations.
Models of human perception highlight the importance of top-down fusion, where high-level representations affect the way sensory inputs are perceived.
We propose a neural architecture that captures top-down cross-modal interactions, using a feedback mechanism in the forward pass during network training.
- Score: 84.7287684402508
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Current deep learning approaches for multimodal fusion rely on bottom-up
fusion of high and mid-level latent modality representations (late/mid fusion)
or low level sensory inputs (early fusion). Models of human perception
highlight the importance of top-down fusion, where high-level representations
affect the way sensory inputs are perceived, i.e. cognition affects perception.
These top-down interactions are not captured in current deep learning models.
In this work we propose a neural architecture that captures top-down
cross-modal interactions, using a feedback mechanism in the forward pass during
network training. The proposed mechanism extracts high-level representations
for each modality and uses these representations to mask the sensory inputs,
allowing the model to perform top-down feature masking. We apply the proposed
model for multimodal sentiment recognition on CMU-MOSEI. Our method shows
consistent improvements over the well established MulT and over our strong late
fusion baseline, achieving state-of-the-art results.
Related papers
- MMA-DFER: MultiModal Adaptation of unimodal models for Dynamic Facial Expression Recognition in-the-wild [81.32127423981426]
Multimodal emotion recognition based on audio and video data is important for real-world applications.
Recent methods have focused on exploiting advances of self-supervised learning (SSL) for pre-training of strong multimodal encoders.
We propose a different perspective on the problem and investigate the advancement of multimodal DFER performance by adapting SSL-pre-trained disjoint unimodal encoders.
arXiv Detail & Related papers (2024-04-13T13:39:26Z) - Joint Multimodal Transformer for Emotion Recognition in the Wild [49.735299182004404]
Multimodal emotion recognition (MMER) systems typically outperform unimodal systems.
This paper proposes an MMER method that relies on a joint multimodal transformer (JMT) for fusion with key-based cross-attention.
arXiv Detail & Related papers (2024-03-15T17:23:38Z) - From Text to Pixels: A Context-Aware Semantic Synergy Solution for
Infrared and Visible Image Fusion [66.33467192279514]
We introduce a text-guided multi-modality image fusion method that leverages the high-level semantics from textual descriptions to integrate semantics from infrared and visible images.
Our method not only produces visually superior fusion results but also achieves a higher detection mAP over existing methods, achieving state-of-the-art results.
arXiv Detail & Related papers (2023-12-31T08:13:47Z) - Exploiting Modality-Specific Features For Multi-Modal Manipulation
Detection And Grounding [54.49214267905562]
We construct a transformer-based framework for multi-modal manipulation detection and grounding tasks.
Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment.
We propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality.
arXiv Detail & Related papers (2023-09-22T06:55:41Z) - Multimodal Latent Emotion Recognition from Micro-expression and
Physiological Signals [11.05207353295191]
The paper discusses the benefits of incorporating multimodal data for improving latent emotion recognition accuracy, focusing on micro-expression (ME) and physiological signals (PS)
The proposed approach presents a novel multimodal learning framework that combines ME and PS, including a 1D separable and mixable depthwise network inception.
Experimental results show that the proposed approach outperforms the benchmark method, with the weighted fusion method and guided attention modules both contributing to enhanced performance.
arXiv Detail & Related papers (2023-08-23T14:17:44Z) - Bi-level Dynamic Learning for Jointly Multi-modality Image Fusion and
Beyond [50.556961575275345]
We build an image fusion module to fuse complementary characteristics and cascade dual task-related modules.
We develop an efficient first-order approximation to compute corresponding gradients and present dynamic weighted aggregation to balance the gradients for fusion learning.
arXiv Detail & Related papers (2023-05-11T10:55:34Z) - Progressive Fusion for Multimodal Integration [12.94175198001421]
We present an iterative representation refinement approach, called Progressive Fusion, which mitigates the issues with late fusion representations.
We show that our approach consistently improves performance, for instance attaining a 5% reduction in MSE and 40% improvement in robustness on multimodal time series prediction.
arXiv Detail & Related papers (2022-09-01T09:08:33Z) - Cross Attentional Audio-Visual Fusion for Dimensional Emotion Recognition [13.994609732846344]
Most effective techniques for emotion recognition efficiently leverage diverse and complimentary sources of information.
We introduce a cross-attentional fusion approach to extract the salient features across audio-visual (A-V) modalities.
Results indicate that our cross-attentional A-V fusion model is a cost-effective approach that outperforms state-of-the-art fusion approaches.
arXiv Detail & Related papers (2021-11-09T16:01:56Z) - Fusion with Hierarchical Graphs for Mulitmodal Emotion Recognition [7.147235324895931]
This paper proposes a novel hierarchical graph network (HFGCN) model that learns more informative multimodal representations.
Specifically, the proposed model fuses multimodality inputs using a two-stage graph construction approach and encodes the modality dependencies into the conversation representation.
Experiments showed the effectiveness of our proposed model for more accurate AER, which yielded state-of-the-art results on two public datasets.
arXiv Detail & Related papers (2021-09-15T08:21:01Z) - Low Rank Fusion based Transformers for Multimodal Sequences [9.507869508188266]
We present two methods for the Multimodal Sentiment and Emotion Recognition results on CMU-MOSEI, CMU-MOSI, and IEMOCAP datasets.
We show that our models have lesser parameters, train faster and perform comparably to many larger fusion-based architectures.
arXiv Detail & Related papers (2020-07-04T08:05:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.