Related papers: MISA: Modality-Invariant and -Specific Representations for Multimodal Sentiment Analysis

MISA: Modality-Invariant and -Specific Representations for Multimodal Sentiment Analysis

URL: http://arxiv.org/abs/2005.03545v3
Date: Mon, 19 Oct 2020 13:41:27 GMT
Title: MISA: Modality-Invariant and -Specific Representations for Multimodal Sentiment Analysis
Authors: Devamanyu Hazarika, Roger Zimmermann, Soujanya Poria
Abstract summary: We propose a novel framework, MISA, which projects each modality to two distinct subspaces. The first subspace is modality-invariant, where the representations across modalities learn their commonalities and reduce the modality gap. Our experiments on popular sentiment analysis benchmarks, MOSI and MOSEI, demonstrate significant gains over state-of-the-art models.
Score: 48.776247141839875
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Multimodal Sentiment Analysis is an active area of research that leverages multimodal signals for affective understanding of user-generated videos. The predominant approach, addressing this task, has been to develop sophisticated fusion techniques. However, the heterogeneous nature of the signals creates distributional modality gaps that pose significant challenges. In this paper, we aim to learn effective modality representations to aid the process of fusion. We propose a novel framework, MISA, which projects each modality to two distinct subspaces. The first subspace is modality-invariant, where the representations across modalities learn their commonalities and reduce the modality gap. The second subspace is modality-specific, which is private to each modality and captures their characteristic features. These representations provide a holistic view of the multimodal data, which is used for fusion that leads to task predictions. Our experiments on popular sentiment analysis benchmarks, MOSI and MOSEI, demonstrate significant gains over state-of-the-art models. We also consider the task of Multimodal Humor Detection and experiment on the recently proposed UR_FUNNY dataset. Here too, our model fares better than strong baselines, establishing MISA as a useful multimodal framework.

Related papers

Multi-modal Retrieval Augmented Multi-modal Generation: A Benchmark, Evaluate Metrics and Strong Baselines [63.427721165404634]
This paper investigates an intriguing task of Multi-modal Retrieval Augmented Multi-modal Generation (M$2$RAG) This task requires foundation models to browse multi-modal web pages, with mixed text and images, and generate multi-modal responses for solving user queries. We construct a benchmark for M$2$RAG task, equipped with a suite of text-modal metrics and multi-modal metrics to analyze the capabilities of existing foundation models.
arXiv Detail & Related papers (2024-11-25T13:20:19Z)
Asynchronous Multimodal Video Sequence Fusion via Learning Modality-Exclusive and -Agnostic Representations [19.731611716111566]
We propose a Multimodal fusion approach for learning modality-Exclusive and modality-Agnostic representations. We introduce a predictive self-attention module to capture reliable context dynamics within modalities. A hierarchical cross-modal attention module is designed to explore valuable element correlations among modalities. A double-discriminator strategy is presented to ensure the production of distinct representations in an adversarial manner.
arXiv Detail & Related papers (2024-07-06T04:36:48Z)
U3M: Unbiased Multiscale Modal Fusion Model for Multimodal Semantic Segmentation [63.31007867379312]
We introduce U3M: An Unbiased Multiscale Modal Fusion Model for Multimodal Semantics. We employ feature fusion at multiple scales to ensure the effective extraction and integration of both global and local features. Experimental results demonstrate that our approach achieves superior performance across multiple datasets.
arXiv Detail & Related papers (2024-05-24T08:58:48Z)
Improving Multimodal Sentiment Analysis: Supervised Angular Margin-based Contrastive Learning for Enhanced Fusion Representation [10.44888349041063]
We introduce a framework called Supervised Angular-based Contrastive Learning for Multimodal Sentiment Analysis. This framework aims to enhance discrimination and generalizability of the multimodal representation and overcome biases in the fusion vector's modality.
arXiv Detail & Related papers (2023-12-04T02:58:19Z)
Unified Multi-modal Unsupervised Representation Learning for Skeleton-based Action Understanding [62.70450216120704]
Unsupervised pre-training has shown great success in skeleton-based action understanding. We propose a Unified Multimodal Unsupervised Representation Learning framework, called UmURL. UmURL exploits an efficient early-fusion strategy to jointly encode the multi-modal features in a single-stream manner.
arXiv Detail & Related papers (2023-11-06T13:56:57Z)
Exploiting Modality-Specific Features For Multi-Modal Manipulation Detection And Grounding [54.49214267905562]
We construct a transformer-based framework for multi-modal manipulation detection and grounding tasks. Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment. We propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality.
arXiv Detail & Related papers (2023-09-22T06:55:41Z)
Unified Discrete Diffusion for Simultaneous Vision-Language Generation [78.21352271140472]
We present a unified multimodal generation model that can conduct both the "modality translation" and "multi-modality generation" tasks. Specifically, we unify the discrete diffusion process for multimodal signals by proposing a unified transition matrix. Our proposed method can perform comparably to the state-of-the-art solutions in various generation tasks.
arXiv Detail & Related papers (2022-11-27T14:46:01Z)
Multimodal Representations Learning Based on Mutual Information Maximization and Minimization and Identity Embedding for Multimodal Sentiment Analysis [33.73730195500633]
We propose a multimodal representation model based on Mutual information Maximization and Identity Embedding. Experimental results on two public datasets demonstrate the effectiveness of the proposed model.
arXiv Detail & Related papers (2022-01-10T01:41:39Z)
Fusion with Hierarchical Graphs for Mulitmodal Emotion Recognition [7.147235324895931]
This paper proposes a novel hierarchical graph network (HFGCN) model that learns more informative multimodal representations. Specifically, the proposed model fuses multimodality inputs using a two-stage graph construction approach and encodes the modality dependencies into the conversation representation. Experiments showed the effectiveness of our proposed model for more accurate AER, which yielded state-of-the-art results on two public datasets.
arXiv Detail & Related papers (2021-09-15T08:21:01Z)
Improving Multimodal fusion via Mutual Dependency Maximisation [5.73995120847626]
Multimodal sentiment analysis is a trending area of research, and the multimodal fusion is one of its most active topic. In this work, we investigate unexplored penalties and propose a set of new objectives that measure the dependency between modalities. We demonstrate that our new penalties lead to a consistent improvement (up to $4.3$ on accuracy) across a large variety of state-of-the-art models.
arXiv Detail & Related papers (2021-08-31T06:26:26Z)
Bi-Bimodal Modality Fusion for Correlation-Controlled Multimodal Sentiment Analysis [96.46952672172021]
Bi-Bimodal Fusion Network (BBFN) is a novel end-to-end network that performs fusion on pairwise modality representations. Model takes two bimodal pairs as input due to known information imbalance among modalities.
arXiv Detail & Related papers (2021-07-28T23:33:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.