Related papers: TransModality: An End2End Fusion Method with Transformer for Multimodal Sentiment Analysis

TransModality: An End2End Fusion Method with Transformer for Multimodal Sentiment Analysis

URL: http://arxiv.org/abs/2009.02902v2
Date: Mon, 28 Sep 2020 04:44:54 GMT
Title: TransModality: An End2End Fusion Method with Transformer for Multimodal Sentiment Analysis
Authors: Zilong Wang, Zhaohong Wan, and Xiaojun Wan
Abstract summary: We propose a new fusion method, TransModality, to address the task of multimodal sentiment analysis. We validate our model on multiple multimodal datasets: CMU-MOSI, MELD, IEMOCAP.
Score: 42.6733747726081
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multimodal sentiment analysis is an important research area that predicts speaker's sentiment tendency through features extracted from textual, visual and acoustic modalities. The central challenge is the fusion method of the multimodal information. A variety of fusion methods have been proposed, but few of them adopt end-to-end translation models to mine the subtle correlation between modalities. Enlightened by recent success of Transformer in the area of machine translation, we propose a new fusion method, TransModality, to address the task of multimodal sentiment analysis. We assume that translation between modalities contributes to a better joint representation of speaker's utterance. With Transformer, the learned features embody the information both from the source modality and the target modality. We validate our model on multiple multimodal datasets: CMU-MOSI, MELD, IEMOCAP. The experiments show that our proposed method achieves the state-of-the-art performance.

Related papers

CMATH: Cross-Modality Augmented Transformer with Hierarchical Variational Distillation for Multimodal Emotion Recognition in Conversation [8.874033487493913]
Multimodal emotion recognition in conversation aims to accurately identify emotions in conversational utterances. We propose a novel Cross-Modality Augmented Transformer with Hierarchical Variational Distillation, called CMATH, which consists of two major components. Experiments on the IEMOCAP and MELD datasets demonstrate that our proposed model outperforms previous state-of-the-art baselines.
arXiv Detail & Related papers (2024-11-15T09:23:02Z)
Multimodal Representation Learning by Alternating Unimodal Adaptation [73.15829571740866]
We propose MLA (Multimodal Learning with Alternating Unimodal Adaptation) to overcome challenges where some modalities appear more dominant than others during multimodal learning. MLA reframes the conventional joint multimodal learning process by transforming it into an alternating unimodal learning process. It captures cross-modal interactions through a shared head, which undergoes continuous optimization across different modalities. Experiments are conducted on five diverse datasets, encompassing scenarios with complete modalities and scenarios with missing modalities.
arXiv Detail & Related papers (2023-11-17T18:57:40Z)
Multimodal Prompt Transformer with Hybrid Contrastive Learning for Emotion Recognition in Conversation [9.817888267356716]
multimodal Emotion Recognition in Conversation (ERC) faces two problems. Deep emotion cues extraction was performed on modalities with strong representation ability. Feature filters were designed as multimodal prompt information for modalities with weak representation ability. MPT embeds multimodal fusion information into each attention layer of the Transformer.
arXiv Detail & Related papers (2023-10-04T13:54:46Z)
Generalizing Multimodal Variational Methods to Sets [35.69942798534849]
This paper presents a novel variational method on sets called the Set Multimodal VAE (SMVAE) for learning a multimodal latent space. By modeling the joint-modality posterior distribution directly, the proposed SMVAE learns to exchange information between multiple modalities and compensate for the drawbacks caused by factorization.
arXiv Detail & Related papers (2022-12-19T23:50:19Z)
Unified Discrete Diffusion for Simultaneous Vision-Language Generation [78.21352271140472]
We present a unified multimodal generation model that can conduct both the "modality translation" and "multi-modality generation" tasks. Specifically, we unify the discrete diffusion process for multimodal signals by proposing a unified transition matrix. Our proposed method can perform comparably to the state-of-the-art solutions in various generation tasks.
arXiv Detail & Related papers (2022-11-27T14:46:01Z)
Multi-scale Cooperative Multimodal Transformers for Multimodal Sentiment Analysis in Videos [58.93586436289648]
We propose a multi-scale cooperative multimodal transformer (MCMulT) architecture for multimodal sentiment analysis. Our model outperforms existing approaches on unaligned multimodal sequences and has strong performance on aligned multimodal sequences.
arXiv Detail & Related papers (2022-06-16T07:47:57Z)
Multistage linguistic conditioning of convolutional layers for speech emotion recognition [7.482371204083917]
We investigate the effectiveness of deep fusion of text and audio features for categorical and dimensional speech emotion recognition (SER) We propose a novel, multistage fusion method where the two information streams are integrated in several layers of a deep neural network (DNN) Experiments on the widely used IEMOCAP and MSP-Podcast databases demonstrate that the two fusion methods clearly outperform a shallow (late) fusion baseline.
arXiv Detail & Related papers (2021-10-13T11:28:04Z)
Abstractive Sentence Summarization with Guidance of Selective Multimodal Reference [3.505062507621494]
We propose a Multimodal Hierarchical Selective Transformer (mhsf) model that considers reciprocal relationships among modalities. We evaluate the generalism of proposed mhsf model with the pre-trained+fine-tuning and fresh training strategies.
arXiv Detail & Related papers (2021-08-11T09:59:34Z)
Bi-Bimodal Modality Fusion for Correlation-Controlled Multimodal Sentiment Analysis [96.46952672172021]
Bi-Bimodal Fusion Network (BBFN) is a novel end-to-end network that performs fusion on pairwise modality representations. Model takes two bimodal pairs as input due to known information imbalance among modalities.
arXiv Detail & Related papers (2021-07-28T23:33:42Z)
MISA: Modality-Invariant and -Specific Representations for Multimodal Sentiment Analysis [48.776247141839875]
We propose a novel framework, MISA, which projects each modality to two distinct subspaces. The first subspace is modality-invariant, where the representations across modalities learn their commonalities and reduce the modality gap. Our experiments on popular sentiment analysis benchmarks, MOSI and MOSEI, demonstrate significant gains over state-of-the-art models.
arXiv Detail & Related papers (2020-05-07T15:13:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.