Low Rank Fusion based Transformers for Multimodal Sequences
- URL: http://arxiv.org/abs/2007.02038v1
- Date: Sat, 4 Jul 2020 08:05:40 GMT
- Title: Low Rank Fusion based Transformers for Multimodal Sequences
- Authors: Saurav Sahay, Eda Okur, Shachi H Kumar, Lama Nachman
- Abstract summary: We present two methods for the Multimodal Sentiment and Emotion Recognition results on CMU-MOSEI, CMU-MOSI, and IEMOCAP datasets.
We show that our models have lesser parameters, train faster and perform comparably to many larger fusion-based architectures.
- Score: 9.507869508188266
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Our senses individually work in a coordinated fashion to express our
emotional intentions. In this work, we experiment with modeling
modality-specific sensory signals to attend to our latent multimodal emotional
intentions and vice versa expressed via low-rank multimodal fusion and
multimodal transformers. The low-rank factorization of multimodal fusion
amongst the modalities helps represent approximate multiplicative latent signal
interactions. Motivated by the work of~\cite{tsai2019MULT} and~\cite{Liu_2018},
we present our transformer-based cross-fusion architecture without any
over-parameterization of the model. The low-rank fusion helps represent the
latent signal interactions while the modality-specific attention helps focus on
relevant parts of the signal. We present two methods for the Multimodal
Sentiment and Emotion Recognition results on CMU-MOSEI, CMU-MOSI, and IEMOCAP
datasets and show that our models have lesser parameters, train faster and
perform comparably to many larger fusion-based architectures.
Related papers
- AIMDiT: Modality Augmentation and Interaction via Multimodal Dimension Transformation for Emotion Recognition in Conversations [57.99479708224221]
We propose a novel framework called AIMDiT to solve the problem of multimodal fusion of deep features.
Experiments conducted using our AIMDiT framework on the public benchmark dataset MELD reveal 2.34% and 2.87% improvements in terms of the Acc-7 and w-F1 metrics.
arXiv Detail & Related papers (2024-04-12T11:31:18Z) - Joint Multimodal Transformer for Emotion Recognition in the Wild [49.735299182004404]
Multimodal emotion recognition (MMER) systems typically outperform unimodal systems.
This paper proposes an MMER method that relies on a joint multimodal transformer (JMT) for fusion with key-based cross-attention.
arXiv Detail & Related papers (2024-03-15T17:23:38Z) - Multimodal Prompt Transformer with Hybrid Contrastive Learning for
Emotion Recognition in Conversation [9.817888267356716]
multimodal Emotion Recognition in Conversation (ERC) faces two problems.
Deep emotion cues extraction was performed on modalities with strong representation ability.
Feature filters were designed as multimodal prompt information for modalities with weak representation ability.
MPT embeds multimodal fusion information into each attention layer of the Transformer.
arXiv Detail & Related papers (2023-10-04T13:54:46Z) - Deep Equilibrium Multimodal Fusion [88.04713412107947]
Multimodal fusion integrates the complementary information present in multiple modalities and has gained much attention recently.
We propose a novel deep equilibrium (DEQ) method towards multimodal fusion via seeking a fixed point of the dynamic multimodal fusion process.
Experiments on BRCA, MM-IMDB, CMU-MOSI, SUN RGB-D, and VQA-v2 demonstrate the superiority of our DEQ fusion.
arXiv Detail & Related papers (2023-06-29T03:02:20Z) - TACOformer:Token-channel compounded Cross Attention for Multimodal
Emotion Recognition [0.951828574518325]
We propose a comprehensive perspective of multimodal fusion that integrates channel-level and token-level cross-modal interactions.
Specifically, we introduce a unified cross attention module called Token-chAnnel COmpound (TACO) Cross Attention.
We also propose a 2D position encoding method to preserve information about the spatial distribution of EEG signal channels.
arXiv Detail & Related papers (2023-06-23T16:28:12Z) - Unified Discrete Diffusion for Simultaneous Vision-Language Generation [78.21352271140472]
We present a unified multimodal generation model that can conduct both the "modality translation" and "multi-modality generation" tasks.
Specifically, we unify the discrete diffusion process for multimodal signals by proposing a unified transition matrix.
Our proposed method can perform comparably to the state-of-the-art solutions in various generation tasks.
arXiv Detail & Related papers (2022-11-27T14:46:01Z) - Multi-scale Cooperative Multimodal Transformers for Multimodal Sentiment
Analysis in Videos [58.93586436289648]
We propose a multi-scale cooperative multimodal transformer (MCMulT) architecture for multimodal sentiment analysis.
Our model outperforms existing approaches on unaligned multimodal sequences and has strong performance on aligned multimodal sequences.
arXiv Detail & Related papers (2022-06-16T07:47:57Z) - MMLatch: Bottom-up Top-down Fusion for Multimodal Sentiment Analysis [84.7287684402508]
Current deep learning approaches for multimodal fusion rely on bottom-up fusion of high and mid-level latent modality representations.
Models of human perception highlight the importance of top-down fusion, where high-level representations affect the way sensory inputs are perceived.
We propose a neural architecture that captures top-down cross-modal interactions, using a feedback mechanism in the forward pass during network training.
arXiv Detail & Related papers (2022-01-24T17:48:04Z) - Fusion with Hierarchical Graphs for Mulitmodal Emotion Recognition [7.147235324895931]
This paper proposes a novel hierarchical graph network (HFGCN) model that learns more informative multimodal representations.
Specifically, the proposed model fuses multimodality inputs using a two-stage graph construction approach and encodes the modality dependencies into the conversation representation.
Experiments showed the effectiveness of our proposed model for more accurate AER, which yielded state-of-the-art results on two public datasets.
arXiv Detail & Related papers (2021-09-15T08:21:01Z) - Improving Multimodal fusion via Mutual Dependency Maximisation [5.73995120847626]
Multimodal sentiment analysis is a trending area of research, and the multimodal fusion is one of its most active topic.
In this work, we investigate unexplored penalties and propose a set of new objectives that measure the dependency between modalities.
We demonstrate that our new penalties lead to a consistent improvement (up to $4.3$ on accuracy) across a large variety of state-of-the-art models.
arXiv Detail & Related papers (2021-08-31T06:26:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.