Exchanging-based Multimodal Fusion with Transformer
- URL: http://arxiv.org/abs/2309.02190v1
- Date: Tue, 5 Sep 2023 12:48:25 GMT
- Title: Exchanging-based Multimodal Fusion with Transformer
- Authors: Renyu Zhu, Chengcheng Han, Yong Qian, Qiushi Sun, Xiang Li, Ming Gao,
Xuezhi Cao, Yunsen Xian
- Abstract summary: We study the problem of multimodal fusion in this paper.
Recent exchanging-based methods have been proposed for vision-vision fusion, which aim to exchange embeddings learned from one modality to the other.
We propose a novel exchanging-based multimodal fusion model MuSE for text-vision fusion based on Transformer.
- Score: 19.398692598523454
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We study the problem of multimodal fusion in this paper. Recent
exchanging-based methods have been proposed for vision-vision fusion, which aim
to exchange embeddings learned from one modality to the other. However, most of
them project inputs of multimodalities into different low-dimensional spaces
and cannot be applied to the sequential input data. To solve these issues, in
this paper, we propose a novel exchanging-based multimodal fusion model MuSE
for text-vision fusion based on Transformer. We first use two encoders to
separately map multimodal inputs into different low-dimensional spaces. Then we
employ two decoders to regularize the embeddings and pull them into the same
space. The two decoders capture the correlations between texts and images with
the image captioning task and the text-to-image generation task, respectively.
Further, based on the regularized embeddings, we present CrossTransformer,
which uses two Transformer encoders with shared parameters as the backbone
model to exchange knowledge between multimodalities. Specifically,
CrossTransformer first learns the global contextual information of the inputs
in the shallow layers. After that, it performs inter-modal exchange by
selecting a proportion of tokens in one modality and replacing their embeddings
with the average of embeddings in the other modality. We conduct extensive
experiments to evaluate the performance of MuSE on the Multimodal Named Entity
Recognition task and the Multimodal Sentiment Analysis task. Our results show
the superiority of MuSE against other competitors. Our code and data are
provided at https://github.com/RecklessRonan/MuSE.
Related papers
- CMATH: Cross-Modality Augmented Transformer with Hierarchical Variational Distillation for Multimodal Emotion Recognition in Conversation [8.874033487493913]
Multimodal emotion recognition in conversation aims to accurately identify emotions in conversational utterances.
We propose a novel Cross-Modality Augmented Transformer with Hierarchical Variational Distillation, called CMATH, which consists of two major components.
Experiments on the IEMOCAP and MELD datasets demonstrate that our proposed model outperforms previous state-of-the-art baselines.
arXiv Detail & Related papers (2024-11-15T09:23:02Z) - Hierarchical Multi-modal Transformer for Cross-modal Long Document Classification [74.45521856327001]
How to classify long documents with hierarchical structure texts and embedding images is a new problem.
We propose a novel approach called Hierarchical Multi-modal Transformer (HMT) for cross-modal long document classification.
Our approach uses a multi-modal transformer and a dynamic multi-scale multi-modal transformer to model the complex relationships between image features, and the section and sentence features.
arXiv Detail & Related papers (2024-07-14T07:12:25Z) - GeminiFusion: Efficient Pixel-wise Multimodal Fusion for Vision Transformer [44.44603063754173]
Cross-modal transformers have demonstrated superiority in various vision tasks by effectively integrating different modalities.
We propose GeminiFusion, a pixel-wise fusion approach that capitalizes on aligned cross-modal representations.
We employ a layer-adaptive noise to adaptively control their interplay on a per-layer basis, thereby achieving a harmonized fusion process.
arXiv Detail & Related papers (2024-06-03T11:24:15Z) - Multimodal Information Interaction for Medical Image Segmentation [24.024848382458767]
We introduce an innovative Multimodal Information Cross Transformer (MicFormer)
It queries features from one modality and retrieves corresponding responses from another, facilitating effective communication between bimodal features.
Compared to other multimodal segmentation techniques, our method outperforms by margins of 2.83 and 4.23, respectively.
arXiv Detail & Related papers (2024-04-25T07:21:14Z) - Multimodal Prompt Transformer with Hybrid Contrastive Learning for
Emotion Recognition in Conversation [9.817888267356716]
multimodal Emotion Recognition in Conversation (ERC) faces two problems.
Deep emotion cues extraction was performed on modalities with strong representation ability.
Feature filters were designed as multimodal prompt information for modalities with weak representation ability.
MPT embeds multimodal fusion information into each attention layer of the Transformer.
arXiv Detail & Related papers (2023-10-04T13:54:46Z) - Meta-Transformer: A Unified Framework for Multimodal Learning [105.77219833997962]
Multimodal learning aims to build models that process and relate information from multiple modalities.
Despite years of development in this field, it still remains challenging to design a unified network for processing various modalities.
We propose a framework, named Meta-Transformer, that leverages a $textbffrozen$ encoder to perform multimodal perception.
arXiv Detail & Related papers (2023-07-20T12:10:29Z) - Multi-scale Cooperative Multimodal Transformers for Multimodal Sentiment
Analysis in Videos [58.93586436289648]
We propose a multi-scale cooperative multimodal transformer (MCMulT) architecture for multimodal sentiment analysis.
Our model outperforms existing approaches on unaligned multimodal sequences and has strong performance on aligned multimodal sequences.
arXiv Detail & Related papers (2022-06-16T07:47:57Z) - Hybrid Transformer with Multi-level Fusion for Multimodal Knowledge
Graph Completion [112.27103169303184]
Multimodal Knowledge Graphs (MKGs) organize visual-text factual knowledge.
MKGformer can obtain SOTA performance on four datasets of multimodal link prediction, multimodal RE, and multimodal NER.
arXiv Detail & Related papers (2022-05-04T23:40:04Z) - Multimodal Token Fusion for Vision Transformers [54.81107795090239]
We propose a multimodal token fusion method (TokenFusion) for transformer-based vision tasks.
To effectively fuse multiple modalities, TokenFusion dynamically detects uninformative tokens and substitutes these tokens with projected and aggregated inter-modal features.
The design of TokenFusion allows the transformer to learn correlations among multimodal features, while the single-modal transformer architecture remains largely intact.
arXiv Detail & Related papers (2022-04-19T07:47:50Z) - A Novel Graph-based Multi-modal Fusion Encoder for Neural Machine
Translation [131.33610549540043]
We propose a novel graph-based multi-modal fusion encoder for NMT.
We first represent the input sentence and image using a unified multi-modal graph.
We then stack multiple graph-based multi-modal fusion layers that iteratively perform semantic interactions to learn node representations.
arXiv Detail & Related papers (2020-07-17T04:06:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.