Switch-BERT: Learning to Model Multimodal Interactions by Switching
Attention and Input
- URL: http://arxiv.org/abs/2306.14182v1
- Date: Sun, 25 Jun 2023 09:28:40 GMT
- Title: Switch-BERT: Learning to Model Multimodal Interactions by Switching
Attention and Input
- Authors: Qingpei Guo, Kaisheng Yao and Wei Chu
- Abstract summary: We present textbfSwitch-BERT for joint vision and language representation learning to address the problem of modality mismatch.
Switch-BERT extends BERT architecture by introducing learnable layer-wise and cross-layer interactions.
Results confirm that, whereas alternative architectures including ViLBERT and UNITER may excel in particular tasks, Switch-BERT can consistently achieve better or comparable performances.
- Score: 27.102030262319197
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The ability to model intra-modal and inter-modal interactions is fundamental
in multimodal machine learning. The current state-of-the-art models usually
adopt deep learning models with fixed structures. They can achieve exceptional
performances on specific tasks, but face a particularly challenging problem of
modality mismatch because of diversity of input modalities and their fixed
structures. In this paper, we present \textbf{Switch-BERT} for joint vision and
language representation learning to address this problem. Switch-BERT extends
BERT architecture by introducing learnable layer-wise and cross-layer
interactions. It learns to optimize attention from a set of attention modes
representing these interactions. One specific property of the model is that it
learns to attend outputs from various depths, therefore mitigates the modality
mismatch problem. We present extensive experiments on visual question
answering, image-text retrieval and referring expression comprehension
experiments. Results confirm that, whereas alternative architectures including
ViLBERT and UNITER may excel in particular tasks, Switch-BERT can consistently
achieve better or comparable performances than the current state-of-the-art
models in these tasks. Ablation studies indicate that the proposed model
achieves superior performances due to its ability in learning task-specific
multimodal interactions.
Related papers
- HEMM: Holistic Evaluation of Multimodal Foundation Models [91.60364024897653]
Multimodal foundation models can holistically process text alongside images, video, audio, and other sensory modalities.
It is challenging to characterize and study progress in multimodal foundation models, given the range of possible modeling decisions, tasks, and domains.
arXiv Detail & Related papers (2024-07-03T18:00:48Z) - Generative Multimodal Models are In-Context Learners [60.50927925426832]
We introduce Emu2, a generative multimodal model with 37 billion parameters, trained on large-scale multimodal sequences.
Emu2 exhibits strong multimodal in-context learning abilities, even emerging to solve tasks that require on-the-fly reasoning.
arXiv Detail & Related papers (2023-12-20T18:59:58Z) - Concrete Subspace Learning based Interference Elimination for Multi-task
Model Fusion [86.6191592951269]
Merging models fine-tuned from common extensively pretrained large model but specialized for different tasks has been demonstrated as a cheap and scalable strategy to construct a multitask model that performs well across diverse tasks.
We propose the CONtinuous relaxation dis (Concrete) subspace learning method to identify a common lowdimensional subspace and utilize its shared information track interference problem without sacrificing performance.
arXiv Detail & Related papers (2023-12-11T07:24:54Z) - MMoE: Enhancing Multimodal Models with Mixtures of Multimodal Interaction Experts [92.76662894585809]
We introduce an approach to enhance multimodal models, which we call Multimodal Mixtures of Experts (MMoE)
MMoE is able to be applied to various types of models to gain improvement.
arXiv Detail & Related papers (2023-11-16T05:31:21Z) - Learning Unseen Modality Interaction [54.23533023883659]
Multimodal learning assumes all modality combinations of interest are available during training to learn cross-modal correspondences.
We pose the problem of unseen modality interaction and introduce a first solution.
It exploits a module that projects the multidimensional features of different modalities into a common space with rich information preserved.
arXiv Detail & Related papers (2023-06-22T10:53:10Z) - UNIMO-3: Multi-granularity Interaction for Vision-Language
Representation Learning [35.88753097105914]
We propose the UNIMO-3 model, which has the capacity to simultaneously learn the multimodal in-layer interaction and cross-layer interaction.
Our model achieves state-of-the-art performance in various downstream tasks, and through ablation study can prove that effective cross-layer learning improves the model's ability of multimodal representation.
arXiv Detail & Related papers (2023-05-23T05:11:34Z) - MultiViz: An Analysis Benchmark for Visualizing and Understanding
Multimodal Models [103.9987158554515]
MultiViz is a method for analyzing the behavior of multimodal models by scaffolding the problem of interpretability into 4 stages.
We show that the complementary stages in MultiViz together enable users to simulate model predictions, assign interpretable concepts to features, perform error analysis on model misclassifications, and use insights from error analysis to debug models.
arXiv Detail & Related papers (2022-06-30T18:42:06Z) - Hybrid Contrastive Learning of Tri-Modal Representation for Multimodal
Sentiment Analysis [18.4364234071951]
We propose a novel framework HyCon for hybrid contrastive learning of tri-modal representation.
Specifically, we simultaneously perform intra-/inter-modal contrastive learning and semi-contrastive learning.
Our proposed method outperforms existing works.
arXiv Detail & Related papers (2021-09-04T06:04:21Z) - Does my multimodal model learn cross-modal interactions? It's harder to
tell than you might think! [26.215781778606168]
Cross-modal modeling seems crucial in multimodal tasks, such as visual question answering.
We propose a new diagnostic tool, empirical multimodally-additive function projection (EMAP), for isolating whether or not cross-modal interactions improve performance for a given model on a given task.
For seven image+text classification tasks (on each of which we set new state-of-the-art benchmarks), we find that, in many cases, removing cross-modal interactions results in little to no performance degradation.
arXiv Detail & Related papers (2020-10-13T17:45:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.