Does my multimodal model learn cross-modal interactions? It's harder to
tell than you might think!
- URL: http://arxiv.org/abs/2010.06572v1
- Date: Tue, 13 Oct 2020 17:45:28 GMT
- Title: Does my multimodal model learn cross-modal interactions? It's harder to
tell than you might think!
- Authors: Jack Hessel and Lillian Lee
- Abstract summary: Cross-modal modeling seems crucial in multimodal tasks, such as visual question answering.
We propose a new diagnostic tool, empirical multimodally-additive function projection (EMAP), for isolating whether or not cross-modal interactions improve performance for a given model on a given task.
For seven image+text classification tasks (on each of which we set new state-of-the-art benchmarks), we find that, in many cases, removing cross-modal interactions results in little to no performance degradation.
- Score: 26.215781778606168
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Modeling expressive cross-modal interactions seems crucial in multimodal
tasks, such as visual question answering. However, sometimes high-performing
black-box algorithms turn out to be mostly exploiting unimodal signals in the
data. We propose a new diagnostic tool, empirical multimodally-additive
function projection (EMAP), for isolating whether or not cross-modal
interactions improve performance for a given model on a given task. This
function projection modifies model predictions so that cross-modal interactions
are eliminated, isolating the additive, unimodal structure. For seven
image+text classification tasks (on each of which we set new state-of-the-art
benchmarks), we find that, in many cases, removing cross-modal interactions
results in little to no performance degradation. Surprisingly, this holds even
when expressive models, with capacity to consider interactions, otherwise
outperform less expressive models; thus, performance improvements, even when
present, often cannot be attributed to consideration of cross-modal feature
interactions. We hence recommend that researchers in multimodal machine
learning report the performance not only of unimodal baselines, but also the
EMAP of their best-performing model.
Related papers
- MMoE: Enhancing Multimodal Models with Mixtures of Multimodal Interaction Experts [92.76662894585809]
We introduce an approach to enhance multimodal models, which we call Multimodal Mixtures of Experts (MMoE)
MMoE is able to be applied to various types of models to gain improvement.
arXiv Detail & Related papers (2023-11-16T05:31:21Z) - Improving Discriminative Multi-Modal Learning with Large-Scale
Pre-Trained Models [51.5543321122664]
This paper investigates how to better leverage large-scale pre-trained uni-modal models to enhance discriminative multi-modal learning.
We introduce Multi-Modal Low-Rank Adaptation learning (MMLoRA)
arXiv Detail & Related papers (2023-10-08T15:01:54Z) - Switch-BERT: Learning to Model Multimodal Interactions by Switching
Attention and Input [27.102030262319197]
We present textbfSwitch-BERT for joint vision and language representation learning to address the problem of modality mismatch.
Switch-BERT extends BERT architecture by introducing learnable layer-wise and cross-layer interactions.
Results confirm that, whereas alternative architectures including ViLBERT and UNITER may excel in particular tasks, Switch-BERT can consistently achieve better or comparable performances.
arXiv Detail & Related papers (2023-06-25T09:28:40Z) - Learning Unseen Modality Interaction [54.23533023883659]
Multimodal learning assumes all modality combinations of interest are available during training to learn cross-modal correspondences.
We pose the problem of unseen modality interaction and introduce a first solution.
It exploits a module that projects the multidimensional features of different modalities into a common space with rich information preserved.
arXiv Detail & Related papers (2023-06-22T10:53:10Z) - Dissecting Multimodality in VideoQA Transformer Models by Impairing Modality Fusion [54.33764537135906]
VideoQA Transformer models demonstrate competitive performance on standard benchmarks.
Do these models capture the rich multimodal structures and dynamics from video and text jointly?
Are they achieving high scores by exploiting biases and spurious features?
arXiv Detail & Related papers (2023-06-15T06:45:46Z) - UNIMO-3: Multi-granularity Interaction for Vision-Language
Representation Learning [35.88753097105914]
We propose the UNIMO-3 model, which has the capacity to simultaneously learn the multimodal in-layer interaction and cross-layer interaction.
Our model achieves state-of-the-art performance in various downstream tasks, and through ablation study can prove that effective cross-layer learning improves the model's ability of multimodal representation.
arXiv Detail & Related papers (2023-05-23T05:11:34Z) - MultiViz: An Analysis Benchmark for Visualizing and Understanding
Multimodal Models [103.9987158554515]
MultiViz is a method for analyzing the behavior of multimodal models by scaffolding the problem of interpretability into 4 stages.
We show that the complementary stages in MultiViz together enable users to simulate model predictions, assign interpretable concepts to features, perform error analysis on model misclassifications, and use insights from error analysis to debug models.
arXiv Detail & Related papers (2022-06-30T18:42:06Z) - Mind the Gap: Understanding the Modality Gap in Multi-modal Contrastive
Representation Learning [35.25854322376364]
We show that different data modalities are embedded at arm's length in their shared representation in multi-modal models such as CLIP.
In contrastive learning keeps the different modalities separate by a certain distance, which is influenced by the temperature parameter in the loss function.
Our experiments further demonstrate that varying the modality gap distance has a significant impact in improving the model's downstream zero-shot classification performance and fairness.
arXiv Detail & Related papers (2022-03-03T22:53:54Z) - Is Cross-Attention Preferable to Self-Attention for Multi-Modal Emotion
Recognition? [36.67937514793215]
Cross-modal attention is seen as an effective mechanism for multi-modal fusion.
We implement and compare a cross-attention and a self-attention model.
We compare the models using different modality combinations for a 7-class emotion classification task.
arXiv Detail & Related papers (2022-02-18T15:44:14Z) - Mutual Modality Learning for Video Action Classification [74.83718206963579]
We show how to embed multi-modality into a single model for video action classification.
We achieve state-of-the-art results in the Something-Something-v2 benchmark.
arXiv Detail & Related papers (2020-11-04T21:20:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.