MLP Architectures for Vision-and-Language Modeling: An Empirical Study
- URL: http://arxiv.org/abs/2112.04453v1
- Date: Wed, 8 Dec 2021 18:26:19 GMT
- Title: MLP Architectures for Vision-and-Language Modeling: An Empirical Study
- Authors: Yixin Nie, Linjie Li, Zhe Gan, Shuohang Wang, Chenguang Zhu, Michael
Zeng, Zicheng Liu, Mohit Bansal, Lijuan Wang
- Abstract summary: We initiate the first empirical study on the use of architectures for vision-and-featured (VL) fusion.
We find that without pre-training, usings for multimodal fusion has a noticeable performance gap compared to transformers.
Instead of heavy multi-head attention, adding tiny one-head attention to encoders is sufficient to achieve comparable performance to transformers.
- Score: 91.6393550858739
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We initiate the first empirical study on the use of MLP architectures for
vision-and-language (VL) fusion. Through extensive experiments on 5 VL tasks
and 5 robust VQA benchmarks, we find that: (i) Without pre-training, using MLPs
for multimodal fusion has a noticeable performance gap compared to
transformers; (ii) However, VL pre-training can help close the performance gap;
(iii) Instead of heavy multi-head attention, adding tiny one-head attention to
MLPs is sufficient to achieve comparable performance to transformers. Moreover,
we also find that the performance gap between MLPs and transformers is not
widened when being evaluated on the harder robust VQA benchmarks, suggesting
using MLPs for VL fusion can generalize roughly to a similar degree as using
transformers. These results hint that MLPs can effectively learn to align
vision and text features extracted from lower-level encoders without heavy
reliance on self-attention. Based on this, we ask an even bolder question: can
we have an all-MLP architecture for VL modeling, where both VL fusion and the
vision encoder are replaced with MLPs? Our result shows that an all-MLP VL
model is sub-optimal compared to state-of-the-art full-featured VL models when
both of them get pre-trained. However, pre-training an all-MLP can surprisingly
achieve a better average score than full-featured transformer models without
pre-training. This indicates the potential of large-scale pre-training of
MLP-like architectures for VL modeling and inspires the future research
direction on simplifying well-established VL modeling with less inductive
design bias. Our code is publicly available at:
https://github.com/easonnie/mlp-vil
Related papers
- MLPs Learn In-Context on Regression and Classification Tasks [28.13046236900491]
In-context learning (ICL) is often assumed to be a unique hallmark of Transformer models.
We demonstrate that multi-layer perceptrons (MLPs) can also learn in-context.
arXiv Detail & Related papers (2024-05-24T15:04:36Z) - Parameter and Computation Efficient Transfer Learning for
Vision-Language Pre-trained Models [79.34513906324727]
In this paper, we aim at parameter and efficient transfer learning (PCETL) for vision-language pre-trained models.
We propose a novel dynamic architecture skipping (DAS) approach towards effective PCETL.
arXiv Detail & Related papers (2023-09-04T09:34:33Z) - Efficient Language Modeling with Sparse all-MLP [53.81435968051093]
All-MLPs can match Transformers in language modeling, but still lag behind in downstream tasks.
We propose sparse all-MLPs with mixture-of-experts (MoEs) in both feature and input (tokens)
We evaluate its zero-shot in-context learning performance on six downstream tasks, and find that it surpasses Transformer-based MoEs and dense Transformers.
arXiv Detail & Related papers (2022-03-14T04:32:19Z) - An Empirical Study of Training End-to-End Vision-and-Language
Transformers [50.23532518166621]
We present METER(textbfMultimodal textbfEnd-to-end textbfTransformtextbfER), through which we investigate how to design and pre-train a fully transformer-based VL model.
Specifically, we dissect the model designs along multiple dimensions: vision encoders (e.g., CLIP-ViT, Swin transformer), text encoders (e.g., RoBERTa, DeBERTa), multimodal fusion (e.g., merged attention vs. co-
arXiv Detail & Related papers (2021-11-03T17:55:36Z) - ConvMLP: Hierarchical Convolutional MLPs for Vision [7.874749885641495]
We propose a hierarchical ConMLP: a light-weight, stage-wise, co-design for visual recognition.
We show that ConvMLP can be seamlessly transferred and achieve competitive results with fewer parameters.
arXiv Detail & Related papers (2021-09-09T17:52:57Z) - Pay Attention to MLPs [84.54729425918164]
We show that gMLP can perform as well as Transformers in key language and applications.
Our comparisons show that self-attention is not critical for Vision Transformers, as gMLP can achieve the same accuracy.
In general, our experiments show that gMLP can scale as well as Transformers over increased data and compute.
arXiv Detail & Related papers (2021-05-17T17:55:04Z) - MLP-Mixer: An all-MLP Architecture for Vision [93.16118698071993]
We present-Mixer, an architecture based exclusively on multi-layer perceptrons (MLPs).
Mixer attains competitive scores on image classification benchmarks, with pre-training and inference comparable to state-of-the-art models.
arXiv Detail & Related papers (2021-05-04T16:17:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.