An Empirical Study of Training End-to-End Vision-and-Language
Transformers
- URL: http://arxiv.org/abs/2111.02387v1
- Date: Wed, 3 Nov 2021 17:55:36 GMT
- Title: An Empirical Study of Training End-to-End Vision-and-Language
Transformers
- Authors: Zi-Yi Dou, Yichong Xu, Zhe Gan, Jianfeng Wang, Shuohang Wang, Lijuan
Wang, Chenguang Zhu, Nanyun (Violet) Peng, Zicheng Liu, Michael Zeng
- Abstract summary: We present METER(textbfMultimodal textbfEnd-to-end textbfTransformtextbfER), through which we investigate how to design and pre-train a fully transformer-based VL model.
Specifically, we dissect the model designs along multiple dimensions: vision encoders (e.g., CLIP-ViT, Swin transformer), text encoders (e.g., RoBERTa, DeBERTa), multimodal fusion (e.g., merged attention vs. co-
- Score: 50.23532518166621
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision-and-language (VL) pre-training has proven to be highly effective on
various VL downstream tasks. While recent work has shown that fully
transformer-based VL models can be more efficient than previous
region-feature-based methods, their performance on downstream tasks are often
degraded significantly. In this paper, we present METER~(\textbf{M}ultimodal
\textbf{E}nd-to-end \textbf{T}ransform\textbf{ER}), through which we
systematically investigate how to design and pre-train a fully
transformer-based VL model in an end-to-end manner. Specifically, we dissect
the model designs along multiple dimensions: vision encoders (e.g., CLIP-ViT,
Swin transformer), text encoders (e.g., RoBERTa, DeBERTa), multimodal fusion
(e.g., merged attention vs. co-attention), architecture design (e.g.,
encoder-only vs. encoder-decoder), and pre-training objectives (e.g., masked
image modeling). We conduct comprehensive experiments on a wide range of VL
tasks, and provide insights on how to train a performant VL transformer while
maintaining fast inference speed. Notably, METER~achieves an accuracy of
77.64\% on the VQAv2 test-std set using only 4M images for pre-training,
surpassing the state-of-the-art region-feature-based VinVL model by +1.04\%,
and outperforming the previous best fully transformer-based ALBEF model by
+1.6\%.
Related papers
- VLsI: Verbalized Layers-to-Interactions from Large to Small Vision Language Models [63.27511432647797]
We propose VLsI: Verbalized Layers-to-Interactions, a new VLM family in 2B and 7B model sizes.
We validate VLsI across ten challenging vision-language benchmarks, achieving notable performance gains (11.0% for 2B and 17.4% for 7B) over GPT-4V.
arXiv Detail & Related papers (2024-12-02T18:58:25Z) - CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation [100.25567121604382]
Vision-Language-Action (VLA) models have improved robotic manipulation in terms of language-guided task execution and generalization to unseen scenarios.
We present a new advanced VLA architecture derived from Vision-Language-Models (VLM)
We show that our model not only significantly surpasses existing VLAs in task performance and but also exhibits remarkable adaptation to new robots and generalization to unseen objects and backgrounds.
arXiv Detail & Related papers (2024-11-29T12:06:03Z) - Unveil Benign Overfitting for Transformer in Vision: Training Dynamics, Convergence, and Generalization [88.5582111768376]
We study the optimization of a Transformer composed of a self-attention layer with softmax followed by a fully connected layer under gradient descent on a certain data distribution model.
Our results establish a sharp condition that can distinguish between the small test error phase and the large test error regime, based on the signal-to-noise ratio in the data model.
arXiv Detail & Related papers (2024-09-28T13:24:11Z) - VL-GPT: A Generative Pre-trained Transformer for Vision and Language
Understanding and Generation [79.02357561313785]
We introduce Vision-Language Generative Pre-trained Transformer (VL-GPT), a transformer model proficient at concurrently perceiving and generating visual and linguistic data.
VL-GPT achieves a unified pre-training approach for both image and text modalities by employing a straightforward auto-regressive objective.
arXiv Detail & Related papers (2023-12-14T18:59:43Z) - Multimodal Adaptive Distillation for Leveraging Unimodal Encoders for
Vision-Language Tasks [118.49566068398642]
Cross-modal encoders for vision-language (VL) tasks are often pretrained with carefully curated vision-language datasets.
Unimodal encoders are pretrained with simpler annotations that are less cost-prohibitive, achieving scales of hundreds of millions to billions.
We propose Multimodal Adaptive Distillation (MAD), which adaptively distills useful knowledge from pretrained encoders to cross-modal VL encoders.
arXiv Detail & Related papers (2022-04-22T04:41:04Z) - VLDeformer: Learning Visual-Semantic Embeddings by Vision-Language
Transformer Decomposing [7.890230091463883]
Vision-language transformers (VL transformers) have shown impressive accuracy in cross-modal retrieval.
We propose a novel Vision-language Transformer Decomposing (VLDeformer) to modify the VL transformer as an individual encoder for a single image or text.
arXiv Detail & Related papers (2021-10-20T09:00:51Z) - CvT: Introducing Convolutions to Vision Transformers [44.74550305869089]
Convolutional vision Transformer (CvT) improves Vision Transformer (ViT) in performance and efficiency.
New architecture introduces convolutions into ViT to yield the best of both designs.
arXiv Detail & Related papers (2021-03-29T17:58:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.