Training Vision-Language Transformers from Captions
- URL: http://arxiv.org/abs/2205.09256v3
- Date: Wed, 14 Jun 2023 17:37:46 GMT
- Title: Training Vision-Language Transformers from Captions
- Authors: Liangke Gui, Yingshan Chang, Qiuyuan Huang, Subhojit Som, Alex
Hauptmann, Jianfeng Gao, Yonatan Bisk
- Abstract summary: We introduce a new model Vision-Language from Captions (VLC) built on top of Masked Auto-Encoders.
In a head-to-head comparison between ViLT and our model, we find that our approach outperforms ViLT on standard benchmarks.
- Score: 80.00302205584335
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision-Language Transformers can be learned without low-level human labels
(e.g. class labels, bounding boxes, etc). Existing work, whether explicitly
utilizing bounding boxes or patches, assumes that the visual backbone must
first be trained on ImageNet class prediction before being integrated into a
multimodal linguistic pipeline. We show that this is not necessary and
introduce a new model Vision-Language from Captions (VLC) built on top of
Masked Auto-Encoders that does not require this supervision. In fact, in a
head-to-head comparison between ViLT, the current state-of-the-art patch-based
vision-language transformer which is pretrained with supervised object
classification, and our model, VLC, we find that our approach 1. outperforms
ViLT on standard benchmarks, 2. provides more interpretable and intuitive patch
visualizations, and 3. is competitive with many larger models that utilize ROIs
trained on annotated bounding-boxes.
Related papers
- APoLLo: Unified Adapter and Prompt Learning for Vision Language Models [58.9772868980283]
We present APoLLo, a unified multi-modal approach that combines Adapter and Prompt learning for Vision-Language models.
APoLLo achieves a relative gain up to 6.03% over MaPLe (SOTA) on novel classes for 10 diverse image recognition datasets.
arXiv Detail & Related papers (2023-12-04T01:42:09Z) - Masked Vision-Language Transformers for Scene Text Recognition [10.057137581956363]
Scene text recognition (STR) enables computers to recognize and read the text in various real-world scenes.
Recent STR models benefit from taking linguistic information in addition to visual cues into consideration.
We propose a novel Masked Vision-Language Transformers (MVLT) to capture both the explicit and the implicit linguistic information.
arXiv Detail & Related papers (2022-11-09T10:28:23Z) - TVLT: Textless Vision-Language Transformer [89.31422264408002]
We present the Textless Vision-Language Transformer (TVLT), where homogeneous transformer blocks take raw visual and audio inputs.
TVLT attains performance comparable to its text-based counterpart, on various multimodal tasks.
Our findings suggest the possibility of learning compact and efficient visual-linguistic representations from low-level visual and audio signals.
arXiv Detail & Related papers (2022-09-28T15:08:03Z) - OmniVL:One Foundation Model for Image-Language and Video-Language Tasks [117.57580168859512]
We present OmniVL, a new foundation model to support both image-language and video-language tasks using one universal architecture.
We demonstrate, for the first time, such a paradigm benefits both image and video tasks, as opposed to the conventional one-directional transfer.
We introduce a novel unified vision-language contrastive (UniVLC) loss to leverage image-text, video-text, image-label (e.g., image classification), video-label (e.g., video action recognition) data together.
arXiv Detail & Related papers (2022-09-15T17:59:59Z) - VL-BEiT: Generative Vision-Language Pretraining [107.25298505511184]
We introduce a vision-language foundation model called VL-BEiT, which is a bidirectional multimodal Transformer learned by generative pretraining.
Specifically, we perform masked vision-language modeling on image-text pairs, masked language modeling on texts, and masked image modeling on images.
arXiv Detail & Related papers (2022-06-02T16:14:19Z) - VL-InterpreT: An Interactive Visualization Tool for Interpreting
Vision-Language Transformers [47.581265194864585]
Internal mechanisms of vision and multimodal transformers remain largely opaque.
With the success of these transformers, it is increasingly critical to understand their inner workings.
We propose VL-InterpreT, which provides novel interactive visualizations for interpreting the attentions and hidden representations in multimodal transformers.
arXiv Detail & Related papers (2022-03-30T05:25:35Z) - Align and Prompt: Video-and-Language Pre-training with Entity Prompts [111.23364631136339]
Video-and-language pre-training has shown promising improvements on various downstream tasks.
We propose Align and Prompt: an efficient and effective video-and-language pre-training framework with better cross-modal alignment.
Our code and pre-trained models will be released.
arXiv Detail & Related papers (2021-12-17T15:55:53Z) - E2E-VLP: End-to-End Vision-Language Pre-training Enhanced by Visual
Learning [31.622393984150314]
We propose the first end-to-end vision-language pre-trained model for both V+L understanding and generation.
We build a unified Transformer framework to jointly learn visual representation, and semantic alignments between image and text.
arXiv Detail & Related papers (2021-06-03T12:50:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.