SimVLM: Simple Visual Language Model Pretraining with Weak Supervision
- URL: http://arxiv.org/abs/2108.10904v1
- Date: Tue, 24 Aug 2021 18:14:00 GMT
- Title: SimVLM: Simple Visual Language Model Pretraining with Weak Supervision
- Authors: Zirui Wang, Jiahui Yu, Adams Wei Yu, Zihang Dai, Yulia Tsvetkov, Yuan
Cao
- Abstract summary: We present a minimalist pretraining framework, named Simple Visual Language Model (SimVLM)
SimVLM reduces the training complexity by exploiting large-scale weak supervision.
It achieves new state-of-the-art results on a wide range of discriminative and generative vision-language benchmarks.
- Score: 48.98275876458666
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With recent progress in joint modeling of visual and textual representations,
Vision-Language Pretraining (VLP) has achieved impressive performance on many
multimodal downstream tasks. However, the requirement for expensive annotations
including clean image captions and regional labels limits the scalability of
existing approaches, and complicates the pretraining procedure with the
introduction of multiple dataset-specific objectives. In this work, we relax
these constraints and present a minimalist pretraining framework, named Simple
Visual Language Model (SimVLM). Unlike prior work, SimVLM reduces the training
complexity by exploiting large-scale weak supervision, and is trained
end-to-end with a single prefix language modeling objective. Without utilizing
extra data or task-specific customization, the resulting model significantly
outperforms previous pretraining methods and achieves new state-of-the-art
results on a wide range of discriminative and generative vision-language
benchmarks, including VQA (+3.74% vqa-score), NLVR2 (+1.17% accuracy), SNLI-VE
(+1.37% accuracy) and image captioning tasks (+10.1% average CIDEr score).
Furthermore, we demonstrate that SimVLM acquires strong generalization and
transfer ability, enabling zero-shot behavior including open-ended visual
question answering and cross-modality transfer.
Related papers
- Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training [49.407311947143825]
We present Mono-InternVL, a novel monolithic MLLM that seamlessly integrates a set of visual experts via a multimodal mixture-of-experts structure.
We also propose an innovative pre-training strategy to maximize the visual capability of Mono-InternVL, namely Endogenous Visual Pre-training (EViP)
arXiv Detail & Related papers (2024-10-10T17:59:22Z) - Unsupervised Pre-training with Language-Vision Prompts for Low-Data Instance Segmentation [105.23631749213729]
We propose a novel method for unsupervised pre-training in low-data regimes.
Inspired by the recently successful prompting technique, we introduce a new method, Unsupervised Pre-training with Language-Vision Prompts.
We show that our method can converge faster and perform better than CNN-based models in low-data regimes.
arXiv Detail & Related papers (2024-05-22T06:48:43Z) - VILA: On Pre-training for Visual Language Models [74.08039416548209]
We study the design options for VLM pre-training through step-by-step controllable comparisons.
We build VILA, a Visual Language model family that consistently outperforms the state-of-the-art models.
arXiv Detail & Related papers (2023-12-12T18:58:18Z) - Vision-Language Pre-Training for Multimodal Aspect-Based Sentiment
Analysis [25.482853330324748]
Multimodal Aspect-Based Sentiment Analysis (MABSA) has attracted increasing attention in recent years.
Previous approaches either (i) use separately pre-trained visual and textual models, which ignore the crossmodal alignment or (ii) use vision-grained models pre-trained with general pre-training tasks.
We propose a task-specific Vision-Language Pre-training framework for MABSA (MABSA), which is a unified multimodal encoder-decoder architecture for all the pretraining and downstream tasks.
arXiv Detail & Related papers (2022-04-17T08:44:00Z) - Enabling Multimodal Generation on CLIP via Vision-Language Knowledge
Distillation [79.72299298976525]
We propose to augment a vision-language pre-training model with a textual pre-trained language model (PLM) via vision-language knowledge distillation (VLKD)
Experiments show that the resulting model has strong zero-shot performance on multimodal generation tasks, such as open-ended visual question answering and image captioning.
The original textual language understanding and generation ability of the PLM is maintained after VLKD, which makes our model versatile for both multimodal and unimodal tasks.
arXiv Detail & Related papers (2022-03-12T09:33:37Z) - MAGMA -- Multimodal Augmentation of Generative Models through
Adapter-based Finetuning [11.339580074756189]
MAGMA is a simple method for augmenting generative language models with additional modalities using adapter-based finetuning.
We train a series of VL models that autoregressively generate text from arbitrary combinations of visual and textual input.
arXiv Detail & Related papers (2021-12-09T23:58:45Z) - Unsupervised Vision-and-Language Pre-training Without Parallel Images
and Captions [92.47566804182338]
We investigate if a strong V&L representation model can be learned through unsupervised pre-training without image-caption corpora.
In particular, we propose to conduct mask-and-predict'' pre-training on text-only and image-only corpora.
We find that such a simple approach performance close to a model pre-trained with aligned data, on four English V&L benchmarks.
arXiv Detail & Related papers (2020-10-24T08:17:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.