Prefix Language Models are Unified Modal Learners
- URL: http://arxiv.org/abs/2206.07699v1
- Date: Wed, 15 Jun 2022 17:49:38 GMT
- Title: Prefix Language Models are Unified Modal Learners
- Authors: Shizhe Diao, Wangchunshu Zhou, Xinsong Zhang, Jiawei Wang
- Abstract summary: We show that a unified modal model could be learned with a prefix language modeling objective upon text and image sequences.
Thanks to the simple but powerful pre-training paradigm, our proposed model, DaVinci, is simple to train, scalable to huge data, and adaptable to a variety of downstream tasks.
- Score: 30.666873206462295
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the success of vision-language pre-training, we have witnessed the
state-of-the-art has been pushed on multi-modal understanding and generation.
However, the current pre-training paradigm is either incapable of targeting all
modalities at once (e.g., text generation and image generation), or requires
multi-fold well-designed tasks which significantly limits the scalability. We
demonstrate that a unified modal model could be learned with a prefix language
modeling objective upon text and image sequences. Thanks to the simple but
powerful pre-training paradigm, our proposed model, DaVinci, is simple to
train, scalable to huge data, and adaptable to a variety of downstream tasks
across modalities (language / vision / vision+language), types (understanding /
generation) and settings (e.g., zero-shot, fine-tuning, linear evaluation) with
a single unified architecture. DaVinci achieves the competitive performance on
a wide range of 26 understanding / generation tasks, and outperforms previous
unified vision-language models on most tasks, including ImageNet classification
(+1.6%), VQAv2 (+1.4%), COCO caption generation (BLEU@4 +1.1%, CIDEr +1.5%) and
COCO image generation (IS +0.9%, FID -1.0%), at the comparable model and data
scale. Furthermore, we offer a well-defined benchmark for future research by
reporting the performance on different scales of the pre-training dataset on a
heterogeneous and wide distribution coverage. Our results establish new,
stronger baselines for future comparisons at different data scales and shed
light on the difficulties of comparing VLP models more generally.
Related papers
- UniBoost: Unsupervised Unimodal Pre-training for Boosting Zero-shot
Vision-Language Tasks [60.46473247205654]
Using large-scale unsupervised unimodal models as pre-training can enhance the zero-shot performance of image-text pair models.
Our experiments show that unimodal pre-training outperforms state-of-the-art CLIP-based models.
arXiv Detail & Related papers (2023-06-07T18:26:22Z) - DINOv2: Learning Robust Visual Features without Supervision [75.42921276202522]
This work shows that existing pretraining methods, especially self-supervised methods, can produce such features if trained on enough curated data from diverse sources.
Most of the technical contributions aim at accelerating and stabilizing the training at scale.
In terms of data, we propose an automatic pipeline to build a dedicated, diverse, and curated image dataset instead of uncurated data, as typically done in the self-supervised literature.
arXiv Detail & Related papers (2023-04-14T15:12:19Z) - eP-ALM: Efficient Perceptual Augmentation of Language Models [70.47962271121389]
We propose to direct effort to efficient adaptations of existing models, and propose to augment Language Models with perception.
Existing approaches for adapting pretrained models for vision-language tasks still rely on several key components that hinder their efficiency.
We show that by freezing more than 99% of total parameters, training only one linear projection layer, and prepending only one trainable token, our approach (dubbed eP-ALM) significantly outperforms other baselines on VQA and Captioning.
arXiv Detail & Related papers (2023-03-20T19:20:34Z) - Multimodal Knowledge Alignment with Reinforcement Learning [103.68816413817372]
ESPER extends language-only zero-shot models to unseen multimodal tasks, like image and audio captioning.
Our key novelty is to use reinforcement learning to align multimodal inputs to language model generations without direct supervision.
Experiments demonstrate that ESPER outperforms baselines and prior work on a variety of zero-shot tasks.
arXiv Detail & Related papers (2022-05-25T10:12:17Z) - Enabling Multimodal Generation on CLIP via Vision-Language Knowledge
Distillation [79.72299298976525]
We propose to augment a vision-language pre-training model with a textual pre-trained language model (PLM) via vision-language knowledge distillation (VLKD)
Experiments show that the resulting model has strong zero-shot performance on multimodal generation tasks, such as open-ended visual question answering and image captioning.
The original textual language understanding and generation ability of the PLM is maintained after VLKD, which makes our model versatile for both multimodal and unimodal tasks.
arXiv Detail & Related papers (2022-03-12T09:33:37Z) - Unsupervised Vision-and-Language Pre-training via Retrieval-based
Multi-Granular Alignment [66.77841319057299]
We propose a novel unsupervised Vision-and-Language pre-training curriculum for non-parallel texts and images.
We first construct a weakly aligned image-text corpus via a retrieval-based approach, then apply a set of multi-granular alignment pre-training tasks.
A comprehensive ablation study shows each granularity is helpful to learn a stronger pre-trained model.
arXiv Detail & Related papers (2022-03-01T05:34:01Z) - SimVLM: Simple Visual Language Model Pretraining with Weak Supervision [48.98275876458666]
We present a minimalist pretraining framework, named Simple Visual Language Model (SimVLM)
SimVLM reduces the training complexity by exploiting large-scale weak supervision.
It achieves new state-of-the-art results on a wide range of discriminative and generative vision-language benchmarks.
arXiv Detail & Related papers (2021-08-24T18:14:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.