VC-GPT: Visual Conditioned GPT for End-to-End Generative
Vision-and-Language Pre-training
- URL: http://arxiv.org/abs/2201.12723v1
- Date: Sun, 30 Jan 2022 04:44:54 GMT
- Title: VC-GPT: Visual Conditioned GPT for End-to-End Generative
Vision-and-Language Pre-training
- Authors: Ziyang Luo, Yadong Xi, Rongsheng Zhang, Jing Ma
- Abstract summary: A vision-and-language pre-training model (VLMs) has achieved tremendous success in the cross-modal area, but most of them require millions of parallel image-caption data for pre-training.
In this work, we focus on reducing such need for generative vision-and-language pre-training by taking advantage of the visual pre-trained model (CLIP-ViT) as encoder and language pre-trained model (GPT2) as decoder.
- Score: 9.511101155155957
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision-and-language pre-training models (VLMs) have achieved tremendous
success in the cross-modal area, but most of them require millions of parallel
image-caption data for pre-training. Collating such data is expensive and
labor-intensive. In this work, we focus on reducing such need for generative
vision-and-language pre-training (G-VLP) by taking advantage of the visual
pre-trained model (CLIP-ViT) as encoder and language pre-trained model (GPT2)
as decoder. Unfortunately, GPT2 lacks a necessary cross-attention module, which
hinders the direct connection of CLIP-ViT and GPT2. To remedy such defects, we
conduct extensive experiments to empirically investigate how to design and
pre-train our model. Based on our experimental results, we propose a novel
G-VLP framework, Visual Conditioned GPT (VC-GPT), and pre-train it with a
small-scale parallel image-caption corpus (Visual Genome, only 110k distinct
images). Evaluating on the image captioning downstream tasks (MSCOCO and
Flickr30k Captioning), VC-GPT achieves either the best or the second-best
performance across all evaluation metrics over the previous works which consume
around 30 times more parallel data during pre-training.
Related papers
- Aligning Modalities in Vision Large Language Models via Preference
Fine-tuning [67.62925151837675]
In this work, we frame the hallucination problem as an alignment issue, tackle it with preference tuning.
Specifically, we propose POVID to generate feedback data with AI models.
We use ground-truth instructions as the preferred response and a two-stage approach to generate dispreferred data.
In experiments across broad benchmarks, we show that we can not only reduce hallucinations, but improve model performance across standard benchmarks, outperforming prior approaches.
arXiv Detail & Related papers (2024-02-18T00:56:16Z) - VL-GPT: A Generative Pre-trained Transformer for Vision and Language
Understanding and Generation [79.02357561313785]
We introduce Vision-Language Generative Pre-trained Transformer (VL-GPT), a transformer model proficient at concurrently perceiving and generating visual and linguistic data.
VL-GPT achieves a unified pre-training approach for both image and text modalities by employing a straightforward auto-regressive objective.
arXiv Detail & Related papers (2023-12-14T18:59:43Z) - Prompt-based Learning for Unpaired Image Captioning [86.44188293709307]
Unpaired Image Captioning (UIC) has been developed to learn image descriptions from unaligned vision-language sample pairs.
Recent successes of Vision-Language Pre-Trained Models (VL-PTMs) have triggered the development of prompt-based learning.
We present in this paper a novel scheme based on prompt to train the UIC model, making best use of the powerful generalization ability.
arXiv Detail & Related papers (2022-05-26T03:13:43Z) - Unsupervised Prompt Learning for Vision-Language Models [12.259694415428026]
We propose an unsupervised prompt learning (UPL) framework to improve the zero-shot transfer of CLIP-like vision-language models.
An enhanced version of UPL is even on par with the 8-shot CoOp and the 8-shot TIP-Adapter on most datasets.
arXiv Detail & Related papers (2022-04-07T17:59:57Z) - Unsupervised Vision-and-Language Pre-training via Retrieval-based
Multi-Granular Alignment [66.77841319057299]
We propose a novel unsupervised Vision-and-Language pre-training curriculum for non-parallel texts and images.
We first construct a weakly aligned image-text corpus via a retrieval-based approach, then apply a set of multi-granular alignment pre-training tasks.
A comprehensive ablation study shows each granularity is helpful to learn a stronger pre-trained model.
arXiv Detail & Related papers (2022-03-01T05:34:01Z) - Vector-quantized Image Modeling with Improved VQGAN [93.8443646643864]
We propose a Vector-quantized Image Modeling approach that involves pretraining a Transformer to predict image tokens autoregressively.
We first propose multiple improvements over vanilla VQGAN from architecture to codebook learning, yielding better efficiency and reconstruction fidelity.
When trained on ImageNet at 256x256 resolution, we achieve Inception Score (IS) of 175.1 and Frechet Inception Distance (FID) of 4.17, a dramatic improvement over the vanilla VQGAN.
arXiv Detail & Related papers (2021-10-09T18:36:00Z) - Unsupervised Vision-and-Language Pre-training Without Parallel Images
and Captions [92.47566804182338]
We investigate if a strong V&L representation model can be learned through unsupervised pre-training without image-caption corpora.
In particular, we propose to conduct mask-and-predict'' pre-training on text-only and image-only corpora.
We find that such a simple approach performance close to a model pre-trained with aligned data, on four English V&L benchmarks.
arXiv Detail & Related papers (2020-10-24T08:17:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.