Enabling Multimodal Generation on CLIP via Vision-Language Knowledge
Distillation
- URL: http://arxiv.org/abs/2203.06386v1
- Date: Sat, 12 Mar 2022 09:33:37 GMT
- Title: Enabling Multimodal Generation on CLIP via Vision-Language Knowledge
Distillation
- Authors: Wenliang Dai, Lu Hou, Lifeng Shang, Xin Jiang, Qun Liu, Pascale Fung
- Abstract summary: We propose to augment a vision-language pre-training model with a textual pre-trained language model (PLM) via vision-language knowledge distillation (VLKD)
Experiments show that the resulting model has strong zero-shot performance on multimodal generation tasks, such as open-ended visual question answering and image captioning.
The original textual language understanding and generation ability of the PLM is maintained after VLKD, which makes our model versatile for both multimodal and unimodal tasks.
- Score: 79.72299298976525
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The recent large-scale vision-language pre-training (VLP) of dual-stream
architectures (e.g., CLIP) with a tremendous amount of image-text pair data,
has shown its superiority on various multimodal alignment tasks. Despite its
success, the resulting models are not capable of multimodal generative tasks
due to the weak text encoder. To tackle this problem, we propose to augment the
dual-stream VLP model with a textual pre-trained language model (PLM) via
vision-language knowledge distillation (VLKD), enabling the capability for
multimodal generation. VLKD is pretty data- and computation-efficient compared
to the pre-training from scratch. Experimental results show that the resulting
model has strong zero-shot performance on multimodal generation tasks, such as
open-ended visual question answering and image captioning. For example, it
achieves 44.5% zero-shot accuracy on the VQAv2 dataset, surpassing the previous
state-of-the-art zero-shot model with $7\times$ fewer parameters. Furthermore,
the original textual language understanding and generation ability of the PLM
is maintained after VLKD, which makes our model versatile for both multimodal
and unimodal tasks.
Related papers
- NVLM: Open Frontier-Class Multimodal LLMs [64.00053046838225]
We introduce NVLM 1.0, a family of frontier-class multimodal large language models (LLMs) that achieve state-of-the-art results on vision-language tasks.
We propose a novel architecture that enhances both training efficiency and multimodal reasoning capabilities.
We develop production-grade multimodality for the NVLM-1.0 models, enabling them to excel in vision-language tasks.
arXiv Detail & Related papers (2024-09-17T17:59:06Z) - VL-GPT: A Generative Pre-trained Transformer for Vision and Language
Understanding and Generation [79.02357561313785]
We introduce Vision-Language Generative Pre-trained Transformer (VL-GPT), a transformer model proficient at concurrently perceiving and generating visual and linguistic data.
VL-GPT achieves a unified pre-training approach for both image and text modalities by employing a straightforward auto-regressive objective.
arXiv Detail & Related papers (2023-12-14T18:59:43Z) - Prefix Language Models are Unified Modal Learners [30.666873206462295]
We show that a unified modal model could be learned with a prefix language modeling objective upon text and image sequences.
Thanks to the simple but powerful pre-training paradigm, our proposed model, DaVinci, is simple to train, scalable to huge data, and adaptable to a variety of downstream tasks.
arXiv Detail & Related papers (2022-06-15T17:49:38Z) - Multimodal Knowledge Alignment with Reinforcement Learning [103.68816413817372]
ESPER extends language-only zero-shot models to unseen multimodal tasks, like image and audio captioning.
Our key novelty is to use reinforcement learning to align multimodal inputs to language model generations without direct supervision.
Experiments demonstrate that ESPER outperforms baselines and prior work on a variety of zero-shot tasks.
arXiv Detail & Related papers (2022-05-25T10:12:17Z) - WenLan: Bridging Vision and Language by Large-Scale Multi-Modal
Pre-Training [71.37731379031487]
We propose a two-tower pre-training model called BriVL within the cross-modal contrastive learning framework.
Unlike OpenAI CLIP that adopts a simple contrastive learning method, we devise a more advanced algorithm by adapting the latest method MoCo into the cross-modal scenario.
By building a large queue-based dictionary, our BriVL can incorporate more negative samples in limited GPU resources.
arXiv Detail & Related papers (2021-03-11T09:39:49Z) - UniVL: A Unified Video and Language Pre-Training Model for Multimodal
Understanding and Generation [76.12027504427708]
This paper proposes UniVL: a Unified Video and Language pre-training model for both multimodal understanding and generation.
It comprises four components, including two single-modal encoders, a cross encoder, and a decoder with the Transformer backbone.
We develop two pre-training strategies, stage by stage pre-training (StagedP) and enhanced video representation (EnhancedV) to make the training process of the UniVL more effective.
arXiv Detail & Related papers (2020-02-15T10:03:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.