WenLan: Bridging Vision and Language by Large-Scale Multi-Modal
Pre-Training
- URL: http://arxiv.org/abs/2103.06561v2
- Date: Sat, 13 Mar 2021 07:52:50 GMT
- Title: WenLan: Bridging Vision and Language by Large-Scale Multi-Modal
Pre-Training
- Authors: Yuqi Huo, Manli Zhang, Guangzhen Liu, Haoyu Lu, Yizhao Gao, Guoxing
Yang, Jingyuan Wen, Heng Zhang, Baogui Xu, Weihao Zheng, Zongzheng Xi,
Yueqian Yang, Anwen Hu, Jinming Zhao, Ruichen Li, Yida Zhao, Liang Zhang,
Yuqing Song, Xin Hong, Wanqing Cui, Danyang Hou, Yingyan Li, Junyi Li, Peiyu
Liu, Zheng Gong, Chuhao Jin, Yuchong Sun, Shizhe Chen, Zhiwu Lu, Zhicheng
Dou, Qin Jin, Yanyan Lan, Wayne Xin Zhao, Ruihua Song, and Ji-Rong Wen
- Abstract summary: We propose a two-tower pre-training model called BriVL within the cross-modal contrastive learning framework.
Unlike OpenAI CLIP that adopts a simple contrastive learning method, we devise a more advanced algorithm by adapting the latest method MoCo into the cross-modal scenario.
By building a large queue-based dictionary, our BriVL can incorporate more negative samples in limited GPU resources.
- Score: 71.37731379031487
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Multi-modal pre-training models have been intensively explored to bridge
vision and language in recent years. However, most of them explicitly model the
cross-modal interaction between image-text pairs, by assuming that there exists
strong semantic correlation between the text and image modalities. Since this
strong assumption is often invalid in real-world scenarios, we choose to
implicitly model the cross-modal correlation for large-scale multi-modal
pre-training, which is the focus of the Chinese project `WenLan' led by our
team. Specifically, with the weak correlation assumption over image-text pairs,
we propose a two-tower pre-training model called BriVL within the cross-modal
contrastive learning framework. Unlike OpenAI CLIP that adopts a simple
contrastive learning method, we devise a more advanced algorithm by adapting
the latest method MoCo into the cross-modal scenario. By building a large
queue-based dictionary, our BriVL can incorporate more negative samples in
limited GPU resources. We further construct a large Chinese multi-source
image-text dataset called RUC-CAS-WenLan for pre-training our BriVL model.
Extensive experiments demonstrate that the pre-trained BriVL model outperforms
both UNITER and OpenAI CLIP on various downstream tasks.
Related papers
- RC3: Regularized Contrastive Cross-lingual Cross-modal Pre-training [84.23022072347821]
We propose a regularized cross-lingual visio-textual contrastive learning objective that constrains the representation proximity of weakly-aligned visio-textual inputs.
Experiments on 5 downstream multi-modal tasks across 6 languages demonstrate the effectiveness of our proposed method.
arXiv Detail & Related papers (2023-05-13T14:41:05Z) - Enabling Multimodal Generation on CLIP via Vision-Language Knowledge
Distillation [79.72299298976525]
We propose to augment a vision-language pre-training model with a textual pre-trained language model (PLM) via vision-language knowledge distillation (VLKD)
Experiments show that the resulting model has strong zero-shot performance on multimodal generation tasks, such as open-ended visual question answering and image captioning.
The original textual language understanding and generation ability of the PLM is maintained after VLKD, which makes our model versatile for both multimodal and unimodal tasks.
arXiv Detail & Related papers (2022-03-12T09:33:37Z) - Unsupervised Vision-and-Language Pre-training via Retrieval-based
Multi-Granular Alignment [66.77841319057299]
We propose a novel unsupervised Vision-and-Language pre-training curriculum for non-parallel texts and images.
We first construct a weakly aligned image-text corpus via a retrieval-based approach, then apply a set of multi-granular alignment pre-training tasks.
A comprehensive ablation study shows each granularity is helpful to learn a stronger pre-trained model.
arXiv Detail & Related papers (2022-03-01T05:34:01Z) - Behind the Scene: Revealing the Secrets of Pre-trained
Vision-and-Language Models [65.19308052012858]
Recent Transformer-based large-scale pre-trained models have revolutionized vision-and-language (V+L) research.
We present VALUE, a set of meticulously designed probing tasks to decipher the inner workings of multimodal pre-training.
Key observations: Pre-trained models exhibit a propensity for attending over text rather than images during inference.
arXiv Detail & Related papers (2020-05-15T01:06:54Z) - InterBERT: Vision-and-Language Interaction for Multi-modal Pretraining [76.32065400614162]
We propose a novel model, namely InterBERT (BERT for Interaction), which is the first model of our series of multimodal pretraining methods M6.
The model owns strong capability of modeling interaction between the information flows of different modalities.
We propose a large-scale dataset for multi-modal pretraining in Chinese, and we develop the Chinese InterBERT which is the first Chinese multi-modal pretrained model.
arXiv Detail & Related papers (2020-03-30T03:13:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.