Wukong: 100 Million Large-scale Chinese Cross-modal Pre-training Dataset
and A Foundation Framework
- URL: http://arxiv.org/abs/2202.06767v1
- Date: Mon, 14 Feb 2022 14:37:15 GMT
- Title: Wukong: 100 Million Large-scale Chinese Cross-modal Pre-training Dataset
and A Foundation Framework
- Authors: Jiaxi Gu, Xiaojun Meng, Guansong Lu, Lu Hou, Minzhe Niu, Hang Xu,
Xiaodan Liang, Wei Zhang, Xin Jiang, Chunjing Xu
- Abstract summary: This paper presents a large-scale Chinese cross-modal dataset for benchmarking different multi-modal pre-training methods.
We release a Large-Scale Chinese Cross-modal dataset named Wukong, containing 100 million Chinese image-text pairs from the web.
- Score: 99.38817546900405
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper presents a large-scale Chinese cross-modal dataset for
benchmarking different multi-modal pre-training methods to facilitate the
Vision-Language Pre-training (VLP) research and community development. Recent
dual-stream VLP models like CLIP, ALIGN and FILIP have shown remarkable
performance on various downstream tasks as well as their remarkable zero-shot
ability in the open domain tasks. However, their success heavily relies on the
scale of pre-trained datasets. Though there have been both small-scale
vision-language English datasets like Flickr30k, CC12M as well as large-scale
LAION-400M, the current community lacks large-scale Vision-Language benchmarks
in Chinese, hindering the development of broader multilingual applications. On
the other hand, there is very rare publicly available large-scale Chinese
cross-modal pre-training dataset that has been released, making it hard to use
pre-trained models as services for downstream tasks. In this work, we release a
Large-Scale Chinese Cross-modal dataset named Wukong, containing 100 million
Chinese image-text pairs from the web. Furthermore, we release a group of big
models pre-trained with advanced image encoders (ResNet/ViT/SwinT) and
different pre-training methods (CLIP/FILIP/LiT). We provide extensive
experiments, a deep benchmarking of different downstream tasks, and some
exciting findings. Experiments show that Wukong can serve as a promising
Chinese pre-training dataset and benchmark for different cross-modal learning
methods, which gives superior performance on various downstream tasks such as
zero-shot image classification and image-text retrieval benchmarks. More
information can refer to https://wukong-dataset.github.io/wukong-dataset/.
Related papers
- Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese [55.95225353842118]
We construct a large-scale dataset of image-text pairs in Chinese, where most data are retrieved from publicly available datasets.
We develop 5 Chinese CLIP models of multiple sizes, spanning from 77 to 958 million parameters.
Our experiments demonstrate that Chinese CLIP can achieve the state-of-the-art performance on MUGE, Flickr30K-CN, and COCO-CN.
arXiv Detail & Related papers (2022-11-02T17:47:23Z) - CCMB: A Large-scale Chinese Cross-modal Benchmark [46.349966178044184]
We build a large-scale high-quality Chinese Cross-Modal Benchmark named CCMB for the research community.
Zero contains 250 million images paired with 750 million text descriptions, plus two of the five fine-tuning datasets are also the largest ones for Chinese cross-modal downstream tasks.
arXiv Detail & Related papers (2022-05-08T13:19:23Z) - WuDaoMM: A large-scale Multi-Modal Dataset for Pre-training models [2.603259641572195]
We introduce a large-scale multi-modal corpora named WuDaoMM, totally containing more than 650M image-text pairs.
About 600 million pairs of data are collected from multiple webpages in which image and caption present weak correlation.
We also release a base version of WuDaoMM with 5 million strong-correlated image-text pairs, which is sufficient to support the common cross-modal model pre-training.
arXiv Detail & Related papers (2022-03-22T06:12:20Z) - Enabling Multimodal Generation on CLIP via Vision-Language Knowledge
Distillation [79.72299298976525]
We propose to augment a vision-language pre-training model with a textual pre-trained language model (PLM) via vision-language knowledge distillation (VLKD)
Experiments show that the resulting model has strong zero-shot performance on multimodal generation tasks, such as open-ended visual question answering and image captioning.
The original textual language understanding and generation ability of the PLM is maintained after VLKD, which makes our model versatile for both multimodal and unimodal tasks.
arXiv Detail & Related papers (2022-03-12T09:33:37Z) - ZeroVL: A Strong Baseline for Aligning Vision-Language Representations
with Limited Resources [13.30815073857842]
We provide a comprehensive training guidance, which allows us to conduct dual-encoder multi-modal representation alignment with limited resources.
We collect 100M web data for pre-training, and achieve comparable or superior results than state-of-the-art methods.
Our code and pre-trained models will be released to facilitate the research community.
arXiv Detail & Related papers (2021-12-17T05:40:28Z) - WenLan: Bridging Vision and Language by Large-Scale Multi-Modal
Pre-Training [71.37731379031487]
We propose a two-tower pre-training model called BriVL within the cross-modal contrastive learning framework.
Unlike OpenAI CLIP that adopts a simple contrastive learning method, we devise a more advanced algorithm by adapting the latest method MoCo into the cross-modal scenario.
By building a large queue-based dictionary, our BriVL can incorporate more negative samples in limited GPU resources.
arXiv Detail & Related papers (2021-03-11T09:39:49Z) - Beyond English-Centric Multilingual Machine Translation [74.21727842163068]
We create a true Many-to-Many multilingual translation model that can translate directly between any pair of 100 languages.
We build and open source a training dataset that covers thousands of language directions with supervised data, created through large-scale mining.
Our focus on non-English-Centric models brings gains of more than 10 BLEU when directly translating between non-English directions while performing competitively to the best single systems of WMT.
arXiv Detail & Related papers (2020-10-21T17:01:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.