WuDaoMM: A large-scale Multi-Modal Dataset for Pre-training models
- URL: http://arxiv.org/abs/2203.11480v1
- Date: Tue, 22 Mar 2022 06:12:20 GMT
- Title: WuDaoMM: A large-scale Multi-Modal Dataset for Pre-training models
- Authors: Sha Yuan, Zhao Shuai, Leng Jiahong, Xue Zhao, Zhao Hanyu and Tang Jie
- Abstract summary: We introduce a large-scale multi-modal corpora named WuDaoMM, totally containing more than 650M image-text pairs.
About 600 million pairs of data are collected from multiple webpages in which image and caption present weak correlation.
We also release a base version of WuDaoMM with 5 million strong-correlated image-text pairs, which is sufficient to support the common cross-modal model pre-training.
- Score: 2.603259641572195
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Compared with the domain-specific model, the vision-language pre-training
models (VLPMs) have shown superior performance on downstream tasks with fast
fine-tuning process. For example, ERNIE-ViL, Oscar and UNIMO trained VLPMs with
a uniform transformers stack architecture and large amounts of image-text
paired data, achieving remarkable results on downstream tasks such as
image-text reference(IR and TR), vision question answering (VQA) and image
captioning (IC) etc. During the training phase, VLPMs are always fed with a
combination of multiple public datasets to meet the demand of large-scare
training data. However, due to the unevenness of data distribution including
size, task type and quality, using the mixture of multiple datasets for model
training can be problematic. In this work, we introduce a large-scale
multi-modal corpora named WuDaoMM, totally containing more than 650M image-text
pairs. Specifically, about 600 million pairs of data are collected from
multiple webpages in which image and caption present weak correlation, and the
other 50 million strong-related image-text pairs are collected from some
high-quality graphic websites. We also release a base version of WuDaoMM with 5
million strong-correlated image-text pairs, which is sufficient to support the
common cross-modal model pre-training. Besides, we trained both an
understanding and a generation vision-language (VL) model to test the dataset
effectiveness. The results show that WuDaoMM can be applied as an efficient
dataset for VLPMs, especially for the model in text-to-image generation task.
The data is released at https://data.wudaoai.cn
Related papers
- VEGA: Learning Interleaved Image-Text Comprehension in Vision-Language Large Models [76.94378391979228]
We introduce a new, more demanding task known as Interleaved Image-Text (IITC)
This task challenges models to discern and disregard superfluous elements in both images and text to accurately answer questions.
In support of this task, we further craft a new VEGA dataset, tailored for the IITC task on scientific content, and devised a subtask, Image-Text Association (ITA)
arXiv Detail & Related papers (2024-06-14T17:59:40Z) - Toffee: Efficient Million-Scale Dataset Construction for Subject-Driven Text-to-Image Generation [58.09421301921607]
We construct the first large-scale dataset for subject-driven image editing and generation.
Our dataset is 5 times the size of previous largest dataset, yet our cost is tens of thousands of GPU hours lower.
arXiv Detail & Related papers (2024-06-13T16:40:39Z) - Sieve: Multimodal Dataset Pruning Using Image Captioning Models [11.362835828985494]
Vision-Language Models (VLMs) are pretrained on large, diverse, and noisy web-crawled datasets.
We argue that this approach suffers from multiple limitations including false positives and negatives due to CLIP's pretraining on noisy labels.
We propose a pruning signal, Sieve, that employs synthetic captions generated by image-captioning models pretrained on small, diverse, and well-aligned image-text pairs.
arXiv Detail & Related papers (2023-10-03T14:53:53Z) - RS5M and GeoRSCLIP: A Large Scale Vision-Language Dataset and A Large
Vision-Language Model for Remote Sensing [26.71560933421903]
We propose a new framework that includes the Domain pre-trained Vision-Language Model (DVLM)
We present an image-text paired dataset in the field of remote sensing (RS), RS5M, which has 5 million RS images with English descriptions.
arXiv Detail & Related papers (2023-06-20T05:30:59Z) - Delving Deeper into Data Scaling in Masked Image Modeling [145.36501330782357]
We conduct an empirical study on the scaling capability of masked image modeling (MIM) methods for visual recognition.
Specifically, we utilize the web-collected Coyo-700M dataset.
Our goal is to investigate how the performance changes on downstream tasks when scaling with different sizes of data and models.
arXiv Detail & Related papers (2023-05-24T15:33:46Z) - LAION-5B: An open large-scale dataset for training next generation
image-text models [16.129935376579326]
We present LAION-5B, a dataset consisting of 5.85 billion CLIP-filtered image-text pairs, of which 2.32B contain English language.
We show successful replication and fine-tuning of foundational models like CLIP, GLIDE and Stable Diffusion using the dataset.
We also provide several nearest neighbor indices, an improved web-interface for dataset exploration and subset generation.
arXiv Detail & Related papers (2022-10-16T00:08:18Z) - Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone [170.85076677740292]
We present FIBER (Fusion-In-the-Backbone-basedER), a new model architecture for vision-language (VL) pre-training.
Instead of having dedicated transformer layers for fusion after the uni-modal backbones, FIBER pushes multimodal fusion deep into the model.
We conduct comprehensive experiments on a wide range of VL tasks, ranging from VQA, image captioning, and retrieval, to phrase grounding, referring expression comprehension, and object detection.
arXiv Detail & Related papers (2022-06-15T16:41:29Z) - KNN-Diffusion: Image Generation via Large-Scale Retrieval [40.6656651653888]
Learning to adapt enables several new capabilities.
Fine-tuning trained models to new samples can be achieved by simply adding them to the table.
Our diffusion-based model trains on images only, by leveraging a joint Text-Image multi-modal metric.
arXiv Detail & Related papers (2022-04-06T14:13:35Z) - Enabling Multimodal Generation on CLIP via Vision-Language Knowledge
Distillation [79.72299298976525]
We propose to augment a vision-language pre-training model with a textual pre-trained language model (PLM) via vision-language knowledge distillation (VLKD)
Experiments show that the resulting model has strong zero-shot performance on multimodal generation tasks, such as open-ended visual question answering and image captioning.
The original textual language understanding and generation ability of the PLM is maintained after VLKD, which makes our model versatile for both multimodal and unimodal tasks.
arXiv Detail & Related papers (2022-03-12T09:33:37Z) - Unsupervised Vision-and-Language Pre-training via Retrieval-based
Multi-Granular Alignment [66.77841319057299]
We propose a novel unsupervised Vision-and-Language pre-training curriculum for non-parallel texts and images.
We first construct a weakly aligned image-text corpus via a retrieval-based approach, then apply a set of multi-granular alignment pre-training tasks.
A comprehensive ablation study shows each granularity is helpful to learn a stronger pre-trained model.
arXiv Detail & Related papers (2022-03-01T05:34:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.