WuDaoMM: A large-scale Multi-Modal Dataset for Pre-training models
- URL: http://arxiv.org/abs/2203.11480v1
- Date: Tue, 22 Mar 2022 06:12:20 GMT
- Title: WuDaoMM: A large-scale Multi-Modal Dataset for Pre-training models
- Authors: Sha Yuan, Zhao Shuai, Leng Jiahong, Xue Zhao, Zhao Hanyu and Tang Jie
- Abstract summary: We introduce a large-scale multi-modal corpora named WuDaoMM, totally containing more than 650M image-text pairs.
About 600 million pairs of data are collected from multiple webpages in which image and caption present weak correlation.
We also release a base version of WuDaoMM with 5 million strong-correlated image-text pairs, which is sufficient to support the common cross-modal model pre-training.
- Score: 2.603259641572195
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Compared with the domain-specific model, the vision-language pre-training
models (VLPMs) have shown superior performance on downstream tasks with fast
fine-tuning process. For example, ERNIE-ViL, Oscar and UNIMO trained VLPMs with
a uniform transformers stack architecture and large amounts of image-text
paired data, achieving remarkable results on downstream tasks such as
image-text reference(IR and TR), vision question answering (VQA) and image
captioning (IC) etc. During the training phase, VLPMs are always fed with a
combination of multiple public datasets to meet the demand of large-scare
training data. However, due to the unevenness of data distribution including
size, task type and quality, using the mixture of multiple datasets for model
training can be problematic. In this work, we introduce a large-scale
multi-modal corpora named WuDaoMM, totally containing more than 650M image-text
pairs. Specifically, about 600 million pairs of data are collected from
multiple webpages in which image and caption present weak correlation, and the
other 50 million strong-related image-text pairs are collected from some
high-quality graphic websites. We also release a base version of WuDaoMM with 5
million strong-correlated image-text pairs, which is sufficient to support the
common cross-modal model pre-training. Besides, we trained both an
understanding and a generation vision-language (VL) model to test the dataset
effectiveness. The results show that WuDaoMM can be applied as an efficient
dataset for VLPMs, especially for the model in text-to-image generation task.
The data is released at https://data.wudaoai.cn
Related papers
- Migician: Revealing the Magic of Free-Form Multi-Image Grounding in Multimodal Large Language Models [79.59567114769513]
We introduce Migician, the first multi-image grounding model capable of performing free-form and accurate grounding across multiple images.
Our model achieves significantly superior multi-image grounding capabilities, outperforming the best existing MLLMs by 24.94% and even surpassing much larger 70B models.
arXiv Detail & Related papers (2025-01-10T07:56:23Z) - Img-Diff: Contrastive Data Synthesis for Multimodal Large Language Models [49.439311430360284]
We introduce a novel data synthesis method inspired by contrastive learning and image difference captioning.
Our key idea involves challenging the model to discern both matching and distinct elements.
We leverage this generated dataset to fine-tune state-of-the-art (SOTA) MLLMs.
arXiv Detail & Related papers (2024-08-08T17:10:16Z) - Toffee: Efficient Million-Scale Dataset Construction for Subject-Driven Text-to-Image Generation [58.09421301921607]
We construct the first large-scale dataset for subject-driven image editing and generation.
Our dataset is 5 times the size of previous largest dataset, yet our cost is tens of thousands of GPU hours lower.
arXiv Detail & Related papers (2024-06-13T16:40:39Z) - Sieve: Multimodal Dataset Pruning Using Image Captioning Models [11.362835828985494]
Vision-Language Models (VLMs) are pretrained on large, diverse, and noisy web-crawled datasets.
We argue that this approach suffers from multiple limitations including false positives and negatives due to CLIP's pretraining on noisy labels.
We propose a pruning signal, Sieve, that employs synthetic captions generated by image-captioning models pretrained on small, diverse, and well-aligned image-text pairs.
arXiv Detail & Related papers (2023-10-03T14:53:53Z) - RS5M and GeoRSCLIP: A Large Scale Vision-Language Dataset and A Large
Vision-Language Model for Remote Sensing [26.71560933421903]
We propose a new framework that includes the Domain pre-trained Vision-Language Model (DVLM)
We present an image-text paired dataset in the field of remote sensing (RS), RS5M, which has 5 million RS images with English descriptions.
arXiv Detail & Related papers (2023-06-20T05:30:59Z) - LAION-5B: An open large-scale dataset for training next generation
image-text models [16.129935376579326]
We present LAION-5B, a dataset consisting of 5.85 billion CLIP-filtered image-text pairs, of which 2.32B contain English language.
We show successful replication and fine-tuning of foundational models like CLIP, GLIDE and Stable Diffusion using the dataset.
We also provide several nearest neighbor indices, an improved web-interface for dataset exploration and subset generation.
arXiv Detail & Related papers (2022-10-16T00:08:18Z) - Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone [170.85076677740292]
We present FIBER (Fusion-In-the-Backbone-basedER), a new model architecture for vision-language (VL) pre-training.
Instead of having dedicated transformer layers for fusion after the uni-modal backbones, FIBER pushes multimodal fusion deep into the model.
We conduct comprehensive experiments on a wide range of VL tasks, ranging from VQA, image captioning, and retrieval, to phrase grounding, referring expression comprehension, and object detection.
arXiv Detail & Related papers (2022-06-15T16:41:29Z) - KNN-Diffusion: Image Generation via Large-Scale Retrieval [40.6656651653888]
Learning to adapt enables several new capabilities.
Fine-tuning trained models to new samples can be achieved by simply adding them to the table.
Our diffusion-based model trains on images only, by leveraging a joint Text-Image multi-modal metric.
arXiv Detail & Related papers (2022-04-06T14:13:35Z) - Enabling Multimodal Generation on CLIP via Vision-Language Knowledge
Distillation [79.72299298976525]
We propose to augment a vision-language pre-training model with a textual pre-trained language model (PLM) via vision-language knowledge distillation (VLKD)
Experiments show that the resulting model has strong zero-shot performance on multimodal generation tasks, such as open-ended visual question answering and image captioning.
The original textual language understanding and generation ability of the PLM is maintained after VLKD, which makes our model versatile for both multimodal and unimodal tasks.
arXiv Detail & Related papers (2022-03-12T09:33:37Z) - Unsupervised Vision-and-Language Pre-training via Retrieval-based
Multi-Granular Alignment [66.77841319057299]
We propose a novel unsupervised Vision-and-Language pre-training curriculum for non-parallel texts and images.
We first construct a weakly aligned image-text corpus via a retrieval-based approach, then apply a set of multi-granular alignment pre-training tasks.
A comprehensive ablation study shows each granularity is helpful to learn a stronger pre-trained model.
arXiv Detail & Related papers (2022-03-01T05:34:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.