LAION-5B: An open large-scale dataset for training next generation
image-text models
- URL: http://arxiv.org/abs/2210.08402v1
- Date: Sun, 16 Oct 2022 00:08:18 GMT
- Title: LAION-5B: An open large-scale dataset for training next generation
image-text models
- Authors: Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross
Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell
Wortsman, Patrick Schramowski, Srivatsa Kundurthy, Katherine Crowson, Ludwig
Schmidt, Robert Kaczmarczyk and Jenia Jitsev
- Abstract summary: We present LAION-5B, a dataset consisting of 5.85 billion CLIP-filtered image-text pairs, of which 2.32B contain English language.
We show successful replication and fine-tuning of foundational models like CLIP, GLIDE and Stable Diffusion using the dataset.
We also provide several nearest neighbor indices, an improved web-interface for dataset exploration and subset generation.
- Score: 16.129935376579326
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Groundbreaking language-vision architectures like CLIP and DALL-E proved the
utility of training on large amounts of noisy image-text data, without relying
on expensive accurate labels used in standard vision unimodal supervised
learning. The resulting models showed capabilities of strong text-guided image
generation and transfer to downstream tasks, while performing remarkably at
zero-shot classification with noteworthy out-of-distribution robustness. Since
then, large-scale language-vision models like ALIGN, BASIC, GLIDE, Flamingo and
Imagen made further improvements. Studying the training and capabilities of
such models requires datasets containing billions of image-text pairs. Until
now, no datasets of this size have been made openly available for the broader
research community. To address this problem and democratize research on
large-scale multi-modal models, we present LAION-5B - a dataset consisting of
5.85 billion CLIP-filtered image-text pairs, of which 2.32B contain English
language. We show successful replication and fine-tuning of foundational models
like CLIP, GLIDE and Stable Diffusion using the dataset, and discuss further
experiments enabled with an openly available dataset of this scale.
Additionally we provide several nearest neighbor indices, an improved
web-interface for dataset exploration and subset generation, and detection
scores for watermark, NSFW, and toxic content detection. Announcement page
https://laion.ai/laion-5b-a-new-era-of-open-large-scale-multi-modal-datasets/
Related papers
- VEGA: Learning Interleaved Image-Text Comprehension in Vision-Language Large Models [76.94378391979228]
We introduce a new, more demanding task known as Interleaved Image-Text (IITC)
This task challenges models to discern and disregard superfluous elements in both images and text to accurately answer questions.
In support of this task, we further craft a new VEGA dataset, tailored for the IITC task on scientific content, and devised a subtask, Image-Text Association (ITA)
arXiv Detail & Related papers (2024-06-14T17:59:40Z) - OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text [112.60163342249682]
We introduce OmniCorpus, a 10 billion-scale image-text interleaved dataset.
Our dataset has 15 times larger scales while maintaining good data quality.
We hope this could provide a solid data foundation for future multimodal model research.
arXiv Detail & Related papers (2024-06-12T17:01:04Z) - FreeSeg-Diff: Training-Free Open-Vocabulary Segmentation with Diffusion Models [56.71672127740099]
We focus on the task of image segmentation, which is traditionally solved by training models on closed-vocabulary datasets.
We leverage different and relatively small-sized, open-source foundation models for zero-shot open-vocabulary segmentation.
Our approach (dubbed FreeSeg-Diff), which does not rely on any training, outperforms many training-based approaches on both Pascal VOC and COCO datasets.
arXiv Detail & Related papers (2024-03-29T10:38:25Z) - Reformulating Vision-Language Foundation Models and Datasets Towards
Universal Multimodal Assistants [65.47222691674074]
Muffin framework employs pre-trained vision-language models to act as providers of visual signals.
UniMM-Chat dataset explores the complementarities of datasets to generate 1.1M high-quality and diverse multimodal instructions.
arXiv Detail & Related papers (2023-10-01T12:35:18Z) - NLLB-CLIP -- train performant multilingual image retrieval model on a
budget [65.268245109828]
We present NLLB-CLIP - CLIP model with a text encoder from the NLLB model.
We used an automatically created dataset of 106,246 good-quality images with captions in 201 languages.
We show that NLLB-CLIP is comparable in quality to state-of-the-art models and significantly outperforms them on low-resource languages.
arXiv Detail & Related papers (2023-09-04T23:26:11Z) - StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized
Image-Dialogue Data [129.92449761766025]
We propose a novel data collection methodology that synchronously synthesizes images and dialogues for visual instruction tuning.
This approach harnesses the power of generative models, marrying the abilities of ChatGPT and text-to-image generative models.
Our research includes comprehensive experiments conducted on various datasets.
arXiv Detail & Related papers (2023-08-20T12:43:52Z) - CCMB: A Large-scale Chinese Cross-modal Benchmark [46.349966178044184]
We build a large-scale high-quality Chinese Cross-Modal Benchmark named CCMB for the research community.
Zero contains 250 million images paired with 750 million text descriptions, plus two of the five fine-tuning datasets are also the largest ones for Chinese cross-modal downstream tasks.
arXiv Detail & Related papers (2022-05-08T13:19:23Z) - KNN-Diffusion: Image Generation via Large-Scale Retrieval [40.6656651653888]
Learning to adapt enables several new capabilities.
Fine-tuning trained models to new samples can be achieved by simply adding them to the table.
Our diffusion-based model trains on images only, by leveraging a joint Text-Image multi-modal metric.
arXiv Detail & Related papers (2022-04-06T14:13:35Z) - WuDaoMM: A large-scale Multi-Modal Dataset for Pre-training models [2.603259641572195]
We introduce a large-scale multi-modal corpora named WuDaoMM, totally containing more than 650M image-text pairs.
About 600 million pairs of data are collected from multiple webpages in which image and caption present weak correlation.
We also release a base version of WuDaoMM with 5 million strong-correlated image-text pairs, which is sufficient to support the common cross-modal model pre-training.
arXiv Detail & Related papers (2022-03-22T06:12:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.