OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text
Documents
- URL: http://arxiv.org/abs/2306.16527v2
- Date: Mon, 21 Aug 2023 09:35:52 GMT
- Title: OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text
Documents
- Authors: Hugo Lauren\c{c}on, Lucile Saulnier, L\'eo Tronchon, Stas Bekman,
Amanpreet Singh, Anton Lozhkov, Thomas Wang, Siddharth Karamcheti, Alexander
M. Rush, Douwe Kiela, Matthieu Cord, Victor Sanh
- Abstract summary: We introduce OBELICS, an open web-scale filtered dataset of interleaved image-text documents.
We describe the dataset creation process, present comprehensive filtering rules, and provide an analysis of the dataset's content.
We train vision and language models of 9 and 80 billion parameters named IDEFICS, and obtain competitive performance on different multimodal benchmarks.
- Score: 122.55393759474181
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large multimodal models trained on natural documents, which interleave images
and text, outperform models trained on image-text pairs on various multimodal
benchmarks. However, the datasets used to train these models have not been
released, and the collection process has not been fully specified. We introduce
the OBELICS dataset, an open web-scale filtered dataset of interleaved
image-text documents comprising 141 million web pages extracted from Common
Crawl, 353 million associated images, and 115 billion text tokens. We describe
the dataset creation process, present comprehensive filtering rules, and
provide an analysis of the dataset's content. To show the viability of OBELICS,
we train vision and language models of 9 and 80 billion parameters named
IDEFICS, and obtain competitive performance on different multimodal benchmarks.
We release our dataset, models and code.
Related papers
- Knowledge-Aware Reasoning over Multimodal Semi-structured Tables [85.24395216111462]
This study investigates whether current AI models can perform knowledge-aware reasoning on multimodal structured data.
We introduce MMTabQA, a new dataset designed for this purpose.
Our experiments highlight substantial challenges for current AI models in effectively integrating and interpreting multiple text and image inputs.
arXiv Detail & Related papers (2024-08-25T15:17:43Z) - OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text [112.60163342249682]
We introduce OmniCorpus, a 10 billion-scale image-text interleaved dataset.
Our dataset has 15 times larger scales while maintaining good data quality.
We hope this could provide a solid data foundation for future multimodal model research.
arXiv Detail & Related papers (2024-06-12T17:01:04Z) - TRINS: Towards Multimodal Language Models that Can Read [61.17806538631744]
TRINS is a Text-Rich image INStruction dataset.
It contains 39,153 text-rich images, captions, and 102,437 questions.
We introduce a Language-vision Reading Assistant (LaRA) which is good at understanding textual content within images.
arXiv Detail & Related papers (2024-06-10T18:52:37Z) - COSMO: COntrastive Streamlined MultimOdal Model with Interleaved
Pre-Training [119.03392147066093]
Recent autoregressive vision-language models have excelled in few-shot text generation tasks but face challenges in alignment tasks.
We introduce the contrastive loss into text generation models, partitioning the language model into dedicated unimodal text processing and adept multimodal data handling components.
To bridge this gap, this work introduces VideoDatasetName, an inaugural interleaved video-text dataset featuring comprehensive captions.
arXiv Detail & Related papers (2024-01-01T18:58:42Z) - JourneyDB: A Benchmark for Generative Image Understanding [89.02046606392382]
We introduce a comprehensive dataset, referred to as JourneyDB, that caters to the domain of generative images.
Our meticulously curated dataset comprises 4 million distinct and high-quality generated images.
On our dataset, we have devised four benchmarks to assess the performance of generated image comprehension.
arXiv Detail & Related papers (2023-07-03T02:39:08Z) - DUBLIN -- Document Understanding By Language-Image Network [37.42637168606938]
We propose DUBLIN, which is pretrained on web pages using three novel objectives.
We show that DUBLIN is the first pixel-based model to achieve an EM of 77.75 and F1 of 84.25 on the WebSRC dataset.
We also achieve competitive performance on RVL-CDIP document classification.
arXiv Detail & Related papers (2023-05-23T16:34:09Z) - GLAMI-1M: A Multilingual Image-Text Fashion Dataset [0.0]
GLAMI-1M is the largest multilingual image-text classification dataset and benchmark.
The dataset contains images of fashion products with item descriptions, each in 1 of 13 languages.
arXiv Detail & Related papers (2022-11-17T13:19:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.