Multimodal C4: An Open, Billion-scale Corpus of Images Interleaved with
Text
- URL: http://arxiv.org/abs/2304.06939v3
- Date: Sat, 28 Oct 2023 04:19:41 GMT
- Title: Multimodal C4: An Open, Billion-scale Corpus of Images Interleaved with
Text
- Authors: Wanrong Zhu and Jack Hessel and Anas Awadalla and Samir Yitzhak Gadre
and Jesse Dodge and Alex Fang and Youngjae Yu and Ludwig Schmidt and William
Yang Wang and Yejin Choi
- Abstract summary: In-context vision and language models like Flamingo support arbitrarily interleaved sequences of images and text as input.
To support this interface, pretraining occurs over web corpora that similarly contain interleaved images+text.
We release Multimodal C4, an augmentation of the popular text-only C4 corpus with images interleaved.
- Score: 130.89493542553151
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In-context vision and language models like Flamingo support arbitrarily
interleaved sequences of images and text as input. This format not only enables
few-shot learning via interleaving independent supervised (image, text)
examples, but also, more complex prompts involving interaction between images,
e.g., "What do image A and image B have in common?" To support this interface,
pretraining occurs over web corpora that similarly contain interleaved
images+text. To date, however, large-scale data of this form have not been
publicly available.
We release Multimodal C4, an augmentation of the popular text-only C4 corpus
with images interleaved. We use a linear assignment algorithm to place images
into longer bodies of text using CLIP features, a process that we show
outperforms alternatives. Multimodal C4 spans everyday topics like cooking,
travel, technology, etc. A manual inspection of a random sample of documents
shows that a vast majority (88%) of images are topically relevant, and that
linear assignment frequently selects individual sentences specifically
well-aligned with each image (80%). After filtering NSFW images, ads, etc., the
resulting corpus consists of 101.2M documents with 571M images interleaved in
43B English tokens.
Related papers
- OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text [112.60163342249682]
We introduce OmniCorpus, a 10 billion-scale image-text interleaved dataset.
Our dataset has 15 times larger scales while maintaining good data quality.
We hope this could provide a solid data foundation for future multimodal model research.
arXiv Detail & Related papers (2024-06-12T17:01:04Z) - MM-Interleaved: Interleaved Image-Text Generative Modeling via Multi-modal Feature Synchronizer [106.79844459065828]
This paper presents MM-Interleaved, an end-to-end generative model for interleaved image-text data.
It introduces a multi-scale and multi-image feature synchronizer module, allowing direct access to fine-grained image features in the previous context.
Experiments demonstrate the versatility of MM-Interleaved in recognizing visual details following multi-modal instructions and generating consistent images following both textual and visual conditions.
arXiv Detail & Related papers (2024-01-18T18:50:16Z) - Improving fine-grained understanding in image-text pre-training [37.163228122323865]
We introduce SPARse Fine-grained Contrastive Alignment (SPARC), a simple method for pretraining more fine-grained multimodal representations from image-text pairs.
We show improved performance over competing approaches over both image-level tasks relying on coarse-grained information.
arXiv Detail & Related papers (2024-01-18T10:28:45Z) - Improving Multimodal Datasets with Image Captioning [65.74736570293622]
We study how generated captions can increase the utility of web-scraped datapoints with nondescript text.
Our experiments with using generated captions at DataComp's large scale (1.28B image-text pairs) offer insights into the limitations of synthetic text.
arXiv Detail & Related papers (2023-07-19T17:47:12Z) - LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image
Understanding [85.39419609430453]
This work enhances the current visual instruction tuning pipeline with text-rich images.
We first use publicly available OCR tools to collect results on 422K text-rich images from the LAION dataset.
We prompt text-only GPT-4 with recognized texts and image captions to generate 16K conversations, each containing question-answer pairs for text-rich images.
arXiv Detail & Related papers (2023-06-29T17:08:16Z) - Linking Representations with Multimodal Contrastive Learning [1.6874375111244329]
In historical record linkage applications, documents are typically noisily transcribed by optical character recognition (OCR)
To leverage multimodal learning, this study develops CLIPPINGS (Contrastively LInking Pooled Pre-trained Embeddings)
arXiv Detail & Related papers (2023-04-07T03:39:08Z) - NewsStories: Illustrating articles with visual summaries [49.924916589209374]
We introduce a large-scale multimodal dataset containing over 31M articles, 22M images and 1M videos.
We show that state-of-the-art image-text alignment methods are not robust to longer narratives with multiple images.
We introduce an intuitive baseline that outperforms these methods on zero-shot image-set retrieval by 10% on the GoodNews dataset.
arXiv Detail & Related papers (2022-07-26T17:34:11Z) - Sequence-aware multimodal page classification of Brazilian legal
documents [0.21204495827342434]
We train and evaluate our methods on a novel multimodal dataset of 6,510 lawsuits.
Each lawsuit is an ordered sequence of pages, which are stored both as an image and as a corresponding text.
We use them as extractors of visual and textual features, which are then combined through our proposed Fusion Module.
arXiv Detail & Related papers (2022-07-02T06:23:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.