From Pixels to Prose: A Large Dataset of Dense Image Captions
- URL: http://arxiv.org/abs/2406.10328v1
- Date: Fri, 14 Jun 2024 17:59:53 GMT
- Title: From Pixels to Prose: A Large Dataset of Dense Image Captions
- Authors: Vasu Singla, Kaiyu Yue, Sukriti Paul, Reza Shirkavand, Mayuka Jayawardhana, Alireza Ganjdanesh, Heng Huang, Abhinav Bhatele, Gowthami Somepalli, Tom Goldstein,
- Abstract summary: PixelProse is a comprehensive dataset of over 16M (million) synthetically generated captions.
To ensure data integrity, we rigorously analyze our dataset for problematic content.
We also provide valuable metadata such as watermark presence and aesthetic scores.
- Score: 76.97493750144812
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Training large vision-language models requires extensive, high-quality image-text pairs. Existing web-scraped datasets, however, are noisy and lack detailed image descriptions. To bridge this gap, we introduce PixelProse, a comprehensive dataset of over 16M (million) synthetically generated captions, leveraging cutting-edge vision-language models for detailed and accurate descriptions. To ensure data integrity, we rigorously analyze our dataset for problematic content, including child sexual abuse material (CSAM), personally identifiable information (PII), and toxicity. We also provide valuable metadata such as watermark presence and aesthetic scores, aiding in further dataset filtering. We hope PixelProse will be a valuable resource for future vision-language research. PixelProse is available at https://huggingface.co/datasets/tomg-group-umd/pixelprose
Related papers
- OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text [112.60163342249682]
We introduce OmniCorpus, a 10 billion-scale image-text interleaved dataset.
Our dataset has 15 times larger scales while maintaining good data quality.
We hope this could provide a solid data foundation for future multimodal model research.
arXiv Detail & Related papers (2024-06-12T17:01:04Z) - Improving Multimodal Datasets with Image Captioning [65.74736570293622]
We study how generated captions can increase the utility of web-scraped datapoints with nondescript text.
Our experiments with using generated captions at DataComp's large scale (1.28B image-text pairs) offer insights into the limitations of synthetic text.
arXiv Detail & Related papers (2023-07-19T17:47:12Z) - A Dense Material Segmentation Dataset for Indoor and Outdoor Scene
Parsing [1.7404865362620798]
We propose a large-scale dataset of 3.2 million dense segments on 44,560 indoor and outdoor images.
Our data covers a more diverse set of scenes, objects, viewpoints and materials.
We show that a model trained on our data outperforms a state-of-the-art model across datasets and viewpoints.
arXiv Detail & Related papers (2022-07-21T17:15:41Z) - RedCaps: web-curated image-text data created by the people, for the
people [12.58157541985447]
We introduce RedCaps -- a large-scale dataset of 12M image-text pairs collected from Reddit.
Images and captions from Reddit depict and describe a wide variety of objects and scenes.
We show that captioning models trained on RedCaps produce rich and varied captions preferred by humans, and learn visual representations that transfer to many downstream tasks.
arXiv Detail & Related papers (2021-11-22T18:59:34Z) - Multimodal datasets: misogyny, pornography, and malignant stereotypes [2.8682942808330703]
We examine the recently released LAION-400M dataset, which is a CLIP-filtered dataset of Image-Alt-text pairs parsed from the Common-Crawl dataset.
We found that the dataset contains, troublesome and explicit images and text pairs of rape, pornography, malign stereotypes, racist and ethnic slurs, and other extremely problematic content.
arXiv Detail & Related papers (2021-10-05T11:47:27Z) - Understanding Mobile GUI: from Pixel-Words to Screen-Sentences [48.97215653702567]
We propose a mobile GUI understanding architecture: Pixel-Words to Screen-Sentence (PW2SS)
Pixel-Words are defined as atomic visual components, which are visually consistent and semantically clear across screenshots.
We are able to make use of metadata available in training data to auto-generate high-quality annotations for Pixel-Words.
arXiv Detail & Related papers (2021-05-25T13:45:54Z) - DatasetGAN: Efficient Labeled Data Factory with Minimal Human Effort [117.41383937100751]
Current deep networks are extremely data-hungry, benefiting from training on large-scale datasets.
We show how the GAN latent code can be decoded to produce a semantic segmentation of the image.
These generated datasets can then be used for training any computer vision architecture just as real datasets are.
arXiv Detail & Related papers (2021-04-13T20:08:29Z) - Scaling Up Visual and Vision-Language Representation Learning With Noisy
Text Supervision [57.031588264841]
We leverage a noisy dataset of over one billion image alt-text pairs, obtained without expensive filtering or post-processing steps.
A simple dual-encoder architecture learns to align visual and language representations of the image and text pairs using a contrastive loss.
We show that the scale of our corpus can make up for its noise and leads to state-of-the-art representations even with such a simple learning scheme.
arXiv Detail & Related papers (2021-02-11T10:08:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.