RedCaps: web-curated image-text data created by the people, for the
people
- URL: http://arxiv.org/abs/2111.11431v1
- Date: Mon, 22 Nov 2021 18:59:34 GMT
- Title: RedCaps: web-curated image-text data created by the people, for the
people
- Authors: Karan Desai, Gaurav Kaul, Zubin Aysola, Justin Johnson
- Abstract summary: We introduce RedCaps -- a large-scale dataset of 12M image-text pairs collected from Reddit.
Images and captions from Reddit depict and describe a wide variety of objects and scenes.
We show that captioning models trained on RedCaps produce rich and varied captions preferred by humans, and learn visual representations that transfer to many downstream tasks.
- Score: 12.58157541985447
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large datasets of paired images and text have become increasingly popular for
learning generic representations for vision and vision-and-language tasks. Such
datasets have been built by querying search engines or collecting HTML alt-text
-- since web data is noisy, they require complex filtering pipelines to
maintain quality. We explore alternate data sources to collect high quality
data with minimal filtering. We introduce RedCaps -- a large-scale dataset of
12M image-text pairs collected from Reddit. Images and captions from Reddit
depict and describe a wide variety of objects and scenes. We collect data from
a manually curated set of subreddits, which give coarse image labels and allow
us to steer the dataset composition without labeling individual instances. We
show that captioning models trained on RedCaps produce rich and varied captions
preferred by humans, and learn visual representations that transfer to many
downstream tasks.
Related papers
- From Pixels to Prose: A Large Dataset of Dense Image Captions [76.97493750144812]
PixelProse is a comprehensive dataset of over 16M (million) synthetically generated captions.
To ensure data integrity, we rigorously analyze our dataset for problematic content.
We also provide valuable metadata such as watermark presence and aesthetic scores.
arXiv Detail & Related papers (2024-06-14T17:59:53Z) - Satellite Captioning: Large Language Models to Augment Labeling [0.0]
caption datasets present a much more difficult challenge due to language differences, grammar, and the time it takes for humans to generate them.
Current datasets have certainly provided many instances to work with, but it becomes problematic when a captioner may have a more limited vocabulary.
This paper aims to address this issue of potential information and communication shortcomings in caption datasets.
arXiv Detail & Related papers (2023-12-18T03:21:58Z) - Improving Multimodal Datasets with Image Captioning [65.74736570293622]
We study how generated captions can increase the utility of web-scraped datapoints with nondescript text.
Our experiments with using generated captions at DataComp's large scale (1.28B image-text pairs) offer insights into the limitations of synthetic text.
arXiv Detail & Related papers (2023-07-19T17:47:12Z) - HL Dataset: Visually-grounded Description of Scenes, Actions and
Rationales [5.010418546872244]
We present a dataset extending 14997 images from the COCO dataset, aligned with a new set of 134,973 human-annotated (high-level) captions.
We further extend this dataset with confidence scores collected from an independent set of readers, as well as a set of narrative captions generated synthetically.
arXiv Detail & Related papers (2023-02-23T17:30:18Z) - NewsStories: Illustrating articles with visual summaries [49.924916589209374]
We introduce a large-scale multimodal dataset containing over 31M articles, 22M images and 1M videos.
We show that state-of-the-art image-text alignment methods are not robust to longer narratives with multiple images.
We introduce an intuitive baseline that outperforms these methods on zero-shot image-set retrieval by 10% on the GoodNews dataset.
arXiv Detail & Related papers (2022-07-26T17:34:11Z) - Generating More Pertinent Captions by Leveraging Semantics and Style on
Multi-Source Datasets [56.018551958004814]
This paper addresses the task of generating fluent descriptions by training on a non-uniform combination of data sources.
Large-scale datasets with noisy image-text pairs provide a sub-optimal source of supervision.
We propose to leverage and separate semantics and descriptive style through the incorporation of a style token and keywords extracted through a retrieval component.
arXiv Detail & Related papers (2021-11-24T19:00:05Z) - MultiSubs: A Large-scale Multimodal and Multilingual Dataset [32.48454703822847]
This paper introduces a large-scale multimodal and multilingual dataset that aims to facilitate research on grounding words to images in their contextual usage in language.
The dataset consists of images selected to unambiguously illustrate concepts expressed in sentences from movie subtitles.
We show the utility of the dataset on two automatic tasks: (i) fill-in-the blank; (ii) lexical translation.
arXiv Detail & Related papers (2021-03-02T18:09:07Z) - Scaling Up Visual and Vision-Language Representation Learning With Noisy
Text Supervision [57.031588264841]
We leverage a noisy dataset of over one billion image alt-text pairs, obtained without expensive filtering or post-processing steps.
A simple dual-encoder architecture learns to align visual and language representations of the image and text pairs using a contrastive loss.
We show that the scale of our corpus can make up for its noise and leads to state-of-the-art representations even with such a simple learning scheme.
arXiv Detail & Related papers (2021-02-11T10:08:12Z) - TextCaps: a Dataset for Image Captioning with Reading Comprehension [56.89608505010651]
Text is omnipresent in human environments and frequently critical to understand our surroundings.
To study how to comprehend text in the context of an image we collect a novel dataset, TextCaps, with 145k captions for 28k images.
Our dataset challenges a model to recognize text, relate it to its visual context, and decide what part of the text to copy or paraphrase.
arXiv Detail & Related papers (2020-03-24T02:38:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.