Captioning Images Taken by People Who Are Blind
- URL: http://arxiv.org/abs/2002.08565v2
- Date: Wed, 15 Jul 2020 15:48:35 GMT
- Title: Captioning Images Taken by People Who Are Blind
- Authors: Danna Gurari, Yinan Zhao, Meng Zhang, Nilavra Bhattacharya
- Abstract summary: VizWiz-Captions consists of over 39,000 images originating from people who are blind that are each paired with five captions.
We analyze this dataset to (1) characterize the typical captions, (2) characterize the diversity of content found in the images, and (3) compare its content to that found in eight popular vision datasets.
- Score: 25.263950448575923
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While an important problem in the vision community is to design algorithms
that can automatically caption images, few publicly-available datasets for
algorithm development directly address the interests of real users. Observing
that people who are blind have relied on (human-based) image captioning
services to learn about images they take for nearly a decade, we introduce the
first image captioning dataset to represent this real use case. This new
dataset, which we call VizWiz-Captions, consists of over 39,000 images
originating from people who are blind that are each paired with five captions.
We analyze this dataset to (1) characterize the typical captions, (2)
characterize the diversity of content found in the images, and (3) compare its
content to that found in eight popular vision datasets. We also analyze modern
image captioning algorithms to identify what makes this new dataset challenging
for the vision community. We publicly-share the dataset with captioning
challenge instructions at https://vizwiz.org
Related papers
- Improving Multimodal Datasets with Image Captioning [65.74736570293622]
We study how generated captions can increase the utility of web-scraped datapoints with nondescript text.
Our experiments with using generated captions at DataComp's large scale (1.28B image-text pairs) offer insights into the limitations of synthetic text.
arXiv Detail & Related papers (2023-07-19T17:47:12Z) - NewsStories: Illustrating articles with visual summaries [49.924916589209374]
We introduce a large-scale multimodal dataset containing over 31M articles, 22M images and 1M videos.
We show that state-of-the-art image-text alignment methods are not robust to longer narratives with multiple images.
We introduce an intuitive baseline that outperforms these methods on zero-shot image-set retrieval by 10% on the GoodNews dataset.
arXiv Detail & Related papers (2022-07-26T17:34:11Z) - LineCap: Line Charts for Data Visualization Captioning Models [6.3596637237946725]
We introduce LineCap, a novel figure captioning dataset of 3,528 figures.
We provide insights from curating this dataset and using end-to-end deep learning models for automated figure captioning.
arXiv Detail & Related papers (2022-07-15T00:35:59Z) - Towards Multimodal Vision-Language Models Generating Non-Generic Text [2.102846336724103]
Vision-language models can assess visual context in an image and generate descriptive text.
Recent work has used optical character recognition to supplement visual information with text extracted from an image.
In this work, we contend that vision-language models can benefit from additional information that can be extracted from an image, but are not used by current models.
arXiv Detail & Related papers (2022-07-09T01:56:35Z) - Who's Waldo? Linking People Across Text and Images [56.40556801773923]
We present a task and benchmark dataset for person-centric visual grounding.
Our new task masks out the names of people in captions in order to encourage methods trained on such image-caption pairs to focus on contextual cues.
We propose a Transformer-based method that outperforms several strong baselines on this task.
arXiv Detail & Related papers (2021-08-16T17:36:49Z) - #PraCegoVer: A Large Dataset for Image Captioning in Portuguese [6.890235464357029]
#PraCegoVer is the first large dataset for image captioning in Portuguese with freely annotated images.
A movement called PraCegoVer arose on the Internet, stimulating users from social media to publish images, tag #PraCegoVer and add a short description of their content.
arXiv Detail & Related papers (2021-03-21T19:55:46Z) - Scaling Up Visual and Vision-Language Representation Learning With Noisy
Text Supervision [57.031588264841]
We leverage a noisy dataset of over one billion image alt-text pairs, obtained without expensive filtering or post-processing steps.
A simple dual-encoder architecture learns to align visual and language representations of the image and text pairs using a contrastive loss.
We show that the scale of our corpus can make up for its noise and leads to state-of-the-art representations even with such a simple learning scheme.
arXiv Detail & Related papers (2021-02-11T10:08:12Z) - Improving Image Captioning with Better Use of Captions [65.39641077768488]
We present a novel image captioning architecture to better explore semantics available in captions and leverage that to enhance both image representation and caption generation.
Our models first construct caption-guided visual relationship graphs that introduce beneficial inductive bias using weakly supervised multi-instance learning.
During generation, the model further incorporates visual relationships using multi-task learning for jointly predicting word and object/predicate tag sequences.
arXiv Detail & Related papers (2020-06-21T14:10:47Z) - TextCaps: a Dataset for Image Captioning with Reading Comprehension [56.89608505010651]
Text is omnipresent in human environments and frequently critical to understand our surroundings.
To study how to comprehend text in the context of an image we collect a novel dataset, TextCaps, with 145k captions for 28k images.
Our dataset challenges a model to recognize text, relate it to its visual context, and decide what part of the text to copy or paraphrase.
arXiv Detail & Related papers (2020-03-24T02:38:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.