TextCaps: a Dataset for Image Captioning with Reading Comprehension
- URL: http://arxiv.org/abs/2003.12462v2
- Date: Tue, 4 Aug 2020 04:08:02 GMT
- Title: TextCaps: a Dataset for Image Captioning with Reading Comprehension
- Authors: Oleksii Sidorov, Ronghang Hu, Marcus Rohrbach, Amanpreet Singh
- Abstract summary: Text is omnipresent in human environments and frequently critical to understand our surroundings.
To study how to comprehend text in the context of an image we collect a novel dataset, TextCaps, with 145k captions for 28k images.
Our dataset challenges a model to recognize text, relate it to its visual context, and decide what part of the text to copy or paraphrase.
- Score: 56.89608505010651
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Image descriptions can help visually impaired people to quickly understand
the image content. While we made significant progress in automatically
describing images and optical character recognition, current approaches are
unable to include written text in their descriptions, although text is
omnipresent in human environments and frequently critical to understand our
surroundings. To study how to comprehend text in the context of an image we
collect a novel dataset, TextCaps, with 145k captions for 28k images. Our
dataset challenges a model to recognize text, relate it to its visual context,
and decide what part of the text to copy or paraphrase, requiring spatial,
semantic, and visual reasoning between multiple text tokens and visual
entities, such as objects. We study baselines and adapt existing approaches to
this new task, which we refer to as image captioning with reading
comprehension. Our analysis with automatic and human studies shows that our new
TextCaps dataset provides many new technical challenges over previous datasets.
Related papers
- Visual Text Meets Low-level Vision: A Comprehensive Survey on Visual
Text Processing [4.057550183467041]
The field of visual text processing has experienced a surge in research, driven by the advent of fundamental generative models.
We present a comprehensive, multi-perspective analysis of recent advancements in this field.
arXiv Detail & Related papers (2024-02-05T15:13:20Z) - ConTextual: Evaluating Context-Sensitive Text-Rich Visual Reasoning in Large Multimodal Models [92.60282074937305]
We introduce ConTextual, a novel dataset featuring human-crafted instructions that require context-sensitive reasoning for text-rich images.
We conduct experiments to assess the performance of 14 foundation models and establish a human performance baseline.
We observe a significant performance gap of 30.8% between GPT-4V and human performance.
arXiv Detail & Related papers (2024-01-24T09:07:11Z) - Unleashing the Imagination of Text: A Novel Framework for Text-to-image
Person Retrieval via Exploring the Power of Words [0.951828574518325]
We propose a novel framework to explore the power of words in sentences.
The framework employs the pre-trained full CLIP model as a dual encoder for the images and texts.
We introduce a cross-modal triplet loss tailored for handling hard samples, enhancing the model's ability to distinguish minor differences.
arXiv Detail & Related papers (2023-07-18T08:23:46Z) - NewsStories: Illustrating articles with visual summaries [49.924916589209374]
We introduce a large-scale multimodal dataset containing over 31M articles, 22M images and 1M videos.
We show that state-of-the-art image-text alignment methods are not robust to longer narratives with multiple images.
We introduce an intuitive baseline that outperforms these methods on zero-shot image-set retrieval by 10% on the GoodNews dataset.
arXiv Detail & Related papers (2022-07-26T17:34:11Z) - Language Matters: A Weakly Supervised Pre-training Approach for Scene
Text Detection and Spotting [69.77701325270047]
This paper presents a weakly supervised pre-training method that can acquire effective scene text representations.
Our network consists of an image encoder and a character-aware text encoder that extract visual and textual features.
Experiments show that our pre-trained model improves F-score by +2.5% and +4.8% while transferring its weights to other text detection and spotting networks.
arXiv Detail & Related papers (2022-03-08T08:10:45Z) - Scaling Up Visual and Vision-Language Representation Learning With Noisy
Text Supervision [57.031588264841]
We leverage a noisy dataset of over one billion image alt-text pairs, obtained without expensive filtering or post-processing steps.
A simple dual-encoder architecture learns to align visual and language representations of the image and text pairs using a contrastive loss.
We show that the scale of our corpus can make up for its noise and leads to state-of-the-art representations even with such a simple learning scheme.
arXiv Detail & Related papers (2021-02-11T10:08:12Z) - VICTR: Visual Information Captured Text Representation for Text-to-Image
Multimodal Tasks [5.840117063192334]
We propose a new visual contextual text representation for text-to-image multimodal tasks, VICTR, which captures rich visual semantic information of objects from the text input.
We train the extracted objects, attributes, and relations in the scene graph and the corresponding geometric relation information using Graph Convolutional Networks.
The text representation is aggregated with word-level and sentence-level embedding to generate both visual contextual word and sentence representation.
arXiv Detail & Related papers (2020-10-07T05:25:30Z) - Learning Object Detection from Captions via Textual Scene Attributes [70.90708863394902]
We argue that captions contain much richer information about the image, including attributes of objects and their relations.
We present a method that uses the attributes in this "textual scene graph" to train object detectors.
We empirically demonstrate that the resulting model achieves state-of-the-art results on several challenging object detection datasets.
arXiv Detail & Related papers (2020-09-30T10:59:20Z) - Multi-Modal Reasoning Graph for Scene-Text Based Fine-Grained Image
Classification and Retrieval [8.317191999275536]
This paper focuses on leveraging multi-modal content in the form of visual and textual cues to tackle the task of fine-grained image classification and retrieval.
We employ a Graph Convolutional Network to perform multi-modal reasoning and obtain relationship-enhanced features by learning a common semantic space between salient objects and text found in an image.
arXiv Detail & Related papers (2020-09-21T12:31:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.