Related papers: TextCaps: a Dataset for Image Captioning with Reading Comprehension

TextCaps: a Dataset for Image Captioning with Reading Comprehension

URL: http://arxiv.org/abs/2003.12462v2
Date: Tue, 4 Aug 2020 04:08:02 GMT
Title: TextCaps: a Dataset for Image Captioning with Reading Comprehension
Authors: Oleksii Sidorov, Ronghang Hu, Marcus Rohrbach, Amanpreet Singh
Abstract summary: Text is omnipresent in human environments and frequently critical to understand our surroundings. To study how to comprehend text in the context of an image we collect a novel dataset, TextCaps, with 145k captions for 28k images. Our dataset challenges a model to recognize text, relate it to its visual context, and decide what part of the text to copy or paraphrase.
Score: 56.89608505010651
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Image descriptions can help visually impaired people to quickly understand the image content. While we made significant progress in automatically describing images and optical character recognition, current approaches are unable to include written text in their descriptions, although text is omnipresent in human environments and frequently critical to understand our surroundings. To study how to comprehend text in the context of an image we collect a novel dataset, TextCaps, with 145k captions for 28k images. Our dataset challenges a model to recognize text, relate it to its visual context, and decide what part of the text to copy or paraphrase, requiring spatial, semantic, and visual reasoning between multiple text tokens and visual entities, such as objects. We study baselines and adapt existing approaches to this new task, which we refer to as image captioning with reading comprehension. Our analysis with automatic and human studies shows that our new TextCaps dataset provides many new technical challenges over previous datasets.

Related papers

Visual Text Meets Low-level Vision: A Comprehensive Survey on Visual Text Processing [4.057550183467041]
The field of visual text processing has experienced a surge in research, driven by the advent of fundamental generative models. We present a comprehensive, multi-perspective analysis of recent advancements in this field.
arXiv Detail & Related papers (2024-02-05T15:13:20Z)
ConTextual: Evaluating Context-Sensitive Text-Rich Visual Reasoning in Large Multimodal Models [92.60282074937305]
We introduce ConTextual, a novel dataset featuring human-crafted instructions that require context-sensitive reasoning for text-rich images. We conduct experiments to assess the performance of 14 foundation models and establish a human performance baseline. We observe a significant performance gap of 30.8% between GPT-4V and human performance.
arXiv Detail & Related papers (2024-01-24T09:07:11Z)
NewsStories: Illustrating articles with visual summaries [49.924916589209374]
We introduce a large-scale multimodal dataset containing over 31M articles, 22M images and 1M videos. We show that state-of-the-art image-text alignment methods are not robust to longer narratives with multiple images. We introduce an intuitive baseline that outperforms these methods on zero-shot image-set retrieval by 10% on the GoodNews dataset.
arXiv Detail & Related papers (2022-07-26T17:34:11Z)
Language Matters: A Weakly Supervised Pre-training Approach for Scene Text Detection and Spotting [69.77701325270047]
This paper presents a weakly supervised pre-training method that can acquire effective scene text representations. Our network consists of an image encoder and a character-aware text encoder that extract visual and textual features. Experiments show that our pre-trained model improves F-score by +2.5% and +4.8% while transferring its weights to other text detection and spotting networks.
arXiv Detail & Related papers (2022-03-08T08:10:45Z)
Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision [57.031588264841]
We leverage a noisy dataset of over one billion image alt-text pairs, obtained without expensive filtering or post-processing steps. A simple dual-encoder architecture learns to align visual and language representations of the image and text pairs using a contrastive loss. We show that the scale of our corpus can make up for its noise and leads to state-of-the-art representations even with such a simple learning scheme.
arXiv Detail & Related papers (2021-02-11T10:08:12Z)
VICTR: Visual Information Captured Text Representation for Text-to-Image Multimodal Tasks [5.840117063192334]
We propose a new visual contextual text representation for text-to-image multimodal tasks, VICTR, which captures rich visual semantic information of objects from the text input. We train the extracted objects, attributes, and relations in the scene graph and the corresponding geometric relation information using Graph Convolutional Networks. The text representation is aggregated with word-level and sentence-level embedding to generate both visual contextual word and sentence representation.
arXiv Detail & Related papers (2020-10-07T05:25:30Z)
Learning Object Detection from Captions via Textual Scene Attributes [70.90708863394902]
We argue that captions contain much richer information about the image, including attributes of objects and their relations. We present a method that uses the attributes in this "textual scene graph" to train object detectors. We empirically demonstrate that the resulting model achieves state-of-the-art results on several challenging object detection datasets.
arXiv Detail & Related papers (2020-09-30T10:59:20Z)
Multi-Modal Reasoning Graph for Scene-Text Based Fine-Grained Image Classification and Retrieval [8.317191999275536]
This paper focuses on leveraging multi-modal content in the form of visual and textual cues to tackle the task of fine-grained image classification and retrieval. We employ a Graph Convolutional Network to perform multi-modal reasoning and obtain relationship-enhanced features by learning a common semantic space between salient objects and text found in an image.
arXiv Detail & Related papers (2020-09-21T12:31:42Z)
Textual Visual Semantic Dataset for Text Spotting [27.788077963411624]
Text Spotting in the wild consists of detecting and recognizing text appearing in images. This is a challenging problem due to the complexity of the context where texts appear. We propose a visual context dataset for Text Spotting in the wild.
arXiv Detail & Related papers (2020-04-21T23:58:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.