Textual Visual Semantic Dataset for Text Spotting
- URL: http://arxiv.org/abs/2004.10349v1
- Date: Tue, 21 Apr 2020 23:58:16 GMT
- Title: Textual Visual Semantic Dataset for Text Spotting
- Authors: Ahmed Sabir, Francesc Moreno-Noguer and Llu\'is Padr\'o
- Abstract summary: Text Spotting in the wild consists of detecting and recognizing text appearing in images.
This is a challenging problem due to the complexity of the context where texts appear.
We propose a visual context dataset for Text Spotting in the wild.
- Score: 27.788077963411624
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Text Spotting in the wild consists of detecting and recognizing text
appearing in images (e.g. signboards, traffic signals or brands in clothing or
objects). This is a challenging problem due to the complexity of the context
where texts appear (uneven backgrounds, shading, occlusions, perspective
distortions, etc.). Only a few approaches try to exploit the relation between
text and its surrounding environment to better recognize text in the scene. In
this paper, we propose a visual context dataset for Text Spotting in the wild,
where the publicly available dataset COCO-text [Veit et al. 2016] has been
extended with information about the scene (such as objects and places appearing
in the image) to enable researchers to include semantic relations between texts
and scene in their Text Spotting systems, and to offer a common framework for
such approaches. For each text in an image, we extract three kinds of context
information: objects in the scene, image location label and a textual image
description (caption). We use state-of-the-art out-of-the-box available tools
to extract this additional information. Since this information has textual
form, it can be used to leverage text similarity or semantic relation methods
into Text Spotting systems, either as a post-processing or in an end-to-end
training strategy. Our data is publicly available at https://git.io/JeZTb.
Related papers
- Dataset and Benchmark for Urdu Natural Scenes Text Detection, Recognition and Visual Question Answering [50.52792174648067]
This initiative seeks to bridge the gap between textual and visual comprehension.
We propose a new multi-task Urdu scene text dataset comprising over 1000 natural scene images.
We provide fine-grained annotations for text instances, addressing the limitations of previous datasets.
arXiv Detail & Related papers (2024-05-21T06:48:26Z) - Scene Graph Based Fusion Network For Image-Text Retrieval [2.962083552798791]
A critical challenge to image-text retrieval is how to learn accurate correspondences between images and texts.
We propose a Scene Graph based Fusion Network (dubbed SGFN) which enhances the images'/texts' features through intra- and cross-modal fusion.
Our SGFN performs better than quite a few SOTA image-text retrieval methods.
arXiv Detail & Related papers (2023-03-20T13:22:56Z) - Towards End-to-End Unified Scene Text Detection and Layout Analysis [60.68100769639923]
We introduce the task of unified scene text detection and layout analysis.
The first hierarchical scene text dataset is introduced to enable this novel research task.
We also propose a novel method that is able to simultaneously detect scene text and form text clusters in a unified way.
arXiv Detail & Related papers (2022-03-28T23:35:45Z) - Knowledge Mining with Scene Text for Fine-Grained Recognition [53.74297368412834]
We propose an end-to-end trainable network that mines implicit contextual knowledge behind scene text image.
We employ KnowBert to retrieve relevant knowledge for semantic representation and combine it with image features for fine-grained classification.
Our method outperforms the state-of-the-art by 3.72% mAP and 5.39% mAP, respectively.
arXiv Detail & Related papers (2022-03-27T05:54:00Z) - Language Matters: A Weakly Supervised Pre-training Approach for Scene
Text Detection and Spotting [69.77701325270047]
This paper presents a weakly supervised pre-training method that can acquire effective scene text representations.
Our network consists of an image encoder and a character-aware text encoder that extract visual and textual features.
Experiments show that our pre-trained model improves F-score by +2.5% and +4.8% while transferring its weights to other text detection and spotting networks.
arXiv Detail & Related papers (2022-03-08T08:10:45Z) - Scene Text Retrieval via Joint Text Detection and Similarity Learning [68.24531728554892]
Scene text retrieval aims to localize and search all text instances from an image gallery, which are the same or similar to a given query text.
We address this problem by directly learning a cross-modal similarity between a query text and each text instance from natural images.
In this way, scene text retrieval can be simply performed by ranking the detected text instances with the learned similarity.
arXiv Detail & Related papers (2021-04-04T07:18:38Z) - StacMR: Scene-Text Aware Cross-Modal Retrieval [19.54677614738065]
Cross-modal retrieval models have benefited from an increasingly rich understanding of visual scenes.
Current models overlook a key aspect: the text appearing in images, which may contain crucial information for retrieval.
We propose a new dataset that allows exploration of cross-modal retrieval where images contain scene-text instances.
arXiv Detail & Related papers (2020-12-08T10:04:25Z) - TextCaps: a Dataset for Image Captioning with Reading Comprehension [56.89608505010651]
Text is omnipresent in human environments and frequently critical to understand our surroundings.
To study how to comprehend text in the context of an image we collect a novel dataset, TextCaps, with 145k captions for 28k images.
Our dataset challenges a model to recognize text, relate it to its visual context, and decide what part of the text to copy or paraphrase.
arXiv Detail & Related papers (2020-03-24T02:38:35Z) - SwapText: Image Based Texts Transfer in Scenes [13.475726959175057]
We present SwapText, a framework to transfer texts across scene images.
A novel text swapping network is proposed to replace text labels only in the foreground image.
The generated foreground image and background image are used to generate the word image by the fusion network.
arXiv Detail & Related papers (2020-03-18T11:02:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.