TextOCR: Towards large-scale end-to-end reasoning for arbitrary-shaped
scene text
- URL: http://arxiv.org/abs/2105.05486v1
- Date: Wed, 12 May 2021 07:50:42 GMT
- Title: TextOCR: Towards large-scale end-to-end reasoning for arbitrary-shaped
scene text
- Authors: Amanpreet Singh, Guan Pang, Mandy Toh, Jing Huang, Wojciech Galuba,
Tal Hassner
- Abstract summary: We propose TextOCR, an arbitrary-shaped scene text detection and recognition with 900k annotated words collected on real images.
We show that current state-of-the-art text-recognition (OCR) models fail to perform well on TextOCR.
We use a TextOCR trained OCR model to create PixelM4C model which can do scene text based reasoning on an image in an end-to-end fashion.
- Score: 23.04601165885908
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: A crucial component for the scene text based reasoning required for TextVQA
and TextCaps datasets involve detecting and recognizing text present in the
images using an optical character recognition (OCR) system. The current systems
are crippled by the unavailability of ground truth text annotations for these
datasets as well as lack of scene text detection and recognition datasets on
real images disallowing the progress in the field of OCR and evaluation of
scene text based reasoning in isolation from OCR systems. In this work, we
propose TextOCR, an arbitrary-shaped scene text detection and recognition with
900k annotated words collected on real images from TextVQA dataset. We show
that current state-of-the-art text-recognition (OCR) models fail to perform
well on TextOCR and that training on TextOCR helps achieve state-of-the-art
performance on multiple other OCR datasets as well. We use a TextOCR trained
OCR model to create PixelM4C model which can do scene text based reasoning on
an image in an end-to-end fashion, allowing us to revisit several design
choices to achieve new state-of-the-art performance on TextVQA dataset.
Related papers
- Decoder Pre-Training with only Text for Scene Text Recognition [54.93037783663204]
Scene text recognition (STR) pre-training methods have achieved remarkable progress, primarily relying on synthetic datasets.
We introduce a novel method named Decoder Pre-training with only text for STR (DPTR)
DPTR treats text embeddings produced by the CLIP text encoder as pseudo visual embeddings and uses them to pre-train the decoder.
arXiv Detail & Related papers (2024-08-11T06:36:42Z) - DLoRA-TrOCR: Mixed Text Mode Optical Character Recognition Based On Transformer [12.966765239586994]
Multi- fonts, mixed scenes and complex layouts seriously affect the recognition accuracy of traditional OCR models.
We propose a parameter-efficient mixed text recognition method based on pre-trained OCR Transformer, namely DLoRA-TrOCR.
arXiv Detail & Related papers (2024-04-19T09:28:16Z) - ODM: A Text-Image Further Alignment Pre-training Approach for Scene Text Detection and Spotting [8.397246652127793]
We propose a new pre-training method called OCR-Text Destylization Modeling (ODM)
ODM transfers diverse styles of text found in images to a uniform style based on the text prompt.
Our method significantly improves performance and outperforms current pre-training methods in scene text detection and spotting tasks.
arXiv Detail & Related papers (2024-03-01T06:13:53Z) - Enhancing Scene Text Detectors with Realistic Text Image Synthesis Using
Diffusion Models [63.99110667987318]
We present DiffText, a pipeline that seamlessly blends foreground text with the background's intrinsic features.
With fewer text instances, our produced text images consistently surpass other synthetic data in aiding text detectors.
arXiv Detail & Related papers (2023-11-28T06:51:28Z) - STEP -- Towards Structured Scene-Text Spotting [9.339184886724812]
We introduce the structured scene-text spotting task, which requires a scene-text OCR system to spot text in the wild according to a query regular expression.
We propose the Structured TExt sPotter (STEP), a model that exploits the provided text structure to guide the OCR process.
Our approach enables accurate zero-shot structured text spotting in a wide variety of real-world reading scenarios.
arXiv Detail & Related papers (2023-09-05T16:11:54Z) - TextFormer: A Query-based End-to-End Text Spotter with Mixed Supervision [61.186488081379]
We propose TextFormer, a query-based end-to-end text spotter with Transformer architecture.
TextFormer builds upon an image encoder and a text decoder to learn a joint semantic understanding for multi-task modeling.
It allows for mutual training and optimization of classification, segmentation, and recognition branches, resulting in deeper feature sharing.
arXiv Detail & Related papers (2023-06-06T03:37:41Z) - Text Detection Forgot About Document OCR [0.0]
This paper compares several methods designed for in-the-wild text recognition and for document text recognition.
The results suggest that state-of-the-art methods originally proposed for in-the-wild text detection also achieve excellent results on document text detection.
arXiv Detail & Related papers (2022-10-14T15:37:54Z) - PreSTU: Pre-Training for Scene-Text Understanding [49.288302725486226]
We propose PreSTU, a novel pre-training recipe dedicated to scene-text understanding (STU)
PreSTU introduces OCR-aware pre-training objectives that encourage the model to recognize text from an image and connect it to the rest of the image content.
We empirically demonstrate the effectiveness of this pre-training approach on eight visual question answering and four image captioning benchmarks.
arXiv Detail & Related papers (2022-09-12T18:29:55Z) - Language Matters: A Weakly Supervised Pre-training Approach for Scene
Text Detection and Spotting [69.77701325270047]
This paper presents a weakly supervised pre-training method that can acquire effective scene text representations.
Our network consists of an image encoder and a character-aware text encoder that extract visual and textual features.
Experiments show that our pre-trained model improves F-score by +2.5% and +4.8% while transferring its weights to other text detection and spotting networks.
arXiv Detail & Related papers (2022-03-08T08:10:45Z) - Text Prior Guided Scene Text Image Super-resolution [11.396781380648756]
Scene text image super-resolution (STISR) aims to improve the resolution and visual quality of low-resolution (LR) scene text images.
We make an attempt to embed categorical text prior into STISR model training.
We present a multi-stage text prior guided super-resolution framework for STISR.
arXiv Detail & Related papers (2021-06-29T12:52:33Z) - TAP: Text-Aware Pre-training for Text-VQA and Text-Caption [75.44716665758415]
We propose Text-Aware Pre-training (TAP) for Text-VQA and Text-Caption tasks.
TAP explicitly incorporates scene text (generated from OCR engines) in pre-training.
Our approach outperforms the state of the art by large margins on multiple tasks.
arXiv Detail & Related papers (2020-12-08T18:55:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.