TextBlockV2: Towards Precise-Detection-Free Scene Text Spotting with Pre-trained Language Model
- URL: http://arxiv.org/abs/2403.10047v1
- Date: Fri, 15 Mar 2024 06:38:25 GMT
- Title: TextBlockV2: Towards Precise-Detection-Free Scene Text Spotting with Pre-trained Language Model
- Authors: Jiahao Lyu, Jin Wei, Gangyan Zeng, Zeng Li, Enze Xie, Wei Wang, Yu Zhou,
- Abstract summary: Existing scene text spotters are designed to locate and transcribe texts from images.
Our proposed scene text spotter leverages advanced PLMs to enhance performance without fine-grained detection.
Benefiting from the comprehensive language knowledge gained during the pre-training phase, the PLM-based recognition module effectively handles complex scenarios.
- Score: 17.77384627944455
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Existing scene text spotters are designed to locate and transcribe texts from images. However, it is challenging for a spotter to achieve precise detection and recognition of scene texts simultaneously. Inspired by the glimpse-focus spotting pipeline of human beings and impressive performances of Pre-trained Language Models (PLMs) on visual tasks, we ask: 1) "Can machines spot texts without precise detection just like human beings?", and if yes, 2) "Is text block another alternative for scene text spotting other than word or character?" To this end, our proposed scene text spotter leverages advanced PLMs to enhance performance without fine-grained detection. Specifically, we first use a simple detector for block-level text detection to obtain rough positional information. Then, we finetune a PLM using a large-scale OCR dataset to achieve accurate recognition. Benefiting from the comprehensive language knowledge gained during the pre-training phase, the PLM-based recognition module effectively handles complex scenarios, including multi-line, reversed, occluded, and incomplete-detection texts. Taking advantage of the fine-tuned language model on scene recognition benchmarks and the paradigm of text block detection, extensive experiments demonstrate the superior performance of our scene text spotter across multiple public benchmarks. Additionally, we attempt to spot texts directly from an entire scene image to demonstrate the potential of PLMs, even Large Language Models (LLMs).
Related papers
- Text Grouping Adapter: Adapting Pre-trained Text Detector for Layout Analysis [52.34110239735265]
We present Text Grouping Adapter (TGA), a module that can enable the utilization of various pre-trained text detectors to learn layout analysis.
Our comprehensive experiments demonstrate that, even with frozen pre-trained models, incorporating our TGA into various pre-trained text detectors and text spotters can achieve superior layout analysis performance.
arXiv Detail & Related papers (2024-05-13T05:48:35Z) - TextFormer: A Query-based End-to-End Text Spotter with Mixed Supervision [61.186488081379]
We propose TextFormer, a query-based end-to-end text spotter with Transformer architecture.
TextFormer builds upon an image encoder and a text decoder to learn a joint semantic understanding for multi-task modeling.
It allows for mutual training and optimization of classification, segmentation, and recognition branches, resulting in deeper feature sharing.
arXiv Detail & Related papers (2023-06-06T03:37:41Z) - Contextual Text Block Detection towards Scene Text Understanding [85.40898487745272]
This paper presents contextual text detection, a new setup that detects contextual text blocks (CTBs) for better understanding of texts in scenes.
We formulate the new setup by a dual detection task which first detects integral text units and then groups them into a CTB.
To this end, we design a novel scene text clustering technique that treats integral text units as tokens and groups them (belonging to the same CTB) into an ordered token sequence.
arXiv Detail & Related papers (2022-07-26T14:59:25Z) - Decoupling Recognition from Detection: Single Shot Self-Reliant Scene
Text Spotter [34.09162878714425]
We propose the single shot Self-Reliant Scene Text Spotter (SRSTS)
We conduct text detection and recognition in parallel and bridge them by the shared positive anchor point.
Our method is able to recognize the text instances correctly even though the precise text boundaries are challenging to detect.
arXiv Detail & Related papers (2022-07-15T01:59:14Z) - Text Detection & Recognition in the Wild for Robot Localization [1.52292571922932]
We propose an end-to-end scene text spotting model that simultaneously outputs the text string and bounding boxes.
Our central contribution is introducing utilizing an end-to-end scene text spotting framework to adequately capture the irregular and occluded text regions.
arXiv Detail & Related papers (2022-05-17T18:16:34Z) - Vision-Language Pre-Training for Boosting Scene Text Detectors [57.08046351495244]
We specifically adapt vision-language joint learning for scene text detection.
We propose to learn contextualized, joint representations through vision-language pre-training.
The pre-trained model is able to produce more informative representations with richer semantics.
arXiv Detail & Related papers (2022-04-29T03:53:54Z) - DEER: Detection-agnostic End-to-End Recognizer for Scene Text Spotting [11.705454066278898]
We propose a novel Detection-agnostic End-to-End Recognizer, DEER, framework.
The proposed method reduces the tight dependency between detection and recognition modules.
It achieves competitive results on regular and arbitrarily-shaped text spotting benchmarks.
arXiv Detail & Related papers (2022-03-10T02:41:05Z) - SPTS: Single-Point Text Spotting [128.52900104146028]
We show that training scene text spotting models can be achieved with an extremely low-cost annotation of a single-point for each instance.
We propose an end-to-end scene text spotting method that tackles scene text spotting as a sequence prediction task.
arXiv Detail & Related papers (2021-12-15T06:44:21Z) - Context-Free TextSpotter for Real-Time and Mobile End-to-End Text
Detection and Recognition [8.480710920894547]
We propose a text-spotting method that consists of simple convolutions and a few post-processes, named Context-Free TextSpotter.
Experiments using standard benchmarks show that Context-Free TextSpotter achieves real-time text spotting on a GPU with only three million parameters, which is the smallest and fastest among existing deep text spotters.
Our text spotter can run on a smartphone with affordable latency, which is valuable for building stand-alone OCR applications.
arXiv Detail & Related papers (2021-06-10T09:32:52Z) - AE TextSpotter: Learning Visual and Linguistic Representation for
Ambiguous Text Spotting [98.08853679310603]
This work proposes a novel text spotter, named Ambiguity Eliminating Text Spotter (AE TextSpotter)
AE TextSpotter learns both visual and linguistic features to significantly reduce ambiguity in text detection.
To our knowledge, it is the first time to improve text detection by using a language model.
arXiv Detail & Related papers (2020-08-03T08:40:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.