SPTS: Single-Point Text Spotting
- URL: http://arxiv.org/abs/2112.07917v1
- Date: Wed, 15 Dec 2021 06:44:21 GMT
- Title: SPTS: Single-Point Text Spotting
- Authors: Dezhi Peng, Xinyu Wang, Yuliang Liu, Jiaxin Zhang, Mingxin Huang,
Songxuan Lai, Shenggao Zhu, Jing Li, Dahua Lin, Chunhua Shen, Lianwen Jin
- Abstract summary: We show that training scene text spotting models can be achieved with an extremely low-cost annotation of a single-point for each instance.
We propose an end-to-end scene text spotting method that tackles scene text spotting as a sequence prediction task.
- Score: 128.52900104146028
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Almost all scene text spotting (detection and recognition) methods rely on
costly box annotation (e.g., text-line box, word-level box, and character-level
box). For the first time, we demonstrate that training scene text spotting
models can be achieved with an extremely low-cost annotation of a single-point
for each instance. We propose an end-to-end scene text spotting method that
tackles scene text spotting as a sequence prediction task, like language
modeling. Given an image as input, we formulate the desired detection and
recognition results as a sequence of discrete tokens and use an auto-regressive
transformer to predict the sequence. We achieve promising results on several
horizontal, multi-oriented, and arbitrarily shaped scene text benchmarks. Most
significantly, we show that the performance is not very sensitive to the
positions of the point annotation, meaning that it can be much easier to be
annotated and automatically generated than the bounding box that requires
precise positions. We believe that such a pioneer attempt indicates a
significant opportunity for scene text spotting applications of a much larger
scale than previously possible.
Related papers
- TextBlockV2: Towards Precise-Detection-Free Scene Text Spotting with Pre-trained Language Model [17.77384627944455]
Existing scene text spotters are designed to locate and transcribe texts from images.
Our proposed scene text spotter leverages advanced PLMs to enhance performance without fine-grained detection.
Benefiting from the comprehensive language knowledge gained during the pre-training phase, the PLM-based recognition module effectively handles complex scenarios.
arXiv Detail & Related papers (2024-03-15T06:38:25Z) - DeepSolo++: Let Transformer Decoder with Explicit Points Solo for Multilingual Text Spotting [112.45423990924283]
DeepSolo++ is a simple DETR-like baseline that lets a single decoder with explicit points solo for text detection, recognition, and script identification simultaneously.
Our method not only performs well in English scenes but also masters the transcription with complex font structure and a thousand-level character classes, such as Chinese.
arXiv Detail & Related papers (2023-05-31T15:44:00Z) - Towards Unified Scene Text Spotting based on Sequence Generation [4.437335677401287]
We propose a UNIfied scene Text Spotter, called UNITS.
Our model unifies various detection formats, including quadrilaterals and polygons.
We apply starting-point prompting to enable the model to extract texts from an arbitrary starting point.
arXiv Detail & Related papers (2023-04-07T01:28:08Z) - SPTS v2: Single-Point Scene Text Spotting [146.98118405786445]
New framework, SPTS v2, allows us to train high-performing text-spotting models using a single-point annotation.
Tests show SPTS v2 can outperform previous state-of-the-art single-point text spotters with fewer parameters.
Experiments suggest a potential preference for single-point representation in scene text spotting.
arXiv Detail & Related papers (2023-01-04T14:20:14Z) - DeepSolo: Let Transformer Decoder with Explicit Points Solo for Text
Spotting [129.73247700864385]
DeepSolo is a simple detection transformer baseline that lets a single Decoder with Explicit Points Solo for text detection and recognition simultaneously.
We introduce a text-matching criterion to deliver more accurate supervisory signals, thus enabling more efficient training.
arXiv Detail & Related papers (2022-11-19T19:06:22Z) - DPText-DETR: Towards Better Scene Text Detection with Dynamic Points in
Transformer [94.35116535588332]
Transformer-based methods, which predict polygon points or Bezier curve control points to localize texts, are quite popular in scene text detection.
However, the used point label form implies the reading order of humans, which affects the robustness of Transformer model.
We propose DPText-DETR, which directly uses point coordinates as queries and dynamically updates them between decoder layers.
arXiv Detail & Related papers (2022-07-10T15:45:16Z) - Few Could Be Better Than All: Feature Sampling and Grouping for Scene
Text Detection [47.820683360286786]
We present a transformer-based architecture for scene text detection.
We first select a few representative features at all scales that are highly relevant to foreground text.
As each feature group corresponds to a text instance, its bounding box can be easily obtained without any post-processing operation.
arXiv Detail & Related papers (2022-03-29T04:02:31Z) - Scene Text Detection with Scribble Lines [59.698806258671105]
We propose to annotate texts by scribble lines instead of polygons for text detection.
It is a general labeling method for texts with various shapes and requires low labeling costs.
Experiments show that the proposed method bridges the performance gap between the weakly labeling method and the original polygon-based labeling methods.
arXiv Detail & Related papers (2020-12-09T13:14:53Z) - MANGO: A Mask Attention Guided One-Stage Scene Text Spotter [41.66707532607276]
We propose a novel Mask AttentioN Guided One-stage text spotting framework named MANGO.
The proposed method achieves competitive and even new state-of-the-art performance on both regular and irregular text spotting benchmarks.
arXiv Detail & Related papers (2020-12-08T10:47:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.