DeepSolo: Let Transformer Decoder with Explicit Points Solo for Text
Spotting
- URL: http://arxiv.org/abs/2211.10772v2
- Date: Wed, 23 Nov 2022 07:36:17 GMT
- Title: DeepSolo: Let Transformer Decoder with Explicit Points Solo for Text
Spotting
- Authors: Maoyuan Ye, Jing Zhang, Shanshan Zhao, Juhua Liu, Tongliang Liu, Bo
Du, Dacheng Tao
- Abstract summary: DeepSolo is a simple detection transformer baseline that lets a single Decoder with Explicit Points Solo for text detection and recognition simultaneously.
We introduce a text-matching criterion to deliver more accurate supervisory signals, thus enabling more efficient training.
- Score: 129.73247700864385
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: End-to-end text spotting aims to integrate scene text detection and
recognition into a unified framework. Dealing with the relationship between the
two sub-tasks plays a pivotal role in designing effective spotters. Although
transformer-based methods eliminate the heuristic post-processing, they still
suffer from the synergy issue between the sub-tasks and low training
efficiency. In this paper, we present DeepSolo, a simple detection transformer
baseline that lets a single Decoder with Explicit Points Solo for text
detection and recognition simultaneously. Technically, for each text instance,
we represent the character sequence as ordered points and model them with
learnable explicit point queries. After passing a single decoder, the point
queries have encoded requisite text semantics and locations and thus can be
further decoded to the center line, boundary, script, and confidence of text
via very simple prediction heads in parallel, solving the sub-tasks in text
spotting in a unified framework. Besides, we also introduce a text-matching
criterion to deliver more accurate supervisory signals, thus enabling more
efficient training. Quantitative experiments on public benchmarks demonstrate
that DeepSolo outperforms previous state-of-the-art methods and achieves better
training efficiency. In addition, DeepSolo is also compatible with line
annotations, which require much less annotation cost than polygons. The code
will be released.
Related papers
- TextFormer: A Query-based End-to-End Text Spotter with Mixed Supervision [61.186488081379]
We propose TextFormer, a query-based end-to-end text spotter with Transformer architecture.
TextFormer builds upon an image encoder and a text decoder to learn a joint semantic understanding for multi-task modeling.
It allows for mutual training and optimization of classification, segmentation, and recognition branches, resulting in deeper feature sharing.
arXiv Detail & Related papers (2023-06-06T03:37:41Z) - DeepSolo++: Let Transformer Decoder with Explicit Points Solo for Multilingual Text Spotting [112.45423990924283]
DeepSolo++ is a simple DETR-like baseline that lets a single decoder with explicit points solo for text detection, recognition, and script identification simultaneously.
Our method not only performs well in English scenes but also masters the transcription with complex font structure and a thousand-level character classes, such as Chinese.
arXiv Detail & Related papers (2023-05-31T15:44:00Z) - SPTS v2: Single-Point Scene Text Spotting [146.98118405786445]
New framework, SPTS v2, allows us to train high-performing text-spotting models using a single-point annotation.
Tests show SPTS v2 can outperform previous state-of-the-art single-point text spotters with fewer parameters.
Experiments suggest a potential preference for single-point representation in scene text spotting.
arXiv Detail & Related papers (2023-01-04T14:20:14Z) - DPText-DETR: Towards Better Scene Text Detection with Dynamic Points in
Transformer [94.35116535588332]
Transformer-based methods, which predict polygon points or Bezier curve control points to localize texts, are quite popular in scene text detection.
However, the used point label form implies the reading order of humans, which affects the robustness of Transformer model.
We propose DPText-DETR, which directly uses point coordinates as queries and dynamically updates them between decoder layers.
arXiv Detail & Related papers (2022-07-10T15:45:16Z) - Towards Weakly-Supervised Text Spotting using a Multi-Task Transformer [21.479222207347238]
We introduce TextTranSpotter (TTS), a transformer-based approach for text spotting.
TTS is trained with both fully- and weakly-supervised settings.
trained in a fully-supervised manner, TextTranSpotter shows state-of-the-art results on multiple benchmarks.
arXiv Detail & Related papers (2022-02-11T08:50:09Z) - SPTS: Single-Point Text Spotting [128.52900104146028]
We show that training scene text spotting models can be achieved with an extremely low-cost annotation of a single-point for each instance.
We propose an end-to-end scene text spotting method that tackles scene text spotting as a sequence prediction task.
arXiv Detail & Related papers (2021-12-15T06:44:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.