Related papers: DeepSolo++: Let Transformer Decoder with Explicit Points Solo for Multilingual Text Spotting

DeepSolo++: Let Transformer Decoder with Explicit Points Solo for Multilingual Text Spotting

URL: http://arxiv.org/abs/2305.19957v2
Date: Mon, 18 Mar 2024 13:30:03 GMT
Title: DeepSolo++: Let Transformer Decoder with Explicit Points Solo for Multilingual Text Spotting
Authors: Maoyuan Ye, Jing Zhang, Shanshan Zhao, Juhua Liu, Tongliang Liu, Bo Du, Dacheng Tao,
Abstract summary: DeepSolo++ is a simple DETR-like baseline that lets a single decoder with explicit points solo for text detection, recognition, and script identification simultaneously. Our method not only performs well in English scenes but also masters the transcription with complex font structure and a thousand-level character classes, such as Chinese.
Score: 112.45423990924283
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: End-to-end text spotting aims to integrate scene text detection and recognition into a unified framework. Dealing with the relationship between the two sub-tasks plays a pivotal role in designing effective spotters. Although Transformer-based methods eliminate the heuristic post-processing, they still suffer from the synergy issue between the sub-tasks and low training efficiency. Besides, they overlook the exploring on multilingual text spotting which requires an extra script identification task. In this paper, we present DeepSolo++, a simple DETR-like baseline that lets a single decoder with explicit points solo for text detection, recognition, and script identification simultaneously. Technically, for each text instance, we represent the character sequence as ordered points and model them with learnable explicit point queries. After passing a single decoder, the point queries have encoded requisite text semantics and locations, thus can be further decoded to the center line, boundary, script, and confidence of text via very simple prediction heads in parallel. Furthermore, we show the surprisingly good extensibility of our method, in terms of character class, language type, and task. On the one hand, our method not only performs well in English scenes but also masters the transcription with complex font structure and a thousand-level character classes, such as Chinese. On the other hand, our DeepSolo++ achieves better performance on the additionally introduced script identification task with a simpler training pipeline compared with previous methods. In addition, our models are also compatible with line annotations, which require much less annotation cost than polygons. The code is available at \url{https://github.com/ViTAE-Transformer/DeepSolo}.

Related papers

TextBlockV2: Towards Precise-Detection-Free Scene Text Spotting with Pre-trained Language Model [17.77384627944455]
Existing scene text spotters are designed to locate and transcribe texts from images. Our proposed scene text spotter leverages advanced PLMs to enhance performance without fine-grained detection. Benefiting from the comprehensive language knowledge gained during the pre-training phase, the PLM-based recognition module effectively handles complex scenarios.
arXiv Detail & Related papers (2024-03-15T06:38:25Z)
Efficiently Leveraging Linguistic Priors for Scene Text Spotting [63.22351047545888]
This paper proposes a method that leverages linguistic knowledge from a large text corpus to replace the traditional one-hot encoding used in auto-regressive scene text spotting and recognition models. We generate text distributions that align well with scene text datasets, removing the need for in-domain fine-tuning. Experimental results show that our method not only improves recognition accuracy but also enables more accurate localization of words.
arXiv Detail & Related papers (2024-02-27T01:57:09Z)
TextFormer: A Query-based End-to-End Text Spotter with Mixed Supervision [61.186488081379]
We propose TextFormer, a query-based end-to-end text spotter with Transformer architecture. TextFormer builds upon an image encoder and a text decoder to learn a joint semantic understanding for multi-task modeling. It allows for mutual training and optimization of classification, segmentation, and recognition branches, resulting in deeper feature sharing.
arXiv Detail & Related papers (2023-06-06T03:37:41Z)
SPTS v2: Single-Point Scene Text Spotting [146.98118405786445]
New framework, SPTS v2, allows us to train high-performing text-spotting models using a single-point annotation. Tests show SPTS v2 can outperform previous state-of-the-art single-point text spotters with fewer parameters. Experiments suggest a potential preference for single-point representation in scene text spotting.
arXiv Detail & Related papers (2023-01-04T14:20:14Z)
DeepSolo: Let Transformer Decoder with Explicit Points Solo for Text Spotting [129.73247700864385]
DeepSolo is a simple detection transformer baseline that lets a single Decoder with Explicit Points Solo for text detection and recognition simultaneously. We introduce a text-matching criterion to deliver more accurate supervisory signals, thus enabling more efficient training.
arXiv Detail & Related papers (2022-11-19T19:06:22Z)
SPTS: Single-Point Text Spotting [128.52900104146028]
We show that training scene text spotting models can be achieved with an extremely low-cost annotation of a single-point for each instance. We propose an end-to-end scene text spotting method that tackles scene text spotting as a sequence prediction task.
arXiv Detail & Related papers (2021-12-15T06:44:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.