ESTextSpotter: Towards Better Scene Text Spotting with Explicit Synergy
in Transformer
- URL: http://arxiv.org/abs/2308.10147v1
- Date: Sun, 20 Aug 2023 03:22:23 GMT
- Title: ESTextSpotter: Towards Better Scene Text Spotting with Explicit Synergy
in Transformer
- Authors: Mingxin Huang, Jiaxin Zhang, Dezhi Peng, Hao Lu, Can Huang, Yuliang
Liu, Xiang Bai, Lianwen Jin
- Abstract summary: We introduce Explicit Synergy-based Text Spotting Transformer framework (ESTextSpotter)
Our model achieves explicit synergy by modeling discriminative and interactive features for text detection and recognition within a single decoder.
Experimental results demonstrate that our model significantly outperforms previous state-of-the-art methods.
- Score: 88.61312640540902
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In recent years, end-to-end scene text spotting approaches are evolving to
the Transformer-based framework. While previous studies have shown the crucial
importance of the intrinsic synergy between text detection and recognition,
recent advances in Transformer-based methods usually adopt an implicit synergy
strategy with shared query, which can not fully realize the potential of these
two interactive tasks. In this paper, we argue that the explicit synergy
considering distinct characteristics of text detection and recognition can
significantly improve the performance text spotting. To this end, we introduce
a new model named Explicit Synergy-based Text Spotting Transformer framework
(ESTextSpotter), which achieves explicit synergy by modeling discriminative and
interactive features for text detection and recognition within a single
decoder. Specifically, we decompose the conventional shared query into
task-aware queries for text polygon and content, respectively. Through the
decoder with the proposed vision-language communication module, the queries
interact with each other in an explicit manner while preserving discriminative
patterns of text detection and recognition, thus improving performance
significantly. Additionally, we propose a task-aware query initialization
scheme to ensure stable training. Experimental results demonstrate that our
model significantly outperforms previous state-of-the-art methods. Code is
available at https://github.com/mxin262/ESTextSpotter.
Related papers
- SwinTextSpotter v2: Towards Better Synergy for Scene Text Spotting [126.01629300244001]
We propose a new end-to-end scene text spotting framework termed SwinTextSpotter v2.
We enhance the relationship between two tasks using novel Recognition Conversion and Recognition Alignment modules.
SwinTextSpotter v2 achieved state-of-the-art performance on various multilingual (English, Chinese, and Vietnamese) benchmarks.
arXiv Detail & Related papers (2024-01-15T12:33:00Z) - TextFormer: A Query-based End-to-End Text Spotter with Mixed Supervision [61.186488081379]
We propose TextFormer, a query-based end-to-end text spotter with Transformer architecture.
TextFormer builds upon an image encoder and a text decoder to learn a joint semantic understanding for multi-task modeling.
It allows for mutual training and optimization of classification, segmentation, and recognition branches, resulting in deeper feature sharing.
arXiv Detail & Related papers (2023-06-06T03:37:41Z) - Vision-Language Pre-Training for Boosting Scene Text Detectors [57.08046351495244]
We specifically adapt vision-language joint learning for scene text detection.
We propose to learn contextualized, joint representations through vision-language pre-training.
The pre-trained model is able to produce more informative representations with richer semantics.
arXiv Detail & Related papers (2022-04-29T03:53:54Z) - SwinTextSpotter: Scene Text Spotting via Better Synergy between Text
Detection and Text Recognition [73.61592015908353]
We propose a new end-to-end scene text spotting framework termed SwinTextSpotter.
Using a transformer with dynamic head as the detector, we unify the two tasks with a novel Recognition Conversion mechanism.
The design results in a concise framework that requires neither additional rectification module nor character-level annotation.
arXiv Detail & Related papers (2022-03-19T01:14:42Z) - Towards Weakly-Supervised Text Spotting using a Multi-Task Transformer [21.479222207347238]
We introduce TextTranSpotter (TTS), a transformer-based approach for text spotting.
TTS is trained with both fully- and weakly-supervised settings.
trained in a fully-supervised manner, TextTranSpotter shows state-of-the-art results on multiple benchmarks.
arXiv Detail & Related papers (2022-02-11T08:50:09Z) - TRIG: Transformer-Based Text Recognizer with Initial Embedding Guidance [15.72669617789124]
Scene text recognition (STR) is an important bridge between images and text.
Recent methods use a frozen initial embedding to guide the decoder to decode the features to text, leading to a loss of accuracy.
We propose a novel architecture for text recognition, named TRansformer-based text recognizer with Initial embedding Guidance (TRIG)
arXiv Detail & Related papers (2021-11-16T09:10:39Z) - Text Perceptron: Towards End-to-End Arbitrary-Shaped Text Spotting [49.768327669098674]
We propose an end-to-end trainable text spotting approach named Text Perceptron.
It first employs an efficient segmentation-based text detector that learns the latent text reading order and boundary information.
Then a novel Shape Transform Module (abbr. STM) is designed to transform the detected feature regions into regular morphologies.
arXiv Detail & Related papers (2020-02-17T08:07:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.