AE TextSpotter: Learning Visual and Linguistic Representation for
Ambiguous Text Spotting
- URL: http://arxiv.org/abs/2008.00714v5
- Date: Tue, 6 Jul 2021 14:06:06 GMT
- Title: AE TextSpotter: Learning Visual and Linguistic Representation for
Ambiguous Text Spotting
- Authors: Wenhai Wang, Xuebo Liu, Xiaozhong Ji, Enze Xie, Ding Liang, Zhibo
Yang, Tong Lu, Chunhua Shen, Ping Luo
- Abstract summary: This work proposes a novel text spotter, named Ambiguity Eliminating Text Spotter (AE TextSpotter)
AE TextSpotter learns both visual and linguistic features to significantly reduce ambiguity in text detection.
To our knowledge, it is the first time to improve text detection by using a language model.
- Score: 98.08853679310603
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Scene text spotting aims to detect and recognize the entire word or sentence
with multiple characters in natural images. It is still challenging because
ambiguity often occurs when the spacing between characters is large or the
characters are evenly spread in multiple rows and columns, making many visually
plausible groupings of the characters (e.g. "BERLIN" is incorrectly detected as
"BERL" and "IN" in Fig. 1(c)). Unlike previous works that merely employed
visual features for text detection, this work proposes a novel text spotter,
named Ambiguity Eliminating Text Spotter (AE TextSpotter), which learns both
visual and linguistic features to significantly reduce ambiguity in text
detection. The proposed AE TextSpotter has three important benefits. 1) The
linguistic representation is learned together with the visual representation in
a framework. To our knowledge, it is the first time to improve text detection
by using a language model. 2) A carefully designed language module is utilized
to reduce the detection confidence of incorrect text lines, making them easily
pruned in the detection stage. 3) Extensive experiments show that AE
TextSpotter outperforms other state-of-the-art methods by a large margin. For
example, we carefully select a validation set of extremely ambiguous samples
from the IC19-ReCTS dataset, where our approach surpasses other methods by more
than 4%. The code has been released at
https://github.com/whai362/AE_TextSpotter. The image list and evaluation
scripts of the validation set have been released at
https://github.com/whai362/TDA-ReCTS.
Related papers
- TextBlockV2: Towards Precise-Detection-Free Scene Text Spotting with Pre-trained Language Model [17.77384627944455]
Existing scene text spotters are designed to locate and transcribe texts from images.
Our proposed scene text spotter leverages advanced PLMs to enhance performance without fine-grained detection.
Benefiting from the comprehensive language knowledge gained during the pre-training phase, the PLM-based recognition module effectively handles complex scenarios.
arXiv Detail & Related papers (2024-03-15T06:38:25Z) - Efficiently Leveraging Linguistic Priors for Scene Text Spotting [63.22351047545888]
This paper proposes a method that leverages linguistic knowledge from a large text corpus to replace the traditional one-hot encoding used in auto-regressive scene text spotting and recognition models.
We generate text distributions that align well with scene text datasets, removing the need for in-domain fine-tuning.
Experimental results show that our method not only improves recognition accuracy but also enables more accurate localization of words.
arXiv Detail & Related papers (2024-02-27T01:57:09Z) - SwinTextSpotter v2: Towards Better Synergy for Scene Text Spotting [126.01629300244001]
We propose a new end-to-end scene text spotting framework termed SwinTextSpotter v2.
We enhance the relationship between two tasks using novel Recognition Conversion and Recognition Alignment modules.
SwinTextSpotter v2 achieved state-of-the-art performance on various multilingual (English, Chinese, and Vietnamese) benchmarks.
arXiv Detail & Related papers (2024-01-15T12:33:00Z) - TextFormer: A Query-based End-to-End Text Spotter with Mixed Supervision [61.186488081379]
We propose TextFormer, a query-based end-to-end text spotter with Transformer architecture.
TextFormer builds upon an image encoder and a text decoder to learn a joint semantic understanding for multi-task modeling.
It allows for mutual training and optimization of classification, segmentation, and recognition branches, resulting in deeper feature sharing.
arXiv Detail & Related papers (2023-06-06T03:37:41Z) - A3S: Adversarial learning of semantic representations for Scene-Text
Spotting [0.0]
Scene-text spotting is a task that predicts a text area on natural scene images and recognizes its text characters simultaneously.
We propose adversarial learning of semantic representations for scene text spotting (A3S) to improve end-to-end accuracy, including text recognition.
A3S simultaneously predicts semantic features in the detected text area instead of only performing text recognition based on existing visual features.
arXiv Detail & Related papers (2023-02-21T12:59:18Z) - SwinTextSpotter: Scene Text Spotting via Better Synergy between Text
Detection and Text Recognition [73.61592015908353]
We propose a new end-to-end scene text spotting framework termed SwinTextSpotter.
Using a transformer with dynamic head as the detector, we unify the two tasks with a novel Recognition Conversion mechanism.
The design results in a concise framework that requires neither additional rectification module nor character-level annotation.
arXiv Detail & Related papers (2022-03-19T01:14:42Z) - DEER: Detection-agnostic End-to-End Recognizer for Scene Text Spotting [11.705454066278898]
We propose a novel Detection-agnostic End-to-End Recognizer, DEER, framework.
The proposed method reduces the tight dependency between detection and recognition modules.
It achieves competitive results on regular and arbitrarily-shaped text spotting benchmarks.
arXiv Detail & Related papers (2022-03-10T02:41:05Z) - CORE-Text: Improving Scene Text Detection with Contrastive Relational
Reasoning [65.57338873921168]
Localizing text instances in natural scenes is regarded as a fundamental challenge in computer vision.
In this work, we quantitatively analyze the sub-text problem and present a simple yet effective design, COntrastive RElation (CORE) module.
We integrate the CORE module into a two-stage text detector of Mask R-CNN and devise our text detector CORE-Text.
arXiv Detail & Related papers (2021-12-14T16:22:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.