Few Could Be Better Than All: Feature Sampling and Grouping for Scene
Text Detection
- URL: http://arxiv.org/abs/2203.15221v2
- Date: Wed, 30 Mar 2022 08:28:07 GMT
- Title: Few Could Be Better Than All: Feature Sampling and Grouping for Scene
Text Detection
- Authors: Jingqun Tang, Wenqing Zhang, Hongye Liu, MingKun Yang, Bo Jiang,
Guanglong Hu, Xiang Bai
- Abstract summary: We present a transformer-based architecture for scene text detection.
We first select a few representative features at all scales that are highly relevant to foreground text.
As each feature group corresponds to a text instance, its bounding box can be easily obtained without any post-processing operation.
- Score: 47.820683360286786
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, transformer-based methods have achieved promising progresses in
object detection, as they can eliminate the post-processes like NMS and enrich
the deep representations. However, these methods cannot well cope with scene
text due to its extreme variance of scales and aspect ratios. In this paper, we
present a simple yet effective transformer-based architecture for scene text
detection. Different from previous approaches that learn robust deep
representations of scene text in a holistic manner, our method performs scene
text detection based on a few representative features, which avoids the
disturbance by background and reduces the computational cost. Specifically, we
first select a few representative features at all scales that are highly
relevant to foreground text. Then, we adopt a transformer for modeling the
relationship of the sampled features, which effectively divides them into
reasonable groups. As each feature group corresponds to a text instance, its
bounding box can be easily obtained without any post-processing operation.
Using the basic feature pyramid network for feature extraction, our method
consistently achieves state-of-the-art results on several popular datasets for
scene text detection.
Related papers
- Real-Time Text Detection with Similar Mask in Traffic, Industrial, and Natural Scenes [31.180352896153682]
We propose an efficient multi-scene text detector that contains an effective text representation similar mask (SM) and a feature correction module (FCM)
To validate the scene of the SM-Net, we conduct experiments on traffic, industrial, and natural scene datasets.
arXiv Detail & Related papers (2024-11-05T04:08:59Z) - LRANet: Towards Accurate and Efficient Scene Text Detection with
Low-Rank Approximation Network [63.554061288184165]
We propose a novel parameterized text shape method based on low-rank approximation.
By exploring the shape correlation among different text contours, our method achieves consistency, compactness, simplicity, and robustness in shape representation.
We implement an accurate and efficient arbitrary-shaped text detector named LRANet.
arXiv Detail & Related papers (2023-06-27T02:03:46Z) - Aggregated Text Transformer for Scene Text Detection [5.387121933662753]
We present the Aggregated Text TRansformer(ATTR), which is designed to represent texts in scene images with a multi-scale self-attention mechanism.
The multi-scale image representations are robust and contain rich information on text contents of various sizes.
The proposed method detects scene texts by representing each text instance as an individual binary mask, which is tolerant of curve texts and regions with dense instances.
arXiv Detail & Related papers (2022-11-25T09:47:34Z) - DPText-DETR: Towards Better Scene Text Detection with Dynamic Points in
Transformer [94.35116535588332]
Transformer-based methods, which predict polygon points or Bezier curve control points to localize texts, are quite popular in scene text detection.
However, the used point label form implies the reading order of humans, which affects the robustness of Transformer model.
We propose DPText-DETR, which directly uses point coordinates as queries and dynamically updates them between decoder layers.
arXiv Detail & Related papers (2022-07-10T15:45:16Z) - Towards End-to-End Unified Scene Text Detection and Layout Analysis [60.68100769639923]
We introduce the task of unified scene text detection and layout analysis.
The first hierarchical scene text dataset is introduced to enable this novel research task.
We also propose a novel method that is able to simultaneously detect scene text and form text clusters in a unified way.
arXiv Detail & Related papers (2022-03-28T23:35:45Z) - Arbitrary Shape Text Detection using Transformers [2.294014185517203]
We propose an end-to-end trainable architecture for arbitrary-shaped text detection using Transformers (DETR)
At its core, our proposed method leverages a bounding box loss function that accurately measures the arbitrary detected text regions' changes in scale and aspect ratio.
We evaluate our proposed model using Total-Text and CTW-1500 datasets for curved text, and MSRA-TD500 and ICDAR15 datasets for multi-oriented text.
arXiv Detail & Related papers (2022-02-22T22:36:29Z) - Comprehensive Studies for Arbitrary-shape Scene Text Detection [78.50639779134944]
We propose a unified framework for the bottom-up based scene text detection methods.
Under the unified framework, we ensure the consistent settings for non-core modules.
With the comprehensive investigations and elaborate analyses, it reveals the advantages and disadvantages of previous models.
arXiv Detail & Related papers (2021-07-25T13:18:55Z) - CentripetalText: An Efficient Text Instance Representation for Scene
Text Detection [19.69057252363207]
We propose an efficient text instance representation named CentripetalText (CT)
CT decomposes text instances into the combination of text kernels and centripetal shifts.
For the task of scene text detection, our approach achieves superior or competitive performance compared to other existing methods.
arXiv Detail & Related papers (2021-07-13T09:34:18Z) - Scene Text Retrieval via Joint Text Detection and Similarity Learning [68.24531728554892]
Scene text retrieval aims to localize and search all text instances from an image gallery, which are the same or similar to a given query text.
We address this problem by directly learning a cross-modal similarity between a query text and each text instance from natural images.
In this way, scene text retrieval can be simply performed by ranking the detected text instances with the learned similarity.
arXiv Detail & Related papers (2021-04-04T07:18:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.