SRFormer: Text Detection Transformer with Incorporated Segmentation and
Regression
- URL: http://arxiv.org/abs/2308.10531v2
- Date: Sun, 24 Dec 2023 17:43:48 GMT
- Title: SRFormer: Text Detection Transformer with Incorporated Segmentation and
Regression
- Authors: Qingwen Bu, Sungrae Park, Minsoo Khang, Yichuan Cheng
- Abstract summary: We propose SRFormer, a unified DETR-based model with amalgamated and Regression.
Our empirical analysis indicates that favorable segmentation predictions can be obtained at the initial decoder layers.
Our method's exceptional robustness, superior training and data efficiency, as well as its state-of-the-art performance.
- Score: 6.74412860849373
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing techniques for text detection can be broadly classified into two
primary groups: segmentation-based and regression-based methods. Segmentation
models offer enhanced robustness to font variations but require intricate
post-processing, leading to high computational overhead. Regression-based
methods undertake instance-aware prediction but face limitations in robustness
and data efficiency due to their reliance on high-level representations. In our
academic pursuit, we propose SRFormer, a unified DETR-based model with
amalgamated Segmentation and Regression, aiming at the synergistic harnessing
of the inherent robustness in segmentation representations, along with the
straightforward post-processing of instance-level regression. Our empirical
analysis indicates that favorable segmentation predictions can be obtained at
the initial decoder layers. In light of this, we constrain the incorporation of
segmentation branches to the first few decoder layers and employ progressive
regression refinement in subsequent layers, achieving performance gains while
minimizing computational load from the mask.Furthermore, we propose a
Mask-informed Query Enhancement module. We take the segmentation result as a
natural soft-ROI to pool and extract robust pixel representations, which are
then employed to enhance and diversify instance queries. Extensive
experimentation across multiple benchmarks has yielded compelling findings,
highlighting our method's exceptional robustness, superior training and data
efficiency, as well as its state-of-the-art performance. Our code is available
at https://github.com/retsuh-bqw/SRFormer-Text-Det.
Related papers
- Early Fusion of Features for Semantic Segmentation [10.362589129094975]
This paper introduces a novel segmentation framework that integrates a classifier network with a reverse HRNet architecture for efficient image segmentation.
Our methodology is rigorously tested across several benchmark datasets including Mapillary Vistas, Cityscapes, CamVid, COCO, and PASCAL-VOC2012.
The results demonstrate the effectiveness of our proposed model in achieving high segmentation accuracy, indicating its potential for various applications in image analysis.
arXiv Detail & Related papers (2024-02-08T22:58:06Z) - Target Variable Engineering [0.0]
We compare the predictive performance of regression models trained to predict numeric targets vs. classifiers trained to predict their binarized counterparts.
We find that regression requires significantly more computational effort to converge upon the optimal performance.
arXiv Detail & Related papers (2023-10-13T23:12:21Z) - Consensus-Adaptive RANSAC [104.87576373187426]
We propose a new RANSAC framework that learns to explore the parameter space by considering the residuals seen so far via a novel attention layer.
The attention mechanism operates on a batch of point-to-model residuals, and updates a per-point estimation state to take into account the consensus found through a lightweight one-step transformer.
arXiv Detail & Related papers (2023-07-26T08:25:46Z) - Semantics-Aware Dynamic Localization and Refinement for Referring Image
Segmentation [102.25240608024063]
Referring image segments an image from a language expression.
We develop an algorithm that shifts from being localization-centric to segmentation-language.
Compared to its counterparts, our method is more versatile yet effective.
arXiv Detail & Related papers (2023-03-11T08:42:40Z) - Learning from Mistakes: Self-Regularizing Hierarchical Representations
in Point Cloud Semantic Segmentation [15.353256018248103]
LiDAR semantic segmentation has gained attention to accomplish fine-grained scene understanding.
We present a coarse-to-fine setup that LEArns from classification mistaKes (LEAK) derived from a standard model.
Our LEAK approach is very general and can be seamlessly applied on top of any segmentation architecture.
arXiv Detail & Related papers (2023-01-26T14:52:30Z) - Real-Time Scene Text Detection with Differentiable Binarization and
Adaptive Scale Fusion [62.269219152425556]
segmentation-based scene text detection methods have drawn extensive attention in the scene text detection field.
We propose a Differentiable Binarization (DB) module that integrates the binarization process into a segmentation network.
An efficient Adaptive Scale Fusion (ASF) module is proposed to improve the scale robustness by fusing features of different scales adaptively.
arXiv Detail & Related papers (2022-02-21T15:30:14Z) - LAPAR: Linearly-Assembled Pixel-Adaptive Regression Network for Single
Image Super-Resolution and Beyond [75.37541439447314]
Single image super-resolution (SISR) deals with a fundamental problem of upsampling a low-resolution (LR) image to its high-resolution (HR) version.
This paper proposes a linearly-assembled pixel-adaptive regression network (LAPAR) to strike a sweet spot of deep model complexity and resulting SISR quality.
arXiv Detail & Related papers (2021-05-21T15:47:18Z) - ISTR: End-to-End Instance Segmentation with Transformers [147.14073165997846]
We propose an instance segmentation Transformer, termed ISTR, which is the first end-to-end framework of its kind.
ISTR predicts low-dimensional mask embeddings, and matches them with ground truth mask embeddings for the set loss.
Benefiting from the proposed end-to-end mechanism, ISTR demonstrates state-of-the-art performance even with approximation-based suboptimal embeddings.
arXiv Detail & Related papers (2021-05-03T06:00:09Z) - Adversarial Feature Augmentation and Normalization for Visual
Recognition [109.6834687220478]
Recent advances in computer vision take advantage of adversarial data augmentation to ameliorate the generalization ability of classification models.
Here, we present an effective and efficient alternative that advocates adversarial augmentation on intermediate feature embeddings.
We validate the proposed approach across diverse visual recognition tasks with representative backbone networks.
arXiv Detail & Related papers (2021-03-22T20:36:34Z) - Improving Pixel Embedding Learning through Intermediate Distance
Regression Supervision for Instance Segmentation [8.870513218826083]
We propose a simple, yet highly effective, architecture for object-aware embedding learning.
A distance regression module is incorporated into our architecture to generate seeds for fast clustering.
We show that the features learned by the distance regression module are able to promote the accuracy of learned object-aware embeddings significantly.
arXiv Detail & Related papers (2020-07-13T20:03:30Z) - Unsupervised Learning Consensus Model for Dynamic Texture Videos
Segmentation [12.462608802359936]
We present an effective unsupervised learning consensus model for the segmentation of dynamic texture (ULCM)
In the proposed model, the set of values of the requantized local binary patterns (LBP) histogram around the pixel to be classified are used as features.
Experiments conducted on the challenging SynthDB dataset show that ULCM is significantly faster, easier to code, simple and has limited parameters.
arXiv Detail & Related papers (2020-06-29T16:40:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.