SPIN: Structure-Preserving Inner Offset Network for Scene Text
Recognition
- URL: http://arxiv.org/abs/2005.13117v4
- Date: Mon, 25 Oct 2021 09:33:59 GMT
- Title: SPIN: Structure-Preserving Inner Offset Network for Scene Text
Recognition
- Authors: Chengwei Zhang, Yunlu Xu, Zhanzhan Cheng, Shiliang Pu, Yi Niu, Fei Wu
and Futai Zou
- Abstract summary: Arbitrary text appearance poses a great challenge in scene text recognition tasks.
We introduce a new learnable geometric-unrelated module, the Structure-Preserving Inner Offset Network (SPIN)
SPIN allows the color manipulation of source data within the network.
- Score: 48.676064155070556
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Arbitrary text appearance poses a great challenge in scene text recognition
tasks. Existing works mostly handle with the problem in consideration of the
shape distortion, including perspective distortions, line curvature or other
style variations. Therefore, methods based on spatial transformers are
extensively studied. However, chromatic difficulties in complex scenes have not
been paid much attention on. In this work, we introduce a new learnable
geometric-unrelated module, the Structure-Preserving Inner Offset Network
(SPIN), which allows the color manipulation of source data within the network.
This differentiable module can be inserted before any recognition architecture
to ease the downstream tasks, giving neural networks the ability to actively
transform input intensity rather than the existing spatial rectification. It
can also serve as a complementary module to known spatial transformations and
work in both independent and collaborative ways with them. Extensive
experiments show that the use of SPIN results in a significant improvement on
multiple text recognition benchmarks compared to the state-of-the-arts.
Related papers
- Latent Space Translation via Semantic Alignment [29.2401314068038]
We show how representations learned from different neural modules can be translated between different pre-trained networks.
Our method directly estimates a transformation between two given latent spaces, thereby enabling effective stitching of encoders and decoders without additional training.
Notably, we show how it is possible to zero-shot stitch text encoders and vision decoders, or vice-versa, yielding surprisingly good classification performance in this multimodal setting.
arXiv Detail & Related papers (2023-11-01T17:12:00Z) - Exploring Geometry of Blind Spots in Vision Models [56.47644447201878]
We study the phenomenon of under-sensitivity in vision models such as CNNs and Transformers.
We propose a Level Set Traversal algorithm that iteratively explores regions of high confidence with respect to the input space.
We estimate the extent of these connected higher-dimensional regions over which the model maintains a high degree of confidence.
arXiv Detail & Related papers (2023-10-30T18:00:33Z) - FMRT: Learning Accurate Feature Matching with Reconciliatory Transformer [29.95553680263075]
We propose Feature Matching with Reconciliatory Transformer (FMRT), a detector-free method that reconciles different features with multiple receptive fields adaptively.
FMRT yields extraordinary performance on multiple benchmarks, including pose estimation, visual localization, homography estimation, and image matching.
arXiv Detail & Related papers (2023-10-20T15:54:18Z) - ESTextSpotter: Towards Better Scene Text Spotting with Explicit Synergy
in Transformer [88.61312640540902]
We introduce Explicit Synergy-based Text Spotting Transformer framework (ESTextSpotter)
Our model achieves explicit synergy by modeling discriminative and interactive features for text detection and recognition within a single decoder.
Experimental results demonstrate that our model significantly outperforms previous state-of-the-art methods.
arXiv Detail & Related papers (2023-08-20T03:22:23Z) - Point-aware Interaction and CNN-induced Refinement Network for RGB-D
Salient Object Detection [95.84616822805664]
We introduce CNNs-assisted Transformer architecture and propose a novel RGB-D SOD network with Point-aware Interaction and CNN-induced Refinement.
In order to alleviate the block effect and detail destruction problems brought by the Transformer naturally, we design a CNN-induced refinement (CNNR) unit for content refinement and supplementation.
arXiv Detail & Related papers (2023-08-17T11:57:49Z) - Enhancing Deformable Local Features by Jointly Learning to Detect and
Describe Keypoints [8.390939268280235]
Local feature extraction is a standard approach in computer vision for tackling important tasks such as image matching and retrieval.
We propose DALF, a novel deformation-aware network for jointly detecting and describing keypoints.
Our approach also enhances the performance of two real-world applications: deformable object retrieval and non-rigid 3D surface registration.
arXiv Detail & Related papers (2023-04-02T18:01:51Z) - Part-guided Relational Transformers for Fine-grained Visual Recognition [59.20531172172135]
We propose a framework to learn the discriminative part features and explore correlations with a feature transformation module.
Our proposed approach does not rely on additional part branches and reaches state-the-of-art performance on 3-of-the-level object recognition.
arXiv Detail & Related papers (2022-12-28T03:45:56Z) - TRIG: Transformer-Based Text Recognizer with Initial Embedding Guidance [15.72669617789124]
Scene text recognition (STR) is an important bridge between images and text.
Recent methods use a frozen initial embedding to guide the decoder to decode the features to text, leading to a loss of accuracy.
We propose a novel architecture for text recognition, named TRansformer-based text recognizer with Initial embedding Guidance (TRIG)
arXiv Detail & Related papers (2021-11-16T09:10:39Z) - Context Decoupling Augmentation for Weakly Supervised Semantic
Segmentation [53.49821324597837]
Weakly supervised semantic segmentation is a challenging problem that has been deeply studied in recent years.
We present a Context Decoupling Augmentation ( CDA) method to change the inherent context in which the objects appear.
To validate the effectiveness of the proposed method, extensive experiments on PASCAL VOC 2012 dataset with several alternative network architectures demonstrate that CDA can boost various popular WSSS methods to the new state-of-the-art by a large margin.
arXiv Detail & Related papers (2021-03-02T15:05:09Z) - Multi-Subspace Neural Network for Image Recognition [33.61205842747625]
In image classification task, feature extraction is always a big issue. Intra-class variability increases the difficulty in designing the extractors.
Recently, deep learning has drawn lots of attention on automatically learning features from data.
In this study, we proposed multi-subspace neural network (MSNN) which integrates key components of the convolutional neural network (CNN), receptive field, with subspace concept.
arXiv Detail & Related papers (2020-06-17T02:55:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.