Representation and Correlation Enhanced Encoder-Decoder Framework for
Scene Text Recognition
- URL: http://arxiv.org/abs/2106.06960v1
- Date: Sun, 13 Jun 2021 10:36:56 GMT
- Title: Representation and Correlation Enhanced Encoder-Decoder Framework for
Scene Text Recognition
- Authors: Mengmeng Cui, Wei Wang, Jinjin Zhang, Liang Wang
- Abstract summary: We propose a Representation and Correlation Enhanced-Decoder Framework(RCEED) to address these deficiencies and break performance bottleneck.
In the encoder module, local visual feature, global context feature, and position information are aligned and fused to generate a small-size comprehensive feature map.
In the decoder module, two methods are utilized to enhance the correlation between scene and text feature space.
- Score: 10.496558786568672
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Attention-based encoder-decoder framework is widely used in the scene text
recognition task. However, for the current state-of-the-art(SOTA) methods,
there is room for improvement in terms of the efficient usage of local visual
and global context information of the input text image, as well as the robust
correlation between the scene processing module(encoder) and the text
processing module(decoder). In this paper, we propose a Representation and
Correlation Enhanced Encoder-Decoder Framework(RCEED) to address these
deficiencies and break performance bottleneck. In the encoder module, local
visual feature, global context feature, and position information are aligned
and fused to generate a small-size comprehensive feature map. In the decoder
module, two methods are utilized to enhance the correlation between scene and
text feature space. 1) The decoder initialization is guided by the holistic
feature and global glimpse vector exported from the encoder. 2) The feature
enriched glimpse vector produced by the Multi-Head General Attention is used to
assist the RNN iteration and the character prediction at each time step.
Meanwhile, we also design a Layernorm-Dropout LSTM cell to improve model's
generalization towards changeable texts. Extensive experiments on the
benchmarks demonstrate the advantageous performance of RCEED in scene text
recognition tasks, especially the irregular ones.
Related papers
- FE-LWS: Refined Image-Text Representations via Decoder Stacking and Fused Encodings for Remote Sensing Image Captioning [0.15346678870160887]
This paper introduces a novel approach that integrates features from two distinct CNN based encoders.
We also propose a weighted averaging technique to combine the outputs of all GRUs in the stacked decoder.
The results demonstrate that our fusion-based approach, along with the enhanced stacked decoder, significantly outperforms both the transformer-based state-of-the-art model and other LSTM-based baselines.
arXiv Detail & Related papers (2025-02-13T12:54:13Z) - Col-OLHTR: A Novel Framework for Multimodal Online Handwritten Text Recognition [82.88856416080331]
Online Handwritten Text Recognition (OLHTR) has gained considerable attention for its diverse range of applications.
Current approaches usually treat OLHTR as a sequence recognition task, employing either a single trajectory or image encoder, or multi-stream encoders.
We propose a Collaborative learning-based OLHTR framework, called Col-OLHTR, that learns multimodal features during training while maintaining a single-stream inference process.
arXiv Detail & Related papers (2025-02-10T02:12:24Z) - SEDS: Semantically Enhanced Dual-Stream Encoder for Sign Language Retrieval [82.51117533271517]
Previous works typically only encode RGB videos to obtain high-level semantic features.
Existing RGB-based sign retrieval works suffer from the huge memory cost of dense visual data embedding in end-to-end training.
We propose a novel sign language representation framework called Semantically Enhanced Dual-Stream.
arXiv Detail & Related papers (2024-07-23T11:31:11Z) - TextFormer: A Query-based End-to-End Text Spotter with Mixed Supervision [61.186488081379]
We propose TextFormer, a query-based end-to-end text spotter with Transformer architecture.
TextFormer builds upon an image encoder and a text decoder to learn a joint semantic understanding for multi-task modeling.
It allows for mutual training and optimization of classification, segmentation, and recognition branches, resulting in deeper feature sharing.
arXiv Detail & Related papers (2023-06-06T03:37:41Z) - Adjacent Context Coordination Network for Salient Object Detection in
Optical Remote Sensing Images [102.75699068451166]
We propose a novel Adjacent Context Coordination Network (ACCoNet) to explore the coordination of adjacent features in an encoder-decoder architecture for optical RSI-SOD.
The proposed ACCoNet outperforms 22 state-of-the-art methods under nine evaluation metrics, and runs up to 81 fps on a single NVIDIA Titan X GPU.
arXiv Detail & Related papers (2022-03-25T14:14:55Z) - TRIG: Transformer-Based Text Recognizer with Initial Embedding Guidance [15.72669617789124]
Scene text recognition (STR) is an important bridge between images and text.
Recent methods use a frozen initial embedding to guide the decoder to decode the features to text, leading to a loss of accuracy.
We propose a novel architecture for text recognition, named TRansformer-based text recognizer with Initial embedding Guidance (TRIG)
arXiv Detail & Related papers (2021-11-16T09:10:39Z) - Collaborative Spatial-Temporal Modeling for Language-Queried Video Actor
Segmentation [90.74732705236336]
Language-queried video actor segmentation aims to predict the pixel-mask of the actor which performs the actions described by a natural language query in the target frames.
We propose a collaborative spatial-temporal encoder-decoder framework which contains a 3D temporal encoder over the video clip to recognize the queried actions, and a 2D spatial encoder over the target frame to accurately segment the queried actors.
arXiv Detail & Related papers (2021-05-14T13:27:53Z) - Primitive Representation Learning for Scene Text Recognition [7.818765015637802]
We propose a primitive representation learning method that aims to exploit intrinsic representations of scene text images.
A Primitive REpresentation learning Network (PREN) is constructed to use the visual text representations for parallel decoding.
We also propose a framework called PREN2D to alleviate the misalignment problem in attention-based methods.
arXiv Detail & Related papers (2021-05-10T11:54:49Z) - Rethinking Text Line Recognition Models [57.47147190119394]
We consider two decoder families (Connectionist Temporal Classification and Transformer) and three encoder modules (Bidirectional LSTMs, Self-Attention, and GRCLs)
We compare their accuracy and performance on widely used public datasets of scene and handwritten text.
Unlike the more common Transformer-based models, this architecture can handle inputs of arbitrary length.
arXiv Detail & Related papers (2021-04-15T21:43:13Z) - RobustScanner: Dynamically Enhancing Positional Clues for Robust Text
Recognition [31.62436356768889]
We show that a character-level sequence decoder utilizes not only context information but also positional information.
We propose a novel position enhancement branch, and dynamically fuse its outputs with those of the decoder attention module for scene text recognition.
Our proposed method, dubbed emphRobustScanner, decodes individual characters with dynamic ratio between context and positional clues.
arXiv Detail & Related papers (2020-07-15T08:37:40Z) - SEED: Semantics Enhanced Encoder-Decoder Framework for Scene Text
Recognition [17.191496890376197]
We propose a semantics enhanced encoder-decoder framework to robustly recognize low-quality scene texts.
The proposed framework is more robust for low-quality text images, and achieves state-of-the-art results on several benchmark datasets.
arXiv Detail & Related papers (2020-05-22T03:02:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.