SEED: Semantics Enhanced Encoder-Decoder Framework for Scene Text
Recognition
- URL: http://arxiv.org/abs/2005.10977v1
- Date: Fri, 22 May 2020 03:02:46 GMT
- Title: SEED: Semantics Enhanced Encoder-Decoder Framework for Scene Text
Recognition
- Authors: Zhi Qiao, Yu Zhou, Dongbao Yang, Yucan Zhou, Weiping Wang
- Abstract summary: We propose a semantics enhanced encoder-decoder framework to robustly recognize low-quality scene texts.
The proposed framework is more robust for low-quality text images, and achieves state-of-the-art results on several benchmark datasets.
- Score: 17.191496890376197
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Scene text recognition is a hot research topic in computer vision. Recently,
many recognition methods based on the encoder-decoder framework have been
proposed, and they can handle scene texts of perspective distortion and curve
shape. Nevertheless, they still face lots of challenges like image blur, uneven
illumination, and incomplete characters. We argue that most encoder-decoder
methods are based on local visual features without explicit global semantic
information. In this work, we propose a semantics enhanced encoder-decoder
framework to robustly recognize low-quality scene texts. The semantic
information is used both in the encoder module for supervision and in the
decoder module for initializing. In particular, the state-of-the art ASTER
method is integrated into the proposed framework as an exemplar. Extensive
experiments demonstrate that the proposed framework is more robust for
low-quality text images, and achieves state-of-the-art results on several
benchmark datasets.
Related papers
- A TextGCN-Based Decoding Approach for Improving Remote Sensing Image Captioning [0.15346678870160887]
We propose a novel encoder-decoder setup that deploys a Text Graph Convolutional Network (TextGCN) and multi-layer LSTMs.
The embeddings generated by TextGCN enhance the decoder's understanding by capturing the semantic relationships among words at both the sentence and corpus levels.
We present an extensive evaluation of our approach against various other state-of-the-art encoder-decoder frameworks.
arXiv Detail & Related papers (2024-09-27T06:12:31Z) - UNIT: Unifying Image and Text Recognition in One Vision Encoder [51.140564856352825]
UNIT is a novel training framework aimed at UNifying Image and Text recognition within a single model.
We show that UNIT significantly outperforms existing methods on document-related tasks.
Notably, UNIT retains the original vision encoder architecture, making it cost-free in terms of inference and deployment.
arXiv Detail & Related papers (2024-09-06T08:02:43Z) - Decoder Pre-Training with only Text for Scene Text Recognition [54.93037783663204]
Scene text recognition (STR) pre-training methods have achieved remarkable progress, primarily relying on synthetic datasets.
We introduce a novel method named Decoder Pre-training with only text for STR (DPTR)
DPTR treats text embeddings produced by the CLIP text encoder as pseudo visual embeddings and uses them to pre-train the decoder.
arXiv Detail & Related papers (2024-08-11T06:36:42Z) - MaskOCR: Text Recognition with Masked Encoder-Decoder Pretraining [68.05105411320842]
We propose a novel approach MaskOCR to unify vision and language pre-training in the classical encoder-decoder recognition framework.
We adopt the masked image modeling approach to pre-train the feature encoder using a large set of unlabeled real text images.
We transform text data into synthesized text images to unify the data modalities of vision and language, and enhance the language modeling capability of the sequence decoder.
arXiv Detail & Related papers (2022-06-01T08:27:19Z) - Representation and Correlation Enhanced Encoder-Decoder Framework for
Scene Text Recognition [10.496558786568672]
We propose a Representation and Correlation Enhanced-Decoder Framework(RCEED) to address these deficiencies and break performance bottleneck.
In the encoder module, local visual feature, global context feature, and position information are aligned and fused to generate a small-size comprehensive feature map.
In the decoder module, two methods are utilized to enhance the correlation between scene and text feature space.
arXiv Detail & Related papers (2021-06-13T10:36:56Z) - Enhanced Modality Transition for Image Captioning [51.72997126838352]
We build a Modality Transition Module (MTM) to transfer visual features into semantic representations before forwarding them to the language model.
During the training phase, the modality transition network is optimised by the proposed modality loss.
Experiments have been conducted on the MS-COCO dataset demonstrating the effectiveness of the proposed framework.
arXiv Detail & Related papers (2021-02-23T07:20:12Z) - Beyond Single Stage Encoder-Decoder Networks: Deep Decoders for Semantic
Image Segmentation [56.44853893149365]
Single encoder-decoder methodologies for semantic segmentation are reaching their peak in terms of segmentation quality and efficiency per number of layers.
We propose a new architecture based on a decoder which uses a set of shallow networks for capturing more information content.
In order to further improve the architecture we introduce a weight function which aims to re-balance classes to increase the attention of the networks to under-represented objects.
arXiv Detail & Related papers (2020-07-19T18:44:34Z) - RobustScanner: Dynamically Enhancing Positional Clues for Robust Text
Recognition [31.62436356768889]
We show that a character-level sequence decoder utilizes not only context information but also positional information.
We propose a novel position enhancement branch, and dynamically fuse its outputs with those of the decoder attention module for scene text recognition.
Our proposed method, dubbed emphRobustScanner, decodes individual characters with dynamic ratio between context and positional clues.
arXiv Detail & Related papers (2020-07-15T08:37:40Z) - Rethinking and Improving Natural Language Generation with Layer-Wise
Multi-View Decoding [59.48857453699463]
In sequence-to-sequence learning, the decoder relies on the attention mechanism to efficiently extract information from the encoder.
Recent work has proposed to use representations from different encoder layers for diversified levels of information.
We propose layer-wise multi-view decoding, where for each decoder layer, together with the representations from the last encoder layer, which serve as a global view, those from other encoder layers are supplemented for a stereoscopic view of the source sequences.
arXiv Detail & Related papers (2020-05-16T20:00:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.