SEED: Semantics Enhanced Encoder-Decoder Framework for Scene Text
Recognition
- URL: http://arxiv.org/abs/2005.10977v1
- Date: Fri, 22 May 2020 03:02:46 GMT
- Title: SEED: Semantics Enhanced Encoder-Decoder Framework for Scene Text
Recognition
- Authors: Zhi Qiao, Yu Zhou, Dongbao Yang, Yucan Zhou, Weiping Wang
- Abstract summary: We propose a semantics enhanced encoder-decoder framework to robustly recognize low-quality scene texts.
The proposed framework is more robust for low-quality text images, and achieves state-of-the-art results on several benchmark datasets.
- Score: 17.191496890376197
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Scene text recognition is a hot research topic in computer vision. Recently,
many recognition methods based on the encoder-decoder framework have been
proposed, and they can handle scene texts of perspective distortion and curve
shape. Nevertheless, they still face lots of challenges like image blur, uneven
illumination, and incomplete characters. We argue that most encoder-decoder
methods are based on local visual features without explicit global semantic
information. In this work, we propose a semantics enhanced encoder-decoder
framework to robustly recognize low-quality scene texts. The semantic
information is used both in the encoder module for supervision and in the
decoder module for initializing. In particular, the state-of-the art ASTER
method is integrated into the proposed framework as an exemplar. Extensive
experiments demonstrate that the proposed framework is more robust for
low-quality text images, and achieves state-of-the-art results on several
benchmark datasets.
Related papers
- CMFN: Cross-Modal Fusion Network for Irregular Scene Text Recognition [22.13675752628]
We propose a novel cross-modal fusion network (CMFN) for irregular scene text recognition.
CMFN consists of a position self-enhanced encoder, a visual recognition branch and an iterative semantic recognition branch.
experiments demonstrate that the proposed CMFN algorithm achieves comparable performance to state-of-the-art algorithms.
arXiv Detail & Related papers (2024-01-18T15:05:57Z) - DTrOCR: Decoder-only Transformer for Optical Character Recognition [0.0]
We propose a simpler and more effective method for text recognition, known as the Decoder-only Transformer for Optical Character Recognition (DTrOCR)
This method uses a decoder-only Transformer to take advantage of a generative language model that is pre-trained on a large corpus.
Our experiments demonstrated that DTrOCR outperforms current state-of-the-art methods by a large margin in the recognition of printed, handwritten, and scene text in both English and Chinese.
arXiv Detail & Related papers (2023-08-30T12:37:03Z) - DeCap: Decoding CLIP Latents for Zero-Shot Captioning via Text-Only
Training [73.74291217502928]
We propose a simple framework, named DeCap, for zero-shot captioning.
We introduce a lightweight visual-aware language decoder.
We project the visual embedding into the CLIP text embedding space, while the projected embedding retains the information of the visual input.
arXiv Detail & Related papers (2023-03-06T11:02:47Z) - MaskOCR: Text Recognition with Masked Encoder-Decoder Pretraining [68.05105411320842]
We propose a novel approach MaskOCR to unify vision and language pre-training in the classical encoder-decoder recognition framework.
We adopt the masked image modeling approach to pre-train the feature encoder using a large set of unlabeled real text images.
We transform text data into synthesized text images to unify the data modalities of vision and language, and enhance the language modeling capability of the sequence decoder.
arXiv Detail & Related papers (2022-06-01T08:27:19Z) - Vision-Language Pre-Training for Boosting Scene Text Detectors [57.08046351495244]
We specifically adapt vision-language joint learning for scene text detection.
We propose to learn contextualized, joint representations through vision-language pre-training.
The pre-trained model is able to produce more informative representations with richer semantics.
arXiv Detail & Related papers (2022-04-29T03:53:54Z) - Representation and Correlation Enhanced Encoder-Decoder Framework for
Scene Text Recognition [10.496558786568672]
We propose a Representation and Correlation Enhanced-Decoder Framework(RCEED) to address these deficiencies and break performance bottleneck.
In the encoder module, local visual feature, global context feature, and position information are aligned and fused to generate a small-size comprehensive feature map.
In the decoder module, two methods are utilized to enhance the correlation between scene and text feature space.
arXiv Detail & Related papers (2021-06-13T10:36:56Z) - Enhanced Modality Transition for Image Captioning [51.72997126838352]
We build a Modality Transition Module (MTM) to transfer visual features into semantic representations before forwarding them to the language model.
During the training phase, the modality transition network is optimised by the proposed modality loss.
Experiments have been conducted on the MS-COCO dataset demonstrating the effectiveness of the proposed framework.
arXiv Detail & Related papers (2021-02-23T07:20:12Z) - Beyond Single Stage Encoder-Decoder Networks: Deep Decoders for Semantic
Image Segmentation [56.44853893149365]
Single encoder-decoder methodologies for semantic segmentation are reaching their peak in terms of segmentation quality and efficiency per number of layers.
We propose a new architecture based on a decoder which uses a set of shallow networks for capturing more information content.
In order to further improve the architecture we introduce a weight function which aims to re-balance classes to increase the attention of the networks to under-represented objects.
arXiv Detail & Related papers (2020-07-19T18:44:34Z) - RobustScanner: Dynamically Enhancing Positional Clues for Robust Text
Recognition [31.62436356768889]
We show that a character-level sequence decoder utilizes not only context information but also positional information.
We propose a novel position enhancement branch, and dynamically fuse its outputs with those of the decoder attention module for scene text recognition.
Our proposed method, dubbed emphRobustScanner, decodes individual characters with dynamic ratio between context and positional clues.
arXiv Detail & Related papers (2020-07-15T08:37:40Z) - Rethinking and Improving Natural Language Generation with Layer-Wise
Multi-View Decoding [59.48857453699463]
In sequence-to-sequence learning, the decoder relies on the attention mechanism to efficiently extract information from the encoder.
Recent work has proposed to use representations from different encoder layers for diversified levels of information.
We propose layer-wise multi-view decoding, where for each decoder layer, together with the representations from the last encoder layer, which serve as a global view, those from other encoder layers are supplemented for a stereoscopic view of the source sequences.
arXiv Detail & Related papers (2020-05-16T20:00:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.