TRIG: Transformer-Based Text Recognizer with Initial Embedding Guidance
- URL: http://arxiv.org/abs/2111.08314v1
- Date: Tue, 16 Nov 2021 09:10:39 GMT
- Title: TRIG: Transformer-Based Text Recognizer with Initial Embedding Guidance
- Authors: Yue Tao, Zhiwei Jia, Runze Ma, Shugong Xu
- Abstract summary: Scene text recognition (STR) is an important bridge between images and text.
Recent methods use a frozen initial embedding to guide the decoder to decode the features to text, leading to a loss of accuracy.
We propose a novel architecture for text recognition, named TRansformer-based text recognizer with Initial embedding Guidance (TRIG)
- Score: 15.72669617789124
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Scene text recognition (STR) is an important bridge between images and text,
attracting abundant research attention. While convolutional neural networks
(CNNS) have achieved remarkable progress in this task, most of the existing
works need an extra module (context modeling module) to help CNN to capture
global dependencies to solve the inductive bias and strengthen the relationship
between text features. Recently, the transformer has been proposed as a
promising network for global context modeling by self-attention mechanism, but
one of the main shortcomings, when applied to recognition, is the efficiency.
We propose a 1-D split to address the challenges of complexity and replace the
CNN with the transformer encoder to reduce the need for a context modeling
module. Furthermore, recent methods use a frozen initial embedding to guide the
decoder to decode the features to text, leading to a loss of accuracy. We
propose to use a learnable initial embedding learned from the transformer
encoder to make it adaptive to different input images. Above all, we introduce
a novel architecture for text recognition, named TRansformer-based text
recognizer with Initial embedding Guidance (TRIG), composed of three stages
(transformation, feature extraction, and prediction). Extensive experiments
show that our approach can achieve state-of-the-art on text recognition
benchmarks.
Related papers
- UNIT: Unifying Image and Text Recognition in One Vision Encoder [51.140564856352825]
UNIT is a novel training framework aimed at UNifying Image and Text recognition within a single model.
We show that UNIT significantly outperforms existing methods on document-related tasks.
Notably, UNIT retains the original vision encoder architecture, making it cost-free in terms of inference and deployment.
arXiv Detail & Related papers (2024-09-06T08:02:43Z) - ESTextSpotter: Towards Better Scene Text Spotting with Explicit Synergy
in Transformer [88.61312640540902]
We introduce Explicit Synergy-based Text Spotting Transformer framework (ESTextSpotter)
Our model achieves explicit synergy by modeling discriminative and interactive features for text detection and recognition within a single decoder.
Experimental results demonstrate that our model significantly outperforms previous state-of-the-art methods.
arXiv Detail & Related papers (2023-08-20T03:22:23Z) - A Transformer-based Approach for Arabic Offline Handwritten Text
Recognition [0.0]
We introduce two alternative architectures for recognizing offline Arabic handwritten text.
Our approach can model language dependencies and relies only on the attention mechanism, thereby making it more parallelizable and less complex.
Our evaluation on the Arabic KHATT dataset demonstrates that our proposed method outperforms the current state-of-the-art approaches.
arXiv Detail & Related papers (2023-07-27T17:51:52Z) - TextFormer: A Query-based End-to-End Text Spotter with Mixed Supervision [61.186488081379]
We propose TextFormer, a query-based end-to-end text spotter with Transformer architecture.
TextFormer builds upon an image encoder and a text decoder to learn a joint semantic understanding for multi-task modeling.
It allows for mutual training and optimization of classification, segmentation, and recognition branches, resulting in deeper feature sharing.
arXiv Detail & Related papers (2023-06-06T03:37:41Z) - A Text Attention Network for Spatial Deformation Robust Scene Text Image
Super-resolution [13.934846626570286]
Scene text image super-resolution aims to increase the resolution and readability of the text in low-resolution images.
It remains difficult to reconstruct high-resolution images for spatially deformed texts, especially rotated and curve-shaped ones.
We propose a CNN based Text ATTention network (TATT) to address this problem.
arXiv Detail & Related papers (2022-03-17T15:28:29Z) - TrOCR: Transformer-based Optical Character Recognition with Pre-trained
Models [47.48019831416665]
We propose an end-to-end text recognition approach with pre-trained image Transformer and text Transformer models, namely TrOCR.
TrOCR is simple but effective, and can be pre-trained with large-scale synthetic data and fine-tuned with human-labeled datasets.
Experiments show that the TrOCR model outperforms the current state-of-the-art models on both printed and handwritten text recognition tasks.
arXiv Detail & Related papers (2021-09-21T16:01:56Z) - Enhanced Modality Transition for Image Captioning [51.72997126838352]
We build a Modality Transition Module (MTM) to transfer visual features into semantic representations before forwarding them to the language model.
During the training phase, the modality transition network is optimised by the proposed modality loss.
Experiments have been conducted on the MS-COCO dataset demonstrating the effectiveness of the proposed framework.
arXiv Detail & Related papers (2021-02-23T07:20:12Z) - VX2TEXT: End-to-End Learning of Video-Based Text Generation From
Multimodal Inputs [103.99315770490163]
We present a framework for text generation from multimodal inputs consisting of video plus text, speech, or audio.
Experiments demonstrate that our approach based on a single architecture outperforms the state-of-the-art on three video-based text-generation tasks.
arXiv Detail & Related papers (2021-01-28T15:22:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.