TrOCR: Transformer-based Optical Character Recognition with Pre-trained
Models
- URL: http://arxiv.org/abs/2109.10282v2
- Date: Wed, 22 Sep 2021 16:44:16 GMT
- Title: TrOCR: Transformer-based Optical Character Recognition with Pre-trained
Models
- Authors: Minghao Li, Tengchao Lv, Lei Cui, Yijuan Lu, Dinei Florencio, Cha
Zhang, Zhoujun Li, Furu Wei
- Abstract summary: We propose an end-to-end text recognition approach with pre-trained image Transformer and text Transformer models, namely TrOCR.
TrOCR is simple but effective, and can be pre-trained with large-scale synthetic data and fine-tuned with human-labeled datasets.
Experiments show that the TrOCR model outperforms the current state-of-the-art models on both printed and handwritten text recognition tasks.
- Score: 47.48019831416665
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Text recognition is a long-standing research problem for document
digitalization. Existing approaches for text recognition are usually built
based on CNN for image understanding and RNN for char-level text generation. In
addition, another language model is usually needed to improve the overall
accuracy as a post-processing step. In this paper, we propose an end-to-end
text recognition approach with pre-trained image Transformer and text
Transformer models, namely TrOCR, which leverages the Transformer architecture
for both image understanding and wordpiece-level text generation. The TrOCR
model is simple but effective, and can be pre-trained with large-scale
synthetic data and fine-tuned with human-labeled datasets. Experiments show
that the TrOCR model outperforms the current state-of-the-art models on both
printed and handwritten text recognition tasks. The code and models will be
publicly available at https://aka.ms/TrOCR.
Related papers
- UNIT: Unifying Image and Text Recognition in One Vision Encoder [51.140564856352825]
UNIT is a novel training framework aimed at UNifying Image and Text recognition within a single model.
We show that UNIT significantly outperforms existing methods on document-related tasks.
Notably, UNIT retains the original vision encoder architecture, making it cost-free in terms of inference and deployment.
arXiv Detail & Related papers (2024-09-06T08:02:43Z) - Decoder Pre-Training with only Text for Scene Text Recognition [54.93037783663204]
Scene text recognition (STR) pre-training methods have achieved remarkable progress, primarily relying on synthetic datasets.
We introduce a novel method named Decoder Pre-training with only text for STR (DPTR)
DPTR treats text embeddings produced by the CLIP text encoder as pseudo visual embeddings and uses them to pre-train the decoder.
arXiv Detail & Related papers (2024-08-11T06:36:42Z) - DLoRA-TrOCR: Mixed Text Mode Optical Character Recognition Based On Transformer [12.966765239586994]
Multi- fonts, mixed scenes and complex layouts seriously affect the recognition accuracy of traditional OCR models.
We propose a parameter-efficient mixed text recognition method based on pre-trained OCR Transformer, namely DLoRA-TrOCR.
arXiv Detail & Related papers (2024-04-19T09:28:16Z) - DTrOCR: Decoder-only Transformer for Optical Character Recognition [0.0]
We propose a simpler and more effective method for text recognition, known as the Decoder-only Transformer for Optical Character Recognition (DTrOCR)
This method uses a decoder-only Transformer to take advantage of a generative language model that is pre-trained on a large corpus.
Our experiments demonstrated that DTrOCR outperforms current state-of-the-art methods by a large margin in the recognition of printed, handwritten, and scene text in both English and Chinese.
arXiv Detail & Related papers (2023-08-30T12:37:03Z) - Transferring General Multimodal Pretrained Models to Text Recognition [46.33867696799362]
We recast text recognition as image captioning and directly transfer a unified vision-language pretrained model to the end task.
We construct an OCR pipeline with OFA-OCR, and we demonstrate that it can achieve competitive performance with the product-level API.
arXiv Detail & Related papers (2022-12-19T08:30:42Z) - GIT: A Generative Image-to-text Transformer for Vision and Language [138.91581326369837]
We train a Generative Image-to-text Transformer, GIT, to unify vision-language tasks such as image/video captioning and question answering.
Our model surpasses the human performance for the first time on TextCaps (138.2 vs. 125.5 in CIDEr)
arXiv Detail & Related papers (2022-05-27T17:03:38Z) - TRIG: Transformer-Based Text Recognizer with Initial Embedding Guidance [15.72669617789124]
Scene text recognition (STR) is an important bridge between images and text.
Recent methods use a frozen initial embedding to guide the decoder to decode the features to text, leading to a loss of accuracy.
We propose a novel architecture for text recognition, named TRansformer-based text recognizer with Initial embedding Guidance (TRIG)
arXiv Detail & Related papers (2021-11-16T09:10:39Z) - Rethinking Text Line Recognition Models [57.47147190119394]
We consider two decoder families (Connectionist Temporal Classification and Transformer) and three encoder modules (Bidirectional LSTMs, Self-Attention, and GRCLs)
We compare their accuracy and performance on widely used public datasets of scene and handwritten text.
Unlike the more common Transformer-based models, this architecture can handle inputs of arbitrary length.
arXiv Detail & Related papers (2021-04-15T21:43:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.