DTrOCR: Decoder-only Transformer for Optical Character Recognition
- URL: http://arxiv.org/abs/2308.15996v1
- Date: Wed, 30 Aug 2023 12:37:03 GMT
- Title: DTrOCR: Decoder-only Transformer for Optical Character Recognition
- Authors: Masato Fujitake
- Abstract summary: We propose a simpler and more effective method for text recognition, known as the Decoder-only Transformer for Optical Character Recognition (DTrOCR)
This method uses a decoder-only Transformer to take advantage of a generative language model that is pre-trained on a large corpus.
Our experiments demonstrated that DTrOCR outperforms current state-of-the-art methods by a large margin in the recognition of printed, handwritten, and scene text in both English and Chinese.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Typical text recognition methods rely on an encoder-decoder structure, in
which the encoder extracts features from an image, and the decoder produces
recognized text from these features. In this study, we propose a simpler and
more effective method for text recognition, known as the Decoder-only
Transformer for Optical Character Recognition (DTrOCR). This method uses a
decoder-only Transformer to take advantage of a generative language model that
is pre-trained on a large corpus. We examined whether a generative language
model that has been successful in natural language processing can also be
effective for text recognition in computer vision. Our experiments demonstrated
that DTrOCR outperforms current state-of-the-art methods by a large margin in
the recognition of printed, handwritten, and scene text in both English and
Chinese.
Related papers
- UNIT: Unifying Image and Text Recognition in One Vision Encoder [51.140564856352825]
UNIT is a novel training framework aimed at UNifying Image and Text recognition within a single model.
We show that UNIT significantly outperforms existing methods on document-related tasks.
Notably, UNIT retains the original vision encoder architecture, making it cost-free in terms of inference and deployment.
arXiv Detail & Related papers (2024-09-06T08:02:43Z) - Enhancing Large Language Model-based Speech Recognition by Contextualization for Rare and Ambiguous Words [10.2138250640885]
We develop a large language model (LLM) based automatic speech recognition (ASR) system that can be contextualized by providing keywords in text prompts.
We adopt decoder-only architecture and use our in-house LLM, PLaMo-100B, pre-trained from scratch using datasets dominated by Japanese and English texts as the decoder.
arXiv Detail & Related papers (2024-08-15T08:50:58Z) - Decoder Pre-Training with only Text for Scene Text Recognition [54.93037783663204]
Scene text recognition (STR) pre-training methods have achieved remarkable progress, primarily relying on synthetic datasets.
We introduce a novel method named Decoder Pre-training with only text for STR (DPTR)
DPTR treats text embeddings produced by the CLIP text encoder as pseudo visual embeddings and uses them to pre-train the decoder.
arXiv Detail & Related papers (2024-08-11T06:36:42Z) - Efficiently Leveraging Linguistic Priors for Scene Text Spotting [63.22351047545888]
This paper proposes a method that leverages linguistic knowledge from a large text corpus to replace the traditional one-hot encoding used in auto-regressive scene text spotting and recognition models.
We generate text distributions that align well with scene text datasets, removing the need for in-domain fine-tuning.
Experimental results show that our method not only improves recognition accuracy but also enables more accurate localization of words.
arXiv Detail & Related papers (2024-02-27T01:57:09Z) - MaskOCR: Text Recognition with Masked Encoder-Decoder Pretraining [68.05105411320842]
We propose a novel approach MaskOCR to unify vision and language pre-training in the classical encoder-decoder recognition framework.
We adopt the masked image modeling approach to pre-train the feature encoder using a large set of unlabeled real text images.
We transform text data into synthesized text images to unify the data modalities of vision and language, and enhance the language modeling capability of the sequence decoder.
arXiv Detail & Related papers (2022-06-01T08:27:19Z) - Lexically Aware Semi-Supervised Learning for OCR Post-Correction [90.54336622024299]
Much of the existing linguistic data in many languages of the world is locked away in non-digitized books and documents.
Previous work has demonstrated the utility of neural post-correction methods on recognition of less-well-resourced languages.
We present a semi-supervised learning method that makes it possible to utilize raw images to improve performance.
arXiv Detail & Related papers (2021-11-04T04:39:02Z) - TrOCR: Transformer-based Optical Character Recognition with Pre-trained
Models [47.48019831416665]
We propose an end-to-end text recognition approach with pre-trained image Transformer and text Transformer models, namely TrOCR.
TrOCR is simple but effective, and can be pre-trained with large-scale synthetic data and fine-tuned with human-labeled datasets.
Experiments show that the TrOCR model outperforms the current state-of-the-art models on both printed and handwritten text recognition tasks.
arXiv Detail & Related papers (2021-09-21T16:01:56Z) - SEED: Semantics Enhanced Encoder-Decoder Framework for Scene Text
Recognition [17.191496890376197]
We propose a semantics enhanced encoder-decoder framework to robustly recognize low-quality scene texts.
The proposed framework is more robust for low-quality text images, and achieves state-of-the-art results on several benchmark datasets.
arXiv Detail & Related papers (2020-05-22T03:02:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.