DTrOCR: Decoder-only Transformer for Optical Character Recognition
- URL: http://arxiv.org/abs/2308.15996v1
- Date: Wed, 30 Aug 2023 12:37:03 GMT
- Title: DTrOCR: Decoder-only Transformer for Optical Character Recognition
- Authors: Masato Fujitake
- Abstract summary: We propose a simpler and more effective method for text recognition, known as the Decoder-only Transformer for Optical Character Recognition (DTrOCR)
This method uses a decoder-only Transformer to take advantage of a generative language model that is pre-trained on a large corpus.
Our experiments demonstrated that DTrOCR outperforms current state-of-the-art methods by a large margin in the recognition of printed, handwritten, and scene text in both English and Chinese.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Typical text recognition methods rely on an encoder-decoder structure, in
which the encoder extracts features from an image, and the decoder produces
recognized text from these features. In this study, we propose a simpler and
more effective method for text recognition, known as the Decoder-only
Transformer for Optical Character Recognition (DTrOCR). This method uses a
decoder-only Transformer to take advantage of a generative language model that
is pre-trained on a large corpus. We examined whether a generative language
model that has been successful in natural language processing can also be
effective for text recognition in computer vision. Our experiments demonstrated
that DTrOCR outperforms current state-of-the-art methods by a large margin in
the recognition of printed, handwritten, and scene text in both English and
Chinese.
Related papers
- Efficiently Leveraging Linguistic Priors for Scene Text Spotting [63.22351047545888]
This paper proposes a method that leverages linguistic knowledge from a large text corpus to replace the traditional one-hot encoding used in auto-regressive scene text spotting and recognition models.
We generate text distributions that align well with scene text datasets, removing the need for in-domain fine-tuning.
Experimental results show that our method not only improves recognition accuracy but also enables more accurate localization of words.
arXiv Detail & Related papers (2024-02-27T01:57:09Z) - Online Gesture Recognition using Transformer and Natural Language
Processing [0.0]
Transformer architecture is shown to provide a powerful machine framework for online gestures corresponding to glyph strokes of natural language sentences.
Transformer architecture is shown to provide a powerful machine framework for online gestures corresponding to glyph strokes of natural language sentences.
arXiv Detail & Related papers (2023-05-05T10:17:22Z) - Diffsound: Discrete Diffusion Model for Text-to-sound Generation [78.4128796899781]
We propose a novel text-to-sound generation framework that consists of a text encoder, a Vector Quantized Variational Autoencoder (VQ-VAE), a decoder, and a vocoder.
The framework first uses the decoder to transfer the text features extracted from the text encoder to a mel-spectrogram with the help of VQ-VAE, and then the vocoder is used to transform the generated mel-spectrogram into a waveform.
arXiv Detail & Related papers (2022-07-20T15:41:47Z) - MaskOCR: Text Recognition with Masked Encoder-Decoder Pretraining [68.05105411320842]
We propose a novel approach MaskOCR to unify vision and language pre-training in the classical encoder-decoder recognition framework.
We adopt the masked image modeling approach to pre-train the feature encoder using a large set of unlabeled real text images.
We transform text data into synthesized text images to unify the data modalities of vision and language, and enhance the language modeling capability of the sequence decoder.
arXiv Detail & Related papers (2022-06-01T08:27:19Z) - TRIG: Transformer-Based Text Recognizer with Initial Embedding Guidance [15.72669617789124]
Scene text recognition (STR) is an important bridge between images and text.
Recent methods use a frozen initial embedding to guide the decoder to decode the features to text, leading to a loss of accuracy.
We propose a novel architecture for text recognition, named TRansformer-based text recognizer with Initial embedding Guidance (TRIG)
arXiv Detail & Related papers (2021-11-16T09:10:39Z) - Lexically Aware Semi-Supervised Learning for OCR Post-Correction [90.54336622024299]
Much of the existing linguistic data in many languages of the world is locked away in non-digitized books and documents.
Previous work has demonstrated the utility of neural post-correction methods on recognition of less-well-resourced languages.
We present a semi-supervised learning method that makes it possible to utilize raw images to improve performance.
arXiv Detail & Related papers (2021-11-04T04:39:02Z) - TrOCR: Transformer-based Optical Character Recognition with Pre-trained
Models [47.48019831416665]
We propose an end-to-end text recognition approach with pre-trained image Transformer and text Transformer models, namely TrOCR.
TrOCR is simple but effective, and can be pre-trained with large-scale synthetic data and fine-tuned with human-labeled datasets.
Experiments show that the TrOCR model outperforms the current state-of-the-art models on both printed and handwritten text recognition tasks.
arXiv Detail & Related papers (2021-09-21T16:01:56Z) - Sentence Bottleneck Autoencoders from Transformer Language Models [53.350633961266375]
We build a sentence-level autoencoder from a pretrained, frozen transformer language model.
We adapt the masked language modeling objective as a generative, denoising one, while only training a sentence bottleneck and a single-layer modified transformer decoder.
We demonstrate that the sentence representations discovered by our model achieve better quality than previous methods that extract representations from pretrained transformers on text similarity tasks, style transfer, and single-sentence classification tasks in the GLUE benchmark, while using fewer parameters than large pretrained models.
arXiv Detail & Related papers (2021-08-31T19:39:55Z) - SEED: Semantics Enhanced Encoder-Decoder Framework for Scene Text
Recognition [17.191496890376197]
We propose a semantics enhanced encoder-decoder framework to robustly recognize low-quality scene texts.
The proposed framework is more robust for low-quality text images, and achieves state-of-the-art results on several benchmark datasets.
arXiv Detail & Related papers (2020-05-22T03:02:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.