Full Page Handwriting Recognition via Image to Sequence Extraction
- URL: http://arxiv.org/abs/2103.06450v1
- Date: Thu, 11 Mar 2021 04:37:29 GMT
- Title: Full Page Handwriting Recognition via Image to Sequence Extraction
- Authors: Sumeet S. Singh, Sergey Karayev
- Abstract summary: The model achieves a new state-of-art in full page recognition on the IAM dataset.
It is deployed in production as part of a commercial web application.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present a Neural Network based Handwritten Text Recognition (HTR) model
architecture that can be trained to recognize full pages of handwritten or
printed text without image segmentation. Being based on an Image to Sequence
architecture, it can be trained to extract text present in an image and
sequence it correctly without imposing any constraints on language, shape of
characters or orientation and layout of text and non-text. The model can also
be trained to generate auxiliary markup related to formatting, layout and
content. We use character level token vocabulary, thereby supporting proper
nouns and terminology of any subject. The model achieves a new state-of-art in
full page recognition on the IAM dataset and when evaluated on scans of real
world handwritten free form test answers - a dataset beset with curved and
slanted lines, drawings, tables, math, chemistry and other symbols - it
performs better than all commercially available HTR APIs. It is deployed in
production as part of a commercial web application.
Related papers
- See then Tell: Enhancing Key Information Extraction with Vision Grounding [54.061203106565706]
We introduce STNet (See then Tell Net), a novel end-to-end model designed to deliver precise answers with relevant vision grounding.
To enhance the model's seeing capabilities, we collect extensive structured table recognition datasets.
arXiv Detail & Related papers (2024-09-29T06:21:05Z) - Decoder Pre-Training with only Text for Scene Text Recognition [54.93037783663204]
Scene text recognition (STR) pre-training methods have achieved remarkable progress, primarily relying on synthetic datasets.
We introduce a novel method named Decoder Pre-training with only text for STR (DPTR)
DPTR treats text embeddings produced by the CLIP text encoder as pseudo visual embeddings and uses them to pre-train the decoder.
arXiv Detail & Related papers (2024-08-11T06:36:42Z) - DANIEL: A fast Document Attention Network for Information Extraction and Labelling of handwritten documents [4.298545628576284]
We introduce DANIEL (Document Attention Network for Information Extraction and Labelling), a fully end-to-end architecture for handwritten document understanding.
DANIEL performs layout recognition, handwriting recognition, and named entity recognition on full-page documents.
It can simultaneously learn across multiple languages, layouts, and tasks.
arXiv Detail & Related papers (2024-07-12T09:09:56Z) - WordStylist: Styled Verbatim Handwritten Text Generation with Latent
Diffusion Models [8.334487584550185]
We present a latent diffusion-based method for styled text-to-text-content-image generation on word-level.
Our proposed method is able to generate realistic word image samples from different writer styles.
We show that the proposed model produces samples that are aesthetically pleasing, help boosting text recognition performance, and get similar writer retrieval score as real data.
arXiv Detail & Related papers (2023-03-29T10:19:26Z) - Language Quantized AutoEncoders: Towards Unsupervised Text-Image
Alignment [81.73717488887938]
Language-Quantized AutoEncoder (LQAE) learns to align text-image data in an unsupervised manner by leveraging pretrained language models.
LQAE learns to represent similar images with similar clusters of text tokens, thereby aligning these two modalities without the use of aligned text-image pairs.
This enables few-shot image classification with large language models (e.g., GPT-3) as well as linear classification of images based on BERT text features.
arXiv Detail & Related papers (2023-02-02T06:38:44Z) - Pix2Struct: Screenshot Parsing as Pretraining for Visual Language
Understanding [58.70423899829642]
We present Pix2Struct, a pretrained image-to-text model for purely visual language understanding.
We show that a single pretrained model can achieve state-of-the-art results in six out of nine tasks across four domains.
arXiv Detail & Related papers (2022-10-07T06:42:06Z) - Boosting Modern and Historical Handwritten Text Recognition with
Deformable Convolutions [52.250269529057014]
Handwritten Text Recognition (HTR) in free-volution pages is a challenging image understanding task.
We propose to adopt deformable convolutions, which can deform depending on the input at hand and better adapt to the geometric variations of the text.
arXiv Detail & Related papers (2022-08-17T06:55:54Z) - Towards Multimodal Vision-Language Models Generating Non-Generic Text [2.102846336724103]
Vision-language models can assess visual context in an image and generate descriptive text.
Recent work has used optical character recognition to supplement visual information with text extracted from an image.
In this work, we contend that vision-language models can benefit from additional information that can be extracted from an image, but are not used by current models.
arXiv Detail & Related papers (2022-07-09T01:56:35Z) - Modelling the semantics of text in complex document layouts using graph
transformer networks [0.0]
We propose a model that approximates the human reading pattern of a document and outputs a unique semantic representation for every text span.
We base our architecture on a graph representation of the structured text, and we demonstrate that not only can we retrieve semantically similar information across documents but also that the embedding space we generate captures useful semantic information.
arXiv Detail & Related papers (2022-02-18T11:49:06Z) - Scaling Up Visual and Vision-Language Representation Learning With Noisy
Text Supervision [57.031588264841]
We leverage a noisy dataset of over one billion image alt-text pairs, obtained without expensive filtering or post-processing steps.
A simple dual-encoder architecture learns to align visual and language representations of the image and text pairs using a contrastive loss.
We show that the scale of our corpus can make up for its noise and leads to state-of-the-art representations even with such a simple learning scheme.
arXiv Detail & Related papers (2021-02-11T10:08:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.