LOCR: Location-Guided Transformer for Optical Character Recognition
- URL: http://arxiv.org/abs/2403.02127v1
- Date: Mon, 4 Mar 2024 15:34:12 GMT
- Title: LOCR: Location-Guided Transformer for Optical Character Recognition
- Authors: Yu Sun, Dongzhan Zhou, Chen Lin, Conghui He, Wanli Ouyang, Han-Sen
Zhong
- Abstract summary: We propose LOCR, a model that integrates location guiding into the transformer architecture during autoregression.
We train the model on a dataset comprising over 77M text-location pairs from 125K academic document pages, including bounding boxes for words, tables and mathematical symbols.
It outperforms all existing methods in our test set constructed from arXiv, as measured by edit distance, BLEU, METEOR and F-measure.
- Score: 55.195165959662795
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Academic documents are packed with texts, equations, tables, and figures,
requiring comprehensive understanding for accurate Optical Character
Recognition (OCR). While end-to-end OCR methods offer improved accuracy over
layout-based approaches, they often grapple with significant repetition issues,
especially with complex layouts in Out-Of-Domain (OOD) documents.To tackle this
issue, we propose LOCR, a model that integrates location guiding into the
transformer architecture during autoregression. We train the model on a dataset
comprising over 77M text-location pairs from 125K academic document pages,
including bounding boxes for words, tables and mathematical symbols. LOCR
adeptly handles various formatting elements and generates content in Markdown
language. It outperforms all existing methods in our test set constructed from
arXiv, as measured by edit distance, BLEU, METEOR and F-measure.LOCR also
reduces repetition frequency from 4.4% of pages to 0.5% in the arXiv dataset,
from 13.2% to 1.3% in OOD quantum physics documents and from 8.1% to 1.8% in
OOD marketing documents. Additionally, LOCR features an interactive OCR mode,
facilitating the generation of complex documents through a few location prompts
from human.
Related papers
- GatedLexiconNet: A Comprehensive End-to-End Handwritten Paragraph Text Recognition System [3.9527064697847005]
We present an end-to-end paragraph recognition system that incorporates internal line segmentation and convolutional layers based encoder.
This study reported character error rates of 2.27% on IAM, 0.9% on RIMES, and 2.13% on READ-16, and word error rates of 5.73% on READ-2016 datasets.
arXiv Detail & Related papers (2024-04-22T10:19:16Z) - DLoRA-TrOCR: Mixed Text Mode Optical Character Recognition Based On Transformer [12.966765239586994]
Multi- fonts, mixed scenes and complex layouts seriously affect the recognition accuracy of traditional OCR models.
We propose a parameter-efficient mixed text recognition method based on pre-trained OCR Transformer, namely DLoRA-TrOCR.
arXiv Detail & Related papers (2024-04-19T09:28:16Z) - Toward Real Text Manipulation Detection: New Dataset and New Solution [58.557504531896704]
High costs associated with professional text manipulation limit the availability of real-world datasets.
We present the Real Text Manipulation dataset, encompassing 14,250 text images.
Our contributions aim to propel advancements in real-world text tampering detection.
arXiv Detail & Related papers (2023-12-12T02:10:16Z) - EfficientOCR: An Extensible, Open-Source Package for Efficiently
Digitizing World Knowledge [1.8434042562191815]
EffOCR is a novel open-source optical character recognition (OCR) package.
It meets both the computational and sample efficiency requirements for liberating texts at scale.
EffOCR is cheap and sample efficient to train, as the model only needs to learn characters' visual appearance and not how they are used in sequence to form language.
arXiv Detail & Related papers (2023-10-16T04:20:16Z) - KOSMOS-2.5: A Multimodal Literate Model [136.96172068766285]
We present KOSMOS-2.5, a multimodal literate model for machine reading of text-intensive images.
KOSMOS-2.5 excels in two distinct yet complementary transcription tasks.
We fine-tune KOSMOS-2.5 for document understanding tasks, resulting in a document understanding generalist named KOSMOS-2.5-CHAT.
arXiv Detail & Related papers (2023-09-20T15:50:08Z) - Unifying Two-Stream Encoders with Transformers for Cross-Modal Retrieval [68.61855682218298]
Cross-modal retrieval methods employ two-stream encoders with different architectures for images and texts.
Inspired by recent advances of Transformers in vision tasks, we propose to unify the encoder architectures with Transformers for both modalities.
We design a cross-modal retrieval framework purely based on two-stream Transformers, dubbed textbfHierarchical Alignment Transformers (HAT), which consists of an image Transformer, a text Transformer, and a hierarchical alignment module.
arXiv Detail & Related papers (2023-08-08T15:43:59Z) - Extending TrOCR for Text Localization-Free OCR of Full-Page Scanned
Receipt Images [0.07673339435080445]
We propose a localization-free document-level OCR model for transcribing all the characters in a receipt image into an ordered sequence end-to-end.
Specifically, we finetune the pretrained instance-level model TrOCR with randomly cropped image chunks.
In our experiments, the model finetuned with our strategy achieved 64.4 F1-score and a 22.8% character error rate.
arXiv Detail & Related papers (2022-12-11T15:45:26Z) - One-shot Key Information Extraction from Document with Deep Partial
Graph Matching [60.48651298832829]
Key Information Extraction (KIE) from documents improves efficiency, productivity, and security in many industrial scenarios.
Existing supervised learning methods for the KIE task need to feed a large number of labeled samples and learn separate models for different types of documents.
We propose a deep end-to-end trainable network for one-shot KIE using partial graph matching.
arXiv Detail & Related papers (2021-09-26T07:45:53Z) - Lights, Camera, Action! A Framework to Improve NLP Accuracy over OCR
documents [2.6201102730518606]
We demonstrate an effective framework for mitigating OCR errors for any downstream NLP task.
We first address the data scarcity problem for model training by constructing a document synthesis pipeline.
For the benefit of the community, we have made the document synthesis pipeline available as an open-source project.
arXiv Detail & Related papers (2021-08-06T00:32:54Z) - Rethinking Text Line Recognition Models [57.47147190119394]
We consider two decoder families (Connectionist Temporal Classification and Transformer) and three encoder modules (Bidirectional LSTMs, Self-Attention, and GRCLs)
We compare their accuracy and performance on widely used public datasets of scene and handwritten text.
Unlike the more common Transformer-based models, this architecture can handle inputs of arbitrary length.
arXiv Detail & Related papers (2021-04-15T21:43:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.