LOCR: Location-Guided Transformer for Optical Character Recognition
- URL: http://arxiv.org/abs/2403.02127v1
- Date: Mon, 4 Mar 2024 15:34:12 GMT
- Title: LOCR: Location-Guided Transformer for Optical Character Recognition
- Authors: Yu Sun, Dongzhan Zhou, Chen Lin, Conghui He, Wanli Ouyang, Han-Sen
Zhong
- Abstract summary: We propose LOCR, a model that integrates location guiding into the transformer architecture during autoregression.
We train the model on a dataset comprising over 77M text-location pairs from 125K academic document pages, including bounding boxes for words, tables and mathematical symbols.
It outperforms all existing methods in our test set constructed from arXiv, as measured by edit distance, BLEU, METEOR and F-measure.
- Score: 55.195165959662795
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Academic documents are packed with texts, equations, tables, and figures,
requiring comprehensive understanding for accurate Optical Character
Recognition (OCR). While end-to-end OCR methods offer improved accuracy over
layout-based approaches, they often grapple with significant repetition issues,
especially with complex layouts in Out-Of-Domain (OOD) documents.To tackle this
issue, we propose LOCR, a model that integrates location guiding into the
transformer architecture during autoregression. We train the model on a dataset
comprising over 77M text-location pairs from 125K academic document pages,
including bounding boxes for words, tables and mathematical symbols. LOCR
adeptly handles various formatting elements and generates content in Markdown
language. It outperforms all existing methods in our test set constructed from
arXiv, as measured by edit distance, BLEU, METEOR and F-measure.LOCR also
reduces repetition frequency from 4.4% of pages to 0.5% in the arXiv dataset,
from 13.2% to 1.3% in OOD quantum physics documents and from 8.1% to 1.8% in
OOD marketing documents. Additionally, LOCR features an interactive OCR mode,
facilitating the generation of complex documents through a few location prompts
from human.
Related papers
- HAND: Hierarchical Attention Network for Multi-Scale Handwritten Document Recognition and Layout Analysis [21.25786478579275]
Handwritten document recognition is one of the most challenging tasks in computer vision.
Traditionally, this problem has been approached as two separate tasks, handwritten text recognition and layout analysis.
This paper introduces HAND, a novel end-to-end and segmentation-free architecture for simultaneous text recognition and layout analysis tasks.
arXiv Detail & Related papers (2024-12-25T20:36:29Z) - Detecting Document-level Paraphrased Machine Generated Content: Mimicking Human Writing Style and Involving Discourse Features [57.34477506004105]
Machine-generated content poses challenges such as academic plagiarism and the spread of misinformation.
We introduce novel methodologies and datasets to overcome these challenges.
We propose MhBART, an encoder-decoder model designed to emulate human writing style.
We also propose DTransformer, a model that integrates discourse analysis through PDTB preprocessing to encode structural features.
arXiv Detail & Related papers (2024-12-17T08:47:41Z) - CC-OCR: A Comprehensive and Challenging OCR Benchmark for Evaluating Large Multimodal Models in Literacy [50.78228433498211]
CC-OCR comprises four OCR-centric tracks: multi-scene text reading, multilingual text reading, document parsing, and key information extraction.
It includes 39 subsets with 7,058 full annotated images, of which 41% are sourced from real applications, and released for the first time.
We evaluate nine prominent LMMs and reveal both the strengths and weaknesses of these models, particularly in text grounding, multi-orientation, and hallucination of repetition.
arXiv Detail & Related papers (2024-12-03T07:03:25Z) - GatedLexiconNet: A Comprehensive End-to-End Handwritten Paragraph Text Recognition System [3.9527064697847005]
We present an end-to-end paragraph recognition system that incorporates internal line segmentation and convolutional layers based encoder.
This study reported character error rates of 2.27% on IAM, 0.9% on RIMES, and 2.13% on READ-16, and word error rates of 5.73% on READ-2016 datasets.
arXiv Detail & Related papers (2024-04-22T10:19:16Z) - DLoRA-TrOCR: Mixed Text Mode Optical Character Recognition Based On Transformer [12.966765239586994]
Multi- fonts, mixed scenes and complex layouts seriously affect the recognition accuracy of traditional OCR models.
We propose a parameter-efficient mixed text recognition method based on pre-trained OCR Transformer, namely DLoRA-TrOCR.
arXiv Detail & Related papers (2024-04-19T09:28:16Z) - Toward Real Text Manipulation Detection: New Dataset and New Solution [58.557504531896704]
High costs associated with professional text manipulation limit the availability of real-world datasets.
We present the Real Text Manipulation dataset, encompassing 14,250 text images.
Our contributions aim to propel advancements in real-world text tampering detection.
arXiv Detail & Related papers (2023-12-12T02:10:16Z) - KOSMOS-2.5: A Multimodal Literate Model [136.96172068766285]
We present KOSMOS-2.5, a multimodal literate model for machine reading of text-intensive images.
KOSMOS-2.5 excels in two distinct yet complementary transcription tasks.
We fine-tune KOSMOS-2.5 for document understanding tasks, resulting in a document understanding generalist named KOSMOS-2.5-CHAT.
arXiv Detail & Related papers (2023-09-20T15:50:08Z) - Extending TrOCR for Text Localization-Free OCR of Full-Page Scanned
Receipt Images [0.07673339435080445]
We propose a localization-free document-level OCR model for transcribing all the characters in a receipt image into an ordered sequence end-to-end.
Specifically, we finetune the pretrained instance-level model TrOCR with randomly cropped image chunks.
In our experiments, the model finetuned with our strategy achieved 64.4 F1-score and a 22.8% character error rate.
arXiv Detail & Related papers (2022-12-11T15:45:26Z) - Lights, Camera, Action! A Framework to Improve NLP Accuracy over OCR
documents [2.6201102730518606]
We demonstrate an effective framework for mitigating OCR errors for any downstream NLP task.
We first address the data scarcity problem for model training by constructing a document synthesis pipeline.
For the benefit of the community, we have made the document synthesis pipeline available as an open-source project.
arXiv Detail & Related papers (2021-08-06T00:32:54Z) - Rethinking Text Line Recognition Models [57.47147190119394]
We consider two decoder families (Connectionist Temporal Classification and Transformer) and three encoder modules (Bidirectional LSTMs, Self-Attention, and GRCLs)
We compare their accuracy and performance on widely used public datasets of scene and handwritten text.
Unlike the more common Transformer-based models, this architecture can handle inputs of arbitrary length.
arXiv Detail & Related papers (2021-04-15T21:43:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.