EfficientOCR: An Extensible, Open-Source Package for Efficiently
Digitizing World Knowledge
- URL: http://arxiv.org/abs/2310.10050v1
- Date: Mon, 16 Oct 2023 04:20:16 GMT
- Title: EfficientOCR: An Extensible, Open-Source Package for Efficiently
Digitizing World Knowledge
- Authors: Tom Bryan, Jacob Carlson, Abhishek Arora, Melissa Dell
- Abstract summary: EffOCR is a novel open-source optical character recognition (OCR) package.
It meets both the computational and sample efficiency requirements for liberating texts at scale.
EffOCR is cheap and sample efficient to train, as the model only needs to learn characters' visual appearance and not how they are used in sequence to form language.
- Score: 1.8434042562191815
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Billions of public domain documents remain trapped in hard copy or lack an
accurate digitization. Modern natural language processing methods cannot be
used to index, retrieve, and summarize their texts; conduct computational
textual analyses; or extract information for statistical analyses, and these
texts cannot be incorporated into language model training. Given the diversity
and sheer quantity of public domain texts, liberating them at scale requires
optical character recognition (OCR) that is accurate, extremely cheap to
deploy, and sample-efficient to customize to novel collections, languages, and
character sets. Existing OCR engines, largely designed for small-scale
commercial applications in high resource languages, often fall short of these
requirements. EffOCR (EfficientOCR), a novel open-source OCR package, meets
both the computational and sample efficiency requirements for liberating texts
at scale by abandoning the sequence-to-sequence architecture typically used for
OCR, which takes representations from a learned vision model as inputs to a
learned language model. Instead, EffOCR models OCR as a character or word-level
image retrieval problem. EffOCR is cheap and sample efficient to train, as the
model only needs to learn characters' visual appearance and not how they are
used in sequence to form language. Models in the EffOCR model zoo can be
deployed off-the-shelf with only a few lines of code. Importantly, EffOCR also
allows for easy, sample efficient customization with a simple model training
interface and minimal labeling requirements due to its sample efficiency. We
illustrate the utility of EffOCR by cheaply and accurately digitizing 20
million historical U.S. newspaper scans, evaluating zero-shot performance on
randomly selected documents from the U.S. National Archives, and accurately
digitizing Japanese documents for which all other OCR solutions failed.
Related papers
- Spanish TrOCR: Leveraging Transfer Learning for Language Adaptation [0.0]
This study explores the transfer learning capabilities of the TrOCR architecture to Spanish.
We integrate an English TrOCR encoder with a language specific decoder and train the model on this specific language.
Fine-tuning the English TrOCR on Spanish yields superior recognition than the language specific decoder for a fixed dataset size.
arXiv Detail & Related papers (2024-07-09T15:31:41Z) - mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document
Understanding [55.4806974284156]
Document understanding refers to automatically extract, analyze and comprehend information from digital documents, such as a web page.
Existing Multi-model Large Language Models (MLLMs) have demonstrated promising zero-shot capabilities in shallow OCR-free text recognition.
arXiv Detail & Related papers (2023-07-04T11:28:07Z) - XTREME-UP: A User-Centric Scarce-Data Benchmark for Under-Represented
Languages [105.54207724678767]
Data scarcity is a crucial issue for the development of highly multilingual NLP systems.
We propose XTREME-UP, a benchmark defined by its focus on the scarce-data scenario rather than zero-shot.
XTREME-UP evaluates the capabilities of language models across 88 under-represented languages over 9 key user-centric technologies.
arXiv Detail & Related papers (2023-05-19T18:00:03Z) - OCRBench: On the Hidden Mystery of OCR in Large Multimodal Models [122.27878464009181]
We conducted a comprehensive evaluation of Large Multimodal Models, such as GPT4V and Gemini, in various text-related visual tasks.
OCRBench contains 29 datasets, making it the most comprehensive OCR evaluation benchmark available.
arXiv Detail & Related papers (2023-05-13T11:28:37Z) - Efficient OCR for Building a Diverse Digital History [1.8434042562191815]
This study models OCR as a character level image retrieval problem, using a contrastively trained vision.
Because the model only learns characters' visual features, it is more efficient than existing architectures, enabling accurate OCR in settings where existing solutions fail.
Crucially, the model opens new avenues for community engagement in making digital history more representative of documentary history.
arXiv Detail & Related papers (2023-04-05T20:36:04Z) - User-Centric Evaluation of OCR Systems for Kwak'wala [92.73847703011353]
We show that utilizing OCR reduces the time spent in the manual transcription of culturally valuable documents by over 50%.
Our results demonstrate the potential benefits that OCR tools can have on downstream language documentation and revitalization efforts.
arXiv Detail & Related papers (2023-02-26T21:41:15Z) - Extending TrOCR for Text Localization-Free OCR of Full-Page Scanned
Receipt Images [0.07673339435080445]
We propose a localization-free document-level OCR model for transcribing all the characters in a receipt image into an ordered sequence end-to-end.
Specifically, we finetune the pretrained instance-level model TrOCR with randomly cropped image chunks.
In our experiments, the model finetuned with our strategy achieved 64.4 F1-score and a 22.8% character error rate.
arXiv Detail & Related papers (2022-12-11T15:45:26Z) - Lexically Aware Semi-Supervised Learning for OCR Post-Correction [90.54336622024299]
Much of the existing linguistic data in many languages of the world is locked away in non-digitized books and documents.
Previous work has demonstrated the utility of neural post-correction methods on recognition of less-well-resourced languages.
We present a semi-supervised learning method that makes it possible to utilize raw images to improve performance.
arXiv Detail & Related papers (2021-11-04T04:39:02Z) - TrOCR: Transformer-based Optical Character Recognition with Pre-trained
Models [47.48019831416665]
We propose an end-to-end text recognition approach with pre-trained image Transformer and text Transformer models, namely TrOCR.
TrOCR is simple but effective, and can be pre-trained with large-scale synthetic data and fine-tuned with human-labeled datasets.
Experiments show that the TrOCR model outperforms the current state-of-the-art models on both printed and handwritten text recognition tasks.
arXiv Detail & Related papers (2021-09-21T16:01:56Z) - PP-OCR: A Practical Ultra Lightweight OCR System [8.740684949994664]
We propose a practical ultra lightweight OCR system, i.e., PP-OCR.
The overall model size of the PP-OCR is only 3.5M for recognizing 6622 Chinese characters and 2.8M for recognizing 63 alphanumeric symbols.
arXiv Detail & Related papers (2020-09-21T14:57:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.