Lights, Camera, Action! A Framework to Improve NLP Accuracy over OCR
documents
- URL: http://arxiv.org/abs/2108.02899v1
- Date: Fri, 6 Aug 2021 00:32:54 GMT
- Title: Lights, Camera, Action! A Framework to Improve NLP Accuracy over OCR
documents
- Authors: Amit Gupte, Alexey Romanov, Sahitya Mantravadi, Dalitso Banda, Jianjie
Liu, Raza Khan, Lakshmanan Ramu Meenal, Benjamin Han, Soundar Srinivasan
- Abstract summary: We demonstrate an effective framework for mitigating OCR errors for any downstream NLP task.
We first address the data scarcity problem for model training by constructing a document synthesis pipeline.
For the benefit of the community, we have made the document synthesis pipeline available as an open-source project.
- Score: 2.6201102730518606
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Document digitization is essential for the digital transformation of our
societies, yet a crucial step in the process, Optical Character Recognition
(OCR), is still not perfect. Even commercial OCR systems can produce
questionable output depending on the fidelity of the scanned documents. In this
paper, we demonstrate an effective framework for mitigating OCR errors for any
downstream NLP task, using Named Entity Recognition (NER) as an example. We
first address the data scarcity problem for model training by constructing a
document synthesis pipeline, generating realistic but degraded data with NER
labels. We measure the NER accuracy drop at various degradation levels and show
that a text restoration model, trained on the degraded data, significantly
closes the NER accuracy gaps caused by OCR errors, including on an
out-of-domain dataset. For the benefit of the community, we have made the
document synthesis pipeline available as an open-source project.
Related papers
- Failing Forward: Improving Generative Error Correction for ASR with Synthetic Data and Retrieval Augmentation [73.9145653659403]
We show that Generative Error Correction models struggle to generalize beyond the specific types of errors encountered during training.
We propose DARAG, a novel approach designed to improve GEC for ASR in in-domain (ID) and OOD scenarios.
Our approach is simple, scalable, and both domain- and language-agnostic.
arXiv Detail & Related papers (2024-10-17T04:00:29Z) - CLOCR-C: Context Leveraging OCR Correction with Pre-trained Language Models [0.0]
This paper introduces Context Leveraging OCR Correction (CLOCR-C)
It uses the infilling and context-adaptive abilities of transformer-based language models (LMs) to improve OCR quality.
The study aims to determine if LMs can perform post-OCR correction, improve downstream NLP tasks, and the value of providing socio-cultural context as part of the correction process.
arXiv Detail & Related papers (2024-08-30T17:26:05Z) - DLoRA-TrOCR: Mixed Text Mode Optical Character Recognition Based On Transformer [12.966765239586994]
Multi- fonts, mixed scenes and complex layouts seriously affect the recognition accuracy of traditional OCR models.
We propose a parameter-efficient mixed text recognition method based on pre-trained OCR Transformer, namely DLoRA-TrOCR.
arXiv Detail & Related papers (2024-04-19T09:28:16Z) - NAF-DPM: A Nonlinear Activation-Free Diffusion Probabilistic Model for Document Enhancement [4.841365627573421]
A crucial preprocessing step is essential to eliminate noise while preserving text and key features of documents.
We propose NAF-DPM, a novel generative framework based on a diffusion probabilistic model (DPM) designed to restore the original quality of degraded documents.
arXiv Detail & Related papers (2024-04-08T16:52:21Z) - LOCR: Location-Guided Transformer for Optical Character Recognition [55.195165959662795]
We propose LOCR, a model that integrates location guiding into the transformer architecture during autoregression.
We train the model on a dataset comprising over 77M text-location pairs from 125K academic document pages, including bounding boxes for words, tables and mathematical symbols.
It outperforms all existing methods in our test set constructed from arXiv, as measured by edit distance, BLEU, METEOR and F-measure.
arXiv Detail & Related papers (2024-03-04T15:34:12Z) - Noise-Robust Dense Retrieval via Contrastive Alignment Post Training [89.29256833403167]
Contrastive Alignment POst Training (CAPOT) is a highly efficient finetuning method that improves model robustness without requiring index regeneration.
CAPOT enables robust retrieval by freezing the document encoder while the query encoder learns to align noisy queries with their unaltered root.
We evaluate CAPOT noisy variants of MSMARCO, Natural Questions, and Trivia QA passage retrieval, finding CAPOT has a similar impact as data augmentation with none of its overhead.
arXiv Detail & Related papers (2023-04-06T22:16:53Z) - Unsupervised Structure-Texture Separation Network for Oracle Character
Recognition [70.29024469395608]
Oracle bone script is the earliest-known Chinese writing system of the Shang dynasty and is precious to archeology and philology.
We propose a structure-texture separation network (STSN), which is an end-to-end learning framework for joint disentanglement, transformation, adaptation and recognition.
arXiv Detail & Related papers (2022-05-13T10:27:02Z) - OCR-IDL: OCR Annotations for Industry Document Library Dataset [8.905920197601171]
We make public the OCR annotations for IDL documents using commercial OCR engine.
The contributed dataset (OCR-IDL) has an estimated monetary value over 20K US$.
arXiv Detail & Related papers (2022-02-25T21:30:48Z) - Donut: Document Understanding Transformer without OCR [17.397447819420695]
We propose a novel VDU model that is end-to-end trainable without underpinning OCR framework.
Our approach achieves state-of-the-art performance on various document understanding tasks in public benchmark datasets and private industrial service datasets.
arXiv Detail & Related papers (2021-11-30T18:55:19Z) - Lexically Aware Semi-Supervised Learning for OCR Post-Correction [90.54336622024299]
Much of the existing linguistic data in many languages of the world is locked away in non-digitized books and documents.
Previous work has demonstrated the utility of neural post-correction methods on recognition of less-well-resourced languages.
We present a semi-supervised learning method that makes it possible to utilize raw images to improve performance.
arXiv Detail & Related papers (2021-11-04T04:39:02Z) - DocScanner: Robust Document Image Rectification with Progressive
Learning [162.03694280524084]
This work presents DocScanner, a new deep network architecture for document image rectification.
DocScanner maintains a single estimate of the rectified image, which is progressively corrected with a recurrent architecture.
The iterative refinements make DocScanner converge to a robust and superior performance, and the lightweight recurrent architecture ensures the running efficiency.
arXiv Detail & Related papers (2021-10-28T09:15:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.