Rerunning OCR: A Machine Learning Approach to Quality Assessment and
Enhancement Prediction
- URL: http://arxiv.org/abs/2110.01661v3
- Date: Thu, 7 Oct 2021 15:25:30 GMT
- Title: Rerunning OCR: A Machine Learning Approach to Quality Assessment and
Enhancement Prediction
- Authors: Pit Schneider
- Abstract summary: Iterating with new and improved OCR solutions enforces decisions to be taken when it comes to targeting the right reprocessing candidates.
This article captures the efforts of the National Library of Luxembourg to support those exact decisions.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Iterating with new and improved OCR solutions enforces decisions to be taken
when it comes to targeting the right reprocessing candidates. This especially
applies when the underlying data collection is of considerable size and rather
diverse in terms of fonts, languages, periods of publication and consequently
OCR quality. This article captures the efforts of the National Library of
Luxembourg to support those exact decisions. They are crucial in order to
guarantee low computational overhead and reduced quality degradation risks,
combined with a more quantifiable OCR improvement. In particular, this work
explains the methodology of the library with respect to text block level
quality assessment. As an extension of this technique, another contribution
comes in the form of a regression model that takes the enhancement potential of
a new OCR engine into account. They both mark promising approaches, especially
for cultural institutions dealing with historic data of lower quality.
Related papers
- CLOCR-C: Context Leveraging OCR Correction with Pre-trained Language Models [0.0]
This paper introduces Context Leveraging OCR Correction (CLOCR-C)
It uses the infilling and context-adaptive abilities of transformer-based language models (LMs) to improve OCR quality.
The study aims to determine if LMs can perform post-OCR correction, improve downstream NLP tasks, and the value of providing socio-cultural context as part of the correction process.
arXiv Detail & Related papers (2024-08-30T17:26:05Z) - RAGEval: Scenario Specific RAG Evaluation Dataset Generation Framework [69.4501863547618]
This paper introduces RAGEval, a framework designed to assess RAG systems across diverse scenarios.
With a focus on factual accuracy, we propose three novel metrics Completeness, Hallucination, and Irrelevance.
Experimental results show that RAGEval outperforms zero-shot and one-shot methods in terms of clarity, safety, conformity, and richness of generated samples.
arXiv Detail & Related papers (2024-08-02T13:35:11Z) - Exploring Precision and Recall to assess the quality and diversity of LLMs [82.21278402856079]
We introduce a novel evaluation framework for Large Language Models (LLMs) such as textscLlama-2 and textscMistral.
This approach allows for a nuanced assessment of the quality and diversity of generated text without the need for aligned corpora.
arXiv Detail & Related papers (2024-02-16T13:53:26Z) - Toward Zero-shot Character Recognition: A Gold Standard Dataset with
Radical-level Annotations [5.761679637905164]
In this paper, we construct an ancient Chinese character image dataset that contains both radical-level and character-level annotations.
To increase the adaptability of ACCID, we propose a splicing-based synthetic character algorithm to augment the training samples and apply an image denoising method to improve the image quality.
arXiv Detail & Related papers (2023-08-01T16:41:30Z) - Bayesian Inverse Contextual Reasoning for Heterogeneous Semantics-Native
Communication [47.9462619619438]
When agents do not share the same communication context, the effectiveness of contextual reasoning is compromised.
This article proposes a novel framework for solving the inverse problem of CR in SNC using two Bayesian inference methods.
arXiv Detail & Related papers (2023-06-10T10:10:55Z) - User-Centric Evaluation of OCR Systems for Kwak'wala [92.73847703011353]
We show that utilizing OCR reduces the time spent in the manual transcription of culturally valuable documents by over 50%.
Our results demonstrate the potential benefits that OCR tools can have on downstream language documentation and revitalization efforts.
arXiv Detail & Related papers (2023-02-26T21:41:15Z) - Better Retrieval May Not Lead to Better Question Answering [59.1892787017522]
A popular approach to improve the system's performance is to improve the quality of the retrieved context from the IR stage.
We show that for StrategyQA, a challenging open-domain QA dataset that requires multi-hop reasoning, this common approach is surprisingly ineffective.
arXiv Detail & Related papers (2022-05-07T16:59:38Z) - OCR Improves Machine Translation for Low-Resource Languages [10.010595434359647]
We introduce and make publicly available a novel benchmark, textscOCR4MT, consisting of real and synthetic data, enriched with noise.
We evaluate state-of-the-art OCR systems on our benchmark and analyse most common errors.
We then perform an ablation study to investigate how OCR errors impact Machine Translation performance.
arXiv Detail & Related papers (2022-02-27T02:36:45Z) - Donut: Document Understanding Transformer without OCR [17.397447819420695]
We propose a novel VDU model that is end-to-end trainable without underpinning OCR framework.
Our approach achieves state-of-the-art performance on various document understanding tasks in public benchmark datasets and private industrial service datasets.
arXiv Detail & Related papers (2021-11-30T18:55:19Z) - Neural Model Reprogramming with Similarity Based Mapping for
Low-Resource Spoken Command Recognition [71.96870151495536]
We propose a novel adversarial reprogramming (AR) approach for low-resource spoken command recognition (SCR)
The AR procedure aims to modify the acoustic signals (from the target domain) to repurpose a pretrained SCR model.
We evaluate the proposed AR-SCR system on three low-resource SCR datasets, including Arabic, Lithuanian, and dysarthric Mandarin speech.
arXiv Detail & Related papers (2021-10-08T05:07:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.