Post-OCR Document Correction with large Ensembles of Character Sequence
Models
- URL: http://arxiv.org/abs/2109.06264v2
- Date: Wed, 15 Sep 2021 09:54:33 GMT
- Title: Post-OCR Document Correction with large Ensembles of Character Sequence
Models
- Authors: Juan Ramirez-Orta and Eduardo Xamena and Ana Maguitman and Evangelos
Milios and Axel J. Soto
- Abstract summary: We propose a novel method to correct documents already processed with Optical Character Recognition (OCR) systems.
The main contribution of this paper is a set of strategies to accurately process strings much longer than the ones used to train the sequence model.
We test our method on nine languages of the ICDAR 2019 competition on post-OCR text correction and achieve a new state-of-the-art performance in five of them.
- Score: 0.3359875577705537
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we propose a novel method based on character
sequence-to-sequence models to correct documents already processed with Optical
Character Recognition (OCR) systems. The main contribution of this paper is a
set of strategies to accurately process strings much longer than the ones used
to train the sequence model while being sample- and resource-efficient,
supported by thorough experimentation. The strategy with the best performance
involves splitting the input document in character n-grams and combining their
individual corrections into the final output using a voting scheme that is
equivalent to an ensemble of a large number of sequence models. We further
investigate how to weigh the contributions from each one of the members of this
ensemble. We test our method on nine languages of the ICDAR 2019 competition on
post-OCR text correction and achieve a new state-of-the-art performance in five
of them. Our code for post-OCR correction is shared at
https://github.com/jarobyte91/post_ocr_correction.
Related papers
- Reference-Based Post-OCR Processing with LLM for Diacritic Languages [0.0]
We propose a method utilizing available content-focused ebooks as a reference base to correct imperfect OCR-generated text.
This technique generates high-precision pseudo-page-to-page labels for diacritic languages.
The pipeline eliminates various types of noise from aged documents and addresses issues such as missing characters, words, and disordered sequences.
arXiv Detail & Related papers (2024-10-17T08:05:02Z) - GEC-DePenD: Non-Autoregressive Grammatical Error Correction with
Decoupled Permutation and Decoding [52.14832976759585]
Grammatical error correction (GEC) is an important NLP task that is usually solved with autoregressive sequence-to-sequence models.
We propose a novel non-autoregressive approach to GEC that decouples the architecture into a permutation network.
We show that the resulting network improves over previously known non-autoregressive methods for GEC.
arXiv Detail & Related papers (2023-11-14T14:24:36Z) - Seq2seq is All You Need for Coreference Resolution [26.551602768015986]
We finetune a pretrained seq2seq transformer to map an input document to a tagged sequence encoding the coreference annotation.
Our model outperforms or closely matches the best coreference systems in the literature on an array of datasets.
arXiv Detail & Related papers (2023-10-20T19:17:22Z) - EfficientOCR: An Extensible, Open-Source Package for Efficiently
Digitizing World Knowledge [1.8434042562191815]
EffOCR is a novel open-source optical character recognition (OCR) package.
It meets both the computational and sample efficiency requirements for liberating texts at scale.
EffOCR is cheap and sample efficient to train, as the model only needs to learn characters' visual appearance and not how they are used in sequence to form language.
arXiv Detail & Related papers (2023-10-16T04:20:16Z) - CLaC at SemEval-2023 Task 2: Comparing Span-Prediction and
Sequence-Labeling approaches for NER [0.554780083433538]
This paper summarizes the CLaC submission for the MultiCoNER 2 task.
We compare two popular approaches for NER, namely Sequence Labeling and Span Prediction.
We find that our best Span Prediction system performs slightly better than our best Sequence Labeling system on test data.
arXiv Detail & Related papers (2023-05-05T20:49:40Z) - Extending TrOCR for Text Localization-Free OCR of Full-Page Scanned
Receipt Images [0.07673339435080445]
We propose a localization-free document-level OCR model for transcribing all the characters in a receipt image into an ordered sequence end-to-end.
Specifically, we finetune the pretrained instance-level model TrOCR with randomly cropped image chunks.
In our experiments, the model finetuned with our strategy achieved 64.4 F1-score and a 22.8% character error rate.
arXiv Detail & Related papers (2022-12-11T15:45:26Z) - DORE: Document Ordered Relation Extraction based on Generative Framework [56.537386636819626]
This paper investigates the root cause of the underwhelming performance of the existing generative DocRE models.
We propose to generate a symbolic and ordered sequence from the relation matrix which is deterministic and easier for model to learn.
Experimental results on four datasets show that our proposed method can improve the performance of the generative DocRE models.
arXiv Detail & Related papers (2022-10-28T11:18:10Z) - Autoregressive Search Engines: Generating Substrings as Document
Identifiers [53.0729058170278]
Autoregressive language models are emerging as the de-facto standard for generating answers.
Previous work has explored ways to partition the search space into hierarchical structures.
In this work we propose an alternative that doesn't force any structure in the search space: using all ngrams in a passage as its possible identifiers.
arXiv Detail & Related papers (2022-04-22T10:45:01Z) - Long Document Summarization with Top-down and Bottom-up Inference [113.29319668246407]
We propose a principled inference framework to improve summarization models on two aspects.
Our framework assumes a hierarchical latent structure of a document where the top-level captures the long range dependency.
We demonstrate the effectiveness of the proposed framework on a diverse set of summarization datasets.
arXiv Detail & Related papers (2022-03-15T01:24:51Z) - Lexically Aware Semi-Supervised Learning for OCR Post-Correction [90.54336622024299]
Much of the existing linguistic data in many languages of the world is locked away in non-digitized books and documents.
Previous work has demonstrated the utility of neural post-correction methods on recognition of less-well-resourced languages.
We present a semi-supervised learning method that makes it possible to utilize raw images to improve performance.
arXiv Detail & Related papers (2021-11-04T04:39:02Z) - COCO-LM: Correcting and Contrasting Text Sequences for Language Model
Pretraining [59.169836983883656]
COCO-LM is a new self-supervised learning framework that pretrains Language Models by COrrecting challenging errors and COntrasting text sequences.
COCO-LM employs an auxiliary language model to mask-and-predict tokens in original text sequences.
Our analyses reveal that COCO-LM's advantages come from its challenging training signals, more contextualized token representations, and regularized sequence representations.
arXiv Detail & Related papers (2021-02-16T22:24:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.