Lexically Aware Semi-Supervised Learning for OCR Post-Correction
- URL: http://arxiv.org/abs/2111.02622v1
- Date: Thu, 4 Nov 2021 04:39:02 GMT
- Title: Lexically Aware Semi-Supervised Learning for OCR Post-Correction
- Authors: Shruti Rijhwani, Daisy Rosenblum, Antonios Anastasopoulos, Graham
Neubig
- Abstract summary: Much of the existing linguistic data in many languages of the world is locked away in non-digitized books and documents.
Previous work has demonstrated the utility of neural post-correction methods on recognition of less-well-resourced languages.
We present a semi-supervised learning method that makes it possible to utilize raw images to improve performance.
- Score: 90.54336622024299
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Much of the existing linguistic data in many languages of the world is locked
away in non-digitized books and documents. Optical character recognition (OCR)
can be used to produce digitized text, and previous work has demonstrated the
utility of neural post-correction methods that improve the results of
general-purpose OCR systems on recognition of less-well-resourced languages.
However, these methods rely on manually curated post-correction data, which are
relatively scarce compared to the non-annotated raw images that need to be
digitized.
In this paper, we present a semi-supervised learning method that makes it
possible to utilize these raw images to improve performance, specifically
through the use of self-training, a technique where a model is iteratively
trained on its own outputs. In addition, to enforce consistency in the
recognized vocabulary, we introduce a lexically-aware decoding method that
augments the neural post-correction model with a count-based language model
constructed from the recognized texts, implemented using weighted finite-state
automata (WFSA) for efficient and effective decoding.
Results on four endangered languages demonstrate the utility of the proposed
method, with relative error reductions of 15-29%, where we find the combination
of self-training and lexically-aware decoding essential for achieving
consistent improvements. Data and code are available at
https://shrutirij.github.io/ocr-el/.
Related papers
- Decoding at the Speed of Thought: Harnessing Parallel Decoding of Lexical Units for LLMs [57.27982780697922]
Large language models have demonstrated exceptional capability in natural language understanding and generation.
However, their generation speed is limited by the inherently sequential nature of their decoding process.
This paper introduces Lexical Unit Decoding, a novel decoding methodology implemented in a data-driven manner.
arXiv Detail & Related papers (2024-05-24T04:35:13Z) - Efficiently Leveraging Linguistic Priors for Scene Text Spotting [63.22351047545888]
This paper proposes a method that leverages linguistic knowledge from a large text corpus to replace the traditional one-hot encoding used in auto-regressive scene text spotting and recognition models.
We generate text distributions that align well with scene text datasets, removing the need for in-domain fine-tuning.
Experimental results show that our method not only improves recognition accuracy but also enables more accurate localization of words.
arXiv Detail & Related papers (2024-02-27T01:57:09Z) - EfficientOCR: An Extensible, Open-Source Package for Efficiently
Digitizing World Knowledge [1.8434042562191815]
EffOCR is a novel open-source optical character recognition (OCR) package.
It meets both the computational and sample efficiency requirements for liberating texts at scale.
EffOCR is cheap and sample efficient to train, as the model only needs to learn characters' visual appearance and not how they are used in sequence to form language.
arXiv Detail & Related papers (2023-10-16T04:20:16Z) - Offline Detection of Misspelled Handwritten Words by Convolving
Recognition Model Features with Text Labels [0.0]
We introduce the task of comparing a handwriting image to text.
Our model's classification head is trained entirely on synthetic data created using a state-of-the-art generative adversarial network.
Such massive performance gains can lead to significant productivity increases in applications utilizing human-in-the-loop automation.
arXiv Detail & Related papers (2023-09-18T21:13:42Z) - Optimizing the Neural Network Training for OCR Error Correction of
Historical Hebrew Texts [0.934612743192798]
This paper proposes an innovative method for training a light-weight neural network for Hebrew OCR post-correction using significantly less manually created data.
An analysis of historical OCRed newspapers was done to learn common language and corpus-specific OCR errors.
arXiv Detail & Related papers (2023-07-30T12:59:06Z) - Scalable Learning of Latent Language Structure With Logical Offline
Cycle Consistency [71.42261918225773]
Conceptually, LOCCO can be viewed as a form of self-learning where the semantic being trained is used to generate annotations for unlabeled text.
As an added bonus, the annotations produced by LOCCO can be trivially repurposed to train a neural text generation model.
arXiv Detail & Related papers (2023-05-31T16:47:20Z) - CSSL-MHTR: Continual Self-Supervised Learning for Scalable Multi-script Handwritten Text Recognition [16.987008461171065]
We explore the potential of continual self-supervised learning to alleviate the catastrophic forgetting problem in handwritten text recognition.
Our method consists in adding intermediate layers called adapters for each task, and efficiently distilling knowledge from the previous model while learning the current task.
We attain state-of-the-art performance on English, Italian and Russian scripts, whilst adding only a few parameters per task.
arXiv Detail & Related papers (2023-03-16T14:27:45Z) - Uncovering the Handwritten Text in the Margins: End-to-end Handwritten
Text Detection and Recognition [0.840835093659811]
This work presents an end-to-end framework for automatic detection and recognition of handwritten marginalia.
It uses data augmentation and transfer learning to overcome training data scarcity.
The effectiveness of the proposed framework has been empirically evaluated on the data from early book collections found in the Uppsala University Library in Sweden.
arXiv Detail & Related papers (2023-03-10T14:00:53Z) - DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting [91.56988987393483]
We present a new framework for dense prediction by implicitly and explicitly leveraging the pre-trained knowledge from CLIP.
Specifically, we convert the original image-text matching problem in CLIP to a pixel-text matching problem and use the pixel-text score maps to guide the learning of dense prediction models.
Our method is model-agnostic, which can be applied to arbitrary dense prediction systems and various pre-trained visual backbones.
arXiv Detail & Related papers (2021-12-02T18:59:32Z) - TEACHTEXT: CrossModal Generalized Distillation for Text-Video Retrieval [103.85002875155551]
We propose a novel generalized distillation method, TeachText, for exploiting large-scale language pretraining.
We extend our method to video side modalities and show that we can effectively reduce the number of used modalities at test time.
Our approach advances the state of the art on several video retrieval benchmarks by a significant margin and adds no computational overhead at test time.
arXiv Detail & Related papers (2021-04-16T17:55:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.