Vartani Spellcheck -- Automatic Context-Sensitive Spelling Correction of
OCR-generated Hindi Text Using BERT and Levenshtein Distance
- URL: http://arxiv.org/abs/2012.07652v1
- Date: Mon, 14 Dec 2020 15:49:54 GMT
- Title: Vartani Spellcheck -- Automatic Context-Sensitive Spelling Correction of
OCR-generated Hindi Text Using BERT and Levenshtein Distance
- Authors: Aditya Pal, Abhijit Mustafi
- Abstract summary: Vartani Spellcheck is a context-sensitive approach for spelling correction of Hindi text.
With an accuracy of 81%, the results show a significant improvement over some of the previously established context-sensitive error correction mechanisms for Hindi.
- Score: 3.0422254248414276
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Traditional Optical Character Recognition (OCR) systems that generate text of
highly inflectional Indic languages like Hindi tend to suffer from poor
accuracy due to a wide alphabet set, compound characters and difficulty in
segmenting characters in a word. Automatic spelling error detection and
context-sensitive error correction can be used to improve accuracy by
post-processing the text generated by these OCR systems. A majority of
previously developed language models for error correction of Hindi spelling
have been context-free. In this paper, we present Vartani Spellcheck - a
context-sensitive approach for spelling correction of Hindi text using a
state-of-the-art transformer - BERT in conjunction with the Levenshtein
distance algorithm, popularly known as Edit Distance. We use a lookup
dictionary and context-based named entity recognition (NER) for detection of
possible spelling errors in the text. Our proposed technique has been tested on
a large corpus of text generated by the widely used Tesseract OCR on the Hindi
epic Ramayana. With an accuracy of 81%, the results show a significant
improvement over some of the previously established context-sensitive error
correction mechanisms for Hindi. We also explain how Vartani Spellcheck may be
used for on-the-fly autocorrect suggestion during continuous typing in a text
editor environment.
Related papers
- Automatic Real-word Error Correction in Persian Text [0.0]
This paper introduces a cutting-edge approach for precise and efficient real-word error correction in Persian text.
We employ semantic analysis, feature selection, and advanced classifiers to enhance error detection and correction efficacy.
Our method achieves an impressive F-measure of 96.6% in the detection phase and an accuracy of 99.1% in the correction phase.
arXiv Detail & Related papers (2024-07-20T07:50:52Z) - Efficiently Leveraging Linguistic Priors for Scene Text Spotting [63.22351047545888]
This paper proposes a method that leverages linguistic knowledge from a large text corpus to replace the traditional one-hot encoding used in auto-regressive scene text spotting and recognition models.
We generate text distributions that align well with scene text datasets, removing the need for in-domain fine-tuning.
Experimental results show that our method not only improves recognition accuracy but also enables more accurate localization of words.
arXiv Detail & Related papers (2024-02-27T01:57:09Z) - Chinese Spelling Correction as Rephrasing Language Model [63.65217759957206]
We study Chinese Spelling Correction (CSC), which aims to detect and correct the potential spelling errors in a given sentence.
Current state-of-the-art methods regard CSC as a sequence tagging task and fine-tune BERT-based models on sentence pairs.
We propose Rephrasing Language Model (ReLM), where the model is trained to rephrase the entire sentence by infilling additional slots, instead of character-to-character tagging.
arXiv Detail & Related papers (2023-08-17T06:04:28Z) - Lenient Evaluation of Japanese Speech Recognition: Modeling Naturally
Occurring Spelling Inconsistency [8.888638284299736]
We create a lattice of plausible respellings of the reference transcription using a combination of lexical resources, a Japanese text-processing system, and a neural machine translation model.
Our method, which does not penalize the system for choosing a valid alternate spelling of a word, affords a 2.4%-3.1% absolute reduction in CER depending on the task.
arXiv Detail & Related papers (2023-06-07T15:39:02Z) - SpellMapper: A non-autoregressive neural spellchecker for ASR
customization with candidate retrieval based on n-gram mappings [76.87664008338317]
Contextual spelling correction models are an alternative to shallow fusion to improve automatic speech recognition.
We propose a novel algorithm for candidate retrieval based on misspelled n-gram mappings.
Experiments on Spoken Wikipedia show 21.4% word error rate improvement compared to a baseline ASR system.
arXiv Detail & Related papers (2023-06-04T10:00:12Z) - Persian Typographical Error Type Detection Using Deep Neural Networks on Algorithmically-Generated Misspellings [2.2503811834154104]
Typographical Error Type Detection in Persian is a relatively understudied area.
This paper presents a compelling approach for detecting typographical errors in Persian texts.
The outcomes of our final method proved to be highly competitive, achieving an accuracy of 97.62%, precision of 98.83%, recall of 98.61%, and surpassing others in terms of speed.
arXiv Detail & Related papers (2023-05-19T15:05:39Z) - DPCSpell: A Transformer-based Detector-Purificator-Corrector Framework
for Spelling Error Correction of Bangla and Resource Scarce Indic Languages [1.7205106391379026]
Spelling error correction is the task of identifying and rectifying misspelled words in texts.
Earlier efforts on spelling error correction in Bangla and resource-scarce Indic languages focused on rule-based, statistical, and machine learning-based methods.
We propose a novel detector-purificator-corrector framework based on denoising transformers by addressing previous issues.
arXiv Detail & Related papers (2022-11-07T17:59:05Z) - Lexically Aware Semi-Supervised Learning for OCR Post-Correction [90.54336622024299]
Much of the existing linguistic data in many languages of the world is locked away in non-digitized books and documents.
Previous work has demonstrated the utility of neural post-correction methods on recognition of less-well-resourced languages.
We present a semi-supervised learning method that makes it possible to utilize raw images to improve performance.
arXiv Detail & Related papers (2021-11-04T04:39:02Z) - Misspelling Correction with Pre-trained Contextual Language Model [0.0]
We present two experiments, based on BERT and the edit distance algorithm, for ranking and selecting candidate corrections.
The results of our experiments demonstrated that when combined properly, contextual word embeddings of BERT and edit distance are capable of effectively correcting spelling errors.
arXiv Detail & Related papers (2021-01-08T20:11:01Z) - NeuSpell: A Neural Spelling Correction Toolkit [88.79419580807519]
NeuSpell is an open-source toolkit for spelling correction in English.
It comprises ten different models, and benchmarks them on misspellings from multiple sources.
We train neural models using spelling errors in context, synthetically constructed by reverse engineering isolated misspellings.
arXiv Detail & Related papers (2020-10-21T15:53:29Z) - On the Robustness of Language Encoders against Grammatical Errors [66.05648604987479]
We collect real grammatical errors from non-native speakers and conduct adversarial attacks to simulate these errors on clean text data.
Results confirm that the performance of all tested models is affected but the degree of impact varies.
arXiv Detail & Related papers (2020-05-12T11:01:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.