Vartani Spellcheck -- Automatic Context-Sensitive Spelling Correction of
OCR-generated Hindi Text Using BERT and Levenshtein Distance
- URL: http://arxiv.org/abs/2012.07652v1
- Date: Mon, 14 Dec 2020 15:49:54 GMT
- Title: Vartani Spellcheck -- Automatic Context-Sensitive Spelling Correction of
OCR-generated Hindi Text Using BERT and Levenshtein Distance
- Authors: Aditya Pal, Abhijit Mustafi
- Abstract summary: Vartani Spellcheck is a context-sensitive approach for spelling correction of Hindi text.
With an accuracy of 81%, the results show a significant improvement over some of the previously established context-sensitive error correction mechanisms for Hindi.
- Score: 3.0422254248414276
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Traditional Optical Character Recognition (OCR) systems that generate text of
highly inflectional Indic languages like Hindi tend to suffer from poor
accuracy due to a wide alphabet set, compound characters and difficulty in
segmenting characters in a word. Automatic spelling error detection and
context-sensitive error correction can be used to improve accuracy by
post-processing the text generated by these OCR systems. A majority of
previously developed language models for error correction of Hindi spelling
have been context-free. In this paper, we present Vartani Spellcheck - a
context-sensitive approach for spelling correction of Hindi text using a
state-of-the-art transformer - BERT in conjunction with the Levenshtein
distance algorithm, popularly known as Edit Distance. We use a lookup
dictionary and context-based named entity recognition (NER) for detection of
possible spelling errors in the text. Our proposed technique has been tested on
a large corpus of text generated by the widely used Tesseract OCR on the Hindi
epic Ramayana. With an accuracy of 81%, the results show a significant
improvement over some of the previously established context-sensitive error
correction mechanisms for Hindi. We also explain how Vartani Spellcheck may be
used for on-the-fly autocorrect suggestion during continuous typing in a text
editor environment.
Related papers
- A Coin Has Two Sides: A Novel Detector-Corrector Framework for Chinese Spelling Correction [79.52464132360618]
Chinese Spelling Correction (CSC) stands as a foundational Natural Language Processing (NLP) task.
We introduce a novel approach based on error detector-corrector framework.
Our detector is designed to yield two error detection results, each characterized by high precision and recall.
arXiv Detail & Related papers (2024-09-06T09:26:45Z) - Learning Robust Named Entity Recognizers From Noisy Data With Retrieval Augmentation [67.89838237013078]
Named entity recognition (NER) models often struggle with noisy inputs.
We propose a more realistic setting in which only noisy text and its NER labels are available.
We employ a multi-view training framework that improves robust NER without retrieving text during inference.
arXiv Detail & Related papers (2024-07-26T07:30:41Z) - Automatic Real-word Error Correction in Persian Text [0.0]
This paper introduces a cutting-edge approach for precise and efficient real-word error correction in Persian text.
We employ semantic analysis, feature selection, and advanced classifiers to enhance error detection and correction efficacy.
Our method achieves an impressive F-measure of 96.6% in the detection phase and an accuracy of 99.1% in the correction phase.
arXiv Detail & Related papers (2024-07-20T07:50:52Z) - Efficiently Leveraging Linguistic Priors for Scene Text Spotting [63.22351047545888]
This paper proposes a method that leverages linguistic knowledge from a large text corpus to replace the traditional one-hot encoding used in auto-regressive scene text spotting and recognition models.
We generate text distributions that align well with scene text datasets, removing the need for in-domain fine-tuning.
Experimental results show that our method not only improves recognition accuracy but also enables more accurate localization of words.
arXiv Detail & Related papers (2024-02-27T01:57:09Z) - Chinese Spelling Correction as Rephrasing Language Model [63.65217759957206]
We study Chinese Spelling Correction (CSC), which aims to detect and correct the potential spelling errors in a given sentence.
Current state-of-the-art methods regard CSC as a sequence tagging task and fine-tune BERT-based models on sentence pairs.
We propose Rephrasing Language Model (ReLM), where the model is trained to rephrase the entire sentence by infilling additional slots, instead of character-to-character tagging.
arXiv Detail & Related papers (2023-08-17T06:04:28Z) - Lenient Evaluation of Japanese Speech Recognition: Modeling Naturally
Occurring Spelling Inconsistency [8.888638284299736]
We create a lattice of plausible respellings of the reference transcription using a combination of lexical resources, a Japanese text-processing system, and a neural machine translation model.
Our method, which does not penalize the system for choosing a valid alternate spelling of a word, affords a 2.4%-3.1% absolute reduction in CER depending on the task.
arXiv Detail & Related papers (2023-06-07T15:39:02Z) - Persian Typographical Error Type Detection Using Deep Neural Networks on Algorithmically-Generated Misspellings [2.2503811834154104]
Typographical Error Type Detection in Persian is a relatively understudied area.
This paper presents a compelling approach for detecting typographical errors in Persian texts.
The outcomes of our final method proved to be highly competitive, achieving an accuracy of 97.62%, precision of 98.83%, recall of 98.61%, and surpassing others in terms of speed.
arXiv Detail & Related papers (2023-05-19T15:05:39Z) - Misspelling Correction with Pre-trained Contextual Language Model [0.0]
We present two experiments, based on BERT and the edit distance algorithm, for ranking and selecting candidate corrections.
The results of our experiments demonstrated that when combined properly, contextual word embeddings of BERT and edit distance are capable of effectively correcting spelling errors.
arXiv Detail & Related papers (2021-01-08T20:11:01Z) - NeuSpell: A Neural Spelling Correction Toolkit [88.79419580807519]
NeuSpell is an open-source toolkit for spelling correction in English.
It comprises ten different models, and benchmarks them on misspellings from multiple sources.
We train neural models using spelling errors in context, synthetically constructed by reverse engineering isolated misspellings.
arXiv Detail & Related papers (2020-10-21T15:53:29Z) - On the Robustness of Language Encoders against Grammatical Errors [66.05648604987479]
We collect real grammatical errors from non-native speakers and conduct adversarial attacks to simulate these errors on clean text data.
Results confirm that the performance of all tested models is affected but the degree of impact varies.
arXiv Detail & Related papers (2020-05-12T11:01:44Z) - Synthetic Error Dataset Generation Mimicking Bengali Writing Pattern [0.0]
We present an algorithm for automatic misspelled Bengali word generation from correct word.
As part of our analysis, we have formed a list of most commonly used Bengali words.
arXiv Detail & Related papers (2020-03-07T01:52:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.