Related papers: Vartani Spellcheck -- Automatic Context-Sensitive Spelling Correction of OCR-generated Hindi Text Using BERT and Levenshtein Distance

Vartani Spellcheck -- Automatic Context-Sensitive Spelling Correction of OCR-generated Hindi Text Using BERT and Levenshtein Distance

URL: http://arxiv.org/abs/2012.07652v1
Date: Mon, 14 Dec 2020 15:49:54 GMT
Title: Vartani Spellcheck -- Automatic Context-Sensitive Spelling Correction of OCR-generated Hindi Text Using BERT and Levenshtein Distance
Authors: Aditya Pal, Abhijit Mustafi
Abstract summary: Vartani Spellcheck is a context-sensitive approach for spelling correction of Hindi text. With an accuracy of 81%, the results show a significant improvement over some of the previously established context-sensitive error correction mechanisms for Hindi.
Score: 3.0422254248414276
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Traditional Optical Character Recognition (OCR) systems that generate text of highly inflectional Indic languages like Hindi tend to suffer from poor accuracy due to a wide alphabet set, compound characters and difficulty in segmenting characters in a word. Automatic spelling error detection and context-sensitive error correction can be used to improve accuracy by post-processing the text generated by these OCR systems. A majority of previously developed language models for error correction of Hindi spelling have been context-free. In this paper, we present Vartani Spellcheck - a context-sensitive approach for spelling correction of Hindi text using a state-of-the-art transformer - BERT in conjunction with the Levenshtein distance algorithm, popularly known as Edit Distance. We use a lookup dictionary and context-based named entity recognition (NER) for detection of possible spelling errors in the text. Our proposed technique has been tested on a large corpus of text generated by the widely used Tesseract OCR on the Hindi epic Ramayana. With an accuracy of 81%, the results show a significant improvement over some of the previously established context-sensitive error correction mechanisms for Hindi. We also explain how Vartani Spellcheck may be used for on-the-fly autocorrect suggestion during continuous typing in a text editor environment.

Related papers

Context Biasing for Pronunciations-Orthography Mismatch in Automatic Speech Recognition [56.972851337263755]
We propose a method which allows corrections of substitution errors to improve the recognition accuracy of challenging words.<n>We show that with this method we get a relative improvement in biased word error rate of up to 11%, while maintaining a competitive overall word error rate.
arXiv Detail & Related papers (2025-06-23T14:42:03Z)
A Coin Has Two Sides: A Novel Detector-Corrector Framework for Chinese Spelling Correction [79.52464132360618]
Chinese Spelling Correction (CSC) stands as a foundational Natural Language Processing (NLP) task. We introduce a novel approach based on error detector-corrector framework. Our detector is designed to yield two error detection results, each characterized by high precision and recall.
arXiv Detail & Related papers (2024-09-06T09:26:45Z)
Learning Robust Named Entity Recognizers From Noisy Data With Retrieval Augmentation [67.89838237013078]
Named entity recognition (NER) models often struggle with noisy inputs. We propose a more realistic setting in which only noisy text and its NER labels are available. We employ a multi-view training framework that improves robust NER without retrieving text during inference.
arXiv Detail & Related papers (2024-07-26T07:30:41Z)
Automatic Real-word Error Correction in Persian Text [0.0]
This paper introduces a cutting-edge approach for precise and efficient real-word error correction in Persian text. We employ semantic analysis, feature selection, and advanced classifiers to enhance error detection and correction efficacy. Our method achieves an impressive F-measure of 96.6% in the detection phase and an accuracy of 99.1% in the correction phase.
arXiv Detail & Related papers (2024-07-20T07:50:52Z)
Efficiently Leveraging Linguistic Priors for Scene Text Spotting [63.22351047545888]
This paper proposes a method that leverages linguistic knowledge from a large text corpus to replace the traditional one-hot encoding used in auto-regressive scene text spotting and recognition models. We generate text distributions that align well with scene text datasets, removing the need for in-domain fine-tuning. Experimental results show that our method not only improves recognition accuracy but also enables more accurate localization of words.
arXiv Detail & Related papers (2024-02-27T01:57:09Z)
Chinese Spelling Correction as Rephrasing Language Model [63.65217759957206]
We study Chinese Spelling Correction (CSC), which aims to detect and correct the potential spelling errors in a given sentence. Current state-of-the-art methods regard CSC as a sequence tagging task and fine-tune BERT-based models on sentence pairs. We propose Rephrasing Language Model (ReLM), where the model is trained to rephrase the entire sentence by infilling additional slots, instead of character-to-character tagging.
arXiv Detail & Related papers (2023-08-17T06:04:28Z)
Lenient Evaluation of Japanese Speech Recognition: Modeling Naturally Occurring Spelling Inconsistency [8.888638284299736]
We create a lattice of plausible respellings of the reference transcription using a combination of lexical resources, a Japanese text-processing system, and a neural machine translation model. Our method, which does not penalize the system for choosing a valid alternate spelling of a word, affords a 2.4%-3.1% absolute reduction in CER depending on the task.
arXiv Detail & Related papers (2023-06-07T15:39:02Z)
Persian Typographical Error Type Detection Using Deep Neural Networks on Algorithmically-Generated Misspellings [2.2503811834154104]
Typographical Error Type Detection in Persian is a relatively understudied area. This paper presents a compelling approach for detecting typographical errors in Persian texts. The outcomes of our final method proved to be highly competitive, achieving an accuracy of 97.62%, precision of 98.83%, recall of 98.61%, and surpassing others in terms of speed.
arXiv Detail & Related papers (2023-05-19T15:05:39Z)
Misspelling Correction with Pre-trained Contextual Language Model [0.0]
We present two experiments, based on BERT and the edit distance algorithm, for ranking and selecting candidate corrections. The results of our experiments demonstrated that when combined properly, contextual word embeddings of BERT and edit distance are capable of effectively correcting spelling errors.
arXiv Detail & Related papers (2021-01-08T20:11:01Z)
NeuSpell: A Neural Spelling Correction Toolkit [88.79419580807519]
NeuSpell is an open-source toolkit for spelling correction in English. It comprises ten different models, and benchmarks them on misspellings from multiple sources. We train neural models using spelling errors in context, synthetically constructed by reverse engineering isolated misspellings.
arXiv Detail & Related papers (2020-10-21T15:53:29Z)
On the Robustness of Language Encoders against Grammatical Errors [66.05648604987479]
We collect real grammatical errors from non-native speakers and conduct adversarial attacks to simulate these errors on clean text data. Results confirm that the performance of all tested models is affected but the degree of impact varies.
arXiv Detail & Related papers (2020-05-12T11:01:44Z)
Synthetic Error Dataset Generation Mimicking Bengali Writing Pattern [0.0]
We present an algorithm for automatic misspelled Bengali word generation from correct word. As part of our analysis, we have formed a list of most commonly used Bengali words.
arXiv Detail & Related papers (2020-03-07T01:52:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.