DPCSpell: A Transformer-based Detector-Purificator-Corrector Framework
for Spelling Error Correction of Bangla and Resource Scarce Indic Languages
- URL: http://arxiv.org/abs/2211.03730v1
- Date: Mon, 7 Nov 2022 17:59:05 GMT
- Title: DPCSpell: A Transformer-based Detector-Purificator-Corrector Framework
for Spelling Error Correction of Bangla and Resource Scarce Indic Languages
- Authors: Mehedi Hasan Bijoy, Nahid Hossain, Salekul Islam, Swakkhar Shatabda
- Abstract summary: Spelling error correction is the task of identifying and rectifying misspelled words in texts.
Earlier efforts on spelling error correction in Bangla and resource-scarce Indic languages focused on rule-based, statistical, and machine learning-based methods.
We propose a novel detector-purificator-corrector framework based on denoising transformers by addressing previous issues.
- Score: 1.7205106391379026
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Spelling error correction is the task of identifying and rectifying
misspelled words in texts. It is a potential and active research topic in
Natural Language Processing because of numerous applications in human language
understanding. The phonetically or visually similar yet semantically distinct
characters make it an arduous task in any language. Earlier efforts on spelling
error correction in Bangla and resource-scarce Indic languages focused on
rule-based, statistical, and machine learning-based methods which we found
rather inefficient. In particular, machine learning-based approaches, which
exhibit superior performance to rule-based and statistical methods, are
ineffective as they correct each character regardless of its appropriateness.
In this work, we propose a novel detector-purificator-corrector framework based
on denoising transformers by addressing previous issues. Moreover, we present a
method for large-scale corpus creation from scratch which in turn resolves the
resource limitation problem of any left-to-right scripted language. The
empirical outcomes demonstrate the effectiveness of our approach that
outperforms previous state-of-the-art methods by a significant margin for
Bangla spelling error correction. The models and corpus are publicly available
at https://tinyurl.com/DPCSpell.
Related papers
- Automatic Real-word Error Correction in Persian Text [0.0]
This paper introduces a cutting-edge approach for precise and efficient real-word error correction in Persian text.
We employ semantic analysis, feature selection, and advanced classifiers to enhance error detection and correction efficacy.
Our method achieves an impressive F-measure of 96.6% in the detection phase and an accuracy of 99.1% in the correction phase.
arXiv Detail & Related papers (2024-07-20T07:50:52Z) - A Methodology for Generative Spelling Correction via Natural Spelling
Errors Emulation across Multiple Domains and Languages [39.75847219395984]
We present a methodology for generative spelling correction (SC), which was tested on English and Russian languages.
We study the ways those errors can be emulated in correct sentences to effectively enrich generative models' pre-train procedure.
As a practical outcome of our work, we introduce SAGE(Spell checking via Augmentation and Generative distribution Emulation)
arXiv Detail & Related papers (2023-08-18T10:07:28Z) - Chinese Spelling Correction as Rephrasing Language Model [63.65217759957206]
We study Chinese Spelling Correction (CSC), which aims to detect and correct the potential spelling errors in a given sentence.
Current state-of-the-art methods regard CSC as a sequence tagging task and fine-tune BERT-based models on sentence pairs.
We propose Rephrasing Language Model (ReLM), where the model is trained to rephrase the entire sentence by infilling additional slots, instead of character-to-character tagging.
arXiv Detail & Related papers (2023-08-17T06:04:28Z) - Persian Typographical Error Type Detection Using Deep Neural Networks on Algorithmically-Generated Misspellings [2.2503811834154104]
Typographical Error Type Detection in Persian is a relatively understudied area.
This paper presents a compelling approach for detecting typographical errors in Persian texts.
The outcomes of our final method proved to be highly competitive, achieving an accuracy of 97.62%, precision of 98.83%, recall of 98.61%, and surpassing others in terms of speed.
arXiv Detail & Related papers (2023-05-19T15:05:39Z) - Correcting Real-Word Spelling Errors: A New Hybrid Approach [1.5469452301122175]
A new hybrid approach is proposed which relies on statistical and syntactic knowledge to detect and correct real-word errors.
The model can prove to be more practical than some other models, such as WordNet-based method of Hirst and Budanitsky and fixed windows size method of Wilcox-O'Hearn and Hirst.
arXiv Detail & Related papers (2023-02-09T06:03:11Z) - Lexically Aware Semi-Supervised Learning for OCR Post-Correction [90.54336622024299]
Much of the existing linguistic data in many languages of the world is locked away in non-digitized books and documents.
Previous work has demonstrated the utility of neural post-correction methods on recognition of less-well-resourced languages.
We present a semi-supervised learning method that makes it possible to utilize raw images to improve performance.
arXiv Detail & Related papers (2021-11-04T04:39:02Z) - Spelling Correction with Denoising Transformer [0.0]
We present a novel method of performing spelling correction on short input strings, such as search queries or individual words.
At its core lies a procedure for generating artificial typos which closely follow the error patterns manifested by humans.
This procedure is used to train the production spelling correction model based on a transformer architecture.
arXiv Detail & Related papers (2021-05-12T21:35:18Z) - Fake it Till You Make it: Self-Supervised Semantic Shifts for
Monolingual Word Embedding Tasks [58.87961226278285]
We propose a self-supervised approach to model lexical semantic change.
We show that our method can be used for the detection of semantic change with any alignment method.
We illustrate the utility of our techniques using experimental results on three different datasets.
arXiv Detail & Related papers (2021-01-30T18:59:43Z) - NeuSpell: A Neural Spelling Correction Toolkit [88.79419580807519]
NeuSpell is an open-source toolkit for spelling correction in English.
It comprises ten different models, and benchmarks them on misspellings from multiple sources.
We train neural models using spelling errors in context, synthetically constructed by reverse engineering isolated misspellings.
arXiv Detail & Related papers (2020-10-21T15:53:29Z) - On the Robustness of Language Encoders against Grammatical Errors [66.05648604987479]
We collect real grammatical errors from non-native speakers and conduct adversarial attacks to simulate these errors on clean text data.
Results confirm that the performance of all tested models is affected but the degree of impact varies.
arXiv Detail & Related papers (2020-05-12T11:01:44Z) - On the Importance of Word Order Information in Cross-lingual Sequence
Labeling [80.65425412067464]
Cross-lingual models that fit into the word order of the source language might fail to handle target languages.
We investigate whether making models insensitive to the word order of the source language can improve the adaptation performance in target languages.
arXiv Detail & Related papers (2020-01-30T03:35:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.