Correcting Arabic Soft Spelling Mistakes using BiLSTM-based Machine
Learning
- URL: http://arxiv.org/abs/2108.01141v1
- Date: Mon, 2 Aug 2021 19:47:55 GMT
- Title: Correcting Arabic Soft Spelling Mistakes using BiLSTM-based Machine
Learning
- Authors: Gheith A. Abandah, Ashraf Suyyagh, Mohammed Z. Khedher
- Abstract summary: Soft spelling errors are widespread among native Arabic speakers and foreign learners alike.
We develop, train, evaluate, and compare a set of BiLSTM networks to correct this class of errors.
The best model corrects 96.4% of the injected errors and achieves a low character error rate of 1.28% on a real test set of soft spelling mistakes.
- Score: 1.7205106391379026
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Soft spelling errors are a class of spelling mistakes that is widespread
among native Arabic speakers and foreign learners alike. Some of these errors
are typographical in nature. They occur due to orthographic variations of some
Arabic letters and the complex rules that dictate their correct usage. Many
people forgo these rules, and given the identical phonetic sounds, they often
confuse such letters. In this paper, we propose a bidirectional long short-term
memory network that corrects this class of errors. We develop, train, evaluate,
and compare a set of BiLSTM networks. We approach the spelling correction
problem at the character level. We handle Arabic texts from both classical and
modern standard Arabic. We treat the problem as a one-to-one sequence
transcription problem. Since the soft Arabic errors class encompasses omission
and addition mistakes, to preserve the one-to-one sequence transcription, we
propose a simple low-resource yet effective technique that maintains the
one-to-one sequencing and avoids using a costly encoder-decoder architecture.
We train the BiLSTM models to correct the spelling mistakes using transformed
input and stochastic error injection approaches. We recommend a configuration
that has two BiLSTM layers, uses the dropout regularization, and is trained
using the latter training approach with error injection rate of 40%. The best
model corrects 96.4% of the injected errors and achieves a low character error
rate of 1.28% on a real test set of soft spelling mistakes.
Related papers
- Tibyan Corpus: Balanced and Comprehensive Error Coverage Corpus Using ChatGPT for Arabic Grammatical Error Correction [0.32885740436059047]
This study aims to develop an Arabic corpus called "Tibyan" for grammatical error correction using ChatGPT.
ChatGPT is used as a data augmenter tool based on a pair of Arabic sentences containing grammatical errors matched with a sentence free of errors extracted from Arabic books.
Our corpus contained 49 of errors, including seven types: orthography, syntax, semantics, punctuation, morphology, and split.
arXiv Detail & Related papers (2024-11-07T10:17:40Z) - Subtle Errors Matter: Preference Learning via Error-injected Self-editing [59.405145971637204]
We propose a novel preference learning framework called eRror-Injected Self-Editing (RISE)
RISE injects predefined subtle errors into partial tokens of correct solutions to construct hard pairs for error mitigation.
Experiments validate the effectiveness of RISE, with preference learning on Qwen2-7B-Instruct yielding notable improvements of 3.0% on GSM8K and 7.9% on MATH.
arXiv Detail & Related papers (2024-10-09T07:43:38Z) - A Comprehensive Approach to Misspelling Correction with BERT and Levenshtein Distance [1.7000578646860536]
Spelling mistakes, among the most prevalent writing errors, are frequently encountered due to various factors.
This research aims to identify and rectify diverse spelling errors in text using neural networks.
arXiv Detail & Related papers (2024-07-24T16:07:11Z) - Chinese Spelling Correction as Rephrasing Language Model [63.65217759957206]
We study Chinese Spelling Correction (CSC), which aims to detect and correct the potential spelling errors in a given sentence.
Current state-of-the-art methods regard CSC as a sequence tagging task and fine-tune BERT-based models on sentence pairs.
We propose Rephrasing Language Model (ReLM), where the model is trained to rephrase the entire sentence by infilling additional slots, instead of character-to-character tagging.
arXiv Detail & Related papers (2023-08-17T06:04:28Z) - Byte-Level Grammatical Error Correction Using Synthetic and Curated
Corpora [0.0]
Grammatical error correction (GEC) is the task of correcting typos, spelling, punctuation and grammatical issues in text.
We show that a byte-level model enables higher correction quality than a subword approach.
arXiv Detail & Related papers (2023-05-29T06:35:40Z) - Persian Typographical Error Type Detection Using Deep Neural Networks on Algorithmically-Generated Misspellings [2.2503811834154104]
Typographical Error Type Detection in Persian is a relatively understudied area.
This paper presents a compelling approach for detecting typographical errors in Persian texts.
The outcomes of our final method proved to be highly competitive, achieving an accuracy of 97.62%, precision of 98.83%, recall of 98.61%, and surpassing others in terms of speed.
arXiv Detail & Related papers (2023-05-19T15:05:39Z) - SoftCorrect: Error Correction with Soft Detection for Automatic Speech
Recognition [116.31926128970585]
We propose SoftCorrect with a soft error detection mechanism to avoid the limitations of both explicit and implicit error detection.
Compared with implicit error detection with CTC loss, SoftCorrect provides explicit signal about which words are incorrect.
Experiments on AISHELL-1 and Aidatatang datasets show that SoftCorrect achieves 26.1% and 9.4% CER reduction respectively.
arXiv Detail & Related papers (2022-12-02T09:11:32Z) - Scarecrow: A Framework for Scrutinizing Machine Text [69.26985439191151]
We introduce a new structured, crowdsourced error annotation schema called Scarecrow.
Scarecrow collects 13k annotations of 1.3k human and machine generate paragraphs of English language news text.
These findings demonstrate the value of Scarecrow annotations in the assessment of current and future text generation systems.
arXiv Detail & Related papers (2021-07-02T22:37:03Z) - NeuSpell: A Neural Spelling Correction Toolkit [88.79419580807519]
NeuSpell is an open-source toolkit for spelling correction in English.
It comprises ten different models, and benchmarks them on misspellings from multiple sources.
We train neural models using spelling errors in context, synthetically constructed by reverse engineering isolated misspellings.
arXiv Detail & Related papers (2020-10-21T15:53:29Z) - Tokenization Repair in the Presence of Spelling Errors [0.2964978357715083]
Spelling errors can be present, but it's not part of the problem to correct them.
We identify three key ingredients of high-quality tokenization repair.
arXiv Detail & Related papers (2020-10-15T16:55:45Z) - On the Robustness of Language Encoders against Grammatical Errors [66.05648604987479]
We collect real grammatical errors from non-native speakers and conduct adversarial attacks to simulate these errors on clean text data.
Results confirm that the performance of all tested models is affected but the degree of impact varies.
arXiv Detail & Related papers (2020-05-12T11:01:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.