AraSpell: A Deep Learning Approach for Arabic Spelling Correction
- URL: http://arxiv.org/abs/2405.06981v1
- Date: Sat, 11 May 2024 10:36:28 GMT
- Title: AraSpell: A Deep Learning Approach for Arabic Spelling Correction
- Authors: Mahmoud Salhab, Faisal Abu-Khzam,
- Abstract summary: "AraSpell" is a framework for Arabic spelling correction using different seq2seq model architectures.
It was trained on more than 6.9 Million Arabic sentences.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Spelling correction is the task of identifying spelling mistakes, typos, and grammatical mistakes in a given text and correcting them according to their context and grammatical structure. This work introduces "AraSpell," a framework for Arabic spelling correction using different seq2seq model architectures such as Recurrent Neural Network (RNN) and Transformer with artificial data generation for error injection, trained on more than 6.9 Million Arabic sentences. Thorough experimental studies provide empirical evidence of the effectiveness of the proposed approach, which achieved 4.8% and 1.11% word error rate (WER) and character error rate (CER), respectively, in comparison with labeled data of 29.72% WER and 5.03% CER. Our approach achieved 2.9% CER and 10.65% WER in comparison with labeled data of 10.02% CER and 50.94% WER. Both of these results are obtained on a test set of 100K sentences.
Related papers
- Grammatical Error Correction for Low-Resource Languages: The Case of Zarma [8.057796934109938]
Grammatical error correction (GEC) is important for improving written materials for low-resource languages like Zarma.
This study compares rule-based methods, machine translation (MT) models, and large language models (LLMs) for GEC in Zarma.
arXiv Detail & Related papers (2024-10-20T23:51:36Z) - Subtle Errors Matter: Preference Learning via Error-injected Self-editing [59.405145971637204]
We propose a novel preference learning framework called eRror-Injected Self-Editing (RISE)
RISE injects predefined subtle errors into partial tokens of correct solutions to construct hard pairs for error mitigation.
Experiments validate the effectiveness of RISE, with preference learning on Qwen2-7B-Instruct yielding notable improvements of 3.0% on GSM8K and 7.9% on MATH.
arXiv Detail & Related papers (2024-10-09T07:43:38Z) - A Coin Has Two Sides: A Novel Detector-Corrector Framework for Chinese Spelling Correction [79.52464132360618]
Chinese Spelling Correction (CSC) stands as a foundational Natural Language Processing (NLP) task.
We introduce a novel approach based on error detector-corrector framework.
Our detector is designed to yield two error detection results, each characterized by high precision and recall.
arXiv Detail & Related papers (2024-09-06T09:26:45Z) - Automatic Real-word Error Correction in Persian Text [0.0]
This paper introduces a cutting-edge approach for precise and efficient real-word error correction in Persian text.
We employ semantic analysis, feature selection, and advanced classifiers to enhance error detection and correction efficacy.
Our method achieves an impressive F-measure of 96.6% in the detection phase and an accuracy of 99.1% in the correction phase.
arXiv Detail & Related papers (2024-07-20T07:50:52Z) - Text Quality-Based Pruning for Efficient Training of Language Models [66.66259229732121]
We propose a novel method for numerically evaluating text quality in large unlabelled NLP datasets.
By proposing the text quality metric, the paper establishes a framework to identify and eliminate low-quality text instances.
Experimental results over multiple models and datasets demonstrate the efficacy of this approach.
arXiv Detail & Related papers (2024-04-26T18:01:25Z) - Assessing the Efficacy of Grammar Error Correction: A Human Evaluation
Approach in the Japanese Context [10.047123247001714]
We evaluate the performance of the state-of-the-art sequence tagging grammar error detection and correction model (SeqTagger)
With an automatic annotation toolkit, ERRANT, we first evaluated SeqTagger's performance on error correction with human expert correction as the benchmark.
Results indicated a precision of 63.66% and a recall of 20.19% for error correction in the full dataset.
arXiv Detail & Related papers (2024-02-28T06:43:43Z) - GEE! Grammar Error Explanation with Large Language Models [64.16199533560017]
We propose the task of grammar error explanation, where a system needs to provide one-sentence explanations for each grammatical error in a pair of erroneous and corrected sentences.
We analyze the capability of GPT-4 in grammar error explanation, and find that it only produces explanations for 60.2% of the errors using one-shot prompting.
We develop a two-step pipeline that leverages fine-tuned and prompted large language models to perform structured atomic token edit extraction.
arXiv Detail & Related papers (2023-11-16T02:45:47Z) - Persian Typographical Error Type Detection Using Deep Neural Networks on Algorithmically-Generated Misspellings [2.2503811834154104]
Typographical Error Type Detection in Persian is a relatively understudied area.
This paper presents a compelling approach for detecting typographical errors in Persian texts.
The outcomes of our final method proved to be highly competitive, achieving an accuracy of 97.62%, precision of 98.83%, recall of 98.61%, and surpassing others in terms of speed.
arXiv Detail & Related papers (2023-05-19T15:05:39Z) - Spell Correction for Azerbaijani Language using Deep Neural Networks [0.0]
This paper sequence to sequence model with an attention mechanism is used to develop spelling correction for Azerbaijani.
Total 12000 wrong and correct sentence pairs used for training, and the model is tested on 1000 real-world misspelled words.
arXiv Detail & Related papers (2021-02-05T15:02:35Z) - Improving the Efficiency of Grammatical Error Correction with Erroneous
Span Detection and Correction [106.63733511672721]
We propose a novel language-independent approach to improve the efficiency for Grammatical Error Correction (GEC) by dividing the task into two subtasks: Erroneous Span Detection ( ESD) and Erroneous Span Correction (ESC)
ESD identifies grammatically incorrect text spans with an efficient sequence tagging model. ESC leverages a seq2seq model to take the sentence with annotated erroneous spans as input and only outputs the corrected text for these spans.
Experiments show our approach performs comparably to conventional seq2seq approaches in both English and Chinese GEC benchmarks with less than 50% time cost for inference.
arXiv Detail & Related papers (2020-10-07T08:29:11Z) - On the Robustness of Language Encoders against Grammatical Errors [66.05648604987479]
We collect real grammatical errors from non-native speakers and conduct adversarial attacks to simulate these errors on clean text data.
Results confirm that the performance of all tested models is affected but the degree of impact varies.
arXiv Detail & Related papers (2020-05-12T11:01:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.