NeuSpell: A Neural Spelling Correction Toolkit
- URL: http://arxiv.org/abs/2010.11085v1
- Date: Wed, 21 Oct 2020 15:53:29 GMT
- Title: NeuSpell: A Neural Spelling Correction Toolkit
- Authors: Sai Muralidhar Jayanthi, Danish Pruthi, Graham Neubig
- Abstract summary: NeuSpell is an open-source toolkit for spelling correction in English.
It comprises ten different models, and benchmarks them on misspellings from multiple sources.
We train neural models using spelling errors in context, synthetically constructed by reverse engineering isolated misspellings.
- Score: 88.79419580807519
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce NeuSpell, an open-source toolkit for spelling correction in
English. Our toolkit comprises ten different models, and benchmarks them on
naturally occurring misspellings from multiple sources. We find that many
systems do not adequately leverage the context around the misspelt token. To
remedy this, (i) we train neural models using spelling errors in context,
synthetically constructed by reverse engineering isolated misspellings; and
(ii) use contextual representations. By training on our synthetic examples,
correction rates improve by 9% (absolute) compared to the case when models are
trained on randomly sampled character perturbations. Using richer contextual
representations boosts the correction rate by another 3%. Our toolkit enables
practitioners to use our proposed and existing spelling correction systems,
both via a unified command line, as well as a web interface. Among many
potential applications, we demonstrate the utility of our spell-checkers in
combating adversarial misspellings. The toolkit can be accessed at
neuspell.github.io. Code and pretrained models are available at
http://github.com/neuspell/neuspell.
Related papers
- Subtle Errors Matter: Preference Learning via Error-injected Self-editing [59.405145971637204]
We propose a novel preference learning framework called eRror-Injected Self-Editing (RISE)
RISE injects predefined subtle errors into partial tokens of correct solutions to construct hard pairs for error mitigation.
Experiments validate the effectiveness of RISE, with preference learning on Qwen2-7B-Instruct yielding notable improvements of 3.0% on GSM8K and 7.9% on MATH.
arXiv Detail & Related papers (2024-10-09T07:43:38Z) - A Methodology for Generative Spelling Correction via Natural Spelling
Errors Emulation across Multiple Domains and Languages [39.75847219395984]
We present a methodology for generative spelling correction (SC), which was tested on English and Russian languages.
We study the ways those errors can be emulated in correct sentences to effectively enrich generative models' pre-train procedure.
As a practical outcome of our work, we introduce SAGE(Spell checking via Augmentation and Generative distribution Emulation)
arXiv Detail & Related papers (2023-08-18T10:07:28Z) - Chinese Spelling Correction as Rephrasing Language Model [63.65217759957206]
We study Chinese Spelling Correction (CSC), which aims to detect and correct the potential spelling errors in a given sentence.
Current state-of-the-art methods regard CSC as a sequence tagging task and fine-tune BERT-based models on sentence pairs.
We propose Rephrasing Language Model (ReLM), where the model is trained to rephrase the entire sentence by infilling additional slots, instead of character-to-character tagging.
arXiv Detail & Related papers (2023-08-17T06:04:28Z) - Correcting Real-Word Spelling Errors: A New Hybrid Approach [1.5469452301122175]
A new hybrid approach is proposed which relies on statistical and syntactic knowledge to detect and correct real-word errors.
The model can prove to be more practical than some other models, such as WordNet-based method of Hirst and Budanitsky and fixed windows size method of Wilcox-O'Hearn and Hirst.
arXiv Detail & Related papers (2023-02-09T06:03:11Z) - An Error-Guided Correction Model for Chinese Spelling Error Correction [13.56600372085612]
We propose an error-guided correction model (EGCM) to improve Chinese spelling correction.
Our model achieves superior performance against state-of-the-art approaches by a remarkable margin.
arXiv Detail & Related papers (2023-01-16T09:27:45Z) - Scarecrow: A Framework for Scrutinizing Machine Text [69.26985439191151]
We introduce a new structured, crowdsourced error annotation schema called Scarecrow.
Scarecrow collects 13k annotations of 1.3k human and machine generate paragraphs of English language news text.
These findings demonstrate the value of Scarecrow annotations in the assessment of current and future text generation systems.
arXiv Detail & Related papers (2021-07-02T22:37:03Z) - Spelling Correction with Denoising Transformer [0.0]
We present a novel method of performing spelling correction on short input strings, such as search queries or individual words.
At its core lies a procedure for generating artificial typos which closely follow the error patterns manifested by humans.
This procedure is used to train the production spelling correction model based on a transformer architecture.
arXiv Detail & Related papers (2021-05-12T21:35:18Z) - Misspelling Correction with Pre-trained Contextual Language Model [0.0]
We present two experiments, based on BERT and the edit distance algorithm, for ranking and selecting candidate corrections.
The results of our experiments demonstrated that when combined properly, contextual word embeddings of BERT and edit distance are capable of effectively correcting spelling errors.
arXiv Detail & Related papers (2021-01-08T20:11:01Z) - Comparison of Interactive Knowledge Base Spelling Correction Models for
Low-Resource Languages [81.90356787324481]
Spelling normalization for low resource languages is a challenging task because the patterns are hard to predict.
This work shows a comparison of a neural model and character language models with varying amounts on target language data.
Our usage scenario is interactive correction with nearly zero amounts of training examples, improving models as more data is collected.
arXiv Detail & Related papers (2020-10-20T17:31:07Z) - On the Robustness of Language Encoders against Grammatical Errors [66.05648604987479]
We collect real grammatical errors from non-native speakers and conduct adversarial attacks to simulate these errors on clean text data.
Results confirm that the performance of all tested models is affected but the degree of impact varies.
arXiv Detail & Related papers (2020-05-12T11:01:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.