Neural spell-checker: Beyond words with synthetic data generation
- URL: http://arxiv.org/abs/2410.23514v1
- Date: Wed, 30 Oct 2024 23:51:01 GMT
- Title: Neural spell-checker: Beyond words with synthetic data generation
- Authors: Matej Klemen, Martin Božič, Špela Arhar Holdt, Marko Robnik-Šikonja,
- Abstract summary: Spell-checkers are valuable tools that enhance communication by identifying misspelled words in written texts.
Recent improvements in deep learning have opened new opportunities to improve traditional spell-checkers with new functionalities.
We present and compare two new spell-checkers and evaluate them on synthetic, learner, and more general-domain Slovene datasets.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Spell-checkers are valuable tools that enhance communication by identifying misspelled words in written texts. Recent improvements in deep learning, and in particular in large language models, have opened new opportunities to improve traditional spell-checkers with new functionalities that not only assess spelling correctness but also the suitability of a word for a given context. In our work, we present and compare two new spell-checkers and evaluate them on synthetic, learner, and more general-domain Slovene datasets. The first spell-checker is a traditional, fast, word-based approach, based on a morphological lexicon with a significantly larger word list compared to existing spell-checkers. The second approach uses a language model trained on a large corpus with synthetically inserted errors. We present the training data construction strategies, which turn out to be a crucial component of neural spell-checkers. Further, the proposed neural model significantly outperforms all existing spell-checkers for Slovene in both precision and recall.
Related papers
- Morphological evaluation of subwords vocabulary used by BETO language model [0.1638581561083717]
Subword tokenization algorithms are more efficient and can independently build the necessary vocabulary of words and subwords without human intervention.
In previous research, we proposed a method to assess the morphological quality of vocabularies, focusing on the overlap between these vocabularies and the morphemes of a given language.
By applying this method to vocabularies created by three subword tokenization algorithms, BPE, Wordpiece, and Unigram, we concluded that these vocabularies generally exhibit very low morphological quality.
This evaluation helps clarify the algorithm used by the tokenizer, that is, Wordpiece, given the inconsistencies between the authors' claims
arXiv Detail & Related papers (2024-10-03T08:07:14Z) - SpellMapper: A non-autoregressive neural spellchecker for ASR
customization with candidate retrieval based on n-gram mappings [76.87664008338317]
Contextual spelling correction models are an alternative to shallow fusion to improve automatic speech recognition.
We propose a novel algorithm for candidate retrieval based on misspelled n-gram mappings.
Experiments on Spoken Wikipedia show 21.4% word error rate improvement compared to a baseline ASR system.
arXiv Detail & Related papers (2023-06-04T10:00:12Z) - Automatic Spell Checker and Correction for Under-represented Spoken
Languages: Case Study on Wolof [9.79241237464453]
This paper presents a spell checker and correction tool specifically designed for Wolof, an under-represented spoken language in Africa.
The proposed spell checker leverages a combination of a trie data structure, dynamic programming, and the weighted Levenshtein distance to generate suggestions for misspelled words.
Despite the limited data available for Wolof, the spell checker's performance showed a predictive accuracy of 98.31% and a suggestion accuracy of 93.33%.
arXiv Detail & Related papers (2023-05-22T04:03:20Z) - Towards preserving word order importance through Forced Invalidation [80.33036864442182]
We show that pre-trained language models are insensitive to word order.
We propose Forced Invalidation to help preserve the importance of word order.
Our experiments demonstrate that Forced Invalidation significantly improves the sensitivity of the models to word order.
arXiv Detail & Related papers (2023-04-11T13:42:10Z) - Correcting Real-Word Spelling Errors: A New Hybrid Approach [1.5469452301122175]
A new hybrid approach is proposed which relies on statistical and syntactic knowledge to detect and correct real-word errors.
The model can prove to be more practical than some other models, such as WordNet-based method of Hirst and Budanitsky and fixed windows size method of Wilcox-O'Hearn and Hirst.
arXiv Detail & Related papers (2023-02-09T06:03:11Z) - A Vocabulary-Free Multilingual Neural Tokenizer for End-to-End Task
Learning [8.052271364177988]
Subword tokenization is a commonly used input pre-processing step in most recent NLP models.
We propose a vocabulary-free neural tokenizer by distilling segmentation information from subword tokenization.
Our tokenizer consistently improves performance on multilingual (NLI) and code-switching (sentiment analysis) tasks.
arXiv Detail & Related papers (2022-04-22T16:50:49Z) - Augmenting semantic lexicons using word embeddings and transfer learning [1.101002667958165]
We propose two models for predicting sentiment scores to augment semantic lexicons at a relatively low cost using word embeddings and transfer learning.
Our evaluation shows both models are able to score new words with a similar accuracy to reviewers from Amazon Mechanical Turk, but at a fraction of the cost.
arXiv Detail & Related papers (2021-09-18T20:59:52Z) - NeuSpell: A Neural Spelling Correction Toolkit [88.79419580807519]
NeuSpell is an open-source toolkit for spelling correction in English.
It comprises ten different models, and benchmarks them on misspellings from multiple sources.
We train neural models using spelling errors in context, synthetically constructed by reverse engineering isolated misspellings.
arXiv Detail & Related papers (2020-10-21T15:53:29Z) - Comparison of Interactive Knowledge Base Spelling Correction Models for
Low-Resource Languages [81.90356787324481]
Spelling normalization for low resource languages is a challenging task because the patterns are hard to predict.
This work shows a comparison of a neural model and character language models with varying amounts on target language data.
Our usage scenario is interactive correction with nearly zero amounts of training examples, improving models as more data is collected.
arXiv Detail & Related papers (2020-10-20T17:31:07Z) - Grounded Compositional Outputs for Adaptive Language Modeling [59.02706635250856]
A language model's vocabulary$-$typically selected before training and permanently fixed later$-$affects its size.
We propose a fully compositional output embedding layer for language models.
To our knowledge, the result is the first word-level language model with a size that does not depend on the training vocabulary.
arXiv Detail & Related papers (2020-09-24T07:21:14Z) - On the Importance of Word Order Information in Cross-lingual Sequence
Labeling [80.65425412067464]
Cross-lingual models that fit into the word order of the source language might fail to handle target languages.
We investigate whether making models insensitive to the word order of the source language can improve the adaptation performance in target languages.
arXiv Detail & Related papers (2020-01-30T03:35:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.