A Methodology for Generative Spelling Correction via Natural Spelling
Errors Emulation across Multiple Domains and Languages
- URL: http://arxiv.org/abs/2308.09435v2
- Date: Wed, 13 Sep 2023 15:22:29 GMT
- Title: A Methodology for Generative Spelling Correction via Natural Spelling
Errors Emulation across Multiple Domains and Languages
- Authors: Nikita Martynov, Mark Baushenko, Anastasia Kozlova, Katerina
Kolomeytseva, Aleksandr Abramov, Alena Fenogenova
- Abstract summary: We present a methodology for generative spelling correction (SC), which was tested on English and Russian languages.
We study the ways those errors can be emulated in correct sentences to effectively enrich generative models' pre-train procedure.
As a practical outcome of our work, we introduce SAGE(Spell checking via Augmentation and Generative distribution Emulation)
- Score: 39.75847219395984
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Modern large language models demonstrate impressive capabilities in text
generation and generalization. However, they often struggle with solving text
editing tasks, particularly when it comes to correcting spelling errors and
mistypings. In this paper, we present a methodology for generative spelling
correction (SC), which was tested on English and Russian languages and
potentially can be extended to any language with minor changes. Our research
mainly focuses on exploring natural spelling errors and mistypings in texts and
studying the ways those errors can be emulated in correct sentences to
effectively enrich generative models' pre-train procedure. We investigate the
impact of such emulations and the models' abilities across different text
domains. In this work, we investigate two spelling corruption techniques: 1)
first one mimics human behavior when making a mistake through leveraging
statistics of errors from particular dataset and 2) second adds the most common
spelling errors, keyboard miss clicks, and some heuristics within the texts. We
conducted experiments employing various corruption strategies, models'
architectures and sizes on the pre-training and fine-tuning stages and
evaluated the models using single-domain and multi-domain test sets. As a
practical outcome of our work, we introduce SAGE(Spell checking via
Augmentation and Generative distribution Emulation). It is a library for
automatic generative SC that includes a family of pre-trained generative models
and built-in augmentation algorithms.
Related papers
- EdaCSC: Two Easy Data Augmentation Methods for Chinese Spelling Correction [0.0]
Chinese Spelling Correction (CSC) aims to detect and correct spelling errors in Chinese sentences caused by phonetic or visual similarities.
We propose two data augmentation methods to address these limitations.
Firstly, we augment the dataset by either splitting long sentences into shorter ones or reducing typos in sentences with multiple typos.
arXiv Detail & Related papers (2024-09-08T14:29:10Z) - A Comprehensive Approach to Misspelling Correction with BERT and Levenshtein Distance [1.7000578646860536]
Spelling mistakes, among the most prevalent writing errors, are frequently encountered due to various factors.
This research aims to identify and rectify diverse spelling errors in text using neural networks.
arXiv Detail & Related papers (2024-07-24T16:07:11Z) - Language Models for Text Classification: Is In-Context Learning Enough? [54.869097980761595]
Recent foundational language models have shown state-of-the-art performance in many NLP tasks in zero- and few-shot settings.
An advantage of these models over more standard approaches is the ability to understand instructions written in natural language (prompts)
This makes them suitable for addressing text classification problems for domains with limited amounts of annotated instances.
arXiv Detail & Related papers (2024-03-26T12:47:39Z) - Bridging the Gap Between Training and Inference of Bayesian Controllable
Language Models [58.990214815032495]
Large-scale pre-trained language models have achieved great success on natural language generation tasks.
BCLMs have been shown to be efficient in controllable language generation.
We propose a "Gemini Discriminator" for controllable language generation which alleviates the mismatch problem with a small computational cost.
arXiv Detail & Related papers (2022-06-11T12:52:32Z) - Detecting Text Formality: A Study of Text Classification Approaches [78.11745751651708]
This work proposes the first to our knowledge systematic study of formality detection methods based on statistical, neural-based, and Transformer-based machine learning methods.
We conducted three types of experiments -- monolingual, multilingual, and cross-lingual.
The study shows the overcome of Char BiLSTM model over Transformer-based ones for the monolingual and multilingual formality classification task.
arXiv Detail & Related papers (2022-04-19T16:23:07Z) - Spelling Correction with Denoising Transformer [0.0]
We present a novel method of performing spelling correction on short input strings, such as search queries or individual words.
At its core lies a procedure for generating artificial typos which closely follow the error patterns manifested by humans.
This procedure is used to train the production spelling correction model based on a transformer architecture.
arXiv Detail & Related papers (2021-05-12T21:35:18Z) - Neural Text Generation with Artificial Negative Examples [7.187858820534111]
We propose to suppress an arbitrary type of errors by training the text generation model in a reinforcement learning framework.
We use a trainable reward function that is capable of discriminating between references and sentences containing the targeted type of errors.
The experimental results show that our method can suppress the generation errors and achieve significant improvements on two machine translation and two image captioning tasks.
arXiv Detail & Related papers (2020-12-28T07:25:10Z) - Comparison of Interactive Knowledge Base Spelling Correction Models for
Low-Resource Languages [81.90356787324481]
Spelling normalization for low resource languages is a challenging task because the patterns are hard to predict.
This work shows a comparison of a neural model and character language models with varying amounts on target language data.
Our usage scenario is interactive correction with nearly zero amounts of training examples, improving models as more data is collected.
arXiv Detail & Related papers (2020-10-20T17:31:07Z) - On the Robustness of Language Encoders against Grammatical Errors [66.05648604987479]
We collect real grammatical errors from non-native speakers and conduct adversarial attacks to simulate these errors on clean text data.
Results confirm that the performance of all tested models is affected but the degree of impact varies.
arXiv Detail & Related papers (2020-05-12T11:01:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.