Correcting the Autocorrect: Context-Aware Typographical Error Correction
via Training Data Augmentation
- URL: http://arxiv.org/abs/2005.01158v1
- Date: Sun, 3 May 2020 18:08:17 GMT
- Title: Correcting the Autocorrect: Context-Aware Typographical Error Correction
via Training Data Augmentation
- Authors: Kshitij Shah, Gerard de Melo
- Abstract summary: We first draw on a small set of annotated data to compute spelling error statistics.
These are then invoked to introduce errors into substantially larger corpora.
We use it to create a set of English language error detection and correction datasets.
- Score: 38.10429793534442
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we explore the artificial generation of typographical errors
based on real-world statistics. We first draw on a small set of annotated data
to compute spelling error statistics. These are then invoked to introduce
errors into substantially larger corpora. The generation methodology allows us
to generate particularly challenging errors that require context-aware error
detection. We use it to create a set of English language error detection and
correction datasets. Finally, we examine the effectiveness of machine learning
models for detecting and correcting errors based on this data. The datasets are
available at http://typo.nlproc.org
Related papers
- A Coin Has Two Sides: A Novel Detector-Corrector Framework for Chinese Spelling Correction [79.52464132360618]
Chinese Spelling Correction (CSC) stands as a foundational Natural Language Processing (NLP) task.
We introduce a novel approach based on error detector-corrector framework.
Our detector is designed to yield two error detection results, each characterized by high precision and recall.
arXiv Detail & Related papers (2024-09-06T09:26:45Z) - Physics of Language Models: Part 2.2, How to Learn From Mistakes on Grade-School Math Problems [47.753284211200665]
We focus on understanding the usefulness of incorporating "error-correction" data directly into the pretraining stage.
This data consists of erroneous solution steps immediately followed by their corrections.
We show promising results: this type of pretrain data can help language models achieve higher reasoning accuracy.
arXiv Detail & Related papers (2024-08-29T06:49:20Z) - Assessing the Efficacy of Grammar Error Correction: A Human Evaluation
Approach in the Japanese Context [10.047123247001714]
We evaluate the performance of the state-of-the-art sequence tagging grammar error detection and correction model (SeqTagger)
With an automatic annotation toolkit, ERRANT, we first evaluated SeqTagger's performance on error correction with human expert correction as the benchmark.
Results indicated a precision of 63.66% and a recall of 20.19% for error correction in the full dataset.
arXiv Detail & Related papers (2024-02-28T06:43:43Z) - Parameter-tuning-free data entry error unlearning with adaptive
selective synaptic dampening [51.34904967046097]
We introduce an extension to the selective synaptic dampening unlearning method that removes the need for parameter tuning.
We demonstrate the performance of this extension, adaptive selective synaptic dampening (ASSD) on various ResNet18 and Vision Transformer unlearning tasks.
The application of this approach is particularly compelling in industrial settings, such as supply chain management.
arXiv Detail & Related papers (2024-02-06T14:04:31Z) - Towards Fine-Grained Information: Identifying the Type and Location of
Translation Errors [80.22825549235556]
Existing approaches can not synchronously consider error position and type.
We build an FG-TED model to predict the textbf addition and textbfomission errors.
Experiments show that our model can identify both error type and position concurrently, and gives state-of-the-art results.
arXiv Detail & Related papers (2023-02-17T16:20:33Z) - Grammatical Error Generation Based on Translated Fragments [0.0]
We perform neural machine translation of sentence fragments in order to create large amounts of training data for English grammatical error correction.
Our method aims at simulating mistakes made by second language learners, and produces a wider range of non-native style language.
arXiv Detail & Related papers (2021-04-20T12:43:40Z) - Deep Neural Network: An Efficient and Optimized Machine Learning
Paradigm for Reducing Genome Sequencing Error [27.84400682210533]
It has become known that most of the platforms used in the sequencing process produce significant errors.
On the two main types of genome errors - substitution and indels - our work is focused on correcting indels.
A deep learning approach was used to correct the errors in sequencing the chosen dataset.
arXiv Detail & Related papers (2020-10-06T08:16:35Z) - On the Robustness of Language Encoders against Grammatical Errors [66.05648604987479]
We collect real grammatical errors from non-native speakers and conduct adversarial attacks to simulate these errors on clean text data.
Results confirm that the performance of all tested models is affected but the degree of impact varies.
arXiv Detail & Related papers (2020-05-12T11:01:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.