Related papers: Correcting the Autocorrect: Context-Aware Typographical Error Correction via Training Data Augmentation

Correcting the Autocorrect: Context-Aware Typographical Error Correction via Training Data Augmentation

URL: http://arxiv.org/abs/2005.01158v1
Date: Sun, 3 May 2020 18:08:17 GMT
Title: Correcting the Autocorrect: Context-Aware Typographical Error Correction via Training Data Augmentation
Authors: Kshitij Shah, Gerard de Melo
Abstract summary: We first draw on a small set of annotated data to compute spelling error statistics. These are then invoked to introduce errors into substantially larger corpora. We use it to create a set of English language error detection and correction datasets.
Score: 38.10429793534442
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this paper, we explore the artificial generation of typographical errors based on real-world statistics. We first draw on a small set of annotated data to compute spelling error statistics. These are then invoked to introduce errors into substantially larger corpora. The generation methodology allows us to generate particularly challenging errors that require context-aware error detection. We use it to create a set of English language error detection and correction datasets. Finally, we examine the effectiveness of machine learning models for detecting and correcting errors based on this data. The datasets are available at http://typo.nlproc.org

Related papers

Adapting LLMs for Minimal-edit Grammatical Error Correction [0.0]
We explore the error rate adaptation topic and propose a novel training schedule method.<n>Our experiments set a new state-of-the-art result for a single-model system on the BEA-test set.<n>We analyze whether training on detokenized datasets impacts the results and measure the impact of the usage of datasets with corrected erroneous examples.
arXiv Detail & Related papers (2025-06-16T07:00:48Z)
Tgea: An error-annotated dataset and benchmark tasks for text generation from pretrained language models [57.758735361535486]
TGEA is an error-annotated dataset for text generation from pretrained language models (PLMs) We create an error taxonomy to cover 24 types of errors occurring in PLM-generated sentences. This is the first dataset with comprehensive annotations for PLM-generated texts.
arXiv Detail & Related papers (2025-03-06T09:14:02Z)
A Coin Has Two Sides: A Novel Detector-Corrector Framework for Chinese Spelling Correction [79.52464132360618]
Chinese Spelling Correction (CSC) stands as a foundational Natural Language Processing (NLP) task. We introduce a novel approach based on error detector-corrector framework. Our detector is designed to yield two error detection results, each characterized by high precision and recall.
arXiv Detail & Related papers (2024-09-06T09:26:45Z)
Physics of Language Models: Part 2.2, How to Learn From Mistakes on Grade-School Math Problems [47.753284211200665]
We focus on understanding the usefulness of incorporating "error-correction" data directly into the pretraining stage. This data consists of erroneous solution steps immediately followed by their corrections. We show promising results: this type of pretrain data can help language models achieve higher reasoning accuracy.
arXiv Detail & Related papers (2024-08-29T06:49:20Z)
Assessing the Efficacy of Grammar Error Correction: A Human Evaluation Approach in the Japanese Context [10.047123247001714]
We evaluate the performance of the state-of-the-art sequence tagging grammar error detection and correction model (SeqTagger) With an automatic annotation toolkit, ERRANT, we first evaluated SeqTagger's performance on error correction with human expert correction as the benchmark. Results indicated a precision of 63.66% and a recall of 20.19% for error correction in the full dataset.
arXiv Detail & Related papers (2024-02-28T06:43:43Z)
Parameter-tuning-free data entry error unlearning with adaptive selective synaptic dampening [51.34904967046097]
We introduce an extension to the selective synaptic dampening unlearning method that removes the need for parameter tuning. We demonstrate the performance of this extension, adaptive selective synaptic dampening (ASSD) on various ResNet18 and Vision Transformer unlearning tasks. The application of this approach is particularly compelling in industrial settings, such as supply chain management.
arXiv Detail & Related papers (2024-02-06T14:04:31Z)
Towards Fine-Grained Information: Identifying the Type and Location of Translation Errors [80.22825549235556]
Existing approaches can not synchronously consider error position and type. We build an FG-TED model to predict the textbf addition and textbfomission errors. Experiments show that our model can identify both error type and position concurrently, and gives state-of-the-art results.
arXiv Detail & Related papers (2023-02-17T16:20:33Z)
Grammatical Error Generation Based on Translated Fragments [0.0]
We perform neural machine translation of sentence fragments in order to create large amounts of training data for English grammatical error correction. Our method aims at simulating mistakes made by second language learners, and produces a wider range of non-native style language.
arXiv Detail & Related papers (2021-04-20T12:43:40Z)
Deep Neural Network: An Efficient and Optimized Machine Learning Paradigm for Reducing Genome Sequencing Error [27.84400682210533]
It has become known that most of the platforms used in the sequencing process produce significant errors. On the two main types of genome errors - substitution and indels - our work is focused on correcting indels. A deep learning approach was used to correct the errors in sequencing the chosen dataset.
arXiv Detail & Related papers (2020-10-06T08:16:35Z)
On the Robustness of Language Encoders against Grammatical Errors [66.05648604987479]
We collect real grammatical errors from non-native speakers and conduct adversarial attacks to simulate these errors on clean text data. Results confirm that the performance of all tested models is affected but the degree of impact varies.
arXiv Detail & Related papers (2020-05-12T11:01:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.