Byte-Level Grammatical Error Correction Using Synthetic and Curated
Corpora
- URL: http://arxiv.org/abs/2305.17906v1
- Date: Mon, 29 May 2023 06:35:40 GMT
- Title: Byte-Level Grammatical Error Correction Using Synthetic and Curated
Corpora
- Authors: Svanhv\'it Lilja Ing\'olfsd\'ottir, P\'etur Orri Ragnarsson, Haukur
P\'all J\'onsson, Haukur Barri S\'imonarson, Vilhj\'almur {\TH}orsteinsson,
V\'esteinn Sn{\ae}bjarnarson
- Abstract summary: Grammatical error correction (GEC) is the task of correcting typos, spelling, punctuation and grammatical issues in text.
We show that a byte-level model enables higher correction quality than a subword approach.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Grammatical error correction (GEC) is the task of correcting typos, spelling,
punctuation and grammatical issues in text. Approaching the problem as a
sequence-to-sequence task, we compare the use of a common subword unit
vocabulary and byte-level encoding. Initial synthetic training data is created
using an error-generating pipeline, and used for finetuning two subword-level
models and one byte-level model. Models are then finetuned further on
hand-corrected error corpora, including texts written by children, university
students, dyslexic and second-language writers, and evaluated over different
error types and origins. We show that a byte-level model enables higher
correction quality than a subword approach, not only for simple spelling
errors, but also for more complex semantic, stylistic and grammatical issues.
In particular, initial training on synthetic corpora followed by finetuning on
a relatively small parallel corpus of real-world errors helps the byte-level
model correct a wide range of commonly occurring errors. Our experiments are
run for the Icelandic language but should hold for other similar languages,
particularly morphologically rich ones.
Related papers
- Chinese Spelling Correction as Rephrasing Language Model [63.65217759957206]
We study Chinese Spelling Correction (CSC), which aims to detect and correct the potential spelling errors in a given sentence.
Current state-of-the-art methods regard CSC as a sequence tagging task and fine-tune BERT-based models on sentence pairs.
We propose Rephrasing Language Model (ReLM), where the model is trained to rephrase the entire sentence by infilling additional slots, instead of character-to-character tagging.
arXiv Detail & Related papers (2023-08-17T06:04:28Z) - Towards Fine-Grained Information: Identifying the Type and Location of
Translation Errors [80.22825549235556]
Existing approaches can not synchronously consider error position and type.
We build an FG-TED model to predict the textbf addition and textbfomission errors.
Experiments show that our model can identify both error type and position concurrently, and gives state-of-the-art results.
arXiv Detail & Related papers (2023-02-17T16:20:33Z) - Improving Pre-trained Language Models with Syntactic Dependency
Prediction Task for Chinese Semantic Error Recognition [52.55136323341319]
Existing Chinese text error detection mainly focuses on spelling and simple grammatical errors.
Chinese semantic errors are understudied and more complex that humans cannot easily recognize.
arXiv Detail & Related papers (2022-04-15T13:55:32Z) - A Syntax-Guided Grammatical Error Correction Model with Dependency Tree
Correction [83.14159143179269]
Grammatical Error Correction (GEC) is a task of detecting and correcting grammatical errors in sentences.
We propose a syntax-guided GEC model (SG-GEC) which adopts the graph attention mechanism to utilize the syntactic knowledge of dependency trees.
We evaluate our model on public benchmarks of GEC task and it achieves competitive results.
arXiv Detail & Related papers (2021-11-05T07:07:48Z) - Hierarchical Character Tagger for Short Text Spelling Error Correction [27.187562419222218]
We present a Hierarchical Character Tagger model, or HCTagger, for short text spelling error correction.
We use a pre-trained language model at the character level as a text encoder, and then predict character-level edits to transform the original text into its error-free form with a much smaller label space.
Experiments on two public misspelling correction datasets demonstrate that HCTagger is an accurate and much faster approach than many existing models.
arXiv Detail & Related papers (2021-09-29T08:04:34Z) - Grammatical Error Correction as GAN-like Sequence Labeling [45.19453732703053]
We propose a GAN-like sequence labeling model, which consists of a grammatical error detector as a discriminator and a grammatical error labeler with Gumbel-Softmax sampling as a generator.
Our results on several evaluation benchmarks demonstrate that our proposed approach is effective and improves the previous state-of-the-art baseline.
arXiv Detail & Related papers (2021-05-29T04:39:40Z) - Synthetic Data Generation for Grammatical Error Correction with Tagged
Corruption Models [15.481446439370343]
We use error type tags from automatic annotation tools such as ERRANT to guide synthetic data generation.
We build a new, large synthetic pre-training data set with error tag frequency distributions matching a given development set.
Our approach is particularly effective in adapting a GEC system, trained on mixed native and non-native English, to a native English test set.
arXiv Detail & Related papers (2021-05-27T17:17:21Z) - Grammatical Error Generation Based on Translated Fragments [0.0]
We perform neural machine translation of sentence fragments in order to create large amounts of training data for English grammatical error correction.
Our method aims at simulating mistakes made by second language learners, and produces a wider range of non-native style language.
arXiv Detail & Related papers (2021-04-20T12:43:40Z) - On the Robustness of Language Encoders against Grammatical Errors [66.05648604987479]
We collect real grammatical errors from non-native speakers and conduct adversarial attacks to simulate these errors on clean text data.
Results confirm that the performance of all tested models is affected but the degree of impact varies.
arXiv Detail & Related papers (2020-05-12T11:01:44Z) - Towards Minimal Supervision BERT-based Grammar Error Correction [81.90356787324481]
We try to incorporate contextual information from pre-trained language model to leverage annotation and benefit multilingual scenarios.
Results show strong potential of Bidirectional Representations from Transformers (BERT) in grammatical error correction task.
arXiv Detail & Related papers (2020-01-10T15:45:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.