Synthetic Data Generation for Grammatical Error Correction with Tagged
Corruption Models
- URL: http://arxiv.org/abs/2105.13318v1
- Date: Thu, 27 May 2021 17:17:21 GMT
- Title: Synthetic Data Generation for Grammatical Error Correction with Tagged
Corruption Models
- Authors: Felix Stahlberg and Shankar Kumar
- Abstract summary: We use error type tags from automatic annotation tools such as ERRANT to guide synthetic data generation.
We build a new, large synthetic pre-training data set with error tag frequency distributions matching a given development set.
Our approach is particularly effective in adapting a GEC system, trained on mixed native and non-native English, to a native English test set.
- Score: 15.481446439370343
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Synthetic data generation is widely known to boost the accuracy of neural
grammatical error correction (GEC) systems, but existing methods often lack
diversity or are too simplistic to generate the broad range of grammatical
errors made by human writers. In this work, we use error type tags from
automatic annotation tools such as ERRANT to guide synthetic data generation.
We compare several models that can produce an ungrammatical sentence given a
clean sentence and an error type tag. We use these models to build a new, large
synthetic pre-training data set with error tag frequency distributions matching
a given development set. Our synthetic data set yields large and consistent
gains, improving the state-of-the-art on the BEA-19 and CoNLL-14 test sets. We
also show that our approach is particularly effective in adapting a GEC system,
trained on mixed native and non-native English, to a native English test set,
even surpassing real training data consisting of high-quality sentence pairs.
Related papers
- Improving Grammatical Error Correction via Contextual Data Augmentation [49.746484518527716]
We propose a synthetic data construction method based on contextual augmentation.
Specifically, we combine rule-based substitution with model-based generation.
We also propose a relabeling-based data cleaning method to mitigate the effects of noisy labels in synthetic data.
arXiv Detail & Related papers (2024-06-25T10:49:56Z) - Organic Data-Driven Approach for Turkish Grammatical Error Correction and LLMs [0.0]
We introduce a new organic data-driven approach, clean insertions, to build parallel Turkish Grammatical Error Correction datasets.
We achieve state-of-the-art results on two Turkish Grammatical Error Correction test sets out of the three publicly available ones.
arXiv Detail & Related papers (2024-05-24T08:00:24Z) - Let's Synthesize Step by Step: Iterative Dataset Synthesis with Large
Language Models by Extrapolating Errors from Small Models [69.76066070227452]
*Data Synthesis* is a promising way to train a small model with very little labeled data.
We propose *Synthesis Step by Step* (**S3**), a data synthesis framework that shrinks this distribution gap.
Our approach improves the performance of a small model by reducing the gap between the synthetic dataset and the real data.
arXiv Detail & Related papers (2023-10-20T17:14:25Z) - Judge a Sentence by Its Content to Generate Grammatical Errors [0.0]
We propose a learning based two stage method for synthetic data generation for grammatical error correction.
We show that a GEC model trained on our synthetically generated corpus outperforms models trained on synthetic data from prior work.
arXiv Detail & Related papers (2022-08-20T14:31:34Z) - Improving Pre-trained Language Models with Syntactic Dependency
Prediction Task for Chinese Semantic Error Recognition [52.55136323341319]
Existing Chinese text error detection mainly focuses on spelling and simple grammatical errors.
Chinese semantic errors are understudied and more complex that humans cannot easily recognize.
arXiv Detail & Related papers (2022-04-15T13:55:32Z) - A Syntax-Guided Grammatical Error Correction Model with Dependency Tree
Correction [83.14159143179269]
Grammatical Error Correction (GEC) is a task of detecting and correcting grammatical errors in sentences.
We propose a syntax-guided GEC model (SG-GEC) which adopts the graph attention mechanism to utilize the syntactic knowledge of dependency trees.
We evaluate our model on public benchmarks of GEC task and it achieves competitive results.
arXiv Detail & Related papers (2021-11-05T07:07:48Z) - Grammatical Error Correction as GAN-like Sequence Labeling [45.19453732703053]
We propose a GAN-like sequence labeling model, which consists of a grammatical error detector as a discriminator and a grammatical error labeler with Gumbel-Softmax sampling as a generator.
Our results on several evaluation benchmarks demonstrate that our proposed approach is effective and improves the previous state-of-the-art baseline.
arXiv Detail & Related papers (2021-05-29T04:39:40Z) - Grammatical Error Generation Based on Translated Fragments [0.0]
We perform neural machine translation of sentence fragments in order to create large amounts of training data for English grammatical error correction.
Our method aims at simulating mistakes made by second language learners, and produces a wider range of non-native style language.
arXiv Detail & Related papers (2021-04-20T12:43:40Z) - On the Robustness of Language Encoders against Grammatical Errors [66.05648604987479]
We collect real grammatical errors from non-native speakers and conduct adversarial attacks to simulate these errors on clean text data.
Results confirm that the performance of all tested models is affected but the degree of impact varies.
arXiv Detail & Related papers (2020-05-12T11:01:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.