Judge a Sentence by Its Content to Generate Grammatical Errors
- URL: http://arxiv.org/abs/2208.09693v1
- Date: Sat, 20 Aug 2022 14:31:34 GMT
- Title: Judge a Sentence by Its Content to Generate Grammatical Errors
- Authors: Chowdhury Rafeed Rahman
- Abstract summary: We propose a learning based two stage method for synthetic data generation for grammatical error correction.
We show that a GEC model trained on our synthetically generated corpus outperforms models trained on synthetic data from prior work.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Data sparsity is a well-known problem for grammatical error correction (GEC).
Generating synthetic training data is one widely proposed solution to this
problem, and has allowed models to achieve state-of-the-art (SOTA) performance
in recent years. However, these methods often generate unrealistic errors, or
aim to generate sentences with only one error. We propose a learning based two
stage method for synthetic data generation for GEC that relaxes this constraint
on sentences containing only one error. Errors are generated in accordance with
sentence merit. We show that a GEC model trained on our synthetically generated
corpus outperforms models trained on synthetic data from prior work.
Related papers
- Improving Grammatical Error Correction via Contextual Data Augmentation [49.746484518527716]
We propose a synthetic data construction method based on contextual augmentation.
Specifically, we combine rule-based substitution with model-based generation.
We also propose a relabeling-based data cleaning method to mitigate the effects of noisy labels in synthetic data.
arXiv Detail & Related papers (2024-06-25T10:49:56Z) - LM-Combiner: A Contextual Rewriting Model for Chinese Grammatical Error Correction [49.0746090186582]
Over-correction is a critical problem in Chinese grammatical error correction (CGEC) task.
Recent work using model ensemble methods can effectively mitigate over-correction and improve the precision of the GEC system.
We propose the LM-Combiner, a rewriting model that can directly modify the over-correction of GEC system outputs without a model ensemble.
arXiv Detail & Related papers (2024-03-26T06:12:21Z) - Let's Synthesize Step by Step: Iterative Dataset Synthesis with Large
Language Models by Extrapolating Errors from Small Models [69.76066070227452]
*Data Synthesis* is a promising way to train a small model with very little labeled data.
We propose *Synthesis Step by Step* (**S3**), a data synthesis framework that shrinks this distribution gap.
Our approach improves the performance of a small model by reducing the gap between the synthetic dataset and the real data.
arXiv Detail & Related papers (2023-10-20T17:14:25Z) - A Unified Strategy for Multilingual Grammatical Error Correction with
Pre-trained Cross-Lingual Language Model [100.67378875773495]
We propose a generic and language-independent strategy for multilingual Grammatical Error Correction.
Our approach creates diverse parallel GEC data without any language-specific operations.
It achieves the state-of-the-art results on the NLPCC 2018 Task 2 dataset (Chinese) and obtains competitive performance on Falko-Merlin (German) and RULEC-GEC (Russian)
arXiv Detail & Related papers (2022-01-26T02:10:32Z) - A Syntax-Guided Grammatical Error Correction Model with Dependency Tree
Correction [83.14159143179269]
Grammatical Error Correction (GEC) is a task of detecting and correcting grammatical errors in sentences.
We propose a syntax-guided GEC model (SG-GEC) which adopts the graph attention mechanism to utilize the syntactic knowledge of dependency trees.
We evaluate our model on public benchmarks of GEC task and it achieves competitive results.
arXiv Detail & Related papers (2021-11-05T07:07:48Z) - Do Grammatical Error Correction Models Realize Grammatical
Generalization? [8.569720582920416]
This study explores to what extent GEC models generalize grammatical knowledge required for correcting errors.
We found that a current standard Transformer-based GEC model fails to realize grammatical generalization even in simple settings.
arXiv Detail & Related papers (2021-06-06T04:59:29Z) - Grammatical Error Correction as GAN-like Sequence Labeling [45.19453732703053]
We propose a GAN-like sequence labeling model, which consists of a grammatical error detector as a discriminator and a grammatical error labeler with Gumbel-Softmax sampling as a generator.
Our results on several evaluation benchmarks demonstrate that our proposed approach is effective and improves the previous state-of-the-art baseline.
arXiv Detail & Related papers (2021-05-29T04:39:40Z) - Synthetic Data Generation for Grammatical Error Correction with Tagged
Corruption Models [15.481446439370343]
We use error type tags from automatic annotation tools such as ERRANT to guide synthetic data generation.
We build a new, large synthetic pre-training data set with error tag frequency distributions matching a given development set.
Our approach is particularly effective in adapting a GEC system, trained on mixed native and non-native English, to a native English test set.
arXiv Detail & Related papers (2021-05-27T17:17:21Z) - Grammatical Error Generation Based on Translated Fragments [0.0]
We perform neural machine translation of sentence fragments in order to create large amounts of training data for English grammatical error correction.
Our method aims at simulating mistakes made by second language learners, and produces a wider range of non-native style language.
arXiv Detail & Related papers (2021-04-20T12:43:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.