FCGEC: Fine-Grained Corpus for Chinese Grammatical Error Correction
- URL: http://arxiv.org/abs/2210.12364v1
- Date: Sat, 22 Oct 2022 06:29:05 GMT
- Title: FCGEC: Fine-Grained Corpus for Chinese Grammatical Error Correction
- Authors: Lvxiaowei Xu, Jianwang Wu, Jiawei Peng, Jiayu Fu, Ming Cai
- Abstract summary: Grammatical Error Correction (GEC) has been broadly applied in automatic correction and proofreading system recently.
We present FCGEC, a fine-grained corpus to detect, identify and correct the grammatical errors.
- Score: 6.116341682577877
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Grammatical Error Correction (GEC) has been broadly applied in automatic
correction and proofreading system recently. However, it is still immature in
Chinese GEC due to limited high-quality data from native speakers in terms of
category and scale. In this paper, we present FCGEC, a fine-grained corpus to
detect, identify and correct the grammatical errors. FCGEC is a human-annotated
corpus with multiple references, consisting of 41,340 sentences collected
mainly from multi-choice questions in public school Chinese examinations.
Furthermore, we propose a Switch-Tagger-Generator (STG) baseline model to
correct the grammatical errors in low-resource settings. Compared to other GEC
benchmark models, experimental results illustrate that STG outperforms them on
our FCGEC. However, there exists a significant gap between benchmark models and
humans that encourages future models to bridge it.
Related papers
- Grammatical Error Correction for Code-Switched Sentences by Learners of English [5.653145656597412]
We conduct the first exploration into the use of Grammar Error Correction systems on CSW text.
We generate synthetic CSW GEC datasets by translating different spans of text within existing GEC corpora.
We then investigate different methods of selecting these spans based on CSW ratio, switch-point factor and linguistic constraints.
Our best model achieves an average increase of 1.57 $F_0.5$ across 3 CSW test sets without affecting the model's performance on a monolingual dataset.
arXiv Detail & Related papers (2024-04-18T20:05:30Z) - RobustGEC: Robust Grammatical Error Correction Against Subtle Context
Perturbation [64.2568239429946]
We introduce RobustGEC, a benchmark designed to evaluate the context robustness of GEC systems.
We reveal that state-of-the-art GEC systems still lack sufficient robustness against context perturbations.
arXiv Detail & Related papers (2023-10-11T08:33:23Z) - Chinese Spelling Correction as Rephrasing Language Model [63.65217759957206]
We study Chinese Spelling Correction (CSC), which aims to detect and correct the potential spelling errors in a given sentence.
Current state-of-the-art methods regard CSC as a sequence tagging task and fine-tune BERT-based models on sentence pairs.
We propose Rephrasing Language Model (ReLM), where the model is trained to rephrase the entire sentence by infilling additional slots, instead of character-to-character tagging.
arXiv Detail & Related papers (2023-08-17T06:04:28Z) - CSCD-NS: a Chinese Spelling Check Dataset for Native Speakers [62.61866477815883]
We present CSCD-NS, the first Chinese spelling check dataset designed for native speakers.
CSCD-NS is ten times larger in scale and exhibits a distinct error distribution.
We propose a novel method that simulates the input process through an input method.
arXiv Detail & Related papers (2022-11-16T09:25:42Z) - Czech Grammar Error Correction with a Large and Diverse Corpus [64.94696028072698]
We introduce a large and diverse Czech corpus annotated for grammatical error correction (GEC)
The Grammar Error Correction Corpus for Czech (GECCC) offers a variety of four domains, covering error distributions ranging from high error density essays written by non-native speakers, to website texts.
We compare several Czech GEC systems, including several Transformer-based ones, setting a strong baseline to future research.
arXiv Detail & Related papers (2022-01-14T18:20:47Z) - A Syntax-Guided Grammatical Error Correction Model with Dependency Tree
Correction [83.14159143179269]
Grammatical Error Correction (GEC) is a task of detecting and correcting grammatical errors in sentences.
We propose a syntax-guided GEC model (SG-GEC) which adopts the graph attention mechanism to utilize the syntactic knowledge of dependency trees.
We evaluate our model on public benchmarks of GEC task and it achieves competitive results.
arXiv Detail & Related papers (2021-11-05T07:07:48Z) - Neural Quality Estimation with Multiple Hypotheses for Grammatical Error
Correction [98.31440090585376]
Grammatical Error Correction (GEC) aims to correct writing errors and help language learners improve their writing skills.
Existing GEC models tend to produce spurious corrections or fail to detect lots of errors.
This paper presents the Neural Verification Network (VERNet) for GEC quality estimation with multiple hypotheses.
arXiv Detail & Related papers (2021-05-10T15:04:25Z) - Grammatical Error Correction in Low Error Density Domains: A New
Benchmark and Analyses [17.57265480823457]
We release CWEB, a new benchmark for GEC consisting of website text generated by English speakers of varying levels of proficiency.
Website data is a common and important domain that contains far fewer grammatical errors than learner essays.
We demonstrate that a factor behind this is the inability of systems to rely on a strong internal language model in low error density domains.
arXiv Detail & Related papers (2020-10-15T07:52:01Z) - Adversarial Grammatical Error Correction [2.132096006921048]
We present an adversarial learning approach to Grammatical Error Correction (GEC) using the generator-discriminator framework.
We pre-train both the discriminator and the generator on parallel texts and then fine-tune them further using a policy gradient method.
Experimental results on FCE, CoNLL-14, and BEA-19 datasets show that Adversarial-GEC can achieve competitive GEC quality compared to NMT-based baselines.
arXiv Detail & Related papers (2020-10-06T00:31:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.