Grammatical Error Correction in Low Error Density Domains: A New
Benchmark and Analyses
- URL: http://arxiv.org/abs/2010.07574v1
- Date: Thu, 15 Oct 2020 07:52:01 GMT
- Title: Grammatical Error Correction in Low Error Density Domains: A New
Benchmark and Analyses
- Authors: Simon Flachs, Oph\'elie Lacroix, Helen Yannakoudakis, Marek Rei,
Anders S{\o}gaard
- Abstract summary: We release CWEB, a new benchmark for GEC consisting of website text generated by English speakers of varying levels of proficiency.
Website data is a common and important domain that contains far fewer grammatical errors than learner essays.
We demonstrate that a factor behind this is the inability of systems to rely on a strong internal language model in low error density domains.
- Score: 17.57265480823457
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Evaluation of grammatical error correction (GEC) systems has primarily
focused on essays written by non-native learners of English, which however is
only part of the full spectrum of GEC applications. We aim to broaden the
target domain of GEC and release CWEB, a new benchmark for GEC consisting of
website text generated by English speakers of varying levels of proficiency.
Website data is a common and important domain that contains far fewer
grammatical errors than learner essays, which we show presents a challenge to
state-of-the-art GEC systems. We demonstrate that a factor behind this is the
inability of systems to rely on a strong internal language model in low error
density domains. We hope this work shall facilitate the development of
open-domain GEC models that generalize to different topics and genres.
Related papers
- A Simple Yet Effective Corpus Construction Framework for Indonesian Grammatical Error Correction [7.378963590826542]
We present a framework for constructing GEC corpora in low-resource languages.
Specifically, we focus on Indonesian as our research language.
We construct an evaluation corpus for Indonesian GEC using the proposed framework.
arXiv Detail & Related papers (2024-10-28T08:44:56Z) - Failing Forward: Improving Generative Error Correction for ASR with Synthetic Data and Retrieval Augmentation [73.9145653659403]
We show that Generative Error Correction models struggle to generalize beyond the specific types of errors encountered during training.
We propose DARAG, a novel approach designed to improve GEC for ASR in in-domain (ID) and OOD scenarios.
Our approach is simple, scalable, and both domain- and language-agnostic.
arXiv Detail & Related papers (2024-10-17T04:00:29Z) - Grammatical Error Correction for Code-Switched Sentences by Learners of English [5.653145656597412]
We conduct the first exploration into the use of Grammar Error Correction systems on CSW text.
We generate synthetic CSW GEC datasets by translating different spans of text within existing GEC corpora.
We then investigate different methods of selecting these spans based on CSW ratio, switch-point factor and linguistic constraints.
Our best model achieves an average increase of 1.57 $F_0.5$ across 3 CSW test sets without affecting the model's performance on a monolingual dataset.
arXiv Detail & Related papers (2024-04-18T20:05:30Z) - Unified Language-driven Zero-shot Domain Adaptation [55.64088594551629]
Unified Language-driven Zero-shot Domain Adaptation (ULDA) is a novel task setting.
It enables a single model to adapt to diverse target domains without explicit domain-ID knowledge.
arXiv Detail & Related papers (2024-04-10T16:44:11Z) - RobustGEC: Robust Grammatical Error Correction Against Subtle Context
Perturbation [64.2568239429946]
We introduce RobustGEC, a benchmark designed to evaluate the context robustness of GEC systems.
We reveal that state-of-the-art GEC systems still lack sufficient robustness against context perturbations.
arXiv Detail & Related papers (2023-10-11T08:33:23Z) - NaSGEC: a Multi-Domain Chinese Grammatical Error Correction Dataset from
Native Speaker Texts [51.64770549988806]
We introduce NaSGEC, a new dataset to facilitate research on Chinese grammatical error correction (CGEC) for native speaker texts from multiple domains.
To broaden the target domain, we annotate multiple references for 12,500 sentences from three native domains, i.e., social media, scientific writing, and examination.
We provide solid benchmark results for NaSGEC by employing cutting-edge CGEC models and different training data.
arXiv Detail & Related papers (2023-05-25T13:05:52Z) - FCGEC: Fine-Grained Corpus for Chinese Grammatical Error Correction [6.116341682577877]
Grammatical Error Correction (GEC) has been broadly applied in automatic correction and proofreading system recently.
We present FCGEC, a fine-grained corpus to detect, identify and correct the grammatical errors.
arXiv Detail & Related papers (2022-10-22T06:29:05Z) - Gender Bias and Universal Substitution Adversarial Attacks on
Grammatical Error Correction Systems for Automated Assessment [1.4213973379473654]
GEC systems are often used on speech transcriptions of English learners as a form of assessment and feedback.
The count of edits from a candidate's input sentence to a GEC system's grammatically corrected output sentence is indicative of a candidate's language ability.
This work examines a simple universal substitution adversarial attack that non-native speakers of English could realistically employ to deceive GEC systems used for assessment.
arXiv Detail & Related papers (2022-08-19T17:44:13Z) - Czech Grammar Error Correction with a Large and Diverse Corpus [64.94696028072698]
We introduce a large and diverse Czech corpus annotated for grammatical error correction (GEC)
The Grammar Error Correction Corpus for Czech (GECCC) offers a variety of four domains, covering error distributions ranging from high error density essays written by non-native speakers, to website texts.
We compare several Czech GEC systems, including several Transformer-based ones, setting a strong baseline to future research.
arXiv Detail & Related papers (2022-01-14T18:20:47Z) - Neural Quality Estimation with Multiple Hypotheses for Grammatical Error
Correction [98.31440090585376]
Grammatical Error Correction (GEC) aims to correct writing errors and help language learners improve their writing skills.
Existing GEC models tend to produce spurious corrections or fail to detect lots of errors.
This paper presents the Neural Verification Network (VERNet) for GEC quality estimation with multiple hypotheses.
arXiv Detail & Related papers (2021-05-10T15:04:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.