Czech Grammar Error Correction with a Large and Diverse Corpus
- URL: http://arxiv.org/abs/2201.05590v1
- Date: Fri, 14 Jan 2022 18:20:47 GMT
- Title: Czech Grammar Error Correction with a Large and Diverse Corpus
- Authors: Jakub N\'aplava, Milan Straka, Jana Strakov\'a, Alexandr Rosen
- Abstract summary: We introduce a large and diverse Czech corpus annotated for grammatical error correction (GEC)
The Grammar Error Correction Corpus for Czech (GECCC) offers a variety of four domains, covering error distributions ranging from high error density essays written by non-native speakers, to website texts.
We compare several Czech GEC systems, including several Transformer-based ones, setting a strong baseline to future research.
- Score: 64.94696028072698
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce a large and diverse Czech corpus annotated for grammatical error
correction (GEC) with the aim to contribute to the still scarce data resources
in this domain for languages other than English. The Grammar Error Correction
Corpus for Czech (GECCC) offers a variety of four domains, covering error
distributions ranging from high error density essays written by non-native
speakers, to website texts, where errors are expected to be much less common.
We compare several Czech GEC systems, including several Transformer-based ones,
setting a strong baseline to future research. Finally, we meta-evaluate common
GEC metrics against human judgements on our data. We make the new Czech GEC
corpus publicly available under the CC BY-SA 4.0 license at
http://hdl.handle.net/11234/1-4639 .
Related papers
- Grammatical Error Correction for Code-Switched Sentences by Learners of English [5.653145656597412]
We conduct the first exploration into the use of Grammar Error Correction systems on CSW text.
We generate synthetic CSW GEC datasets by translating different spans of text within existing GEC corpora.
We then investigate different methods of selecting these spans based on CSW ratio, switch-point factor and linguistic constraints.
Our best model achieves an average increase of 1.57 $F_0.5$ across 3 CSW test sets without affecting the model's performance on a monolingual dataset.
arXiv Detail & Related papers (2024-04-18T20:05:30Z) - RobustGEC: Robust Grammatical Error Correction Against Subtle Context
Perturbation [64.2568239429946]
We introduce RobustGEC, a benchmark designed to evaluate the context robustness of GEC systems.
We reveal that state-of-the-art GEC systems still lack sufficient robustness against context perturbations.
arXiv Detail & Related papers (2023-10-11T08:33:23Z) - NaSGEC: a Multi-Domain Chinese Grammatical Error Correction Dataset from
Native Speaker Texts [51.64770549988806]
We introduce NaSGEC, a new dataset to facilitate research on Chinese grammatical error correction (CGEC) for native speaker texts from multiple domains.
To broaden the target domain, we annotate multiple references for 12,500 sentences from three native domains, i.e., social media, scientific writing, and examination.
We provide solid benchmark results for NaSGEC by employing cutting-edge CGEC models and different training data.
arXiv Detail & Related papers (2023-05-25T13:05:52Z) - CSCD-NS: a Chinese Spelling Check Dataset for Native Speakers [62.61866477815883]
We present CSCD-NS, the first Chinese spelling check dataset designed for native speakers.
CSCD-NS is ten times larger in scale and exhibits a distinct error distribution.
We propose a novel method that simulates the input process through an input method.
arXiv Detail & Related papers (2022-11-16T09:25:42Z) - FCGEC: Fine-Grained Corpus for Chinese Grammatical Error Correction [6.116341682577877]
Grammatical Error Correction (GEC) has been broadly applied in automatic correction and proofreading system recently.
We present FCGEC, a fine-grained corpus to detect, identify and correct the grammatical errors.
arXiv Detail & Related papers (2022-10-22T06:29:05Z) - A Unified Strategy for Multilingual Grammatical Error Correction with
Pre-trained Cross-Lingual Language Model [100.67378875773495]
We propose a generic and language-independent strategy for multilingual Grammatical Error Correction.
Our approach creates diverse parallel GEC data without any language-specific operations.
It achieves the state-of-the-art results on the NLPCC 2018 Task 2 dataset (Chinese) and obtains competitive performance on Falko-Merlin (German) and RULEC-GEC (Russian)
arXiv Detail & Related papers (2022-01-26T02:10:32Z) - Diacritics Restoration using BERT with Analysis on Czech language [3.2729625923640278]
We propose a new architecture for diacritics restoration based on contextualized embeddings, namely BERT.
We conduct a detailed error analysis on Czech, a morphologically rich language with a high level of diacritization.
arXiv Detail & Related papers (2021-05-24T16:58:27Z) - UA-GEC: Grammatical Error Correction and Fluency Corpus for the
Ukrainian Language [0.0]
This is the first grammatical error correction corpus for the Ukrainian language.
Professional proofreaders corrected and annotated the corpus for errors relating to fluency, grammar, punctuation, and spelling.
This corpus can be used for developing and evaluating GEC systems in Ukrainian.
arXiv Detail & Related papers (2021-03-31T11:18:36Z) - Grammatical Error Correction in Low Error Density Domains: A New
Benchmark and Analyses [17.57265480823457]
We release CWEB, a new benchmark for GEC consisting of website text generated by English speakers of varying levels of proficiency.
Website data is a common and important domain that contains far fewer grammatical errors than learner essays.
We demonstrate that a factor behind this is the inability of systems to rely on a strong internal language model in low error density domains.
arXiv Detail & Related papers (2020-10-15T07:52:01Z) - Improving the Efficiency of Grammatical Error Correction with Erroneous
Span Detection and Correction [106.63733511672721]
We propose a novel language-independent approach to improve the efficiency for Grammatical Error Correction (GEC) by dividing the task into two subtasks: Erroneous Span Detection ( ESD) and Erroneous Span Correction (ESC)
ESD identifies grammatically incorrect text spans with an efficient sequence tagging model. ESC leverages a seq2seq model to take the sentence with annotated erroneous spans as input and only outputs the corrected text for these spans.
Experiments show our approach performs comparably to conventional seq2seq approaches in both English and Chinese GEC benchmarks with less than 50% time cost for inference.
arXiv Detail & Related papers (2020-10-07T08:29:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.