Related papers: Czech Grammar Error Correction with a Large and Diverse Corpus

Czech Grammar Error Correction with a Large and Diverse Corpus

URL: http://arxiv.org/abs/2201.05590v1
Date: Fri, 14 Jan 2022 18:20:47 GMT
Title: Czech Grammar Error Correction with a Large and Diverse Corpus
Authors: Jakub N\'aplava, Milan Straka, Jana Strakov\'a, Alexandr Rosen
Abstract summary: We introduce a large and diverse Czech corpus annotated for grammatical error correction (GEC) The Grammar Error Correction Corpus for Czech (GECCC) offers a variety of four domains, covering error distributions ranging from high error density essays written by non-native speakers, to website texts. We compare several Czech GEC systems, including several Transformer-based ones, setting a strong baseline to future research.
Score: 64.94696028072698
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We introduce a large and diverse Czech corpus annotated for grammatical error correction (GEC) with the aim to contribute to the still scarce data resources in this domain for languages other than English. The Grammar Error Correction Corpus for Czech (GECCC) offers a variety of four domains, covering error distributions ranging from high error density essays written by non-native speakers, to website texts, where errors are expected to be much less common. We compare several Czech GEC systems, including several Transformer-based ones, setting a strong baseline to future research. Finally, we meta-evaluate common GEC metrics against human judgements on our data. We make the new Czech GEC corpus publicly available under the CC BY-SA 4.0 license at http://hdl.handle.net/11234/1-4639 .

Related papers

Refining Czech GEC: Insights from a Multi-Experiment Approach [2.4792831876409718]
We present a grammar error correction (GEC) system that achieves state of the art for the Czech language.<n>Our system is based on a neural network translation approach with the Transformer architecture.<n>Key feature is its real-time synthetic generation pipeline, which dynamically augments sentences with artificial errors.
arXiv Detail & Related papers (2025-06-27T17:21:40Z)
Grammatical Error Correction for Code-Switched Sentences by Learners of English [5.653145656597412]
We conduct the first exploration into the use of Grammar Error Correction systems on CSW text. We generate synthetic CSW GEC datasets by translating different spans of text within existing GEC corpora. We then investigate different methods of selecting these spans based on CSW ratio, switch-point factor and linguistic constraints. Our best model achieves an average increase of 1.57 $F_0.5$ across 3 CSW test sets without affecting the model's performance on a monolingual dataset.
arXiv Detail & Related papers (2024-04-18T20:05:30Z)
RobustGEC: Robust Grammatical Error Correction Against Subtle Context Perturbation [64.2568239429946]
We introduce RobustGEC, a benchmark designed to evaluate the context robustness of GEC systems. We reveal that state-of-the-art GEC systems still lack sufficient robustness against context perturbations.
arXiv Detail & Related papers (2023-10-11T08:33:23Z)
NaSGEC: a Multi-Domain Chinese Grammatical Error Correction Dataset from Native Speaker Texts [51.64770549988806]
We introduce NaSGEC, a new dataset to facilitate research on Chinese grammatical error correction (CGEC) for native speaker texts from multiple domains. To broaden the target domain, we annotate multiple references for 12,500 sentences from three native domains, i.e., social media, scientific writing, and examination. We provide solid benchmark results for NaSGEC by employing cutting-edge CGEC models and different training data.
arXiv Detail & Related papers (2023-05-25T13:05:52Z)
CSCD-NS: a Chinese Spelling Check Dataset for Native Speakers [62.61866477815883]
We present CSCD-NS, the first Chinese spelling check dataset designed for native speakers. CSCD-NS is ten times larger in scale and exhibits a distinct error distribution. We propose a novel method that simulates the input process through an input method.
arXiv Detail & Related papers (2022-11-16T09:25:42Z)
FCGEC: Fine-Grained Corpus for Chinese Grammatical Error Correction [6.116341682577877]
Grammatical Error Correction (GEC) has been broadly applied in automatic correction and proofreading system recently. We present FCGEC, a fine-grained corpus to detect, identify and correct the grammatical errors.
arXiv Detail & Related papers (2022-10-22T06:29:05Z)
A Unified Strategy for Multilingual Grammatical Error Correction with Pre-trained Cross-Lingual Language Model [100.67378875773495]
We propose a generic and language-independent strategy for multilingual Grammatical Error Correction. Our approach creates diverse parallel GEC data without any language-specific operations. It achieves the state-of-the-art results on the NLPCC 2018 Task 2 dataset (Chinese) and obtains competitive performance on Falko-Merlin (German) and RULEC-GEC (Russian)
arXiv Detail & Related papers (2022-01-26T02:10:32Z)
Diacritics Restoration using BERT with Analysis on Czech language [3.2729625923640278]
We propose a new architecture for diacritics restoration based on contextualized embeddings, namely BERT. We conduct a detailed error analysis on Czech, a morphologically rich language with a high level of diacritization.
arXiv Detail & Related papers (2021-05-24T16:58:27Z)
UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language [0.0]
This is the first grammatical error correction corpus for the Ukrainian language. Professional proofreaders corrected and annotated the corpus for errors relating to fluency, grammar, punctuation, and spelling. This corpus can be used for developing and evaluating GEC systems in Ukrainian.
arXiv Detail & Related papers (2021-03-31T11:18:36Z)
Grammatical Error Correction in Low Error Density Domains: A New Benchmark and Analyses [17.57265480823457]
We release CWEB, a new benchmark for GEC consisting of website text generated by English speakers of varying levels of proficiency. Website data is a common and important domain that contains far fewer grammatical errors than learner essays. We demonstrate that a factor behind this is the inability of systems to rely on a strong internal language model in low error density domains.
arXiv Detail & Related papers (2020-10-15T07:52:01Z)
Improving the Efficiency of Grammatical Error Correction with Erroneous Span Detection and Correction [106.63733511672721]
We propose a novel language-independent approach to improve the efficiency for Grammatical Error Correction (GEC) by dividing the task into two subtasks: Erroneous Span Detection ( ESD) and Erroneous Span Correction (ESC) ESD identifies grammatically incorrect text spans with an efficient sequence tagging model. ESC leverages a seq2seq model to take the sentence with annotated erroneous spans as input and only outputs the corrected text for these spans. Experiments show our approach performs comparably to conventional seq2seq approaches in both English and Chinese GEC benchmarks with less than 50% time cost for inference.
arXiv Detail & Related papers (2020-10-07T08:29:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.