UA-GEC: Grammatical Error Correction and Fluency Corpus for the
Ukrainian Language
- URL: http://arxiv.org/abs/2103.16997v1
- Date: Wed, 31 Mar 2021 11:18:36 GMT
- Title: UA-GEC: Grammatical Error Correction and Fluency Corpus for the
Ukrainian Language
- Authors: Oleksiy Syvokon and Olena Nahorna
- Abstract summary: This is the first grammatical error correction corpus for the Ukrainian language.
Professional proofreaders corrected and annotated the corpus for errors relating to fluency, grammar, punctuation, and spelling.
This corpus can be used for developing and evaluating GEC systems in Ukrainian.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present a corpus professionally annotated for grammatical error correction
(GEC) and fluency edits in the Ukrainian language. To the best of our
knowledge, this is the first GEC corpus for the Ukrainian language. We
collected texts with errors (20,715 sentences) from a diverse pool of
contributors, including both native and non-native speakers. The data cover a
wide variety of writing domains, from text chats and essays to formal writing.
Professional proofreaders corrected and annotated the corpus for errors
relating to fluency, grammar, punctuation, and spelling. This corpus can be
used for developing and evaluating GEC systems in Ukrainian. More generally, it
can be used for researching multilingual and low-resource NLP, morphologically
rich languages, document-level GEC, and fluency correction. The corpus is
publicly available at https://github.com/grammarly/ua-gec
Related papers
- A Language Model for Grammatical Error Correction in L2 Russian [0.3149883354098941]
Grammatical error correction is one of the fundamental tasks in Natural Language Processing.
For the Russian language, most of the spellcheckers available correct typos and other simple errors with high accuracy, but often fail when faced with non-native (L2) writing.
We propose a pipeline involving a language model intended for correcting errors in L2 Russian writing.
arXiv Detail & Related papers (2023-07-04T09:50:13Z) - CSED: A Chinese Semantic Error Diagnosis Corpus [52.92010408053424]
We study the complicated problem of Chinese Semantic Error Diagnosis (CSED), which lacks relevant datasets.
The study of semantic errors is important because they are very common and may lead to syntactic irregularities or even problems of comprehension.
This paper proposes syntax-aware models to specifically adapt to the CSED task.
arXiv Detail & Related papers (2023-05-09T05:33:31Z) - CLSE: Corpus of Linguistically Significant Entities [58.29901964387952]
We release a Corpus of Linguistically Significant Entities (CLSE) annotated by experts.
CLSE covers 74 different semantic types to support various applications from airline ticketing to video games.
We create a linguistically representative NLG evaluation benchmark in three languages: French, Marathi, and Russian.
arXiv Detail & Related papers (2022-11-04T12:56:12Z) - FCGEC: Fine-Grained Corpus for Chinese Grammatical Error Correction [6.116341682577877]
Grammatical Error Correction (GEC) has been broadly applied in automatic correction and proofreading system recently.
We present FCGEC, a fine-grained corpus to detect, identify and correct the grammatical errors.
arXiv Detail & Related papers (2022-10-22T06:29:05Z) - A Unified Strategy for Multilingual Grammatical Error Correction with
Pre-trained Cross-Lingual Language Model [100.67378875773495]
We propose a generic and language-independent strategy for multilingual Grammatical Error Correction.
Our approach creates diverse parallel GEC data without any language-specific operations.
It achieves the state-of-the-art results on the NLPCC 2018 Task 2 dataset (Chinese) and obtains competitive performance on Falko-Merlin (German) and RULEC-GEC (Russian)
arXiv Detail & Related papers (2022-01-26T02:10:32Z) - Czech Grammar Error Correction with a Large and Diverse Corpus [64.94696028072698]
We introduce a large and diverse Czech corpus annotated for grammatical error correction (GEC)
The Grammar Error Correction Corpus for Czech (GECCC) offers a variety of four domains, covering error distributions ranging from high error density essays written by non-native speakers, to website texts.
We compare several Czech GEC systems, including several Transformer-based ones, setting a strong baseline to future research.
arXiv Detail & Related papers (2022-01-14T18:20:47Z) - YACLC: A Chinese Learner Corpus with Multidimensional Annotation [45.304130762057945]
We construct a large-scale, multidimensional annotated Chinese learner corpus.
By analyzing the original sentences and annotations in the corpus, we found that YACLC has a considerable size and very high annotation quality.
arXiv Detail & Related papers (2021-12-30T13:07:08Z) - VLGrammar: Grounded Grammar Induction of Vision and Language [86.88273769411428]
We study grounded grammar induction of vision and language in a joint learning framework.
We present VLGrammar, a method that uses compound probabilistic context-free grammars (compound PCFGs) to induce the language grammar and the image grammar simultaneously.
arXiv Detail & Related papers (2021-03-24T04:05:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.