FlaCGEC: A Chinese Grammatical Error Correction Dataset with
Fine-grained Linguistic Annotation
- URL: http://arxiv.org/abs/2311.04906v1
- Date: Tue, 26 Sep 2023 10:22:43 GMT
- Title: FlaCGEC: A Chinese Grammatical Error Correction Dataset with
Fine-grained Linguistic Annotation
- Authors: Hanyue Du, Yike Zhao, Qingyuan Tian, Jiani Wang, Lei Wang, Yunshi Lan,
Xuesong Lu
- Abstract summary: FlaCGEC is a new CGEC dataset featured with fine-grained linguistic annotation.
We collect raw corpus from the linguistic schema defined by Chinese language experts, conduct edits on sentences via rules, and refine generated samples manually.
We evaluate various cutting-edge CGEC methods on the proposed FlaCGEC dataset and their unremarkable results indicate that this dataset is challenging in covering a large range of grammatical errors.
- Score: 11.421545095092815
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Chinese Grammatical Error Correction (CGEC) has been attracting growing
attention from researchers recently. In spite of the fact that multiple CGEC
datasets have been developed to support the research, these datasets lack the
ability to provide a deep linguistic topology of grammar errors, which is
critical for interpreting and diagnosing CGEC approaches. To address this
limitation, we introduce FlaCGEC, which is a new CGEC dataset featured with
fine-grained linguistic annotation. Specifically, we collect raw corpus from
the linguistic schema defined by Chinese language experts, conduct edits on
sentences via rules, and refine generated samples manually, which results in
10k sentences with 78 instantiated grammar points and 3 types of edits. We
evaluate various cutting-edge CGEC methods on the proposed FlaCGEC dataset and
their unremarkable results indicate that this dataset is challenging in
covering a large range of grammatical errors. In addition, we also treat
FlaCGEC as a diagnostic dataset for testing generalization skills and conduct a
thorough evaluation of existing CGEC models.
Related papers
- SOUL: Towards Sentiment and Opinion Understanding of Language [96.74878032417054]
We propose a new task called Sentiment and Opinion Understanding of Language (SOUL)
SOUL aims to evaluate sentiment understanding through two subtasks: Review (RC) and Justification Generation (JG)
arXiv Detail & Related papers (2023-10-27T06:48:48Z) - GECTurk: Grammatical Error Correction and Detection Dataset for Turkish [1.804922416527064]
Grammatical Error Detection and Correction (GEC) tools have proven useful for native speakers and second language learners.
Synthetic data generation is a common practice to overcome the scarcity of such data.
We present a flexible and synthetic data generation pipeline for Turkish covering more than 20 expert-curated grammar and spelling rules.
arXiv Detail & Related papers (2023-09-20T14:25:44Z) - NaSGEC: a Multi-Domain Chinese Grammatical Error Correction Dataset from
Native Speaker Texts [51.64770549988806]
We introduce NaSGEC, a new dataset to facilitate research on Chinese grammatical error correction (CGEC) for native speaker texts from multiple domains.
To broaden the target domain, we annotate multiple references for 12,500 sentences from three native domains, i.e., social media, scientific writing, and examination.
We provide solid benchmark results for NaSGEC by employing cutting-edge CGEC models and different training data.
arXiv Detail & Related papers (2023-05-25T13:05:52Z) - Advancements in Arabic Grammatical Error Detection and Correction: An
Empirical Investigation [12.15509670220182]
Grammatical error correction (GEC) is a well-explored problem in English.
Research on GEC in morphologically rich languages has been limited due to challenges such as data scarcity and language complexity.
We present the first results on Arabic GEC using two newly developed Transformer-based pretrained sequence-to-sequence models.
arXiv Detail & Related papers (2023-05-24T05:12:58Z) - CSED: A Chinese Semantic Error Diagnosis Corpus [52.92010408053424]
We study the complicated problem of Chinese Semantic Error Diagnosis (CSED), which lacks relevant datasets.
The study of semantic errors is important because they are very common and may lead to syntactic irregularities or even problems of comprehension.
This paper proposes syntax-aware models to specifically adapt to the CSED task.
arXiv Detail & Related papers (2023-05-09T05:33:31Z) - Linguistic Rules-Based Corpus Generation for Native Chinese Grammatical
Error Correction [36.74272211767197]
We propose a linguistic rules-based approach to construct large-scale CGEC training corpora with automatically generated grammatical errors.
We present a challenging CGEC benchmark derived entirely from errors made by native Chinese speakers in real-world scenarios.
arXiv Detail & Related papers (2022-10-19T10:20:39Z) - MuCGEC: a Multi-Reference Multi-Source Evaluation Dataset for Chinese
Grammatical Error Correction [51.3754092853434]
MuCGEC is a multi-reference evaluation dataset for Chinese Grammatical Error Correction (CGEC)
It consists of 7,063 sentences collected from three different Chinese-as-a-Second-Language (CSL) learner sources.
Each sentence has been corrected by three annotators, and their corrections are meticulously reviewed by an expert, resulting in 2.3 references per sentence.
arXiv Detail & Related papers (2022-04-23T05:20:38Z) - A Unified Strategy for Multilingual Grammatical Error Correction with
Pre-trained Cross-Lingual Language Model [100.67378875773495]
We propose a generic and language-independent strategy for multilingual Grammatical Error Correction.
Our approach creates diverse parallel GEC data without any language-specific operations.
It achieves the state-of-the-art results on the NLPCC 2018 Task 2 dataset (Chinese) and obtains competitive performance on Falko-Merlin (German) and RULEC-GEC (Russian)
arXiv Detail & Related papers (2022-01-26T02:10:32Z) - ErAConD : Error Annotated Conversational Dialog Dataset for Grammatical
Error Correction [30.917993017459615]
We present a novel parallel grammatical error correction (GEC) dataset drawn from open-domain conversations.
This dataset is, to our knowledge, the first GEC dataset targeted to a conversational setting.
To demonstrate the utility of the dataset, we use our annotated data to fine-tune a state-of-the-art GEC model.
arXiv Detail & Related papers (2021-12-15T20:27:40Z) - A Self-Refinement Strategy for Noise Reduction in Grammatical Error
Correction [54.569707226277735]
Existing approaches for grammatical error correction (GEC) rely on supervised learning with manually created GEC datasets.
There is a non-negligible amount of "noise" where errors were inappropriately edited or left uncorrected.
We propose a self-refinement method where the key idea is to denoise these datasets by leveraging the prediction consistency of existing models.
arXiv Detail & Related papers (2020-10-07T04:45:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.