Linguistic Rules-Based Corpus Generation for Native Chinese Grammatical
Error Correction
- URL: http://arxiv.org/abs/2210.10442v1
- Date: Wed, 19 Oct 2022 10:20:39 GMT
- Title: Linguistic Rules-Based Corpus Generation for Native Chinese Grammatical
Error Correction
- Authors: Shirong Ma, Yinghui Li, Rongyi Sun, Qingyu Zhou, Shulin Huang, Ding
Zhang, Li Yangning, Ruiyang Liu, Zhongli Li, Yunbo Cao, Haitao Zheng and Ying
Shen
- Abstract summary: We propose a linguistic rules-based approach to construct large-scale CGEC training corpora with automatically generated grammatical errors.
We present a challenging CGEC benchmark derived entirely from errors made by native Chinese speakers in real-world scenarios.
- Score: 36.74272211767197
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Chinese Grammatical Error Correction (CGEC) is both a challenging NLP task
and a common application in human daily life. Recently, many data-driven
approaches are proposed for the development of CGEC research. However, there
are two major limitations in the CGEC field: First, the lack of high-quality
annotated training corpora prevents the performance of existing CGEC models
from being significantly improved. Second, the grammatical errors in widely
used test sets are not made by native Chinese speakers, resulting in a
significant gap between the CGEC models and the real application. In this
paper, we propose a linguistic rules-based approach to construct large-scale
CGEC training corpora with automatically generated grammatical errors.
Additionally, we present a challenging CGEC benchmark derived entirely from
errors made by native Chinese speakers in real-world scenarios. Extensive
experiments and detailed analyses not only demonstrate that the training data
constructed by our method effectively improves the performance of CGEC models,
but also reflect that our benchmark is an excellent resource for further
development of the CGEC field.
Related papers
- Robust ASR Error Correction with Conservative Data Filtering [15.833428810891427]
Error correction (EC) based on large language models is an emerging technology to enhance the performance of automatic speech recognition (ASR) systems.
We propose two fundamental criteria that EC training data should satisfy.
We identify low-quality EC pairs and train the models not to make any correction in such cases.
arXiv Detail & Related papers (2024-07-18T09:05:49Z) - Rethinking the Roles of Large Language Models in Chinese Grammatical
Error Correction [62.409807640887834]
Chinese Grammatical Error Correction (CGEC) aims to correct all potential grammatical errors in the input sentences.
LLMs' performance as correctors on CGEC remains unsatisfactory due to its challenging task focus.
We rethink the roles of LLMs in the CGEC task so that they can be better utilized and explored in CGEC.
arXiv Detail & Related papers (2024-02-18T01:40:34Z) - Grammatical Error Correction via Mixed-Grained Weighted Training [68.94921674855621]
Grammatical Error Correction (GEC) aims to automatically correct grammatical errors in natural texts.
MainGEC designs token-level and sentence-level training weights based on inherent discrepancies in accuracy and potential diversity of data annotation.
arXiv Detail & Related papers (2023-11-23T08:34:37Z) - FlaCGEC: A Chinese Grammatical Error Correction Dataset with
Fine-grained Linguistic Annotation [11.421545095092815]
FlaCGEC is a new CGEC dataset featured with fine-grained linguistic annotation.
We collect raw corpus from the linguistic schema defined by Chinese language experts, conduct edits on sentences via rules, and refine generated samples manually.
We evaluate various cutting-edge CGEC methods on the proposed FlaCGEC dataset and their unremarkable results indicate that this dataset is challenging in covering a large range of grammatical errors.
arXiv Detail & Related papers (2023-09-26T10:22:43Z) - Focus Is What You Need For Chinese Grammatical Error Correction [17.71297141482757]
We argue that even though this is a very reasonable hypothesis, it is too harsh for the intelligence of the mainstream models in this era.
We propose a simple yet effective training strategy called OneTarget to improve the focus ability of the CGEC models.
arXiv Detail & Related papers (2022-10-23T10:44:50Z) - MuCGEC: a Multi-Reference Multi-Source Evaluation Dataset for Chinese
Grammatical Error Correction [51.3754092853434]
MuCGEC is a multi-reference evaluation dataset for Chinese Grammatical Error Correction (CGEC)
It consists of 7,063 sentences collected from three different Chinese-as-a-Second-Language (CSL) learner sources.
Each sentence has been corrected by three annotators, and their corrections are meticulously reviewed by an expert, resulting in 2.3 references per sentence.
arXiv Detail & Related papers (2022-04-23T05:20:38Z) - A Unified Strategy for Multilingual Grammatical Error Correction with
Pre-trained Cross-Lingual Language Model [100.67378875773495]
We propose a generic and language-independent strategy for multilingual Grammatical Error Correction.
Our approach creates diverse parallel GEC data without any language-specific operations.
It achieves the state-of-the-art results on the NLPCC 2018 Task 2 dataset (Chinese) and obtains competitive performance on Falko-Merlin (German) and RULEC-GEC (Russian)
arXiv Detail & Related papers (2022-01-26T02:10:32Z) - Joint Contextual Modeling for ASR Correction and Language Understanding [60.230013453699975]
We propose multi-task neural approaches to perform contextual language correction on ASR outputs jointly with language understanding (LU)
We show that the error rates of off the shelf ASR and following LU systems can be reduced significantly by 14% relative with joint models trained using small amounts of in-domain data.
arXiv Detail & Related papers (2020-01-28T22:09:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.