A Simple Yet Effective Corpus Construction Framework for Indonesian Grammatical Error Correction
- URL: http://arxiv.org/abs/2410.20838v1
- Date: Mon, 28 Oct 2024 08:44:56 GMT
- Title: A Simple Yet Effective Corpus Construction Framework for Indonesian Grammatical Error Correction
- Authors: Nankai Lin, Meiyu Zeng, Wentao Huang, Shengyi Jiang, Lixian Xiao, Aimin Yang,
- Abstract summary: We present a framework for constructing GEC corpora in low-resource languages.
Specifically, we focus on Indonesian as our research language.
We construct an evaluation corpus for Indonesian GEC using the proposed framework.
- Score: 7.378963590826542
- License:
- Abstract: Currently, the majority of research in grammatical error correction (GEC) is concentrated on universal languages, such as English and Chinese. Many low-resource languages lack accessible evaluation corpora. How to efficiently construct high-quality evaluation corpora for GEC in low-resource languages has become a significant challenge. To fill these gaps, in this paper, we present a framework for constructing GEC corpora. Specifically, we focus on Indonesian as our research language and construct an evaluation corpus for Indonesian GEC using the proposed framework, addressing the limitations of existing evaluation corpora in Indonesian. Furthermore, we investigate the feasibility of utilizing existing large language models (LLMs), such as GPT-3.5-Turbo and GPT-4, to streamline corpus annotation efforts in GEC tasks. The results demonstrate significant potential for enhancing the performance of LLMs in low-resource language settings. Our code and corpus can be obtained from https://github.com/GKLMIP/GEC-Construction-Framework.
Related papers
- Large Language Models Are State-of-the-Art Evaluator for Grammatical Error Correction [14.822205658480813]
Large Language Models (LLMs) have been reported to outperform existing automatic evaluation metrics in some tasks.
This study investigates the performance of LLMs in grammatical error correction (GEC) evaluation by employing prompts inspired by previous research.
arXiv Detail & Related papers (2024-03-26T09:43:15Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - A BERT-based Unsupervised Grammatical Error Correction Framework [9.431453382607845]
Grammatical error correction (GEC) is a challenging task of natural language processing techniques.
In low-resource languages, the current unsupervised GEC based on language model scoring performs well.
This study proposes a BERT-based unsupervised GEC framework, where GEC is viewed as multi-class classification task.
arXiv Detail & Related papers (2023-03-30T13:29:49Z) - CLSE: Corpus of Linguistically Significant Entities [58.29901964387952]
We release a Corpus of Linguistically Significant Entities (CLSE) annotated by experts.
CLSE covers 74 different semantic types to support various applications from airline ticketing to video games.
We create a linguistically representative NLG evaluation benchmark in three languages: French, Marathi, and Russian.
arXiv Detail & Related papers (2022-11-04T12:56:12Z) - RuCoLA: Russian Corpus of Linguistic Acceptability [6.500438378175089]
We introduce the Russian Corpus of Linguistic Acceptability (RuCoLA)
RuCoLA consists of $9.8$k in-domain sentences from linguistic publications and $3.6$k out-of-domain sentences produced by generative models.
We demonstrate that the most widely used language models still fall behind humans by a large margin, especially when detecting morphological and semantic errors.
arXiv Detail & Related papers (2022-10-23T18:29:22Z) - Linguistic Rules-Based Corpus Generation for Native Chinese Grammatical
Error Correction [36.74272211767197]
We propose a linguistic rules-based approach to construct large-scale CGEC training corpora with automatically generated grammatical errors.
We present a challenging CGEC benchmark derived entirely from errors made by native Chinese speakers in real-world scenarios.
arXiv Detail & Related papers (2022-10-19T10:20:39Z) - A Unified Strategy for Multilingual Grammatical Error Correction with
Pre-trained Cross-Lingual Language Model [100.67378875773495]
We propose a generic and language-independent strategy for multilingual Grammatical Error Correction.
Our approach creates diverse parallel GEC data without any language-specific operations.
It achieves the state-of-the-art results on the NLPCC 2018 Task 2 dataset (Chinese) and obtains competitive performance on Falko-Merlin (German) and RULEC-GEC (Russian)
arXiv Detail & Related papers (2022-01-26T02:10:32Z) - CUGE: A Chinese Language Understanding and Generation Evaluation
Benchmark [144.05723617401674]
General-purpose language intelligence evaluation has been a longstanding goal for natural language processing.
We argue that for general-purpose language intelligence evaluation, the benchmark itself needs to be comprehensive and systematic.
We propose CUGE, a Chinese Language Understanding and Generation Evaluation benchmark with the following features.
arXiv Detail & Related papers (2021-12-27T11:08:58Z) - LM-Critic: Language Models for Unsupervised Grammatical Error Correction [128.9174409251852]
We show how to leverage a pretrained language model (LM) in defining an LM-Critic, which judges a sentence to be grammatical.
We apply this LM-Critic and BIFI along with a large set of unlabeled sentences to bootstrap realistic ungrammatical / grammatical pairs for training a corrector.
arXiv Detail & Related papers (2021-09-14T17:06:43Z) - Grammatical Error Correction in Low Error Density Domains: A New
Benchmark and Analyses [17.57265480823457]
We release CWEB, a new benchmark for GEC consisting of website text generated by English speakers of varying levels of proficiency.
Website data is a common and important domain that contains far fewer grammatical errors than learner essays.
We demonstrate that a factor behind this is the inability of systems to rely on a strong internal language model in low error density domains.
arXiv Detail & Related papers (2020-10-15T07:52:01Z) - Building Low-Resource NER Models Using Non-Speaker Annotation [58.78968578460793]
Cross-lingual methods have had notable success in addressing these concerns.
We propose a complementary approach to building low-resource Named Entity Recognition (NER) models using non-speaker'' (NS) annotations.
We show that use of NS annotators produces results that are consistently on par or better than cross-lingual methods built on modern contextual representations.
arXiv Detail & Related papers (2020-06-17T03:24:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.