KoGEC : Korean Grammatical Error Correction with Pre-trained Translation Models
- URL: http://arxiv.org/abs/2506.11432v1
- Date: Fri, 13 Jun 2025 03:10:15 GMT
- Title: KoGEC : Korean Grammatical Error Correction with Pre-trained Translation Models
- Authors: Taeeun Kim, Semin Jeong, Youngsook Song,
- Abstract summary: This research introduces KoGEC, a Korean Grammatical Error Correction system using pre-trained translation models.<n>We fine-tuned NLLB (No Language Left Behind) models for Korean GEC, comparing their performance against large language models like GPT-4 and HCX-3.<n>Results showed that the fine-tuned NLLB (KoGEC) models outperformed GPT-4o and HCX-3 in Korean GEC tasks.
- Score: 0.8225943723988819
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: This research introduces KoGEC, a Korean Grammatical Error Correction system using pre\--trained translation models. We fine-tuned NLLB (No Language Left Behind) models for Korean GEC, comparing their performance against large language models like GPT-4 and HCX-3. The study used two social media conversation datasets for training and testing. The NLLB models were fine-tuned using special language tokens to distinguish between original and corrected Korean sentences. Evaluation was done using BLEU scores and an "LLM as judge" method to classify error types. Results showed that the fine-tuned NLLB (KoGEC) models outperformed GPT-4o and HCX-3 in Korean GEC tasks. KoGEC demonstrated a more balanced error correction profile across various error types, whereas the larger LLMs tended to focus less on punctuation errors. We also developed a Chrome extension to make the KoGEC system accessible to users. Finally, we explored token vocabulary expansion to further improve the model but found it to decrease model performance. This research contributes to the field of NLP by providing an efficient, specialized Korean GEC system and a new evaluation method. It also highlights the potential of compact, task-specific models to compete with larger, general-purpose language models in specialized NLP tasks.
Related papers
- Refining Czech GEC: Insights from a Multi-Experiment Approach [2.4792831876409718]
We present a grammar error correction (GEC) system that achieves state of the art for the Czech language.<n>Our system is based on a neural network translation approach with the Transformer architecture.<n>Key feature is its real-time synthetic generation pipeline, which dynamically augments sentences with artificial errors.
arXiv Detail & Related papers (2025-06-27T17:21:40Z) - NeKo: Toward Post Recognition Generative Correction Large Language Models with Task-Oriented Experts [57.53692236201343]
We propose a Multi-Task Correction MoE, where we train the experts to become an expert'' of speech-to-text, language-to-text and vision-to-text datasets.
NeKo performs competitively on grammar and post-OCR correction as a multi-task model.
arXiv Detail & Related papers (2024-11-08T20:11:24Z) - Efficient and Interpretable Grammatical Error Correction with Mixture of Experts [33.748193858033346]
We propose a mixture-of-experts model, MoECE, for grammatical error correction.
Our model successfully achieves the performance of T5-XL with three times fewer effective parameters.
arXiv Detail & Related papers (2024-10-30T23:27:54Z) - LM-Combiner: A Contextual Rewriting Model for Chinese Grammatical Error Correction [49.0746090186582]
Over-correction is a critical problem in Chinese grammatical error correction (CGEC) task.
Recent work using model ensemble methods can effectively mitigate over-correction and improve the precision of the GEC system.
We propose the LM-Combiner, a rewriting model that can directly modify the over-correction of GEC system outputs without a model ensemble.
arXiv Detail & Related papers (2024-03-26T06:12:21Z) - Chinese Spelling Correction as Rephrasing Language Model [63.65217759957206]
We study Chinese Spelling Correction (CSC), which aims to detect and correct the potential spelling errors in a given sentence.
Current state-of-the-art methods regard CSC as a sequence tagging task and fine-tune BERT-based models on sentence pairs.
We propose Rephrasing Language Model (ReLM), where the model is trained to rephrase the entire sentence by infilling additional slots, instead of character-to-character tagging.
arXiv Detail & Related papers (2023-08-17T06:04:28Z) - Rethinking Masked Language Modeling for Chinese Spelling Correction [70.85829000570203]
We study Chinese Spelling Correction (CSC) as a joint decision made by two separate models: a language model and an error model.
We find that fine-tuning BERT tends to over-fit the error model while under-fit the language model, resulting in poor generalization to out-of-distribution error patterns.
We demonstrate that a very simple strategy, randomly masking 20% non-error tokens from the input sequence during fine-tuning is sufficient for learning a much better language model without sacrificing the error model.
arXiv Detail & Related papers (2023-05-28T13:19:12Z) - Are Pre-trained Language Models Useful for Model Ensemble in Chinese
Grammatical Error Correction? [10.302225525539003]
We explore several ensemble strategies based on strong PLMs with four sophisticated single models.
The performance does not improve but even gets worse after the PLM-based ensemble.
arXiv Detail & Related papers (2023-05-24T14:18:52Z) - CSCD-NS: a Chinese Spelling Check Dataset for Native Speakers [62.61866477815883]
We present CSCD-NS, the first Chinese spelling check dataset designed for native speakers.
CSCD-NS is ten times larger in scale and exhibits a distinct error distribution.
We propose a novel method that simulates the input process through an input method.
arXiv Detail & Related papers (2022-11-16T09:25:42Z) - Towards standardizing Korean Grammatical Error Correction: Datasets and
Annotation [26.48270086631483]
We provide datasets that cover a wide range of Korean grammatical errors.
We then define 14 error types for Korean and provide KAGAS, which can automatically annotate error types from parallel corpora.
We show that the model trained with our datasets significantly outperforms the currently used statistical Korean GEC system (Hanspell) on a wider range of error types.
arXiv Detail & Related papers (2022-10-25T23:41:52Z) - FCGEC: Fine-Grained Corpus for Chinese Grammatical Error Correction [6.116341682577877]
Grammatical Error Correction (GEC) has been broadly applied in automatic correction and proofreading system recently.
We present FCGEC, a fine-grained corpus to detect, identify and correct the grammatical errors.
arXiv Detail & Related papers (2022-10-22T06:29:05Z) - A Syntax-Guided Grammatical Error Correction Model with Dependency Tree
Correction [83.14159143179269]
Grammatical Error Correction (GEC) is a task of detecting and correcting grammatical errors in sentences.
We propose a syntax-guided GEC model (SG-GEC) which adopts the graph attention mechanism to utilize the syntactic knowledge of dependency trees.
We evaluate our model on public benchmarks of GEC task and it achieves competitive results.
arXiv Detail & Related papers (2021-11-05T07:07:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.