Towards standardizing Korean Grammatical Error Correction: Datasets and
Annotation
- URL: http://arxiv.org/abs/2210.14389v3
- Date: Wed, 24 May 2023 10:46:52 GMT
- Title: Towards standardizing Korean Grammatical Error Correction: Datasets and
Annotation
- Authors: Soyoung Yoon, Sungjoon Park, Gyuwan Kim, Junhee Cho, Kihyo Park,
Gyutae Kim, Minjoon Seo and Alice Oh
- Abstract summary: We provide datasets that cover a wide range of Korean grammatical errors.
We then define 14 error types for Korean and provide KAGAS, which can automatically annotate error types from parallel corpora.
We show that the model trained with our datasets significantly outperforms the currently used statistical Korean GEC system (Hanspell) on a wider range of error types.
- Score: 26.48270086631483
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Research on Korean grammatical error correction (GEC) is limited, compared to
other major languages such as English. We attribute this problematic
circumstance to the lack of a carefully designed evaluation benchmark for
Korean GEC. In this work, we collect three datasets from different sources
(Kor-Lang8, Kor-Native, and Kor-Learner) that covers a wide range of Korean
grammatical errors. Considering the nature of Korean grammar, We then define 14
error types for Korean and provide KAGAS (Korean Automatic Grammatical error
Annotation System), which can automatically annotate error types from parallel
corpora. We use KAGAS on our datasets to make an evaluation benchmark for
Korean, and present baseline models trained from our datasets. We show that the
model trained with our datasets significantly outperforms the currently used
statistical Korean GEC system (Hanspell) on a wider range of error types,
demonstrating the diversity and usefulness of the datasets. The implementations
and datasets are open-sourced.
Related papers
- Does Incomplete Syntax Influence Korean Language Model? Focusing on Word Order and Case Markers [7.275938266030414]
Syntactic elements, such as word order and case markers, are fundamental in natural language processing.
This study explores whether Korean language models can accurately capture this flexibility.
arXiv Detail & Related papers (2024-07-12T11:33:41Z) - CLIcK: A Benchmark Dataset of Cultural and Linguistic Intelligence in Korean [18.526285276022907]
We introduce a benchmark of Cultural and Linguistic Intelligence in Korean dataset comprising 1,995 QA pairs.
CLIcK sources its data from official Korean exams and textbooks, partitioning the questions into eleven categories under the two main categories of language and culture.
Using CLIcK, we test 13 language models to assess their performance. Our evaluation uncovers insights into their performances across the categories, as well as the diverse factors affecting their comprehension.
arXiv Detail & Related papers (2024-03-11T03:54:33Z) - CSED: A Chinese Semantic Error Diagnosis Corpus [52.92010408053424]
We study the complicated problem of Chinese Semantic Error Diagnosis (CSED), which lacks relevant datasets.
The study of semantic errors is important because they are very common and may lead to syntactic irregularities or even problems of comprehension.
This paper proposes syntax-aware models to specifically adapt to the CSED task.
arXiv Detail & Related papers (2023-05-09T05:33:31Z) - CSCD-NS: a Chinese Spelling Check Dataset for Native Speakers [62.61866477815883]
We present CSCD-NS, the first Chinese spelling check dataset designed for native speakers.
CSCD-NS is ten times larger in scale and exhibits a distinct error distribution.
We propose a novel method that simulates the input process through an input method.
arXiv Detail & Related papers (2022-11-16T09:25:42Z) - FCGEC: Fine-Grained Corpus for Chinese Grammatical Error Correction [6.116341682577877]
Grammatical Error Correction (GEC) has been broadly applied in automatic correction and proofreading system recently.
We present FCGEC, a fine-grained corpus to detect, identify and correct the grammatical errors.
arXiv Detail & Related papers (2022-10-22T06:29:05Z) - MuCGEC: a Multi-Reference Multi-Source Evaluation Dataset for Chinese
Grammatical Error Correction [51.3754092853434]
MuCGEC is a multi-reference evaluation dataset for Chinese Grammatical Error Correction (CGEC)
It consists of 7,063 sentences collected from three different Chinese-as-a-Second-Language (CSL) learner sources.
Each sentence has been corrected by three annotators, and their corrections are meticulously reviewed by an expert, resulting in 2.3 references per sentence.
arXiv Detail & Related papers (2022-04-23T05:20:38Z) - KOBEST: Korean Balanced Evaluation of Significant Tasks [3.664687661363732]
A well-formulated benchmark plays a critical role in spurring advancements in the natural language processing (NLP) field.
We propose a new benchmark named Korean balanced evaluation of significant tasks (KoBEST), which consists of five Korean-language downstream tasks.
arXiv Detail & Related papers (2022-04-09T20:13:51Z) - Learning How to Translate North Korean through South Korean [24.38451366384134]
South and North Korea both use the Korean language.
Existing NLP systems of the Korean language cannot handle North Korean inputs.
We create data for North Korean NMT models using a comparable corpus.
We verify that a model trained by North Korean bilingual data without human annotation can significantly boost North Korean translation accuracy.
arXiv Detail & Related papers (2022-01-27T01:21:29Z) - A Syntax-Guided Grammatical Error Correction Model with Dependency Tree
Correction [83.14159143179269]
Grammatical Error Correction (GEC) is a task of detecting and correcting grammatical errors in sentences.
We propose a syntax-guided GEC model (SG-GEC) which adopts the graph attention mechanism to utilize the syntactic knowledge of dependency trees.
We evaluate our model on public benchmarks of GEC task and it achieves competitive results.
arXiv Detail & Related papers (2021-11-05T07:07:48Z) - Comparison of Interactive Knowledge Base Spelling Correction Models for
Low-Resource Languages [81.90356787324481]
Spelling normalization for low resource languages is a challenging task because the patterns are hard to predict.
This work shows a comparison of a neural model and character language models with varying amounts on target language data.
Our usage scenario is interactive correction with nearly zero amounts of training examples, improving models as more data is collected.
arXiv Detail & Related papers (2020-10-20T17:31:07Z) - GATE: Graph Attention Transformer Encoder for Cross-lingual Relation and
Event Extraction [107.8262586956778]
We introduce graph convolutional networks (GCNs) with universal dependency parses to learn language-agnostic sentence representations.
GCNs struggle to model words with long-range dependencies or are not directly connected in the dependency tree.
We propose to utilize the self-attention mechanism to learn the dependencies between words with different syntactic distances.
arXiv Detail & Related papers (2020-10-06T20:30:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.