Enriching the Korean Learner Corpus with Multi-reference Annotations and Rubric-Based Scoring
- URL: http://arxiv.org/abs/2505.00261v1
- Date: Thu, 01 May 2025 03:04:07 GMT
- Title: Enriching the Korean Learner Corpus with Multi-reference Annotations and Rubric-Based Scoring
- Authors: Jayoung Song, KyungTae Lim, Jungyeul Park,
- Abstract summary: We enhance the KoLLA Korean learner corpus by adding grammatical error correction references.<n>We enrich the corpus with rubric-based scores aligned with guidelines from the Korean National Language Institute.
- Score: 2.824980053889876
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Despite growing global interest in Korean language education, there remains a significant lack of learner corpora tailored to Korean L2 writing. To address this gap, we enhance the KoLLA Korean learner corpus by adding multiple grammatical error correction (GEC) references, thereby enabling more nuanced and flexible evaluation of GEC systems, and reflects the variability of human language. Additionally, we enrich the corpus with rubric-based scores aligned with guidelines from the Korean National Language Institute, capturing grammatical accuracy, coherence, and lexical diversity. These enhancements make KoLLA a robust and standardized resource for research in Korean L2 education, supporting advancements in language learning, assessment, and automated error correction.
Related papers
- Open Ko-LLM Leaderboard2: Bridging Foundational and Practical Evaluation for Korean LLMs [7.924819546105335]
We propose Open Ko-LLM Leaderboard2, an improved version of the earlier Open Ko-LLM Leaderboard.<n>The original benchmarks are entirely replaced with new tasks that are more closely aligned with real-world capabilities.<n>Four new native Korean benchmarks are introduced to better reflect the distinct characteristics of the Korean language.
arXiv Detail & Related papers (2024-10-16T10:49:22Z) - RedWhale: An Adapted Korean LLM Through Efficient Continual Pretraining [0.0]
We present RedWhale, a model specifically tailored for Korean language processing.
RedWhale is developed using an efficient continual pretraining approach that includes a comprehensive Korean corpus preprocessing pipeline.
Experimental results demonstrate that RedWhale outperforms other leading models on Korean NLP benchmarks.
arXiv Detail & Related papers (2024-08-21T02:49:41Z) - Deep Exploration of Cross-Lingual Zero-Shot Generalization in Instruction Tuning [47.75550640881761]
We explore cross-lingual generalization in instruction tuning by applying it to non-English tasks.
We design cross-lingual templates to mitigate discrepancies in language and instruction-format of the template between training and inference.
Our experiments reveal consistent improvements through cross-lingual generalization in both English and Korean.
arXiv Detail & Related papers (2024-06-13T04:10:17Z) - Multilingual Pretraining and Instruction Tuning Improve Cross-Lingual Knowledge Alignment, But Only Shallowly [53.04368883943773]
Two approaches are proposed to address this, i.e., multilingual pretraining and multilingual instruction tuning.
We propose CLiKA to assess the cross-lingual knowledge alignment of LLMs in the Performance, Consistency and Conductivity levels.
Results show that while both multilingual pretraining and instruction tuning are beneficial for cross-lingual knowledge alignment, the training strategy needs to be carefully designed.
arXiv Detail & Related papers (2024-04-06T15:25:06Z) - HyperCLOVA X Technical Report [119.94633129762133]
We introduce HyperCLOVA X, a family of large language models (LLMs) tailored to the Korean language and culture.
HyperCLOVA X was trained on a balanced mix of Korean, English, and code data, followed by instruction-tuning with high-quality human-annotated datasets.
The model is evaluated across various benchmarks, including comprehensive reasoning, knowledge, commonsense, factuality, coding, math, chatting, instruction-following, and harmlessness, in both Korean and English.
arXiv Detail & Related papers (2024-04-02T13:48:49Z) - HAE-RAE Bench: Evaluation of Korean Knowledge in Language Models [0.0]
We introduce the HAE-RAE Bench, a dataset curated to challenge models lacking Korean cultural and contextual depth.
The dataset encompasses six downstream tasks across four domains: vocabulary, history, general knowledge, and reading comprehension.
arXiv Detail & Related papers (2023-09-06T04:38:16Z) - CLSE: Corpus of Linguistically Significant Entities [58.29901964387952]
We release a Corpus of Linguistically Significant Entities (CLSE) annotated by experts.
CLSE covers 74 different semantic types to support various applications from airline ticketing to video games.
We create a linguistically representative NLG evaluation benchmark in three languages: French, Marathi, and Russian.
arXiv Detail & Related papers (2022-11-04T12:56:12Z) - YACLC: A Chinese Learner Corpus with Multidimensional Annotation [45.304130762057945]
We construct a large-scale, multidimensional annotated Chinese learner corpus.
By analyzing the original sentences and annotations in the corpus, we found that YACLC has a considerable size and very high annotation quality.
arXiv Detail & Related papers (2021-12-30T13:07:08Z) - LXPER Index 2.0: Improving Text Readability Assessment Model for L2
English Students in Korea [1.7006003864727408]
This paper investigates a text readability assessment model for L2 English learners in Korea.
We train our model with CoKEC-text and significantly improve the accuracy of readability assessment for texts in the Korean ELT curriculum.
arXiv Detail & Related papers (2020-10-26T07:03:14Z) - Inducing Language-Agnostic Multilingual Representations [61.97381112847459]
Cross-lingual representations have the potential to make NLP techniques available to the vast majority of languages in the world.
We examine three approaches for this: (i) re-aligning the vector spaces of target languages to a pivot source language; (ii) removing language-specific means and variances, which yields better discriminativeness of embeddings as a by-product; and (iii) increasing input similarity across languages by removing morphological contractions and sentence reordering.
arXiv Detail & Related papers (2020-08-20T17:58:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.