Polishing Every Facet of the GEM: Testing Linguistic Competence of LLMs and Humans in Korean
- URL: http://arxiv.org/abs/2506.01237v1
- Date: Mon, 02 Jun 2025 01:27:46 GMT
- Title: Polishing Every Facet of the GEM: Testing Linguistic Competence of LLMs and Humans in Korean
- Authors: SungHo Kim, Nayeon Kim, Taehee Jeon, SangKeun Lee,
- Abstract summary: KoGEM is designed to assess the linguistic competence of LLMs and humans in Korean.<n>It consists of 1.5k multiple-choice QA pairs covering five main categories and 16 subcategories.
- Score: 8.072947878765941
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce the $\underline{Ko}rean \underline{G}rammar \underline{E}valuation Bench\underline{M}ark (KoGEM)$, designed to assess the linguistic competence of LLMs and humans in Korean. KoGEM consists of 1.5k multiple-choice QA pairs covering five main categories and 16 subcategories. The zero-shot evaluation of 27 LLMs of various sizes and types reveals that while LLMs perform remarkably well on straightforward tasks requiring primarily definitional knowledge, they struggle with tasks that demand the integration of real-world experiential knowledge, such as phonological rules and pronunciation. Furthermore, our in-depth analysis suggests that incorporating such experiential knowledge could enhance the linguistic competence of LLMs. With KoGEM, we not only highlight the limitations of current LLMs in linguistic competence but also uncover hidden facets of LLMs in linguistic competence, paving the way for enhancing comprehensive language understanding. Our code and dataset are available at: https://github.com/SungHo3268/KoGEM.
Related papers
- EXECUTE: A Multilingual Benchmark for LLM Token Understanding [54.70665106141121]
Tests across multiple languages reveal that challenges in other languages are not always on the character level as in English.<n>We also examine sub-character tasks in Chinese, Japanese, and Korean to assess LLMs' understanding of character components.
arXiv Detail & Related papers (2025-05-23T11:56:48Z) - Evaluating Large Language Model with Knowledge Oriented Language Specific Simple Question Answering [73.73820209993515]
We introduce KoLasSimpleQA, the first benchmark evaluating the multilingual factual ability of Large Language Models (LLMs)<n>Inspired by existing research, we created the question set with features such as single knowledge point coverage, absolute objectivity, unique answers, and temporal stability.<n>Results show significant performance differences between the two domains.
arXiv Detail & Related papers (2025-05-22T12:27:02Z) - PolyMath: Evaluating Mathematical Reasoning in Multilingual Contexts [79.84059473102778]
PolyMath is a multilingual mathematical reasoning benchmark covering 18 languages and 4 easy-to-hard difficulty levels.<n>Our benchmark ensures difficulty comprehensiveness, language diversity, and high-quality translation.
arXiv Detail & Related papers (2025-04-25T15:39:04Z) - LLMzSzĹ: a comprehensive LLM benchmark for Polish [1.147194267316659]
This article introduces the first comprehensive benchmark for the Polish language at this scale: LLMzSzL (LLMs Behind the School Desk).<n>It is based on a coherent collection of Polish national exams, including both academic and professional tests extracted from the archives of the Polish Central Examination Board.<n>Altogether, it consists of almost 19k closed-ended questions.
arXiv Detail & Related papers (2025-01-04T12:04:46Z) - Can Code-Switched Texts Activate a Knowledge Switch in LLMs? A Case Study on English-Korean Code-Switching [14.841981996951395]
Recent large language models (LLMs) demonstrate multilingual abilities, yet they are English-centric due to dominance of English in training corpora.<n>Code-switching (CS), a phenomenon where multilingual speakers alternate between languages in a discourse, can convey subtle cultural and linguistic nuances.<n>Our results demonstrate that compared to English text, CS can faithfully activate knowledge inside LLMs especially on language-specific domains.
arXiv Detail & Related papers (2024-10-24T05:14:03Z) - Pragmatic Competence Evaluation of Large Language Models for the Korean Language [0.6757476692230009]
This study evaluates how well Large Language Models (LLMs) understand context-dependent expressions from a pragmatic standpoint, specifically in Korean.
We use both Multiple-Choice Questions (MCQs) for automatic evaluation and Open-Ended Questions (OEQs) assessed by human experts.
arXiv Detail & Related papers (2024-03-19T12:21:20Z) - FAC$^2$E: Better Understanding Large Language Model Capabilities by Dissociating Language and Cognition [56.76951887823882]
Large language models (LLMs) are primarily evaluated by overall performance on various text understanding and generation tasks.
We present FAC$2$E, a framework for Fine-grAined and Cognition-grounded LLMs' Capability Evaluation.
arXiv Detail & Related papers (2024-02-29T21:05:37Z) - Hire a Linguist!: Learning Endangered Languages with In-Context Linguistic Descriptions [49.97641297850361]
LINGOLLM is a training-free approach to enable an LLM to process unseen languages that hardly occur in its pre-training.
We implement LINGOLLM on top of two models, GPT-4 and Mixtral, and evaluate their performance on 5 tasks across 8 endangered or low-resource languages.
Our results show that LINGOLLM elevates translation capability from GPT-4's 0 to 10.5 BLEU for 10 language directions.
arXiv Detail & Related papers (2024-02-28T03:44:01Z) - OMGEval: An Open Multilingual Generative Evaluation Benchmark for Large
Language Models [59.54423478596468]
We introduce OMGEval, the first Open-source Multilingual Generative test set that can assess the capability of LLMs in different languages.
For each language, OMGEval provides 804 open-ended questions, covering a wide range of important capabilities of LLMs.
Specifically, the current version of OMGEval includes 5 languages (i.e., Zh, Ru, Fr, Es, Ar)
arXiv Detail & Related papers (2024-02-21T04:42:41Z) - How Vocabulary Sharing Facilitates Multilingualism in LLaMA? [19.136382859468693]
Large Language Models (LLMs) often show strong performance on English tasks, while exhibiting limitations on other languages.
This study endeavors to examine the multilingual capability of LLMs from the vocabulary sharing perspective.
arXiv Detail & Related papers (2023-11-15T16:13:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.