Related papers: From KMMLU-Redux to KMMLU-Pro: A Professional Korean Benchmark Suite for LLM Evaluation

From KMMLU-Redux to KMMLU-Pro: A Professional Korean Benchmark Suite for LLM Evaluation

URL: http://arxiv.org/abs/2507.08924v2
Date: Fri, 18 Jul 2025 09:31:19 GMT
Title: From KMMLU-Redux to KMMLU-Pro: A Professional Korean Benchmark Suite for LLM Evaluation
Authors: Seokhee Hong, Sunkyoung Kim, Guijin Son, Soyeon Kim, Yeonjung Hong, Jinsik Lee,
Abstract summary: We introduce two Korean expert-level benchmarks.<n>KMMLU-Redux, reconstructed from the existing KMMLU, consists of questions from the Korean National Technical Qualification exams.<n>KMMLU-Pro is based on Korean National Professional Licensure exams to reflect professional knowledge in Korea.
Score: 3.7217185777150497
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The development of Large Language Models (LLMs) requires robust benchmarks that encompass not only academic domains but also industrial fields to effectively evaluate their applicability in real-world scenarios. In this paper, we introduce two Korean expert-level benchmarks. KMMLU-Redux, reconstructed from the existing KMMLU, consists of questions from the Korean National Technical Qualification exams, with critical errors removed to enhance reliability. KMMLU-Pro is based on Korean National Professional Licensure exams to reflect professional knowledge in Korea. Our experiments demonstrate that these benchmarks comprehensively represent industrial knowledge in Korea. We release our dataset publicly available.

Related papers

Thunder-LLM: Efficiently Adapting LLMs to Korean with Minimal Resources [5.341994281991984]
This paper presents methods to adapt an existing English-based LLM to Korean in a low-budget scenario.<n>We describe the entire end-to-end process: collecting Korean datasets, preprocessing the data, training the model, creating downstream benchmarks, and conducting evaluations.<n>Our new bilingual models, Thunder-LLM and Thunder-LLM-Ins, achieve superior Korean performance compared to state-of-the-art models while utilizing minimal data and computational resources.
arXiv Detail & Related papers (2025-06-18T17:33:51Z)
Evaluating Large Language Model with Knowledge Oriented Language Specific Simple Question Answering [73.73820209993515]
We introduce KoLasSimpleQA, the first benchmark evaluating the multilingual factual ability of Large Language Models (LLMs)<n>Inspired by existing research, we created the question set with features such as single knowledge point coverage, absolute objectivity, unique answers, and temporal stability.<n>Results show significant performance differences between the two domains.
arXiv Detail & Related papers (2025-05-22T12:27:02Z)
MMLU-ProX: A Multilingual Benchmark for Advanced Large Language Model Evaluation [86.7047714187813]
MMLU-ProX is a benchmark covering 29 languages, built on an English benchmark.<n>Each language version consists of 11,829 identical questions, enabling direct cross-linguistic comparisons.<n>To meet efficient evaluation needs, we provide a lite version containing 658 questions per language.
arXiv Detail & Related papers (2025-03-13T15:59:20Z)
MME-Industry: A Cross-Industry Multimodal Evaluation Benchmark [20.642661835794975]
We introduce MME-Industry, a novel benchmark designed specifically for evaluating MLLMs in industrial settings.<n>The benchmark encompasses 21 distinct domain, comprising 1050 question-answer pairs with 50 questions per domain.<n>We provide both Chinese and English versions of the benchmark, enabling comparative analysis of MLLMs' capabilities across these languages.
arXiv Detail & Related papers (2025-01-28T03:56:17Z)
Open Ko-LLM Leaderboard2: Bridging Foundational and Practical Evaluation for Korean LLMs [7.924819546105335]
We propose Open Ko-LLM Leaderboard2, an improved version of the earlier Open Ko-LLM Leaderboard.<n>The original benchmarks are entirely replaced with new tasks that are more closely aligned with real-world capabilities.<n>Four new native Korean benchmarks are introduced to better reflect the distinct characteristics of the Korean language.
arXiv Detail & Related papers (2024-10-16T10:49:22Z)
Inference-Time Decontamination: Reusing Leaked Benchmarks for Large Language Model Evaluation [61.350306618479365]
Leakage of benchmarks can prevent the accurate assessment of large language models' true performance. We propose Inference-Time Decontamination (ITD) to address this issue. ITD reduces inflated accuracy by 22.9% on GSM8K and 19.0% on MMLU.
arXiv Detail & Related papers (2024-06-20T04:35:59Z)
GECKO: Generative Language Model for English, Code and Korean [0.02046223849354785]
We introduce GECKO, a bilingual large language model (LLM) optimized for Korean and English, along with programming languages. GECKO is pretrained on the balanced, high-quality corpus of Korean and English employing LLaMA architecture.
arXiv Detail & Related papers (2024-05-24T15:30:41Z)
KMMLU: Measuring Massive Multitask Language Understanding in Korean [32.06346608507584]
We propose KMMLU, a new Korean benchmark with 35,030 expert-level multiple-choice questions across 45 subjects ranging from humanities to STEM. While prior Korean benchmarks are translated from existing English benchmarks, KMMLU is collected from original Korean exams.
arXiv Detail & Related papers (2024-02-18T11:41:07Z)
Assessing the Reliability of Large Language Model Knowledge [78.38870272050106]
Large language models (LLMs) have been treated as knowledge bases due to their strong performance in knowledge probing tasks. How do we evaluate the capabilities of LLMs to consistently produce factually correct answers? We propose MOdel kNowledge relIabiliTy scORe (MONITOR), a novel metric designed to directly measure LLMs' factual reliability.
arXiv Detail & Related papers (2023-10-15T12:40:30Z)
KoLA: Carefully Benchmarking World Knowledge of Large Language Models [87.96683299084788]
We construct a Knowledge-oriented LLM Assessment benchmark (KoLA) We mimic human cognition to form a four-level taxonomy of knowledge-related abilities, covering $19$ tasks. We use both Wikipedia, a corpus prevalently pre-trained by LLMs, along with continuously collected emerging corpora, to evaluate the capacity to handle unseen data and evolving knowledge.
arXiv Detail & Related papers (2023-06-15T17:20:46Z)
CMMLU: Measuring massive multitask language understanding in Chinese [133.70911295934746]
This paper introduces a comprehensive Chinese benchmark that covers various subjects, including natural science, social sciences, engineering, and humanities. CMMLU fills the gap in evaluating the knowledge and reasoning capabilities of large language models within the Chinese context.
arXiv Detail & Related papers (2023-06-15T15:49:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.