KMMLU: Measuring Massive Multitask Language Understanding in Korean
- URL: http://arxiv.org/abs/2402.11548v2
- Date: Thu, 6 Jun 2024 12:23:58 GMT
- Title: KMMLU: Measuring Massive Multitask Language Understanding in Korean
- Authors: Guijin Son, Hanwool Lee, Sungdong Kim, Seungone Kim, Niklas Muennighoff, Taekyoon Choi, Cheonbok Park, Kang Min Yoo, Stella Biderman,
- Abstract summary: We propose KMMLU, a new Korean benchmark with 35,030 expert-level multiple-choice questions across 45 subjects ranging from humanities to STEM.
While prior Korean benchmarks are translated from existing English benchmarks, KMMLU is collected from original Korean exams.
- Score: 32.06346608507584
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We propose KMMLU, a new Korean benchmark with 35,030 expert-level multiple-choice questions across 45 subjects ranging from humanities to STEM. While prior Korean benchmarks are translated from existing English benchmarks, KMMLU is collected from original Korean exams, capturing linguistic and cultural aspects of the Korean language. We test 27 public and proprietary LLMs and observe the best public model to score 50.5%, leaving significant room for improvement. This model was primarily trained for English and Chinese, not Korean. Current LLMs tailored to Korean, such as Polyglot-Ko, perform far worse. Surprisingly, even the most capable proprietary LLMs, e.g., GPT-4 and HyperCLOVA X do not exceed 60%. This suggests that further work is needed to improve LLMs for Korean, and we believe KMMLU offers the appropriate tool to track this progress. We make our dataset publicly available on the Hugging Face Hub and integrate the benchmark into EleutherAI's Language Model Evaluation Harness.
Related papers
- JMMMU: A Japanese Massive Multi-discipline Multimodal Understanding Benchmark for Culture-aware Evaluation [63.83457341009046]
JMMMU (Japanese MMMU) is the first large-scale Japanese benchmark designed to evaluate LMMs on expert-level tasks based on the Japanese cultural context.
Using the CA subset, we observe performance drop in many LMMs when evaluated in Japanese, which is purely attributable to language variation.
By combining both subsets, we identify that some LMMs perform well on the CA subset but not on the CS subset, exposing a shallow understanding of the Japanese language that lacks depth in cultural understanding.
arXiv Detail & Related papers (2024-10-22T17:59:56Z) - Open Ko-LLM Leaderboard2: Bridging Foundational and Practical Evaluation for Korean LLMs [7.924819546105335]
We propose Open Ko-LLM Leaderboard2, an improved version of the earlier Open Ko-LLM Leaderboard.
The original benchmarks are entirely replaced with new tasks that are more closely aligned with real-world capabilities.
Four new native Korean benchmarks are introduced to better reflect the distinct characteristics of the Korean language.
arXiv Detail & Related papers (2024-10-16T10:49:22Z) - RedWhale: An Adapted Korean LLM Through Efficient Continual Pretraining [0.0]
We present RedWhale, a model specifically tailored for Korean language processing.
RedWhale is developed using an efficient continual pretraining approach that includes a comprehensive Korean corpus preprocessing pipeline.
Experimental results demonstrate that RedWhale outperforms other leading models on Korean NLP benchmarks.
arXiv Detail & Related papers (2024-08-21T02:49:41Z) - GECKO: Generative Language Model for English, Code and Korean [0.02046223849354785]
We introduce GECKO, a bilingual large language model (LLM) optimized for Korean and English, along with programming languages.
GECKO is pretrained on the balanced, high-quality corpus of Korean and English employing LLaMA architecture.
arXiv Detail & Related papers (2024-05-24T15:30:41Z) - HyperCLOVA X Technical Report [119.94633129762133]
We introduce HyperCLOVA X, a family of large language models (LLMs) tailored to the Korean language and culture.
HyperCLOVA X was trained on a balanced mix of Korean, English, and code data, followed by instruction-tuning with high-quality human-annotated datasets.
The model is evaluated across various benchmarks, including comprehensive reasoning, knowledge, commonsense, factuality, coding, math, chatting, instruction-following, and harmlessness, in both Korean and English.
arXiv Detail & Related papers (2024-04-02T13:48:49Z) - Efficient and Effective Vocabulary Expansion Towards Multilingual Large
Language Models [9.359647125218359]
This report introduces textttEEVE-Korean-v1.0, a Korean adaptation of large language models.
Our method can significantly boost non-English proficiency within just 2 billion tokens.
arXiv Detail & Related papers (2024-02-22T17:12:39Z) - OMGEval: An Open Multilingual Generative Evaluation Benchmark for Large
Language Models [59.54423478596468]
We introduce OMGEval, the first Open-source Multilingual Generative test set that can assess the capability of LLMs in different languages.
For each language, OMGEval provides 804 open-ended questions, covering a wide range of important capabilities of LLMs.
Specifically, the current version of OMGEval includes 5 languages (i.e., Zh, Ru, Fr, Es, Ar)
arXiv Detail & Related papers (2024-02-21T04:42:41Z) - CIF-Bench: A Chinese Instruction-Following Benchmark for Evaluating the Generalizability of Large Language Models [53.9835961434552]
We introduce the Chinese Instruction-Following Benchmark (CIF-Bench) to evaluate the generalizability of large language models (LLMs) to the Chinese language.
CIF-Bench comprises 150 tasks and 15,000 input-output pairs, developed by native speakers to test complex reasoning and Chinese cultural nuances.
To mitigate data contamination, we release only half of the dataset publicly, with the remainder kept private, and introduce diversified instructions to minimize score variance.
arXiv Detail & Related papers (2024-02-20T16:02:12Z) - CMMLU: Measuring massive multitask language understanding in Chinese [133.70911295934746]
This paper introduces a comprehensive Chinese benchmark that covers various subjects, including natural science, social sciences, engineering, and humanities.
CMMLU fills the gap in evaluating the knowledge and reasoning capabilities of large language models within the Chinese context.
arXiv Detail & Related papers (2023-06-15T15:49:51Z) - KoreALBERT: Pretraining a Lite BERT Model for Korean Language
Understanding [6.414554168135807]
KoreALBERT is a monolingual ALBERT model specifically for Korean language understanding.
Our pretrained KoreALBERT outperforms its BERT counterpart on 6 different NLU tasks.
arXiv Detail & Related papers (2021-01-27T12:48:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.