Open Ko-LLM Leaderboard2: Bridging Foundational and Practical Evaluation for Korean LLMs
- URL: http://arxiv.org/abs/2410.12445v1
- Date: Wed, 16 Oct 2024 10:49:22 GMT
- Title: Open Ko-LLM Leaderboard2: Bridging Foundational and Practical Evaluation for Korean LLMs
- Authors: Hyeonwoo Kim, Dahyun Kim, Jihoo Kim, Sukyung Lee, Yungi Kim, Chanjun Park,
- Abstract summary: We propose Open Ko-LLM Leaderboard2, an improved version of the earlier Open Ko-LLM Leaderboard.
The original benchmarks are entirely replaced with new tasks that are more closely aligned with real-world capabilities.
Four new native Korean benchmarks are introduced to better reflect the distinct characteristics of the Korean language.
- Score: 7.924819546105335
- License:
- Abstract: The Open Ko-LLM Leaderboard has been instrumental in benchmarking Korean Large Language Models (LLMs), yet it has certain limitations. Notably, the disconnect between quantitative improvements on the overly academic leaderboard benchmarks and the qualitative impact of the models should be addressed. Furthermore, the benchmark suite is largely composed of translated versions of their English counterparts, which may not fully capture the intricacies of the Korean language. To address these issues, we propose Open Ko-LLM Leaderboard2, an improved version of the earlier Open Ko-LLM Leaderboard. The original benchmarks are entirely replaced with new tasks that are more closely aligned with real-world capabilities. Additionally, four new native Korean benchmarks are introduced to better reflect the distinct characteristics of the Korean language. Through these refinements, Open Ko-LLM Leaderboard2 seeks to provide a more meaningful evaluation for advancing Korean LLMs.
Related papers
- Bridging the Language Gaps in Large Language Models with Inference-Time Cross-Lingual Intervention [71.12193680015622]
Large Language Models (LLMs) have shown remarkable capabilities in natural language processing.
LLMs exhibit significant performance gaps among different languages.
We propose Inference-Time Cross-Lingual Intervention (INCLINE) to overcome these limitations without incurring significant costs.
arXiv Detail & Related papers (2024-10-16T11:23:03Z) - RedWhale: An Adapted Korean LLM Through Efficient Continual Pretraining [0.0]
We present RedWhale, a model specifically tailored for Korean language processing.
RedWhale is developed using an efficient continual pretraining approach that includes a comprehensive Korean corpus preprocessing pipeline.
Experimental results demonstrate that RedWhale outperforms other leading models on Korean NLP benchmarks.
arXiv Detail & Related papers (2024-08-21T02:49:41Z) - Deep Exploration of Cross-Lingual Zero-Shot Generalization in Instruction Tuning [47.75550640881761]
We explore cross-lingual generalization in instruction tuning by applying it to non-English tasks.
We design cross-lingual templates to mitigate discrepancies in language and instruction-format of the template between training and inference.
Our experiments reveal consistent improvements through cross-lingual generalization in both English and Korean.
arXiv Detail & Related papers (2024-06-13T04:10:17Z) - Open Ko-LLM Leaderboard: Evaluating Large Language Models in Korean with Ko-H5 Benchmark [11.389789978431446]
This paper introduces the Open Ko-LLM Leaderboard and the Ko-H5 Benchmark as vital tools for evaluating Large Language Models (LLMs) in Korean.
arXiv Detail & Related papers (2024-05-31T02:05:45Z) - Efficient and Effective Vocabulary Expansion Towards Multilingual Large
Language Models [9.359647125218359]
This report introduces textttEEVE-Korean-v1.0, a Korean adaptation of large language models.
Our method can significantly boost non-English proficiency within just 2 billion tokens.
arXiv Detail & Related papers (2024-02-22T17:12:39Z) - KMMLU: Measuring Massive Multitask Language Understanding in Korean [32.06346608507584]
We propose KMMLU, a new Korean benchmark with 35,030 expert-level multiple-choice questions across 45 subjects ranging from humanities to STEM.
While prior Korean benchmarks are translated from existing English benchmarks, KMMLU is collected from original Korean exams.
arXiv Detail & Related papers (2024-02-18T11:41:07Z) - Evaluating Large Language Models at Evaluating Instruction Following [54.49567482594617]
We introduce a challenging meta-evaluation benchmark, LLMBar, designed to test the ability of an LLM evaluator in discerning instruction-following outputs.
We discover that different evaluators exhibit distinct performance on LLMBar and even the highest-scoring ones have substantial room for improvement.
arXiv Detail & Related papers (2023-10-11T16:38:11Z) - Improving Translation Faithfulness of Large Language Models via
Augmenting Instructions [89.76691340615848]
We propose SWIE (Segment-Weighted Instruction Embedding) and an instruction-following dataset OVERMISS.
SWIE improves the model instruction understanding by adding a global instruction representation on the following input and response representations.
OVERMISS improves model faithfulness by comparing over-translation and miss-translation results with the correct translation.
arXiv Detail & Related papers (2023-08-24T09:32:29Z) - CLEVA: Chinese Language Models EVAluation Platform [92.42981537317817]
We present CLEVA, a user-friendly platform crafted to holistically evaluate Chinese LLMs.
Our platform employs a standardized workflow to assess LLMs' performance across various dimensions, regularly updating a competitive leaderboard.
To alleviate contamination, CLEVA curates a significant proportion of new data and develops a sampling strategy that guarantees a unique subset for each leaderboard round.
Empowered by an easy-to-use interface that requires just a few mouse clicks and a model API, users can conduct a thorough evaluation with minimal coding.
arXiv Detail & Related papers (2023-08-09T09:11:31Z) - KOBEST: Korean Balanced Evaluation of Significant Tasks [3.664687661363732]
A well-formulated benchmark plays a critical role in spurring advancements in the natural language processing (NLP) field.
We propose a new benchmark named Korean balanced evaluation of significant tasks (KoBEST), which consists of five Korean-language downstream tasks.
arXiv Detail & Related papers (2022-04-09T20:13:51Z) - CUGE: A Chinese Language Understanding and Generation Evaluation
Benchmark [144.05723617401674]
General-purpose language intelligence evaluation has been a longstanding goal for natural language processing.
We argue that for general-purpose language intelligence evaluation, the benchmark itself needs to be comprehensive and systematic.
We propose CUGE, a Chinese Language Understanding and Generation Evaluation benchmark with the following features.
arXiv Detail & Related papers (2021-12-27T11:08:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.