Related papers: KITE: A Benchmark for Evaluating Korean Instruction-Following Abilities in Large Language Models

KITE: A Benchmark for Evaluating Korean Instruction-Following Abilities in Large Language Models

URL: http://arxiv.org/abs/2510.15558v1
Date: Fri, 17 Oct 2025 11:45:15 GMT
Title: KITE: A Benchmark for Evaluating Korean Instruction-Following Abilities in Large Language Models
Authors: Dongjun Kim, Chanhee Park, Chanjun Park, Heuiseok Lim,
Abstract summary: We introduce the Korean Instruction-following Task Evaluation (KITE), a benchmark designed to evaluate both general and Korean-specific instructions.<n>Unlike existing Korean benchmarks that focus mainly on factual knowledge or multiple-choice testing, KITE directly targets diverse, open-ended instruction-following tasks.
Score: 36.90941464587649
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The instruction-following capabilities of large language models (LLMs) are pivotal for numerous applications, from conversational agents to complex reasoning systems. However, current evaluations predominantly focus on English models, neglecting the linguistic and cultural nuances of other languages. Specifically, Korean, with its distinct syntax, rich morphological features, honorific system, and dual numbering systems, lacks a dedicated benchmark for assessing open-ended instruction-following capabilities. To address this gap, we introduce the Korean Instruction-following Task Evaluation (KITE), a comprehensive benchmark designed to evaluate both general and Korean-specific instructions. Unlike existing Korean benchmarks that focus mainly on factual knowledge or multiple-choice testing, KITE directly targets diverse, open-ended instruction-following tasks. Our evaluation pipeline combines automated metrics with human assessments, revealing performance disparities across models and providing deeper insights into their strengths and weaknesses. By publicly releasing the KITE dataset and code, we aim to foster further research on culturally and linguistically inclusive LLM development and inspire similar endeavors for other underrepresented languages.

Related papers

Redefining Evaluation Standards: A Unified Framework for Evaluating the Korean Capabilities of Language Models [0.0]
We introduce HRET (Haerae Evaluation Toolkit), an open-source, registry-based framework that unifies Korean assessment.<n>HRET integrates major Korean benchmarks, multiple inference backends, and multi-method evaluation.<n>Its modular registry design also enables rapid incorporation of new datasets, methods, and backends.
arXiv Detail & Related papers (2025-03-29T04:17:58Z)
mFollowIR: a Multilingual Benchmark for Instruction Following in Retrieval [61.17793165194077]
We introduce mFollowIR, a benchmark for measuring instruction-following ability in retrieval models.<n>We present results for both multilingual (XX-XX) and cross-lingual (En-XX) performance.<n>We see strong cross-lingual performance with English-based retrievers that trained using instructions, but find a notable drop in performance in the multilingual setting.
arXiv Detail & Related papers (2025-01-31T16:24:46Z)
Deep Exploration of Cross-Lingual Zero-Shot Generalization in Instruction Tuning [47.75550640881761]
We explore cross-lingual generalization in instruction tuning by applying it to non-English tasks. We design cross-lingual templates to mitigate discrepancies in language and instruction-format of the template between training and inference. Our experiments reveal consistent improvements through cross-lingual generalization in both English and Korean.
arXiv Detail & Related papers (2024-06-13T04:10:17Z)
HyperCLOVA X Technical Report [119.94633129762133]
We introduce HyperCLOVA X, a family of large language models (LLMs) tailored to the Korean language and culture. HyperCLOVA X was trained on a balanced mix of Korean, English, and code data, followed by instruction-tuning with high-quality human-annotated datasets. The model is evaluated across various benchmarks, including comprehensive reasoning, knowledge, commonsense, factuality, coding, math, chatting, instruction-following, and harmlessness, in both Korean and English.
arXiv Detail & Related papers (2024-04-02T13:48:49Z)
Pragmatic Competence Evaluation of Large Language Models for the Korean Language [0.6757476692230009]
This study evaluates how well Large Language Models (LLMs) understand context-dependent expressions from a pragmatic standpoint, specifically in Korean. We use both Multiple-Choice Questions (MCQs) for automatic evaluation and Open-Ended Questions (OEQs) assessed by human experts.
arXiv Detail & Related papers (2024-03-19T12:21:20Z)
HAE-RAE Bench: Evaluation of Korean Knowledge in Language Models [0.0]
We introduce the HAE-RAE Bench, a dataset curated to challenge models lacking Korean cultural and contextual depth. The dataset encompasses six downstream tasks across four domains: vocabulary, history, general knowledge, and reading comprehension.
arXiv Detail & Related papers (2023-09-06T04:38:16Z)
KOBEST: Korean Balanced Evaluation of Significant Tasks [3.664687661363732]
A well-formulated benchmark plays a critical role in spurring advancements in the natural language processing (NLP) field. We propose a new benchmark named Korean balanced evaluation of significant tasks (KoBEST), which consists of five Korean-language downstream tasks.
arXiv Detail & Related papers (2022-04-09T20:13:51Z)
CUGE: A Chinese Language Understanding and Generation Evaluation Benchmark [144.05723617401674]
General-purpose language intelligence evaluation has been a longstanding goal for natural language processing. We argue that for general-purpose language intelligence evaluation, the benchmark itself needs to be comprehensive and systematic. We propose CUGE, a Chinese Language Understanding and Generation Evaluation benchmark with the following features.
arXiv Detail & Related papers (2021-12-27T11:08:58Z)
XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization [128.37244072182506]
Cross-lingual TRansfer Evaluation of Multilinguals XTREME is a benchmark for evaluating the cross-lingual generalization capabilities of multilingual representations across 40 languages and 9 tasks. We demonstrate that while models tested on English reach human performance on many tasks, there is still a sizable gap in the performance of cross-lingually transferred models.
arXiv Detail & Related papers (2020-03-24T19:09:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.